Building a Reusable Firestore Migration Runner in Go

A reusable Go library for Firestore collection migrations with cursor-based pagination, batch writes, checkpoint resume, rate limiting, and dry-run. Built for real production use.

The migration script worked perfectly. It transformed 847 documents in 30 seconds. Then it crashed at document 848.

I restarted it. It re-processed all 847 documents I’d already migrated, because there was no checkpoint. It crashed again at 848 - same network timeout. I fixed the timeout, restarted. Another 847 wasted reads. Another 847 wasted writes.

My fourth one-off migration script had the exact same problem as the first three. Each time I told myself “it’s just a one-time script, I don’t need resume support.” Each time I was wrong.

The Problem with One-Off Firestore Migration Scripts

If you’ve worked with Firestore long enough, you’ve written this code:

iter := client.Collection("users").Documents(ctx)
for {
    doc, err := iter.Next()
    if err == iterator.Done {
        break
    }
    // transform doc...
    doc.Ref.Set(ctx, newData, firestore.MergeAll)
}

It works on 50 documents. It works on 500. At 5,000, your context times out and you lose all progress. At 50,000, you hit Firestore’s rate limits and get throttled into oblivion.

Every production Firestore migration eventually needs the same things:

  • Cursor-based pagination so you don’t load everything into memory
  • Batch writes so you don’t pay per-document RPC overhead (Firestore caps batches at 500 operations)
  • Checkpointing so crashes don’t restart from zero
  • Rate limiting so you don’t overwhelm Firestore quotas
  • Dry-run mode so you can preview before committing
  • Idempotent transforms so re-runs are safe

The Go SDK makes this harder than it should be. Python and Node.js have BulkWriter - a high-level API that handles batching and throttling automatically. Go has nothing. You get client.Batch() with a 500-operation cap and no auto-retry. Everything else is manual.

After writing six migration scripts with the same boilerplate - and watching each one fail differently - I extracted the pattern into a library.

What We’re Building

A standalone Go library called firestoremigrate that handles the migration machinery. You provide a transform function. The library handles scanning, batching, checkpointing, rate limiting, and error tracking.

Here’s the complete API:

// Your transform function: receives a doc, returns fields to write.
// Return nil to skip. Return error to record a failure.
type TransformFunc func(docID string, data map[string]any) (map[string]any, error)

// Configure the migration
cfg := firestoremigrate.Config{
    ProjectID:     "my-project",
    DatabaseID:    "my-database",    // empty = default database
    Collection:    "users",
    MigrationName: "add-email-index",
    BatchSize:     200,              // default 200, max 500
    WritesPerSecond: 50,             // rate limit (negative = unlimited)
    DryRun:        false,
    Resume:        true,             // pick up where you left off
}

// Run it
runner := firestoremigrate.New(cfg, myTransform)
result, err := runner.Run(ctx)

The result tells you exactly what happened:

type Result struct {
    Processed  int
    Skipped    int
    Written    int
    Errors     int
    FailedDocs []string  // capped at 1000
    Duration   string
    Resumed    bool
}

Let’s build it piece by piece.

Step 1: Checkpoint State

The checkpoint is what makes crash-resume possible. After every 500 documents, the runner writes its progress to a Firestore doc. If it crashes, the next run reads the checkpoint and picks up where it left off.

// checkpoint.go
package firestoremigrate

const maxFailedDocs = 1000

type CheckpointState struct {
    LastDocID           string    `firestore:"last_doc_id"`
    Processed           int       `firestore:"processed"`
    Skipped             int       `firestore:"skipped"`
    Written             int       `firestore:"written"`
    Errors              int       `firestore:"errors"`
    FailedDocs          []string  `firestore:"failed_docs"`
    FailedDocsTruncated bool      `firestore:"failed_docs_truncated"`
    StartedAt           time.Time `firestore:"started_at"`
    UpdatedAt           time.Time `firestore:"updated_at"`
    Status              string    `firestore:"status"`
}

func (s *CheckpointState) RecordFailure(docID string) {
    if len(s.FailedDocs) < maxFailedDocs {
        s.FailedDocs = append(s.FailedDocs, docID)
    } else {
        s.FailedDocsTruncated = true
    }
}

Two design decisions worth noting:

Failed docs are capped at 1,000. If more than 1,000 documents fail, something is fundamentally wrong with your transform - you need to debug, not retry. The cap prevents the checkpoint doc from growing unboundedly (Firestore docs max out at 1MB).

The checkpoint collection is configurable (defaults to migration_state). Each migration gets its own doc: migration_state/{migration-name}. This means you can run multiple migrations concurrently on different collections without conflict.

Step 2: Cursor-Based Pagination

Firestore doesn’t support OFFSET. You can’t say “skip the first 10,000 documents.” Instead, you use cursor-based pagination: order by document ID, fetch a page, then start the next page after the last document you saw.

runnerScanPage = func(ctx context.Context, c firestoreClient,
    collection string, cursor string, limit int) ([]docSnapshot, error) {

    fc := c.(*firestore.Client)
    q := fc.Collection(collection).
        OrderBy(firestore.DocumentID, firestore.Asc).
        Limit(limit)

    if cursor != "" {
        q = q.StartAfter(cursor)
    }

    docs, err := q.Documents(ctx).GetAll()
    if err != nil {
        return nil, err
    }

    result := make([]docSnapshot, len(docs))
    for i, d := range docs {
        result[i] = docSnapshot{ID: d.Ref.ID, Data: d.Data()}
    }
    return result, nil
}

The cursor is just a string - the last document ID we processed. This is what gets saved in the checkpoint. When resuming, we pass the cursor to StartAfter and Firestore picks up exactly where we left off.

This is cheaper than it sounds. Firestore charges per document read, and the StartAfter doesn’t read the skipped documents. You only pay for the documents you actually fetch.

Step 3: Batch Writes with Rate Limiting

The Go SDK’s client.Batch() is the only batch API available. It’s atomic (all-or-nothing) and capped at 500 operations.

runnerWriteBatch = func(ctx context.Context, c firestoreClient,
    collection string, writes []batchWrite) error {

    fc := c.(*firestore.Client)
    batch := fc.Batch()
    for _, w := range writes {
        ref := fc.Collection(collection).Doc(w.DocID)
        batch.Set(ref, w.Data, firestore.MergeAll)
    }
    _, err := batch.Commit(ctx)
    return err
}

MergeAll is critical. It only writes the fields you provide, leaving everything else untouched. This is what makes migrations safe - you’re adding fields, not replacing documents.

Rate limiting uses Go’s x/time/rate token bucket:

if r.cfg.WritesPerSecond >= 0 {
    limiter = rate.NewLimiter(rate.Limit(r.cfg.WritesPerSecond), 1)
}

// Before each batch commit:
if limiter != nil {
    limiter.Wait(ctx)
}

The default is 50 writes per second. Firestore’s hard limit is 10,000 writes/sec, but you rarely want to hit that - other services are probably reading from the same collection.

Step 4: The Run Loop

The core loop ties everything together. Here’s the structure (simplified from the actual implementation):

func (r *Runner) Run(ctx context.Context) (*Result, error) {
    client, err := runnerNewClient(ctx, r.cfg.ProjectID, r.cfg.DatabaseID)
    if err != nil {
        return nil, err
    }
    defer runnerCloseClient(client)

    // Resume from checkpoint if requested
    var state *CheckpointState
    if r.cfg.Resume {
        state = runnerReadCheckpoint(ctx, client, ...)
        if state != nil && state.Status == "completed" {
            return previousResult(state), nil  // already done
        }
    }
    if state == nil {
        state = &CheckpointState{Status: "in_progress", StartedAt: time.Now()}
    }

    cursor := state.LastDocID
    var pendingWrites []batchWrite

    // Scan loop
    for {
        page, err := runnerScanPage(ctx, client, r.cfg.Collection, cursor, r.cfg.BatchSize)
        if len(page) == 0 || err != nil {
            break
        }

        for _, doc := range page {
            state.Processed++
            cursor = doc.ID

            result, err := r.transform(doc.ID, doc.Data)
            if err != nil {
                state.RecordFailure(doc.ID)
                continue
            }
            if result == nil {
                state.Skipped++
                continue
            }

            state.Written++
            pendingWrites = append(pendingWrites, batchWrite{DocID: doc.ID, Data: result})

            if len(pendingWrites) >= r.cfg.BatchSize {
                r.flushBatch(ctx, client, pendingWrites, limiter)
                pendingWrites = nil
            }
        }

        // Periodic checkpoint
        writeCheckpoint(ctx, client, state)

        if len(page) < r.cfg.BatchSize {
            break  // last page
        }
    }

    // Flush remaining + final checkpoint
    r.flushBatch(ctx, client, pendingWrites, limiter)
    state.Status = "completed"
    writeCheckpoint(ctx, client, state)

    return &Result{...}, nil
}

Key behaviors:

  • Transform returns nil: document is skipped (already migrated). This is how idempotency works - your transform function checks if the migration has already been applied.
  • Transform returns error: document is recorded as failed, but processing continues. One bad document doesn’t abort 10,000 good ones.
  • Dry-run mode: flushBatch skips the actual write. Everything else runs normally - you see realistic counts.
  • Completed migration: if you run with Resume: true and the checkpoint says “completed,” it returns immediately. No wasted reads.

Step 5: Testing Without Firestore

The library uses injectable package-level variables for all Firestore operations. Tests swap them with fakes:

func setupFakes(t *testing.T, docs []testDoc) *int {
    t.Helper()
    writeCalls := 0

    // Save originals
    origScan := runnerScanPage
    origWrite := runnerWriteBatch
    // ... save all originals

    t.Cleanup(func() {
        runnerScanPage = origScan
        runnerWriteBatch = origWrite
        // ... restore all
    })

    // Inject fakes
    runnerNewClient = func(...) (firestoreClient, error) {
        return &fakeClient{}, nil
    }

    runnerScanPage = func(ctx context.Context, c firestoreClient,
        collection, cursor string, limit int) ([]docSnapshot, error) {
        // Simulate pagination over in-memory docs
        startIdx := 0
        if cursor != "" {
            for i, d := range docs {
                if d.id == cursor {
                    startIdx = i + 1
                    break
                }
            }
        }
        end := startIdx + limit
        if end > len(docs) {
            end = len(docs)
        }
        // return docs[startIdx:end]
    }

    runnerWriteBatch = func(...) error {
        writeCalls++
        return nil
    }

    return &writeCalls
}

This is the same pattern used in production Go codebases at scale - it avoids the complexity of running a Firestore emulator in CI while still testing real behavior. The 13 tests in the library cover: defaults, config validation, basic scan/transform/write, skip handling, dry-run, error recording, resume from checkpoint, completed-migration detection, and retry-failures mode.

A Real Migration: 11,532 Documents in 74 Seconds

Here’s the migration that motivated building this library. We had 11,532 Supreme Court case documents in Firestore with party names stored as flat strings:

{
  "pet_name": "GANESH KUMAR",
  "res_name": "STATE OF INDIA"
}

We needed to add structured arrays matching the format used by other court types:

{
  "pet_name": "GANESH KUMAR",
  "res_name": "STATE OF INDIA",
  "petitioners": [{ "name": "GANESH KUMAR" }],
  "respondents": [{ "name": "STATE OF INDIA" }]
}

The transform function is 20 lines:

// sc_parties.go
func transformSCParties(docID string, data map[string]any) (map[string]any, error) {
    // Skip if already migrated
    if _, ok := data["petitioners"]; ok {
        return nil, nil
    }

    petName, _ := data["pet_name"].(string)
    resName, _ := data["res_name"].(string)

    if petName == "" && resName == "" {
        return nil, nil
    }

    result := make(map[string]any)
    if petName != "" {
        result["petitioners"] = []map[string]any{{"name": petName}}
    }
    if resName != "" {
        result["respondents"] = []map[string]any{{"name": resName}}
    }
    return result, nil
}

The CLI wiring:

func main() {
    dryRun := flag.Bool("dry-run", false, "Preview without writing")
    resume := flag.Bool("resume", false, "Resume from checkpoint")
    flag.Parse()

    runner := migrate.New(migrate.Config{
        ProjectID:     "my-project",
        DatabaseID:    "my-database",
        Collection:    "sc_cases",
        MigrationName: "sc-parties-normalize",
        DryRun:        *dryRun,
        Resume:        *resume,
    }, transformSCParties)

    result, err := runner.Run(context.Background())
    if err != nil {
        log.Fatal(err)
    }

    fmt.Printf("Processed: %d  Written: %d  Skipped: %d  Errors: %d\n",
        result.Processed, result.Written, result.Skipped, result.Errors)
}

Dry run first:

$ go run sc_parties.go --dry-run
Processed: 11535  Written: 11532  Skipped: 3  Errors: 0  Duration: 32.4s

Then the real thing:

$ go run sc_parties.go
Processed: 11535  Written: 11532  Skipped: 3  Errors: 0  Duration: 1m14s

74 seconds. 11,532 documents transformed. Zero errors. The 3 skipped documents either already had the new format or had no party data. The rate limiter stretched it from 32 seconds (read-only) to 74 seconds (with writes at 50/sec).

If it had crashed at document 6,000, I’d run go run sc_parties.go --resume and it would pick up from the last checkpoint - no wasted work.

When This Doesn’t Fit

This library is designed for schema migrations - adding fields, reformatting data, backfilling computed values. It has real limitations:

Not for deletes. The library uses MergeAll exclusively. If you need to remove fields, you’ll need to use firestore.Delete sentinels in your transform output, which works but isn’t the primary use case.

Not for cross-collection migrations. The runner scans one collection and writes back to the same collection. If you need to copy data between collections, you’ll need to handle the destination writes yourself.

Not for real-time. This is a batch tool. It scans the collection once and exits. If new documents arrive during the migration, they won’t be processed. Run it again (idempotent transforms handle this safely) or integrate the new format into your write path first.

Go SDK only. If you’re using Python or Node.js, the built-in BulkWriter handles most of what this library does. This library exists because the Go SDK doesn’t have BulkWriter.

The library works best when:

  • You have thousands to millions of documents to transform
  • The transform is idempotent (safe to re-run)
  • You need crash-resume for large collections
  • You want to preview changes before committing

The Key Design Decision: Injectable Functions

The most important architectural decision isn’t the pagination or the checkpointing - it’s making every Firestore operation an injectable function:

var (
    runnerNewClient     = func(...) (firestoreClient, error) { ... }
    runnerScanPage      = func(...) ([]docSnapshot, error) { ... }
    runnerWriteBatch    = func(...) error { ... }
    runnerReadCheckpoint  = func(...) *CheckpointState { ... }
    runnerWriteCheckpoint = func(...) error { ... }
)

This single decision enabled:

  • 13 unit tests with zero infrastructure - no Firestore emulator, no Docker, no test project
  • Predictable test behavior - fake scan returns exactly the documents you define
  • Edge case testing - simulate batch write failures, checkpoint corruption, mid-scan errors

The trade-off is that these are package-level variables, which means tests can’t run in parallel. For a migration library that runs sequentially by design, this is the right trade-off.

Source Code

The full library is at github.com/ashishthedev/firestoremigrate. It’s 400 lines of Go with 13 tests. Use it directly or use it as a starting point for your own migration tooling.

go get github.com/ashishthedev/firestoremigrate

The migration script that crashed four times before this library existed? It ran once, completed, and the checkpoint proves it.


Source: github.com/ashishthedev/firestoremigrate

About the Author

Ashish Anand

Ashish Anand

Founder & Lead Developer

Full-stack developer with 10+ years experience in Python, JavaScript, and DevOps. Creator of DevGuide.dev. Previously worked at Microsoft. Specializes in developer tools and automation.