Core Data Migration Incident Analysis: The Hidden Traps We Overlooked

Published on

Compared to some open-source frameworks, Core Data and SwiftData, despite having Apple’s official endorsement, often leave developers helpless when exceptions occur due to their “black box” nature, making it difficult to quickly locate problems and find effective solutions. This article documents an app startup timeout incident caused by Core Data model migration, shares the solution, and deeply analyzes the underlying causes.

Multiple User Complaints Within a Week

A few days ago, my developer friend Zhang reached out to me. He had received numerous user complaints within just one week: some veteran users encountered white screen issues after updating the app, unable to access any interface, rendering the app completely unusable.

Zhang’s NotingPro is a note-taking app designed specifically for iPadOS. Many long-term users have accumulated massive amounts of data, with individual accounts containing anywhere from several GB to nearly 20GB of data.

What was particularly troubling was that the problems appeared precisely after this update, and those affected were mostly the app’s most loyal users, which left Zhang quite anxious.

The Real Cause of the White Screen

NotingPro uses Core Data with CloudKit as its local and cloud data persistence solution. In this update, Zhang modified the data model by adding two entities and a new attribute to an existing entity.

Since the problem occurred right after the model modifications, we immediately suspected that data migration was causing the issue. However, Zhang’s changes fully complied with lightweight migration rules, and most users were unaffected, so we initially ruled out model incompatibility as the cause.

Considering that the affected users had enormous amounts of local data, we wondered whether Core Data couldn’t efficiently handle super-large database migrations. Based on experience, although 10-20GB of data isn’t small, it’s completely within SQLite’s capabilities. Zhang had also built GB-level test data locally but couldn’t reproduce the issue.

Finally, IPS (iOS Problem Summary) crash reports provided by some users revealed the truth: Core Data migration took too long, exceeding iOS watchdog’s 20-second threshold, causing the app to be forcibly terminated by the system.

Simply put, the data migration process executed on the main thread, causing white screens due to prolonged blocking.

Temporary Solution

After identifying the problem, to help affected users recover quickly, Zhang adopted an emergency solution: moving database initialization to a background thread, switching to the normal interface only after migration completion, thus avoiding prolonged main thread blocking.

I organized this approach into a SwiftUI-compatible version (simplified):

Swift
@MainActor
final class Stack: ObservableObject {
    @Published var status = LoadingStatus.loading
    private let container: NSPersistentContainer

    init() {
        self.container = NSPersistentContainer(name: "ActorStack")
        loadStores()
    }
    
    private func loadStores() {
        DispatchQueue.global().async { [weak self] in
            guard let self else { return }
            
            self.container.loadPersistentStores { _, error in
                DispatchQueue.main.async {
                    if let error = error as NSError? {
                        self.status = .failed(error)
                        print("Core Data loading failed: \(error)")
                    } else {
                        self.configureContexts()
                        self.status = .success
                    }
                }
            }
        }
    }
    
    private func configureContexts() {
        container.viewContext.automaticallyMergesChangesFromParent = true
        container.viewContext.mergePolicy = NSMergeByPropertyObjectTrumpMergePolicy
    }
  
    var viewContext: NSManagedObjectContext {
        container.viewContext
    }

    static let shared = Stack()
}

enum LoadingStatus {
    case loading
    case success
    case failed(NSError)
}

@main
struct ActorStackApp: App {
    @StateObject var stack = Stack.shared

    var body: some Scene {
        WindowGroup {
            switch stack.status {
            case .success:
                ContentView()
                    .environment(\.managedObjectContext, stack.viewContext)
            case .failed(let error):
                ErrorView(error: error) {
                    stack.retryLoading()
                }
            case .loading:
                LoadingView()
            }
        }
    }
}

This way, regardless of how long the migration takes, the main thread won’t be blocked, and the app will only enter the normal interface after migration completion.

After the updated version went live, all affected users returned to normal, although the first startup might require waiting for a longer time.

The Root Cause

Although the problem was resolved, we still needed to explore why a lightweight migration would take so long. The mystery was only solved when I examined Zhang’s Core Data Stack configuration code.

Several versions earlier, to improve write performance (note-taking apps often generate large amounts of data in short periods), Zhang had adjusted SQLite’s configuration:

Swift
storeDescription.setValue("WAL" as NSString, forPragmaNamed: "journal_mode")
storeDescription.setValue("PASSIVE" as NSString, forPragmaNamed: "wal_checkpoint")
storeDescription.setValue("100000000" as NSString, forPragmaNamed: "journal_size_limit")

Among these, setting wal_checkpoint to PASSIVE mode was the culprit behind the slow migration.

WAL Mode and Checkpoint Mechanism

WAL (Write-Ahead Logging) mode improves read-write concurrency performance by writing all modifications to WAL log files first. Since iOS 7, Core Data has used WAL mode by default.

To prevent WAL files from growing indefinitely, SQLite needs to periodically merge data back to the main database through Checkpoint operations. Core Data’s default strategy automatically executes this operation at appropriate times.

Zhang’s configured PASSIVE mode posed potential risks:

  • SQLite won’t actively execute checkpoints, only merging WAL when triggered by other connections.
  • For single-process mobile apps, this means the checkpoint mechanism is essentially useless.
  • WAL files will continue to expand, and even setting journal_size_limit can hardly effectively limit their size.

Problem Chain Review

  1. Users accumulate data over time, WAL files expand indefinitely to several GB.
  2. App starts, Core Data needs to execute checkpoint first before migration.
  3. The massive WAL data merging process takes extremely long, blocking the main thread.
  4. Watchdog detects main thread unresponsiveness and terminates the app.

No Absolute Right or Wrong

Some readers might think that not adjusting WAL settings would avoid this problem. Indeed, default settings have better universality and can largely prevent such situations. But for certain developers, there are indeed special requirements. For Zhang’s situation, even if he used PASSIVE mode, as long as he regularly performed manual merge operations within the app, there wouldn’t be problems. The key is having clear understanding and skilled mastery of setting details, application scenarios, and impact scope.

This incident reminds us: any optimization must be carefully implemented after evaluating long-term impacts. For most applications using Core Data, I recommend:

  1. Prioritize Core Data default configurations.
  2. If custom WAL settings are needed:
    • Avoid PASSIVE mode.
    • Set reasonable journal_size_limit (like 10-20MB).
    • Regularly execute checkpoints actively.
  3. Move database initialization to background threads, especially in scenarios with large data volumes.
  4. Conduct edge case testing before release to ensure stability.

Conclusion

Although most developers may not encounter similar problems, through this sharing, I hope to provide valuable reference for the community. Core Data isn’t a complete “black box” - the key is whether we’re willing to explore and understand its operational mechanisms.

While performance optimization is important, stability should always come first - any configuration changes must be accompanied by thorough testing to ensure they won’t trigger unexpected “chain reactions”.

If this article helped you, feel free to buy me a coffee ☕️ . For sponsorship inquiries, please check out the details here.

Weekly Swift & SwiftUI highlights!