Database Design for Document Processing Pipeline: Audit Trail + Schema Evolution

I’m building an automated immigration form filling service that processes 200+ documents per application using AI/OCR to extract structured data. Looking for architectural advice on handling audit trails and schema evolution.

Stack: TypeScript, PostgreSQL, Prisma

Current Architecture Problems

Current approach: Store extracted data as merged JSONB objects

CREATE TABLE applications (
  id UUID PRIMARY KEY,
  -- Merged extracted data
  beneficiary JSONB,  -- ~30 fields
  petitioner JSONB,   -- ~25 fields  
  employment JSONB,   -- ~20 fields
  -- ... more sections
);

Processing flow:

  1. Upload 5-15 documents per application (passports, forms, certificates)
  2. AI extracts structured data from each document separately
  3. Merge all extractions into final JSONB objects (first-wins strategy)
  4. Store merged result

Key Problems:

  • No audit trail: Can’t tell which document provided which field value
  • Data loss: Multiple extractions of same field (e.g., different passport numbers) - only keep first
  • Schema brittleness: Adding new fields breaks validation on existing data
  • No conflict resolution: Can’t surface competing values to users

Example Scenario

Document A (passport): passportNumber = "P123456" 
Document B (form):     passportNumber = "P654321"
Document C (resume):   passportNumber = "P123456"

Current result: { passportNumber: "P123456" }
Desired: Show user all 3 values + sources for manual resolution

Solutions Considered

Option 1: Field-level storage (EAV-like)

CREATE TABLE field_extractions (
  id UUID PRIMARY KEY,
  application_id UUID,
  field_path VARCHAR,     -- e.g., "beneficiary.passportNumber"
  value JSONB,
  source_document_id UUID,
  confidence DECIMAL,
  extracted_at TIMESTAMP
);

Option 2: Document-first storage

CREATE TABLE document_extractions (
  id UUID PRIMARY KEY,
  application_id UUID,
  document_id UUID,
  extracted_data JSONB,   -- Full extraction result per document
  schema_version VARCHAR
);

Questions

  1. Which approach better handles audit trails and conflict resolution?
  • EAV-style field tracking vs document-centric storage?
  1. How to handle schema evolution gracefully?
  • Version schemas in application code vs database?
  • Migrate old data vs maintain multiple schema versions?
  1. Performance considerations for reconstruction?
  • Building final objects from 100s of field extractions
  • Query patterns for conflict detection
  1. Are there established patterns for this problem domain?
  • Document processing pipelines with provenance tracking
  • Similar architectures in ETL/data integration systems

The system processes ~1000 applications/month with plans to scale significantly. Any architectural guidance would be greatly appreciated!

Which approach is better for audit trails and conflict resolution?

  • EAV-style field tracking (Option 1) is better because:
    • You store each field extraction separately, with source info.
    • You can track which document gave which value.
    • You can keep all conflicting values and show them to users for manual resolution.
    • Enables fine-grained audit and easier conflict detection.
  • Document-centric storage (Option 2) stores full document extractions but requires extra logic to merge and track conflicts at query time, making audit trails harder.

How to handle schema evolution gracefully?

  • Version your schemas in application code, not just the database.
  • Store a schema_version with each extraction (either field or document).
  • Use a schema migration plan:
    • New fields: make fields optional or add defaults to avoid breaking old data.
    • Avoid strict validation that rejects unknown fields.
    • Gradually migrate old data or support multiple versions with backward compatibility.

Performance & query patterns

  • For EAV:
    • Build final objects by aggregating and merging fields at query time (e.g., JSONB aggregation).
    • Index on (application_id, field_path) and (source_document_id) for fast lookups.
    • Conflict detection: query fields with multiple differing values per field_path.
  • For ~1000 applications/month, EAV with proper indexes and batching will perform well.

Established patterns

  • Your problem is common in ETL, data integration, and document processing pipelines.
  • Provenance tracking (source, confidence, timestamp) is key.
  • Use immutable event logs or append-only stores to keep full history.
  • Conflict resolution UIs are standard in these systems.

Summary recommendation

  • Go with EAV-style field extraction storage.
  • Keep full provenance metadata per field.
  • Store schema version per extraction.
  • Handle schema evolution in app code and with soft migration.
  • Build final merged data at read/query time, allowing manual conflict resolution.
  • Index heavily for performance.

This approach offers best flexibility, auditability, and user transparency for your automated immigration form filling service.