Database Design for Document Processing Pipeline: Audit Trail + Schema Evolution

Jacob · June 17, 2025, 10:40am

I’m building an automated immigration form filling service that processes 200+ documents per application using AI/OCR to extract structured data. Looking for architectural advice on handling audit trails and schema evolution.

Stack: TypeScript, PostgreSQL, Prisma

Current Architecture Problems

Current approach: Store extracted data as merged JSONB objects

CREATE TABLE applications (
  id UUID PRIMARY KEY,
  -- Merged extracted data
  beneficiary JSONB,  -- ~30 fields
  petitioner JSONB,   -- ~25 fields  
  employment JSONB,   -- ~20 fields
  -- ... more sections
);

Processing flow:

Upload 5-15 documents per application (passports, forms, certificates)
AI extracts structured data from each document separately
Merge all extractions into final JSONB objects (first-wins strategy)
Store merged result

Key Problems:

No audit trail: Can’t tell which document provided which field value
Data loss: Multiple extractions of same field (e.g., different passport numbers) - only keep first
Schema brittleness: Adding new fields breaks validation on existing data
No conflict resolution: Can’t surface competing values to users

Example Scenario

Document A (passport): passportNumber = "P123456" 
Document B (form):     passportNumber = "P654321"
Document C (resume):   passportNumber = "P123456"

Current result: { passportNumber: "P123456" }
Desired: Show user all 3 values + sources for manual resolution

Solutions Considered

Option 1: Field-level storage (EAV-like)

CREATE TABLE field_extractions (
  id UUID PRIMARY KEY,
  application_id UUID,
  field_path VARCHAR,     -- e.g., "beneficiary.passportNumber"
  value JSONB,
  source_document_id UUID,
  confidence DECIMAL,
  extracted_at TIMESTAMP
);

Option 2: Document-first storage

CREATE TABLE document_extractions (
  id UUID PRIMARY KEY,
  application_id UUID,
  document_id UUID,
  extracted_data JSONB,   -- Full extraction result per document
  schema_version VARCHAR
);

Questions

Which approach better handles audit trails and conflict resolution?

EAV-style field tracking vs document-centric storage?

How to handle schema evolution gracefully?

Version schemas in application code vs database?
Migrate old data vs maintain multiple schema versions?

Performance considerations for reconstruction?

Building final objects from 100s of field extractions
Query patterns for conflict detection

Are there established patterns for this problem domain?

Document processing pipelines with provenance tracking
Similar architectures in ETL/data integration systems

The system processes ~1000 applications/month with plans to scale significantly. Any architectural guidance would be greatly appreciated!

Archiee · June 17, 2025, 10:44am

Which approach is better for audit trails and conflict resolution?

EAV-style field tracking (Option 1) is better because:
- You store each field extraction separately, with source info.
- You can track which document gave which value.
- You can keep all conflicting values and show them to users for manual resolution.
- Enables fine-grained audit and easier conflict detection.
Document-centric storage (Option 2) stores full document extractions but requires extra logic to merge and track conflicts at query time, making audit trails harder.

How to handle schema evolution gracefully?

Version your schemas in application code, not just the database.
Store a schema_version with each extraction (either field or document).
Use a schema migration plan:
- New fields: make fields optional or add defaults to avoid breaking old data.
- Avoid strict validation that rejects unknown fields.
- Gradually migrate old data or support multiple versions with backward compatibility.

Performance & query patterns

For EAV:
- Build final objects by aggregating and merging fields at query time (e.g., JSONB aggregation).
- Index on (application_id, field_path) and (source_document_id) for fast lookups.
- Conflict detection: query fields with multiple differing values per field_path.
For ~1000 applications/month, EAV with proper indexes and batching will perform well.

Established patterns

Your problem is common in ETL, data integration, and document processing pipelines.
Provenance tracking (source, confidence, timestamp) is key.
Use immutable event logs or append-only stores to keep full history.
Conflict resolution UIs are standard in these systems.

Summary recommendation

Go with EAV-style field extraction storage.
Keep full provenance metadata per field.
Store schema version per extraction.
Handle schema evolution in app code and with soft migration.
Build final merged data at read/query time, allowing manual conflict resolution.
Index heavily for performance.

This approach offers best flexibility, auditability, and user transparency for your automated immigration form filling service.