I’m building an automated immigration form filling service that processes 200+ documents per application using AI/OCR to extract structured data. Looking for architectural advice on handling audit trails and schema evolution.
Stack: TypeScript, PostgreSQL, Prisma
Current Architecture Problems
Current approach: Store extracted data as merged JSONB objects
CREATE TABLE applications (
id UUID PRIMARY KEY,
-- Merged extracted data
beneficiary JSONB, -- ~30 fields
petitioner JSONB, -- ~25 fields
employment JSONB, -- ~20 fields
-- ... more sections
);
Processing flow:
- Upload 5-15 documents per application (passports, forms, certificates)
- AI extracts structured data from each document separately
- Merge all extractions into final JSONB objects (first-wins strategy)
- Store merged result
Key Problems:
- No audit trail: Can’t tell which document provided which field value
- Data loss: Multiple extractions of same field (e.g., different passport numbers) - only keep first
- Schema brittleness: Adding new fields breaks validation on existing data
- No conflict resolution: Can’t surface competing values to users
Example Scenario
Document A (passport): passportNumber = "P123456"
Document B (form): passportNumber = "P654321"
Document C (resume): passportNumber = "P123456"
Current result: { passportNumber: "P123456" }
Desired: Show user all 3 values + sources for manual resolution
Solutions Considered
Option 1: Field-level storage (EAV-like)
CREATE TABLE field_extractions (
id UUID PRIMARY KEY,
application_id UUID,
field_path VARCHAR, -- e.g., "beneficiary.passportNumber"
value JSONB,
source_document_id UUID,
confidence DECIMAL,
extracted_at TIMESTAMP
);
Option 2: Document-first storage
CREATE TABLE document_extractions (
id UUID PRIMARY KEY,
application_id UUID,
document_id UUID,
extracted_data JSONB, -- Full extraction result per document
schema_version VARCHAR
);
Questions
- Which approach better handles audit trails and conflict resolution?
- EAV-style field tracking vs document-centric storage?
- How to handle schema evolution gracefully?
- Version schemas in application code vs database?
- Migrate old data vs maintain multiple schema versions?
- Performance considerations for reconstruction?
- Building final objects from 100s of field extractions
- Query patterns for conflict detection
- Are there established patterns for this problem domain?
- Document processing pipelines with provenance tracking
- Similar architectures in ETL/data integration systems
The system processes ~1000 applications/month with plans to scale significantly. Any architectural guidance would be greatly appreciated!