Automated Legal Document Validation at Scale

How an Ops Compliance team eliminated manual review of thousands of home-equity loan agreements — building an AI-powered pipeline that extracts over a hundred structured fields from each document, runs every ten minutes, and never processes the same loan twice.

Note: This case study describes a real production pipeline. Specific identifiers — table names, bucket names, webhook URLs, queries, and credentials — have been intentionally omitted so the design can be shared safely.

Executive Summary


Team	Ops Compliance
Workflow	Main Orchestrator for Legal Document Validation
Status	Active
Stack	n8n, data warehouse, object storage, multimodal LLM
Output	~100 structured fields extracted per loan agreement, written to a queryable warehouse table
Throughput	A few hundred agreements every ten minutes, across ten parallel workers

The Ops Compliance team had a backlog problem. Roughly a hundred structured fields needed to come out of every executed home-equity loan agreement — APRs, fee schedules, loan terms, maturity rules, billing-error timelines, and so on — so the data could be checked against what borrowers were actually offered. There were thousands of these agreements sitting in storage, and more arriving all the time.

Reading them by hand wasn’t really a workable option. Too slow, too inconsistent from one reviewer to the next, and no clean way to prove an audit had actually covered everything it claimed to. So the team built a pipeline in n8n instead. It runs on its own, pulls unprocessed agreements from a data warehouse and object storage, hands each PDF to a multimodal LLM with instructions to return a strict JSON record, and writes the result back into a warehouse table. That same table doubles as the pipeline’s memory of what it’s already done.

What came out of that is a loop that runs itself — duplicate-run protection built in, Slack alerts when something breaks, and a few hundred agreements processed every ten minutes without anyone needing to babysit it.

The Problem

Volume, Complexity, and Stakes

Home-equity loan agreements aren’t short. Each one is a dense legal PDF covering interest rates, fee structures, draw periods, repayment periods, billing-error procedures, and maturity terms, written in the kind of precise, cross-referential language that takes a trained reader real time to get through correctly. For the audit, the team needed roughly a hundred discrete data points out of each agreement, pulled the same way every time, across a backlog running into the thousands.

A few things made this harder than it sounds.

Volume was the first issue — thousands of agreements sitting in object storage, with new ones arriving constantly as loans hit “offered” status. Complexity was the bigger problem, though. Each agreement has around a hundred fields buried in legal prose, and the easy ones (interest rate, fee amounts, payment due dates) aren’t where the difficulty actually lives. Maturity logic is the field that causes the real trouble: it can show up as an explicit calendar date, a one-year renewable term, or a relative offset clause like “measured from sixty days after account opening” — three representations that aren’t interchangeable and don’t even always refer to the same thing.

Stakes mattered more than either of those. This data feeds a compliance audit, and a mis-extracted value is a corrupted finding. A hallucinated one is worse — a value the model invented because it looked plausible — because that’s a fabricated finding that could misrepresent what a borrower was actually offered, sitting in a regulatory record. Research into LLM hallucinations in financial services puts hallucination rates for financial AI applications somewhere between 3% and 8% on compliance-type queries, which in an audit context isn’t a number anyone can shrug off. Extraction had to be literal. Never inferred.

On top of all that, the pipeline had to run unattended around the clock, never double-process a loan, never collide with a second copy of itself, and fail loudly enough that nobody had to go hunting for a quiet bug three weeks later.

Architecture diagram

The Architecture

The solution is three cooperating n8n workflows, each with a clearly defined responsibility:

Workflow	Role
Main Orchestrator	Schedules runs, queries candidates, fans work out to workers
Guard — Check Running Executions	Prevents overlapping runs
Batch Processor (×10)	Downloads, extracts via LLM, writes results

The Orchestrator

The orchestrator wakes up every ten minutes. Before it touches anything, it checks in with a Guard sub-workflow and asks whether a run is already in progress. If it finds more than one live execution, it throws a hard error and stops right there — it doesn’t queue itself up for later or quietly retry, it just halts. Manual runs get a pass on this check, on the theory that a person triggering a run on purpose is a different kind of risk than two scheduled jobs accidentally overlapping.

Once the guard clears, it runs two warehouse queries back to back. The first pulls every loan ID already processed under the current schema version straight from the output table — that becomes the “already done” list. The second query is doing the actual work: finding home-equity loans with a stored agreement document, status of “offered,” not flagged as internal or test accounts, inside the right regulatory action-year window, and not already on that done list. Results get ordered by most recent action date and capped at a few hundred per run.

Put those two queries together and you get a work list with no separate job table sitting behind it. Nothing to manage by hand, and nothing that can drift out of sync with reality, because the queue and the results are the same table.

Fan-Out: Ten Workers in Parallel

A few hundred documents is too much to get through one at a time inside a ten-minute window, so a code node splits the candidate list round-robin into ten equal groups. Ten HTTP request nodes each post one group off to a dedicated webhook on a waiting batch processor, with generous timeouts attached. At the end, a merge node acts as a barrier and waits for all ten to report back before calling the run complete — which gives each cycle a clean, logged start and finish. It’s a fairly standard fan-out/fan-in pattern for n8n: run several independent things in parallel without losing track of any of them.

The Worker Loop

Every one of the ten batch processors runs the identical loop, working through its assigned documents one at a time:

Receive the batch via webhook and parse the items
Run its own duplicate guard (independent of the orchestrator’s check)
For each loan in the batch:
- Download the agreement PDF from object storage using the document key from the loan record
- Extract the loan ID from the filename
- Send the PDF to the LLM with the extraction prompt
- Parse the model’s JSON response (extract the {...} block, run JSON.parse())
- On parse failure → notify Slack, move to next document
- Insert the structured row into the warehouse output table
- On insert failure → notify Slack, move to next document
- Wait briefly before the next document (rate-limit cushion)

One bad document doesn’t take the whole batch down. The worker just logs it to Slack and moves on. A malformed LLM response or a flaky database write isn’t allowed to stop everything else behind it.

The Extraction Prompt: Where the Real Work Happens

If you stripped this system down to one artifact worth protecting, it wouldn’t be the n8n canvas. It would be the prompt. Five things about it are worth walking through.

1. A Strict No-Inference Contract

The prompt opens with a rule the model isn’t allowed to break:

“Do NOT infer or guess. If a field’s information is not explicitly stated in the document, return an empty string.”

This is probably the single most important line in the whole system. A model that fills in reasonable-looking values when the document is ambiguous isn’t being helpful — it’s introducing risk. EY’s research on managing hallucination risk in LLM deployments makes a similar point: an LLM that fabricates a regulatory detail or misstates a compliance term can expose an organization to real legal and reputational risk, not just a data-quality headache. An honest blank beats a confident wrong answer every time in this context.

2. Hard JSON-Safety Rules

Downstream, a code node runs a strict JSON.parse() on whatever the model sends back. One malformed character — an unescaped quote, a stray backslash, a literal newline sitting inside a string — and the whole document’s extraction fails, no partial credit given. So the prompt is explicit about it: escape internal quotes, avoid control characters, double-check that every bracket and quote mark actually closes before answering. None of this is theoretical. It’s the kind of instruction you only write after watching a batch fail because of one stray character.

3. Formatting Normalization at the Source

Rather than cleaning up messy formatting after the fact, the prompt pushes normalization into the extraction step itself. Percentages come back as 5.99%, currency as $500, and every date — no matter how it’s written in the source — gets coerced into MM/DD/YYYY. By the time anything lands in the warehouse it’s already clean and queryable, with no separate transformation step needed afterward.

4. The Maturity Logic Waterfall

This is the most complicated piece of prompt engineering in the system, and honestly it earns the complexity. Loan maturity can show up several different ways in a home-equity agreement, so the prompt encodes an ordered decision tree to sort through them:

Step 1 — Explicit maturity date. If the document states a calendar date (or something directly computable, like “twenty years after the Account Opening Date”) and labels it as the maturity or loan-end date, that’s extracted directly.

Step 2 — One-year renewable fallback. No explicit date? Check for a renewable-term pattern instead — language saying the loan renews automatically each year unless cancelled. If that’s there, maturity gets derived from the renewal pattern.

Step 3 — Relative-date clause disambiguation. If there’s a relative-date clause sitting in the text (“sixty days after account opening,” say), the model has to figure out whether it’s actually talking about maturity or about something else entirely, most often the draw period. Two worked examples make the distinction concrete:

Positive: “This Agreement shall mature on the Maturity Date, which is the date twenty (20) years after the Account Opening Date.” The clause is clearly modifying maturity. Extract it.
Negative: “The Draw Period shall end sixty (60) days after the Account Opening Date, unless earlier terminated.” That same kind of phrase, but modifying the draw period instead. Maturity field stays blank.

This decision tree is basically business policy written as prompt logic, and it lives in the prompt rather than a separate rules engine because the context needed to resolve it is linguistic — you can’t regex your way out of “which noun does this clause attach to.”

5. A Fixed, Closed Output Schema

The prompt names roughly a hundred exact keys the model is allowed to return and says, in effect, use only these. Those keys map one-to-one onto the columns of the warehouse insert. If the model ever decided to rename a field, or add a helpful extra one, the insert downstream would break silently — so the schema gets treated as a closed contract, not a loose suggestion.

Reliability and Safety Features

Concern	How It’s Handled
Overlapping scheduled runs	Guard sub-workflow checks live executions; more than one active run → immediate hard stop
Worker-level overlap	Each batch processor carries its own independent duplicate guard
Re-processing already-done loans	Output table is read back each cycle; candidate query explicitly excludes completed IDs
Bad model output	JSON-safety prompt rules + try/catch parse + Slack alert on parse failure
Database errors	Post-insert status check + Slack alert
Rate limits	Work split into 10 parallel batches; per-item wait between documents
Manual runs	Guard never suppresses manual executions — human-triggered runs always proceed

Results and Impact

The throughput numbers are straightforward: a few hundred agreements every ten minutes, spread across ten parallel workers — a pace that would take a meaningful number of human reviewers to match, and they still wouldn’t match it on consistency.

Consistency is really where this pays off, though. Every agreement gets read against the same roughly hundred-field rubric. There’s no drift in interpretation between one reviewer and the next, no fatigue setting in by document two hundred, no question later about what was actually checked versus skipped.

Auditability was the other non-negotiable, and it’s handled by storing the raw source sentence alongside every extracted value. An auditor — or a regulator, if it comes to that — can take any single number sitting in the warehouse and trace it back to the exact line in the original agreement that produced it.

And the system is mostly self-maintaining. It finds its own remaining work by reading its own output, so there’s no dashboard to check for “what’s left,” and no manual intervention needed after a crash. It just picks back up where the output table leaves off.

Notable Engineering Decisions

Webhook fan-out over in-process sub-workflow calls. Triggering the ten batch processors over HTTP webhooks, rather than as in-process sub-workflows, decouples the orchestrator from worker execution. Each batch runs as its own independent execution with its own timeout and its own log trail, which makes failures a lot easier to isolate later without that debugging touching the orchestrator’s own run.

Output table as queue state. Instead of keeping a separate job or queue table next to the results table, the pipeline just treats its destination as the single source of truth for what’s done. That removes a whole category of synchronization bugs, since the queue and the results literally can’t disagree — they’re the same table. This is idempotent design in practice: run the operation as many times as you want, and you always land on the same correct outcome.

Schema versioning. There’s a schema-version column on the output table, and it’s a small thing that pays off disproportionately. Bump the version, and the pipeline treats every previously-processed loan as unprocessed under the new schema, working back through the backlog automatically — no manual migration, nothing lost from before.

Defensive prompt engineering. Nothing in this system trusts the LLM further than it can verify. Every output passes through a parse step and a database-insert check before it’s allowed anywhere near the production table, with Slack wired into both checkpoints. The intelligence doing the reading is real. The trust placed in it is not automatic.

Risks and Possible Improvements

No design like this is free of trade-offs, and a few are worth flagging honestly.

The static per-run cap assumes a fairly steady backlog. If intake volume spikes, the pipeline doesn’t have a built-in way to notice it’s falling behind — it just keeps processing its fixed few hundred per cycle while the backlog quietly grows underneath it. A dynamic cap tied to backlog depth, or even just a backlog-size metric somewhere visible, would catch that early.

Round-robin batching also doesn’t account for document size. A batch that happens to land several unusually large PDFs could brush up against the ten-minute timeout while the other nine workers finish with time to spare. Size-aware batching would spread that load more evenly.

The harder one to catch is silent under-extraction. Hard failures — a parse error, a failed database write — trigger Slack immediately. But a field that quietly comes back empty when it shouldn’t have doesn’t trip anything at all. As research on enterprise AI risk points out, outputs that are wrong but still structurally valid tend to be more dangerous than outputs that fail loudly, precisely because nothing announces that they happened. A field-level completeness check, flagging unusual empty-rate spikes across a run, would close that gap.

And the maturity waterfall — along with every other business rule — currently lives inside each of the ten workers’ prompts independently. Change the policy and you’re editing ten places, with real risk of them drifting apart from each other over time unless they’re pulled from one shared source.

Still Reviewing Documents by Hand?

Every week spent manually reading loan agreements, lease documents, or title commitments is a week of compounding inconsistency, and a compliance risk that grows a little with each new document added to the pile.

The pipeline in this case study processes hundreds of agreements every ten minutes, applies the same rubric every time, and keeps the source evidence behind every value it extracts. It runs in the background while your team spends its time on the parts of the job that actually need a person.

If you’ve got the backlog, we can build the pipeline.

→ Get a free consultation with LnP Infotech’s engineering team

No commitment. Just a conversation about whether automation is the right fit for your document workflow.