Our Blog

How We Built a Multilingual Invoice Parser That Can Read Handwritten Receipts

AI Automation

Computer Vision

Document Processing

Travel Tech

How We Built a Multilingual Invoice Parser That Can Read Handwritten Receipts

Introduction

Invoice automation sounds simple until real-world documents show up. Travel and expense teams do not receive clean, uniform PDFs. They receive German hotel invoices, Indian tax invoices, taxi slips, booking confirmations, photographed receipts, and handwritten notes captured from mobile phones. We built a multilingual invoice parser to handle that exact mess.

The goal was not generic OCR. The goal was structured extraction that a production travel workflow could actually use: invoice date, vendor, amount, currency, VAT, hotel stay dates, and meal flags, across multiple document formats and languages.

The Problem With Traditional OCR Pipelines

Classic OCR pipelines work well on high-quality printed text, but they break down quickly when receipts are photographed at an angle, handwriting appears on the document, tax fields are split across multiple blocks, or the invoice language changes. In travel expense workflows, those edge cases are not rare. They are the norm.

Printed OCR struggled with handwritten taxi receipts.
Different invoice layouts produced inconsistent field positions.
German tax labels such as MwSt and USt added variation.
Booking confirmations and payment slips were easy to misclassify as invoices.

That is why we moved from an OCR-first architecture to a vision-language model architecture.

Why We Switched to a Vision Model

We needed a model that could reason over the whole page, not just recognized text fragments. A vision model can interpret layout, understand relationships between labels and values, and recover more signal from noisy or handwritten inputs. That matters when a VAT line is handwritten, when a total appears in a different block than the service description, or when the same page mixes hotel stay data with payment information.

Instead of asking the model to summarize the page, we forced it to return a narrow structured schema. That kept the output useful for the product and made testing practical.

What the Parser Needed to Extract

The parser was designed for TravelPro expense ingestion, so the output had to match operational fields rather than produce a free-form description. The core fields were:

Invoice date and invoice time when explicitly present.
Payment date only when the document clearly showed one.
Total amount and currency.
Vendor name.
VAT amount and VAT percentage.
Hotel stay start and end dates.
Meal inclusion flags for breakfast, lunch, and dinner.
Whether the document was actually a hotel receipt.

This forced us to solve document understanding, not just text recognition.

Architecture Overview

The final system was built as an asynchronous parsing service that receives an uploaded invoice image or PDF, normalizes the input, forwards it to a hosted vision model, and returns structured JSON back into the product workflow.

Input Layer: image and PDF invoice uploads from the product.
Preprocessing: page rendering, image resizing, and file normalization.
Vision Inference: a hosted vision-language model tuned through prompt and schema constraints.
Post-Processing: field validation, null handling, and invoice-vs-non-invoice guardrails.
Delivery: asynchronous status polling and structured expense updates back to the application.

We deployed the inference layer on RunPod so we could test heavier vision models without tying the application directly to local hardware limits.

Model Selection: Speed vs Accuracy

We tested multiple local and hosted models because invoice parsing quality did not line up neatly with model size alone. Some models were faster but too eager to invent fields like payment dates. Others performed better on handwritten receipts but had longer inference times.

Two patterns became clear:

Smaller vision models were faster but more likely to confuse due dates with payment dates.
Stronger document-oriented models produced better handwritten VAT and vendor extraction.

That testing loop mattered. We did not pick a model family based on marketing claims. We picked the model based on the actual invoices we needed to parse.

Handling Multiple Languages

Multilingual parsing was not just about supporting translated text. It required teaching the system that different invoice conventions still map into the same business schema. German invoices, for example, often expose tax and billing terms differently than English-language invoices. Indian hotel invoices carry another pattern entirely.

We addressed this by keeping the output schema language-independent and pushing the model prompt toward semantic interpretation rather than keyword matching. The model was instructed to find the real invoice fields even when labels changed across locales.

Why Handwritten Receipts Were the Real Test

The hardest document in our test set was a handwritten taxi invoice. That single receipt told us more than a dozen clean PDFs. It forced the model to read messy handwriting, separate vendor information from the amount, identify whether VAT existed, and avoid incorrectly classifying the page as a hotel receipt.

If the parser could not handle that document, it was not ready for production expense ingestion. Handwriting became the benchmark that separated a demo pipeline from a usable system.

Prompt Design and Guardrails

Prompting mattered as much as model choice. We tightened instructions repeatedly to stop the model from fabricating fields that were not explicitly present.

Do not treat due dates as payment dates unless payment is explicitly stated.
Do not infer meals unless breakfast, lunch, or dinner is written on the document.
Prefer the actual merchant or service provider over intermediaries.
Return null when the field is unclear instead of guessing.
Reject non-invoice documents such as booking confirmations and payment slips where possible.

That discipline improved production usefulness more than broad prompt verbosity ever did.

What We Learned From Production-Oriented Testing

Three lessons stood out from the implementation:

Handwritten support changes the evaluation criteria. A parser that works on clean hotel invoices is not automatically ready for receipts in the field.
Vision context beats OCR fragments on messy documents. Layout understanding made a measurable difference.
Field discipline matters. Returning null is often better than returning a wrong business value.

We also learned that throughput, warm-start behavior, and infrastructure choices matter once the model is good enough. A working parser still needs to fit into a real asynchronous product flow.

Business Outcome

The result is a multilingual invoice parser that can extract structured data from printed invoices, photographed receipts, and handwritten travel expenses with far better resilience than a traditional OCR-only flow. For TravelPro, that means less manual expense processing, better downstream automation, and more realistic coverage of the documents teams actually upload.

This is the kind of AI implementation that delivers value: narrow schema, measurable output, real document constraints, and a deployment path designed around production use rather than only laboratory demos.

Final Thoughts

If your workflow depends on invoices, receipts, or expense documents across languages and formats, the model decision should be driven by your real inputs, especially the ugly ones. Clean PDFs do not tell the whole story. Handwritten receipts do.

At Craftnotion, we design and deploy AI systems like this around actual business workflows, from document understanding and structured extraction to application integration and production rollout.

Need a document parser or vision workflow for your product? Talk to Craftnotion and we can build the right architecture for it.

Get in touch

Ready to start your project? Let's build something amazing together!