Invoice parsing has been the most challenging task among the Enterprise AI community . Even with tools like Tesseract OCR, the job of parsing an invoice has been super complicated considering the different types of format from different vendors. But what if we can club the power of a smart AI system like Google’s Gemini Vision model for processing invoices? Well, that’s what we are going to do in this article.
This prototype was a part of Skcript’s S1 AI Suite , where you can get pre-built AI solutions for your enterprise and run it at scale in days.
Before you jump into this article, remember that what we are showing here is just the beginning. We are implementing complex invoice processing in near-real-time speeds for enterprise AI use-cases, and if you are looking for a pre-built AI model that is specific for invoice processing at scale, reach out to us .
What is Google’s Gemini Vision?
Gemini Vision is a multimodal model that can take in both text and image as input and give out the output. It is a pre-trained model that can be used for various use-cases like invoice processing, document processing, etc. It is a part of Google’s Makers Suite and can be used for free.
What are prompts?
These are custom instructions that guide Gemini Vision towards specific data points within an invoice. Just like pointing on a map, prompts direct the AI to find the vendor name, due date, or any other desired information. Think of them as personalized cheat codes for unlocking precise data extraction.
Contents of an Invoice
Even though invoices are of different formats and designs, they all have common parameters in them which we might be interested in extracting from them and posting on different systems or recording in databases. The common ones are as follows:
- Invoice Number
- Invoice Date
- Due Date
- Vendor Name
- Vendor Address
- Vendor Phone Number
- Vendor Email
For this article I will use the following invoice as an example:
So our goal is to extract these parameters from the invoice and store them in a database or post them on different systems. But before that we need to understand how to use the Google’s Gemini Vision model to extract these parameters from the invoice.
Talking to the model
Google’s Gemini Vision model is a multimodal model which means it can take in both text and image as input and give out the output. So we need to feed in both the text and image of the invoice to the model to get the output. We can simply use their playgroud to test the model and see how it works.
-
Navigate to Google’s AI Studio
-
Create an account and navigate to https://makersuite.google.com/app/prompts/new_freeform
-
Then choose the model as Gemini Pro Vision
- Insert your invoice image in the playground and add a prompt, in this case we’ll use the below prompts
If you see, Im telling the model about what exactly I want from the invoice and how I want it to be structured. This is the power of prompts.
Now when you run the model, you’ll get the output as follows:
And thats how simple it is to use the model. You dont need to train the model or anything, just feed in the prompts and the model will do the rest.
How do we do it in our systems instead of the playground? Well, Google has provided a simple API to do the same.
Building an API out of it
Developing an API around Gemini Vision unlocks its full potential. Developers can leverage Google’s pre-built functionalities or tailor the API to their specific needs. This empowers integration with diverse systems, streamlining invoice processing and unlocking valuable data insights.
- In Google’s studio, click on the “Get API Key” button
- Copy the API key and use it in the below code
- Create a simple python project and install the below packages
- Make sure you place your invoice image in the same folder as the python file under the name
invoice.png
- Create a new
index.py
file and paste the below code
Just replace the API Key and the image path and run the code. You’ll get the output as follows:
Now you can simple use this data to feed your invoice processing pipeline and do the rest of the processing.
A New Era of Invoice Processing
Google Gemini Vision marks a paradigm shift in invoice parsing. Its multimodal approach promises to:
- Reduce manual processing: Automation significantly minimizes tedious manual data entry, saving time and resources.
- Boost accuracy: Multimodal processing minimizes errors and inconsistencies, leading to more reliable financial and inventory data.
- Unlock deeper insights: Extracted data readily integrates with existing systems, enabling insightful analysis and reporting.
- Streamline workflows: Automated invoice processing fosters faster approvals, payments, and other workflow steps.
Need to implement this invoice processing using AI at enterprise scale? We’re here to help. Reach out to us and we’ll get back to you in 24 hours.