Azure AI Document Intelligence for RAG use cases

10 min readJun 8, 2024

Microsoft Azure’s AI Document Intelligence (previously Azure Form Recognizer) can extract “rich” content from multiple file formats including PDFs and Microsoft Office files. It could be used as a service to extract content from different types of files for Retrieval Augmented Generation (RAG) use cases, which could then be chunked and embedded for retrieval. Some alternatives include APIs like Unstructured, and language specific libraries like pypdf, for extracting text from PDF files, python-docx, for extracting text from DOCX files, etc.

I present below some of my learnings from using this service and also notes from a Microsoft course in italics.

Azure AI Document Intelligence and Azure AI Vision

As an Azure AI Service, Azure AI Document Intelligence is a high-level AI service that enables developers to access data in forms quickly. It’s built on the lower level Azure AI Services, including Azure AI Vision.

If you use Azure AI Vision with its Optical Character Recognition (OCR) feature, you can submit photographed or scanned documents and extract their words and text in JSON format. This functionality is similar to Azure AI Document Intelligence and can make it difficult to choose from these services.

If you want to extract simple words and text from a picture of a form or document, without contextual information, Azure AI Vision OCR is an appropriate service to consider. You might want to use this service if you already have your own analysis code, for example. However, Azure AI Document Intelligence includes a more sophisticated analysis of documents. For example, it can identify key/value pairs, tables, and context-specific fields. If you want to deploy a complete document analysis solution that enables users to both extract and understand text, consider Azure AI Document Intelligence.

In my experience an OCR service alone falls short of what is typically expected in RAG use cases where the chunked content should have a coherent meaning that could later be used to answer user’s queries. In case we extract a “bag of words” without an understanding of the document layout, the performance of the vector based retrieval and the final answer may not be up to the mark.

Using models with Azure AI Document Intelligence

Use a model to inform Azure AI Document Intelligence about the type of data you expect to be in the documents you’re analyzing. If your forms have a common structure or layout, you can increase the accuracy of the results and control the structure of the output data by using the most appropriate model. Azure AI Document Intelligence outputs data in JSON format, which is widely compatible with many databases, other storage locations, and programming languages.

Azure AI Document Intelligence includes several prebuilt models for common types of forms and documents. If your forms are of one of these types, you can extract information from them without training your own custom models. It’s very quick to create and deploy an Azure AI Document Intelligence solution when you use prebuilt models.

In Azure AI Document Intelligence, three of the prebuilt models are for general document analysis:

Read
General document
Layout

The other prebuilt models expect a common type of form or document:

Invoice
Receipt
W-2 US tax declaration
ID Document
Business card
Health insurance card

If you have an unusual or unique type of form, you can use the above general document analysis prebuilt models to extract information from them. However, if you want to extract more specific information than the prebuilt models support, you can create a custom model and train it by using examples of completed forms.

You can also associate multiple custom models, trained on different types of document, into a single model, known as a composed model. With a composed model, users can submit forms of different types to a single service, which identifies them and selects the most appropriate custom model to use in their analysis.

Azure AI Document Intelligence includes Application Programming Interfaces (APIs) for each of the model types you’ve seen. The following languages are supported:

C#/.NET
Java
Python
JavaScript

If you prefer to use another language, you can call Azure AI Document Intelligence by using its RESTful web service.

Connect to Azure AI Document Intelligence using Python

When you write an application that uses Azure AI Document Intelligence, you need two pieces of information to connect to the resource:

Endpoint. This is the URL where the resource can be contacted.
Access key. This is unique string that Azure uses to authenticate the call to Azure AI Document Intelligence.

from azure.core.credentials import AzureKeyCredential
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import AnalyzeResult

endpoint = "<your-endpoint>"
key = "<your-key>"

docUrl = "<url-of-document-to-analyze>"

document_analysis_client = DocumentIntelligenceClient(endpoint=endpoint, 
    credential=AzureKeyCredential(key))

poller = document_analysis_client.begin_analyze_document_from_url(
    "prebuilt-document", docUrl)
result: AnalyzeResult = poller.result()

Prebuilt models

Document types such as invoices and receipts vary in different businesses and industry but have similar structures and key-value pairs. For example, a “Total cost” value is likely to appear on almost all invoices although it might be called “Total”, “Sum”, or some other name. Microsoft has provided a set of prebuilt models with Azure AI Document Intelligence to handle the most common types of documents. You don’t have to train these models and you can create solutions using them very quickly.

General document analysis models

Three of the prebuilt models are designed to handle general documents and extract words, lines, structure and other information such as the language the document is written in:

Read. Use this model to extract words and lines from both printed and hand-written documents. It also detects the language used in the document.

General document. Use this model to extract key-value pairs and tables in your documents.

Layout. Use this model to extract text, tables, and structure information from forms. It can also recognize selection marks such as check boxes and radio buttons.

For RAG use cases, the Layout model is very useful to extracted text, tables and selection marks.

*The model screenshots above show Document Intelligence Layout model extracts data in Azure AI Document Intelligence Studio.*

The below code snippet shows how to extract content with layout information using AI document intelligence. It is modified from the popular Azure search OpenAI demo repo.

from azure.ai.documentintelligence.aio import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import DocumentTable
from typing import IO, AsyncGenerator, Union 

class Page:
    """
    A single page from a document

    Attributes:
        page_num (int): Page number
        offset (int): If the text of the entire Document was concatenated into a single string, the index of the first character on the page. For example, if page 1 had the text "hello" and page 2 had the text "world", the offset of page 2 is 5 ("hellow")
        text (str): The text of the page
    """

    def __init__(self, page_num: int, offset: int, text: str):
        self.page_num = page_num
        self.offset = offset
        self.text = text


class SplitPage:
    """
    A section of a page that has been split into a smaller chunk.
    """

    def __init__(self, page_num: int, text: str):
        self.page_num = page_num
        self.text = text

logger = logging.getLogger("ingester")

class DocumentAnalysisParser(Parser):
    """
    Concrete parser backed by Azure AI Document Intelligence that can parse many document formats into pages
    To learn more, please visit https://learn.microsoft.com/azure/ai-services/document-intelligence/overview
    """

    def __init__(
        self, endpoint: str, credential: Union[AsyncTokenCredential, AzureKeyCredential], model_id="prebuilt-layout"
    ):
        self.model_id = model_id
        self.endpoint = endpoint
        self.credential = credential

    async def parse(self, content: IO) -> AsyncGenerator[Page, None]:
        logger.info("Extracting text from '%s' using Azure Document Intelligence", content.name)

        async with DocumentIntelligenceClient(
            endpoint=self.endpoint, credential=self.credential
        ) as document_intelligence_client:
            poller = await document_intelligence_client.begin_analyze_document(
                model_id=self.model_id, analyze_request=content, content_type="application/octet-stream"
            )
            form_recognizer_results = await poller.result()

            offset = 0
            for page_num, page in enumerate(form_recognizer_results.pages):
                tables_on_page = [
                    table
                    for table in (form_recognizer_results.tables or [])
                    if table.bounding_regions and table.bounding_regions[0].page_number == page_num + 1
                ]

                # mark all positions of the table spans in the page
                page_offset = page.spans[0].offset
                page_length = page.spans[0].length
                table_chars = [-1] * page_length
                for table_id, table in enumerate(tables_on_page):
                    for span in table.spans:
                        # replace all table spans with "table_id" in table_chars array
                        for i in range(span.length):
                            idx = span.offset - page_offset + i
                            if idx >= 0 and idx < page_length:
                                table_chars[idx] = table_id

                # build page text by replacing characters in table spans with table html
                page_text = ""
                added_tables = set()
                for idx, table_id in enumerate(table_chars):
                    if table_id == -1:
                        page_text += form_recognizer_results.content[page_offset + idx]
                    elif table_id not in added_tables:
                        page_text += DocumentAnalysisParser.table_to_html(tables_on_page[table_id])
                        added_tables.add(table_id)

                yield Page(page_num=page_num, offset=offset, text=page_text)
                offset += len(page_text)

    @classmethod
    def table_to_html(cls, table: DocumentTable):
        table_html = "<table>"
        rows = [
            sorted([cell for cell in table.cells if cell.row_index == i], key=lambda cell: cell.column_index)
            for i in range(table.row_count)
        ]
        for row_cells in rows:
            table_html += "<tr>"
            for cell in row_cells:
                tag = "th" if (cell.kind == "columnHeader" or cell.kind == "rowHeader") else "td"
                cell_spans = ""
                if cell.column_span is not None and cell.column_span > 1:
                    cell_spans += f" colSpan={cell.column_span}"
                if cell.row_span is not None and cell.row_span > 1:
                    cell_spans += f" rowSpan={cell.row_span}"
                table_html += f"<{tag}{cell_spans}>{html.escape(cell.content)}</{tag}>"
            table_html += "</tr>"
        table_html += "</table>"
        return table_html

Specific document type models

The other prebuilt models are each designed to handle, and trained on, a specific and commonly used type of document. Some examples include:

Invoice. Use this model to extract key information from sales invoices in English and Spanish.

Receipt. Use this model to extract data from printed and handwritten receipts.

W-2. Use this model to extract data from United States government’s W-2 tax declaration form.

ID document. Use this model to extract data from United States driver’s licenses and international passports.
Business card. Use this model to extract names and contact details from business cards.

Custom models

If the prebuilt models don’t suit your purposes, you can create a custom model and train it to analyze the specific type of document users will send to your Azure AI Document Intelligence service. The general document analyzer prebuilt models can extract rich information from these forms and you might be able to use them if your requirements are to obtain general data. However, by using a custom model, trained on forms with similar structures and key-value pairs, you will obtain more predictable and standardized results from your unusual form types.

To train a custom model, you must supply at least five examples of the completed form but the more examples you supply, the greater the confidence levels Azure AI Document Intelligence will return when it analyzes input. The more varied your documents are in terms of structure and terminology, the greater the number of example documents you will need to supply to train a reliable model. You can either supply a labeled dataset to describe the expected data or allow the model to identify key-value pairs and table data based on what it finds in the example forms. Also, make sure your training forms include examples that span the full range of possible input. For example, if you are expecting both hand-written and printed entries, include them both in your training.

Once you have trained a custom model in this way, Azure AI Document Intelligence can accurately and predictably identify information in your unique forms.

There are two kinds of custom model:

Custom template models. A custom template model is most appropriate when the forms you want to analyze have a consistent visual template. If you remove all the user-entered data from the forms and find that the blank forms are identical, use a custom template model. Custom template models support 9 different languages for handwritten text and a wide range of languages for printed text. If you have a few different variations of the form templates, train a model for each of the variations and then compose the models together into a single model. The service will invoke the model best suited to analyze the document.
Custom neural models. A custom neural model can work across the spectrum of structured to unstructured documents. Documents like contracts with no defined structure or highly structured forms can be analyzed with a neural model. Neural models work on English with the highest accuracy and a marginal drop in accuracy for Latin based languages like German, French, Italian, Spanish, and Dutch. Try using the custom neural model first if your scenario is addressed by the model.

Composed models

A composed model is one that consists of multiple custom models. Typical scenarios where composed models help are when you don’t know the submitted document type and want to classify and then analyze it. They are also useful if you have multiple variations of a form, each with a trained individual model. When a user submits a form to the composed model, Document Intelligence automatically classifies it to determine which of the custom models should be used in its analysis. In this approach, a user doesn’t have to know what kind of document it is before submission. That can be helpful when you’re using lots of similar forms or when you want to publish a single endpoint for all your form types.

The results from a composed model include the docType property, which indicates the custom model that was chosen to analyze each form.

In conclusion, while AI Document Intelligence (previously Azure Form Recognizer) is a very powerful service to extract text from a variety of files, it can be quite expensive to use on a lot of files. So one should take particular note on the API and client versions (e.g. 2024–02–29-preview) to know more about the file types that are supported, for example, the older Microsoft Office file fomats like .ppt, .doc, .xls are currently not supported and how “pages” are calculated for different file types so that you can get an estimate of the total cost of extracting text from the supported documents inside your document library before actually using the service.