Extracting chunks from PDF and storing in locally hosted ChromaDB using Langchain utilities (as of April 2024)
As the ecosystem to build Retrieval Augmented Generation (RAG) applications evolves, it can get quite challenging to know which tutorial should be followed for what are now the basic steps of any RAG application. The indexing step where text chunks are extracted from documents, embeddings are generated for those chunks and finally the content with the embeddings and optional metadata are stored in a vector database (DB) like Chroma is a pre-requisite for most RAG use cases where the answer generated by the LLM is grounded by the context retrieved from the vector DB.
To keep the example close to real world use cases, we will have ChromaDB running as a separate process configured via docker-compose as shown below. For a real world use case additional configuration steps would be required for authentication, authorization and monitoring of the database which are outside the scope of this article.
version: '3'
services:
chromadb:
image: chromadb/chroma
container_name: chromadb
ports:
- "8005:8000"
volumes:
- ./data/chroma_data/:/chroma/chroma
restart: always
Here we can see that ChromaDB will be available at port 8005 and the content in the DB will be persisted at ./data/chroma_data/
The values to connect to the hosted ChromaDB are defined as environment variables as below, which will be used in our script below.
CHROMA_HOST = "localhost"
CHROMA_PORT = "8005"
CHROMA_COLLECTION_NAME = "reports"
The relevant Python packages that are used are as per the below versions
python = ">=3.11,<3.13"
chromadb = "0.4.24"
langchain = "0.1.16"
pypdf = "4.1.0"
langchain-chroma = "0.1.0"
sentence-transformers = "2.7.0"
The below script uses the above environment variables and packages to perform the indexing step for a specific PDF file whose path needs to be provided. Please refer to the Langchain docs here, here and here for details on some of the steps that are used below.
import os
import chromadb
from chromadb.config import Settings
from langchain_chroma import Chroma
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.embeddings.sentence_transformer import (
SentenceTransformerEmbeddings,
)
from langchain_text_splitters import RecursiveCharacterTextSplitter
chroma_client = chromadb.HttpClient(host=os.getenv("CHROMA_HOST"), port=int(os.getenv("CHROMA_PORT")), settings=Settings())
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
# TODO: put the path to the PDF file to index here
file_path = ""
loader = PyPDFLoader(file_path)
document = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
chunked_documents = text_splitter.split_documents(document)
Chroma.from_documents(
documents=chunked_documents,
embedding=embedding_function,
collection_name=os.getenv("CHROMA_COLLECTION_NAME"),
client=chroma_client,
)
print(f"Added {len(chunked_documents)} chunks to chroma db")
Finally, once the documents are indexed/ added to the vector database, we can verify that we can query the ChromaDB to retrieve the relevant documents via an ANN search. We can define a Python client to check the retrieval as below.
import chromadb
chroma_client = chromadb.HttpClient(host=os.getenv("CHROMA_HOST"), port=int(os.getenv("CHROMA_PORT")))
collection = chroma_client.get_collection(name=os.getenv("CHROMA_COLLECTION_NAME"))
results = collection.query(
query_texts=["What are the forecasts for 2024?"],
n_results=2
)
print(results)
As can be seen above the chunks nearest to the query will be returned by the client and the content in the result set can be added to the prompts to the LLM to generate answers grounded in the data stored in the database.