A Python package for semantic retrieval of Wordpress blog articles

Anurag Chatterjee
4 min readNov 4, 2023

--

Results from the library for the Wordpress website with sitemap https://learnwoo.com/post-sitemap1.xml and user query “What was mentioned on web scraping or crawling?”

Some background on the project

As the craze for the Retrieval Augmented Generation (RAG) paradigm using Large Language Models (LLMs) has hit a peak, I decided to build a Python package to retrieve semantically relevant articles from a Wordpress blog/website given its sitemap url and a user query. This library could be used in a RAG chat system add-on for a Wordpress website or just as a standalone search/recommendation service.

Here is the link to the package that I published to the Python Package Index and if you want to look at the code, here is a link to the code.

Wordpress is a popular platform among bloggers and it is not a surprise that finding content in the blog might be of interest to the blog authors or the readers to avoid repeating creating or reading content on certain topics. In such cases searching with exact key word matches may not be enough. Hence the idea of a semantic retriever is to retrieve previous articles that are contextually similar.

In simple words, the package will provide a list of articles from the Wordpress blog based a textual query from the user that are most similar to the query . These articles may or may not share the same words as the query but they would be closely related to the query in their meaning.

Solution design

The Python ecosystem has a lot of tools that makes it very easy to develop such packages by tying together a lot of different packages that provide quite complicated functionalities under the hood. The availability of remote compute facilities e.g. inference endpoints by platforms like HuggingFace allows developers to offload some of the heavy AI work to those platforms. Also, Open Source databases like ChromaDB which can store and efficiently retrieve content based on embeddings/vectors that can run locally as well as a managed service make it easier to try them out for different use cases.

Tools and libraries used

The major tools and packages that have enabled this package include:

  1. Poetry — for Python package management and publishing packages to PyPI
  2. Typer — Create a feature rich CLI application with out a lot of code and its integration with Poetry helps to run scripts after installing the library a breeze
  3. BeautifulSoup— Parse content from the HTML web pages of a blog
  4. Langchain — Although no Large Language Model feature is used in the current version of the library, Langchain has connectors for Hugging Face’s hosted embedding models, Chroma vector index store creation and retrieval of data from vector indices. These make interaction with these services consistent and easy.
  5. ChromaDB — Store the embedding index locally so that reindexing is not required to process each each query
  6. Pandas — To process intermediate data which is contained as CSV files

Part 1: Data pipeline/ offline index building flow

Once the library is installed and the 2 below commands are executed the below data flow is performed as shown in the diagram. The only parameter for the commands is the sitemap of the Wordpress website.

wordpress-recommender download-content "https://learnwoo.com/post-sitemap1.xml"
wordpress-recommender generate-index "https://learnwoo.com/post-sitemap1.xml"
Offline index building from blog articles

Part 2: Query pipeline/ online user query

Once the content is downloaded and index is created the below diagram shows how user queries are executed and relevant documents are retrieved. The parameters for the command are the sitemap of the Wordpress website, the user query for which the results need to be retrieved and an optional “top” parameter to control how many results are retrieved.

wordpress-recommender query-index "https://learnwoo.com/post-sitemap1.xml" --query "What was mentioned on web scraping or crawling?"
Getting relevant articles for the user
Results for the user query “What was mentioned on web scraping or crawling?” for https://learnwoo.com/ website

Conclusion

We can scale the above idea for any scale by periodically building the index and then making the query result retrieval a REST API. Indeed that is the idea for most Retrieval Augmented Generation (RAG) systems, where the query result retrieval system also has a Large Language Model to generate an answer to the user’s query in a more natural way by acting as a reasoning engine on the retrieved search results.

Feel free to explore the code on Github to explore the minor details on how the library works. Hope you might find the library or the code useful for your next big idea and if not it might lead to an useful learning journey!

--

--

Anurag Chatterjee
Anurag Chatterjee

Written by Anurag Chatterjee

I am an experienced professional who likes to build solutions to real-world problems using innovative technologies and then share my learnings with everyone.

No responses yet