Workflow to use Great Expectations with Pandas

Anurag Chatterjee
3 min readFeb 24, 2024

--

Great expectations Python package logo

Great Expectations is a robust package to build data quality checks into your data or ML pipelines. However, it could require some set up and getting used to before incorporating the same in a simple pipeline.

Photo by Erik Mclean on Unsplash

Consider a scenario where you might need to parse incoming CSV files and then process the same to build a vector index for a Retrieval-Augmented Generation application.

This article shows the minimal components that are required to create the expectations for a Pandas dataframe created from the incoming CSVs and then use the expectations in a script inside the ML/data pipeline where the incoming CSV file is validated before any further processing.

Step 1: Create the expectations using a notebook

Based on the contract set between the upstream and downstream system you might already have an idea on what should the expected data look like. You can then use a notebook and the GE package as shown below to generate the initial set of expectations. Just as unit tests, the expectations would evolve as the project progresses.

Step 2: Have a look at the generated expectations using a JSON viewer

Have a look at the generated expectations in the ge_config folder. Most of the fields in the generated JSON are self-explanatory.

It is always good to save the above expectations is your source control so that in case there are test failures it can be clearly identified which version of the expectation suite was used.

Step 3: Use the generated expectations in your inference/data pipeline

Here we would like to apply the expectations on the incoming data and assess whether the data meets the “expectations” before we start any further processing. For automated systems, this is a necessity as without such tests, “bad” data will propagate downstream without any obvious failures and end-users might start to see unexpected behaviour from the apps.

We can have a step in our inference/data pipeline where a similar script as shown below is applied to the incoming data along with the prevously generated expectations.

In case Great Expectations detects the result of the data test is a failure, then the script will raise an exception and the entire pipeline should fail without any further processing.

In that case, the data pipeline team might have to check what could be the reason for the data anomaly. In case there is indeed a change in the upstream system producing the data then the data producers would have to clarify what the change was and then the entire change would need to propagate through the pipeline after making the required changes in the expectation suite. So we would need to repeat steps 1 to 3 on the updated data.

--

--

Anurag Chatterjee
Anurag Chatterjee

Written by Anurag Chatterjee

I am an experienced professional who likes to build solutions to real-world problems using innovative technologies and then share my learnings with everyone.

No responses yet