3 ways to integrate machine learning into your systems

3 min readSep 25, 2022

Integrating outputs from machine learning models in batch mode is one of the most widely used patterns in the industry to leverage machine learning in software solutions.

There are many ways in which custom solutions are engineered for batch prediction using machine learning. In most cases there are 4 main components:

The data source from which the features that will be used are read.
The compute service responsible for doing transformations on the data such that it can be fed to a trained model. The compute is also responsible to run the model or models such that predictions from the models could be obtained. It might also transform the outputs into a format that makes it acceptable for a downstream system to use the same.
The data sink where the output from the models along with other factual data are stored which can be used by a downstream system.
A workflow management platform that can schedule the different compute steps and also ensure that all dependencies are met for each step before processing.

The following sections will discuss 3 ways in which the 4 components are realized in practice and also their pros and cons.

Hadoop, Spark and Airflow

Apache Hadoop is used as the data lake to store the data in partitioned Parquet files. Apache Spark is used to do the transformations and based on the data volume, separately trained SparkML or Python-based models/frameworks are used to do batch inference and write the outputs to the data lake also as partitioned Parquet files. The whole process is orchestrated by Airflow which also provides operational monitoring of the pipelines.

Pros

Availability of big data
Model management with MLFlow integration
Use existing setup for Cloudera clusters (for example)

Cons

Python environment management is difficult when there are many dependencies
Model update will need re-deployment of the whole pipeline

Public cloud — Azure

Azure Blob, Azure Machine learning, and Data factory (image by author)

Data stored as partitioned parquet files can be accessed by Azure Machine learning pipelines via datastores and datasets. Azure machine learning provides capabilities for environment and model management and these building blocks can be used to create machine learning pipelines. The pipelines can then be published with parameters to get a unique URL to reference the machine learning pipeline. The published pipeline endpoint can then be connected to Azure data factory via a linked service to schedule and for operational monitoring.

Pros

Less moving parts/points of failures as all are managed services
Easy Python environment management
Standardized MLOps available from Azure

Cons

Vendor lock-in with Azure SDK usage
Model update will need re-deployment of the whole pipeline

PaaS service as compute, model available as API

PCF, Spring Batch, Model in K8s, and MySQL

Pivotal Cloud Foundry (PCF) which is a popular PaaS service is used to run Java Spring Batch applications to do the data processing. PCF has a scheduler for scheduling and monitoring. Maven libraries can be used to connect to Hadoop and MySQL storage. The models are deployed in enterprise Kubernetes (e.g. Openshift) as REST services. The Spring batch application creates a batch request and retrieves a batch response from the model endpoint. The outputs are then written to MySQL tables from where downstream applications can consume the output.

Pros

Separation of responsibilities. Data transformation can be done in JVM languages and model inference script and environment are still based on Python.
Complete decoupling allows for easier blue-green deployment without any operational downtime
Easy deployment of model inference with MLFlow integration

Cons

Complex and many points of failure
API needs to be able to handle spikes in loads
Costlier as 2 different compute services need to be available

While the above list is not exhaustive, in most enterprise settings batch machine learning in software systems will be used in a manner that will resemble one of the above patterns.

3 ways to integrate machine learning into your systems

Hadoop, Spark and Airflow

Pros

Cons

Public cloud — Azure

Pros

Cons

PaaS service as compute, model available as API

Pros

Cons

Written by Anurag Chatterjee

No responses yet