3 ways to integrate machine learning into your systems
Integrating outputs from machine learning models in batch mode is one of the most widely used patterns in the industry to leverage machine learning in software solutions.
There are many ways in which custom solutions are engineered for batch prediction using machine learning. In most cases there are 4 main components:
- The data source from which the features that will be used are read.
- The compute service responsible for doing transformations on the data such that it can be fed to a trained model. The compute is also responsible to run the model or models such that predictions from the models could be obtained. It might also transform the outputs into a format that makes it acceptable for a downstream system to use the same.
- The data sink where the output from the models along with other factual data are stored which can be used by a downstream system.
- A workflow management platform that can schedule the different compute steps and also ensure that all dependencies are met for each step before processing.
The following sections will discuss 3 ways in which the 4 components are realized in practice and also their pros and cons.
Hadoop, Spark and Airflow
Apache Hadoop is used as the data lake to store the data in partitioned Parquet files. Apache Spark is used to do the transformations and based on the data volume, separately trained SparkML or Python-based models/frameworks are used to do batch inference and write the outputs to the data lake also as partitioned Parquet files. The whole process is orchestrated by Airflow which also provides operational monitoring of the pipelines.
Pros
- Availability of big data
- Model management with MLFlow integration
- Use existing setup for Cloudera clusters (for example)
Cons
- Python environment management is difficult when there are many dependencies
- Model update will need re-deployment of the whole pipeline
Public cloud — Azure
Data stored as partitioned parquet files can be accessed by Azure Machine learning pipelines via datastores and datasets. Azure machine learning provides capabilities for environment and model management and these building blocks can be used to create machine learning pipelines. The pipelines can then be published with parameters to get a unique URL to reference the machine learning pipeline. The published pipeline endpoint can then be connected to Azure data factory via a linked service to schedule and for operational monitoring.
Pros
- Less moving parts/points of failures as all are managed services
- Easy Python environment management
- Standardized MLOps available from Azure
Cons
- Vendor lock-in with Azure SDK usage
- Model update will need re-deployment of the whole pipeline
PaaS service as compute, model available as API
Pivotal Cloud Foundry (PCF) which is a popular PaaS service is used to run Java Spring Batch applications to do the data processing. PCF has a scheduler for scheduling and monitoring. Maven libraries can be used to connect to Hadoop and MySQL storage. The models are deployed in enterprise Kubernetes (e.g. Openshift) as REST services. The Spring batch application creates a batch request and retrieves a batch response from the model endpoint. The outputs are then written to MySQL tables from where downstream applications can consume the output.
Pros
- Separation of responsibilities. Data transformation can be done in JVM languages and model inference script and environment are still based on Python.
- Complete decoupling allows for easier blue-green deployment without any operational downtime
- Easy deployment of model inference with MLFlow integration
Cons
- Complex and many points of failure
- API needs to be able to handle spikes in loads
- Costlier as 2 different compute services need to be available
While the above list is not exhaustive, in most enterprise settings batch machine learning in software systems will be used in a manner that will resemble one of the above patterns.