AzureML — Connecting batch ML pipelines via Datasets

2 min readMay 8, 2024

In most batch machine learning workflows, there are different data and or ML pipelines. Each pipeline would take an input usually as a directory with multiple files, do some processing which might involve using a trained model or using “curated” prompts with a LLM and generating outputs which might also be a directory with multiple files. In case there are multiple pipelines, the outputs from the producer pipelines would need to be fed as inputs to the consumer pipelines. An example for this would be in a pipeline where labelled outputs from a curation/labelling platform are fed into another pipeline which uses these outputs from the platform to create prompts with dynamic few shot examples that can then be leveraged to do NLP tasks like NER, relevance classification, etc. without training specialized models.

In such scenarios, it is useful to leverage the Datasets feature such that we can upgrade a new version of the dataset whenever the producer pipeline executes successfully. We can then use this “latest” version as input to the consumer pipeline. A Dataset is a reference to data in a datastore, which are themselves an abstraction over an Azure storage account. So a dataset can be mounted as a path in local storage. So read and write file operations can be performed as if being performed on local storage.

The below sample shows how an OutputFileDatasetConfig can be registered as a dataset whenever files are written to the path referenced by the OutputFileDatasetConfig in the producer pipeline. This will create a new Dataset with a version 1 during the first run and in each subsequent run it will just increment the version. The latest generated version will always be available with the version = “latest”.

from azureml.data import OutputFileDatasetConfig

output_path = OutputFileDatasetConfig(
name = "output_path",
destination=("outputdatastore", "outputs/{run-id}")
).register_on_complete(name="outputdataset")

The below sample shows how the “latest” version of the dataset can be referenced and then mounted as a local path in a consumer pipeline. Note the below example uses the AzureML SDK v1.

from azureml.core import Dataset

# Get the workspace object (ws) using config or from current run

input_dataset = Dataset.get_by_name(
ws,
name="outputdataset",
version="latest"
)

with input_dataset.mount() as mount_context:
  input_path = mount_context.mount_point
  print(input_path)
  # All the files generated from the producer will be here 
  # You can use os.listdir to list the contents inside this path

In order to orchestrate the producer and the consumer, 1 pattern that I have used is to have an Azure Data Factory pipeline that will set the dependency between the producer and consumer and then schedule the complete pipeline on a trigger.

AzureML — Connecting batch ML pipelines via Datasets

Written by Anurag Chatterjee

No responses yet