The 12+ factor Model #2: Single Source of Truth.

Codebase + Data + Artifacts + Models

Leandro G. Almeida

and

Prafful Mishra

Jul 08, 2023

a close up of a window with a building in the background — Photo by Claudio Schwarz on Unsplash

Introduction

Last time, we introduced many questions that arose when creating machine learning model applications and conforming to the original 12-factor app.

We could modify some of the original factors to make them broader but run into potentially dangerous territory. These new broad factors may not be able to guide development enough to be helpful. We could go the constitutional route and have Amendments! Also, more question is generally not the direction we want to go!

So we answer this one by choosing to create new factors. First, let’s go over the goal or manifesto of the original 12-factor app. Our main goal will be to keep the main principles:

Use declarative formats for setup automation, to minimize time and cost for new developers joining the project;
Have a clean contract with the underlying operating system, offering maximum portability between execution environments;
Are suitable for deployment on modern cloud platforms, obviating the need for servers and systems administration;
Minimise divergence between development and production, enabling continuous deployment for maximum agility;
And can scale up without significant changes to tooling, architecture, or development practices

What are the minimal changes we need to transfer these principles to machine learning?

Use declarative formats for setup automation to minimize time and cost for new developers joining the project;
Have a clean contract with the underlying operating system and data systems, offering maximum portability and reproducibility between execution and training environments;
Are suitable for deployment and training on modern cloud platforms, obviating the need for servers and systems administration;
Minimise divergence between development, training, and production, enabling continuous deployment for maximum agility;
And can scale up without significant changes to tooling, architecture, or development practices.

By adhering to the KISS principles, adding the terms "training" and "data" appears to effectively accomplish the task!

Factor 1: Single Source of Truth.

In the original 12 factors, the first factor they focus on is the Codebase, where “One codebase tracked in revision control, many deploys.” A model is, and the results from inferencing with that model depend on internal data and model parameters that may be stored separately from the codebase. Furthermore, how the data is integrated with the codebase to create those parameters depends on the version of the data used and the code-specific settings and techniques used. These settings are generally stored externally to the model to facilitate experimentation and thus need to be versioned themselves.

One single source of truth for our code, data, and artifacts tracked in revision control, many deploys. In regards to training, each training session is associated with a variation in at least of these objects, code, data, and initial configurations, leading to individual versioned artifacts.

Figure 1. It shows exactly the objects that need to be versioned so our model can create reproducible results in the deployment and other environments. Assuming it is not allowed to change its state during deployment.

Code Base

First comes the code base! If only the code base were all that made a model; unfortunately, the weights we use during inference are artifacts that result from a training step where model code and training data come together. Not only is the code evolving (bug fixed, improvements, etc.) so is the data. So we require the code, data and all the artifacts to be versioned if we want to have a versioned model during deployment.

Tons of questions pop up on this simple diagram, especially as one tries to create a systematic way to understand how to compare models in production and generate governance frameworks around them to deal with overall risk management.

We very quickly move from sequential versioning to over that considers all the aspects of experimental trial and error associated with the modeling aspect and the experimentation associated with finding the correct market product/model fit—image a deployment pipeline where you want to test multiple trained models congruently. Here, hashes can be helpful and almost required, especially in a generally automated framework. One can look at AutoML frameworks where converging to the correct model could require multiple branches of the model architecture and associated initial configuration.

Nonetheless, from the user point of view, a particular semantic denomination is generally preferred; one can achieve this by keeping information associated with the version of the data in the code base itself or through a configuration artifact associated with where the end model is storage, and generating a performance based on those.

From using semantic versioning to hashes in an unstructured dataset, a standard is better than none. Young teams tend to use git-like tools as much as possible to keep track of changes, something we are all familiar with. As the team and deployment evolve, other choices are required to deal with the divergent nature of experimentation without slowing the development process.

Data

The simplest addition to the data versioning, the answer is to start with a single source of truth. While data can be delivered functionally in different ways, we don’t want the data to diverge or change. If I call it in production, it should be the same as when I called it during development.

There is one caveat, data that is real-time and used as input; we don’t get to see it before deployment, mostly due to user input or, in the case of forecasting due to the stochastic nature of our universe, things like weather, individual decisions, or even unknown behavior due to unknown actors. While we may not be able to track it, this is something that, ideally, we should reproduce issues in deployment.

Data versioning is a critical component for ensuring reproducibility in Data Science projects. The growing awareness of this need has led to the development of various solutions that try to solve this problem with a completely different approach.

Having said that, data versioning can be achieved at multiple levels (i.e., at the object level, at the project level, or by introducing a versioning layer between your object/file storage and the data consumption layer). Sometimes one can be in a situation where an existing tool might not suit one’s use-case as-is; being creative is the solution here.

We will delve into a comprehensive list of implementation for each of the solutions below in future articles, if you have a suggestion or like to contribute, feel free to reach out to us.

However, for now, let's take a quick look at a few additional options worth considering:

Temporal Tables
An early way to achieve it was to manually wrap or reroute database CRUD calls. These can be implemented using attributes, such as `created_at` commonly found in most auditable databases. `delete` turn into `update` calls and add an `expirated_at.` Furthermore, `update` calls, add an `expirated_at`, and create a new item, essentially duplicating. This allows a manual filter to be implemented during `read`; note that all unstructured data will be duplicated since a link to it will always exist. Note no data point is ever deleted until it reaches a TTL.
The feature mentioned above is found in Temporal tables, also known as system-versioned temporal tables in SQL Server. This type of SQL table records the history of data changes directly within the table. This functionality allows you to easily maintain a full change history, including when the change occurred and what the data looked like before and after the change. Temporal tables are a powerful tool for the governance of data, auditing, temporal data analysis, and monitoring.
In both the cases above one can use a tool like 1 or to archive.
DVC2, LakeFS3, and Git-LFS4 are good examples of data versioning as an intermediate layer between the object storage and the corresponding codebase.
Druid is a database for event-based data. Thus it is particularly useful for data that is created often (e.g., streams of events). Furthermore, it has features to quickly aggregate data over time with user-defined methods. This is great for data that needs to be aggregated before it can be analyzed or trained on, e.g., through federated methods.

ML Artifacts and Configurations

In the context of ML, an artifact is the byproduct of the model training process. This could include the trained model itself or just the trained weights/parameters, evaluation metrics, visualizations, and other outputs. These artifacts are critical for understanding the performance of the model and making decisions about its deployment, use, and reproducibility.

Configurations, on the other hand, refer to the settings and parameters used in the model training process. This could include hyperparameters, feature selection criteria, preprocessing steps, and other settings. These configurations can significantly impact the performance of the model and hence need to be tracked and managed. It is important to note that we need to version configurations to reproduce our training results.

The Importance of Versioning Artifacts and Configurations

Versioning artifacts and configurations is crucial for several reasons:

Reproducibility: Just like with data, the ability to reproduce an experiment or a model requires knowing exactly which configurations were used and what artifacts were produced. This is crucial for validating results and building upon previous work.
Collaboration: In a team setting, multiple data scientists may be working on the same model, each experimenting with different configurations. Versioning allows them to work independently without overwriting each other's work.
Experiment Tracking: Data scientists often experiment with different configurations to optimize performance during the model development process. Versioning allows them to easily track these experiments and compare results.
Deployment and Monitoring: When a model is deployed, it's important to know exactly what version of the model is in production, along with the associated configurations. This is crucial for monitoring the model's performance and diagnosing any issues.

A few commonly known options to achieve it are Airflow5, MLFlow6, and Kubeflow.7

Models

Model versioning is one of the last pieces of this puzzle and the one pretty tied to the data science/machine learning field. But this one works best when you have already solved the code and data versioning problems. As you cannot get a lot of information from the question:

Which model version is this?

without knowing the answer to the question:

How did I end up with this model?

A well-known and highly talked about solution is that a model can be treated as just another (read as special) artifact of your experiment process and then use a pipelining/experiment management tool to save it to object storage as discussed above.

Another very interesting approach to solve this problem comes from a project called ORMB8, which lets you could wrap your model and related metadata as an OCI image (think docker image) and re-use your existing or pre-setup container registry.

If one doesn’t have the bandwidth to setup/tweak/develop/maintain a full-blown experimentation management system or if the scale and complexity are still very low, one could start with tools like ORMB or directly utilize existing object storage and move towards a more complex solution as the need arises. (Remember KISS?9 )

Model Deployment

Note that how models are versioning and methodology used are incredibly important and will greatly impact how you deploy and serve your model.

For example: If saved as separate weights, unless you have a machine with the model loaded you will need to pull/load the code image and pull the weights before you are ready to serve any requests. That adds latency, especially as models grow in size; that is, the ‘L’ in LLMs., or your image contains large libraries, e.g., GPU-specific libraries.

The Fun is yet to come.

The fun part is that this doesn’t end here, as versioning is just one part of the puzzle; code quality, well-written tests, and the design of code components are factors of equal importance. To set all these in place, one must have one codebase (one source of truth) for each project while maintaining the lifecycle of all the artifacts produced.

This might sound very simple from a software engineer’s perspective, but it can be very convoluted to achieve in the context of machine learning; If executed accurately, it can lead to significant progress.

In the next section, we will be discussing dependencies.