The 12+ factor model #3: Dependencies

yep data is also a dependency now.

and

Sep 19, 2023

In the previous post, we talked about a single source of truth, and chaining that, the next thought is definitely about the best way to manage downstream dependencies and how to keep the leash of dependencies under control.

Photo by João Vítor Heinrichs: https://www.pexels.com/photo/stock-of-timber-logs-with-scratches-5022456/

Building an ML system is exactly like stacking wood, as seen in the above picture, all the dependencies play an important role in ensuring the stability of the pieces above them. It’s possible to build similar stacks that achieve a similar purpose, but this introduces risk. One could argue that you don’t need to have all the pieces perfectly stacked, and the system would still do the job, but one unfortunate event can bring down the complete stack.

Similarly, in ML, one needs to take into account your entire stack to be able to reproduce issues and have consistent inference with predictable results. Oftentimes, issues happen not because of individual pieces but because of their relationships with one another.

Data

Data dependencies are a really important piece of the ML puzzle, and a lot of times; your app would be driven according to a lot of dependencies/constraints around the availability, legality, and ease of experimentation with the data. It is important to have a clear understanding of data dependencies through data governance tools that allow us to find issues that may lead to data leakage, bias, and risk to customers and the company. Data governance tools have been proliferating, allowing teams to know how data was collected, and for this reason, most of these include data profiling capabilities, which allow one to create associated data unit tests with tools like Great Expectations, among others.

A lot of tools are available today for data versioning (as we talked about in our last writeup mentioned below), which ties very closely to data dependencies as well.

If a data explorer has the ability to modulate and play around with data, create multiple versions of it, has the option to pick one of the versions and take the ML app to production with that version seamlessly, we can say that the ML project is handling its data dependencies pretty well.

The 12+ factor Model #2: Single Source of Truth.

Leandro G. Almeida and Prafful Mishra

July 8, 2023

Read full story

Tooling

Tools can be a big constraint when looking at dependencies for an ML app; a big consideration while committing to a tool should be the ability to easily migrate away from the tool, either in whole or being able to migrate the resources out of the tool (there are obvious exceptions to it).

Let’s take an example of a pipelining orchestration tool used to generate feature sets for training and to save artifacts from a pipeline; you are forced to use the tool’s SDK in the source code of your pipeline component emitting the artifact. Also, this SDK serializes your artifacts in this only readable and super special object.

There are a few dependency issues to be handled here:

The first one is obvious that the artifacts become unreadable outside this tool.
What if there is a need to export an artifact that the tool doesn’t support (yet 🧐)?
Another issue frequently overlooked is What happens if I want to use the same pipeline in a different tool (as a part of an experiment or for any other reason). Changes need to be made in the source code of every pipeline component, along with changing the DAG specification of how these components interact.

There can be numerous other examples of such tooling dependencies that one should ponder over a bit longer and then have an action plan on how to handle them efficiently to move with speed and robustness.

CodeBase

When talking about the codebase of an ML application, it is fair to assume, looking at the current state of affairs, that Python is the primary language of choice.
(Psst!!! Rust is catching up soon, especially with Cargo 🤫)

Python brings a ton of dependency issues and a ton of solutions to solve them. I would recommend reading The Nine Circles of Python Dependency Hell.

A few notable mentions for Python dependency management include (poetry pdm, both of which are built on top of pipenv which also provides some tooling for dependency management). Sprinkling a bit of Nix package manager makes handling OS-level dependencies simple and easy. Mixing this all into a container, you get yourself a reproducible build.

Deployment

Deployment dependencies could be defined as Tooling Dependencies + some extra. The extra dependencies can be the other components the ML deployment needs to interact with. To describe in a one-liner:

Consumption defines the serving…

So, depending on how the output of the ML system is being consumed, the complete picture of your deployment can be easily painted.

This requirement can be versioned with its terraform files, or you can manage your infrastructure as code. For example, check out this example of a FastAPI app with all the tooling and deployment dependencies.

People/Process

Effectively managing people and process dependencies in an ML project involves defining clear objectives, assembling a diverse cross-functional team, fostering open communication, and identifying and prioritizing dependencies across multiple teams (which could include Data Engineering, MLEs, Data Scientists, products teams, and many more). Depending on the size of your organization, this may be from 1 - 2 people to 20 or more.

We would think that once we’ve noted down a long list of stakeholders involved, and a lot of alignment across multiple processes and people needed for our ML product development, deployment should work like a well-oiled machine! This is where things usually fall apart. Here is where company hierarchies, and organizational silos, decision by committees create huge bottlenecks.

A working product, in this case code, reigns, and only through fast iteration and a close-knit group of data engineers, ML engineers, frontend engineers, and owners can we achieve this fast iteration through multiple layers of services required to build high-quality ML products. The larger the organization, the more freedom this team needs to cut through bureaucracy like butter.

However, it is important to note that not all bureaucracy is bad; regulation generally exists for a reason, and that is why the shift-left movement is so important. The risks and associated mediations need to come from the developers and owners so they can quickly understand the risk in ML products as a whole. If you don’t have a working product to experiment with, you won’t understand the requirements needed to mediate the risk.

Unfortunately, there is never a straight answer to “How to handle PEOPLE dependencies?” But if handled well, it could do wonders. The experience here is king, so go forth, fail, and learn.🚀

Diverging Asymptotically

The 12+ factor Model #2: Single Source of Truth.

Discussion about this post