Skip to main content

Hidden Technical Debt in Machine Learning Systems

The software engineering industry precedes other industries in it's tooling, and with the understanding of the importance of dealing with technical debt.  For example software developers use versioning, markdown for fast documentation, text to diagram tools, continuous deployments, these among many other tools are things you don't and sometimes cannot see in other industries.  

One place where the software industry is lagging a lot behind is the ability to create software in a well organized form, instead it's using processes which try to make sense of coding on the go like scrum.  As cal Newport said just imagine a car company where someone runs with a part, puts it in the car, then another email comes they and decide to change the color to blue, people moving around, decisions in mail with regards how many cars to take out this week, then something blasts at one part of the car manufacturing area, so they open the graphs they see that indeed they overutilized their robots, so they add new ones,  you get the idea.

Having said all the above the software industry has the lead, the edge of the lead with the ability to store or revisions diagrams alongside with comments, having amazing tools to create documentation, understanding the importance of technical debt and having active measurements to fix or avoid it.

Hidden Technical Debt in Machine Learning Systems is the title of a great clear straight to the point paper published by google.  This paper proposes how quick wins in machine learning which are taken without proper engineering practices, lead to, technical debt which then leads to slower development and release of features as a result of increased exponential complexity and unpredictable effects in between services and system

Remember that in ML systems in many cases only a small fraction of the code is actually dealing with the learning and the prediction in many cases only about 5% of code is for learning and prediction while the vast majority of code in ML systems is plumbing and plumbing code is highly susceptible to technical debt.

The paper analyzes multiple common practices taking today for example separation of research and engineering teams and many additional practices which will review here, and then analyzes the problems that these practices cause in the scope of machine learning.

It goes ahead to suggest some potential remedies to the technical debt that they uncover.

They start by summing up the list of main technical debt they describe which include boundary erosion, pipeline jungle, entanglement, hidden feedback loops, undeclared consumers, data dependencies, configuration issues, changes in the external world, and a variety of system level antipatterns.

I really wish there were more paper like this one dealing with the boundaries in between engineering and research so thank you Dietmar Ebner, Vinary Chaundhary, Michael Young, Jean Francois Crespo, Dan Dennison, you have really created a masterpiece paper.

Deploying and developing ML systems should have according to the paper a similar technical debt reduction mechanisms just as software engineering have.  The technical debt metaphor to remind us was coined by Ward Cunningham in 1992, this is the long term cost of taking short term technical debt just like any debt.  The mitigation to technical debt is in two parts, one take existing technical debt and reduce it, the other to invest actively in present in order to reduce technical debt in the future.

Reducing technical debt is usually paid with refactoring, improving unit tests, deleting dead code, reducing dependencies, tightening API's, improving documentation.

Deferring any such payment results in the same way of delaying a money debt, the compounding interest is dangerous because the technical debt compounds silent until you can't do any progress anymore.

ML systems tend to have a lot of glue code, generic ML packages, more over tend to be affected by the external world as input to those models more than standard systems, The paper focuses on ML systems technical debt the claim here is that we can accumulate even more technical debt and it's even more dangerous, due to the nature of ML systems.  In general the debt in ML systems tend to accumulate in system level which is more difficult to notice because we have pipelines of data, separate processes, and debt is in between and among them.  So the most obvious piece here is that technical debt for ML is mainly occurring on system level which is harder than project level.

ML systems tend to have CACE property where 

Changing Anything Changes Everything.
We have a lot of complexity in this way.

ML Systems tend to be entangled meaning we have a mix of signals coming in, as input to our model, sometimes the whole purpose of the model is to take in features x1, x2, x3, ... in addition you have at the same box or model hyper parameters, settings, sampling methods, thresholds all these are in many times not isolated, so where is your best practice isolation?

One mitigation to this is to serve ensembles of model where each model handles a specific aspect of the problem, but this is only useful in cases where the sub problems decompose naturally and this is not always the case.

 There are many cases where correcting a problem with system A could cause a problem in system B, for example lets say you have system A where you correct it, it's bought too many stocks, now the fact that system A bought too many stocks can effect model B which takes as input the stocks purchased its data is now effected although you fixed a problem in model A.

The system is dependent on data coming in, for the regeneration of model and then for prediction, what this means that any instability in the input signals which cause instability in the output sometimes this is good but sometimes this is additional instability on the system an improvement as we said in system A can cause instability in signals for system B.

A mitigation that appears in industry is to assign a version to the data models so that you change the model only when the whole system was tested as a whole, just like software release, if when you release software projects you have a specific version that crosses the projects and you tested and you know that it's good why are models being shipped without such tested snapshots as a whole?  This can be at least a partial mitigation.

In standard software engineering we can have some modules or code that are dead code path, when reducing technical debt we delete those, in ML systems these can simply be features that don't contribute an enough to the output to justify the complexity they induce.   They can also come up as bug, when the input changes let's say all stock symbols now have a prefix, this would not be a good day for the models.

This is especially true for ML systems as you tend to do experimentation and experimentation sometimes or many times ends up as dead code you need to clean up.

We have more hidden feedback loops in ML systems where one system effect another, one model puts a stock on top, another model picks it up and decides to reduce some other number effecting back the first model.

There is a concept of Pipeline Jungle.  The pipeline jungle is a special case of glue code, this involves all the data preparation, the problem is that evolves you add another signal another data source, another join, another step, and then you end up with this untested unverified pipeline jungle which is a pile of data coming in and out in steps where it's far from a well-designed and clean piece of a project.

The way to mitigate pipeline jungles is the same as avoiding a jungle in engineering projects - by taking a holistic approach to the ML system, designing and reevaluating it the main point is to reevaluate it as a WHOLE!.
Reevaluate the ML System as a whole to avoid pipeline jungles.
 When you work with ML code usually you see data frames, but there has to be also data modelling and relationship declarations between the objects otherwise you have an abstraction debt.  Which make it harder both to understand the system and evolve it.  There is a well known smell for it named the plain old data type smell where you see the lot of integers and string data types used instead of proper modelling. 

Additional smells for general ML technical debt are the multiple language smell, where you start with one computer language with data scientists or for experimentation but when you get to production you move to another language that the engineering use.

ML Systems have configuration, hyper parameters settings, which you tune, once you start having a lot of this configuration it's becomes actually coding with configuration, if you have configuration like this you must be able to have proper tooling you have for computer languages, validations, tests, CI CD, versioning, code review for configurations etc.

In many ML systems there are fixed thresholds, while this can be fine in some systems namely static systems this will not be good in Dynamic systems, this is a decision threshold, which are tweaked manually, but with new data there is a time-consuming effort of re-updating them.


To sum up, ML systems tend to incur technical debt even more than standard systems that is both because the lack of education about ML technical debt and also because of the specific nature of ML systems where it's harder to reproduce the state as it's changes with the external world changes, and also the cultural division between researches and engineering practices.   If don't think yet you are ready to adopt technical debt reduction in ML systems at least start by measuring it, how hard is it to change it? How much time do you invest in thresholds, how easy to change configurations without surprises, does improve one model has a downside on another, how quickly do new team members are being effective and understand the system.







Comments

Popular posts from this blog

Functional Programming in Scala for Working Class OOP Java Programmers - Part 1

Introduction Have you ever been to a scala conf and told yourself "I have no idea what this guy talks about?" did you look nervously around and see all people smiling saying "yeah that's obvious " only to get you even more nervous? . If so this post is for you, otherwise just skip it, you already know fp in scala ;) This post is optimistic, although I'm going to say functional programming in scala is not easy, our target is to understand it, so bare with me. Let's face the truth functional programmin in scala is difficult if is difficult if you are just another working class programmer coming mainly from java background. If you came from haskell background then hell it's easy. If you come from heavy math background then hell yes it's easy. But if you are a standard working class java backend engineer with previous OOP design background then hell yeah it's difficult. Scala and Design Patterns An interesting point of view on scala, is

Alternatives to Using UUIDs

  Alternatives to Using UUIDs UUIDs are valuable for several reasons: Global Uniqueness : UUIDs are designed to be globally unique across systems, ensuring that no two identifiers collide unintentionally. This property is crucial for distributed systems, databases, and scenarios where data needs to be uniquely identified regardless of location or time. Standardization : UUIDs adhere to well-defined formats (such as UUIDv4) and are widely supported by various programming languages and platforms. This consistency simplifies interoperability and data exchange. High Collision Resistance : The probability of generating duplicate UUIDs is extremely low due to the combination of timestamp, random bits, and other factors. This collision resistance is essential for avoiding data corruption. However, there are situations where UUIDs may not be the optimal choice: Length and Readability : UUIDs are lengthy (typically 36 characters in their canonical form) and may not be human-readable. In URLs,

Bellman Ford Graph Algorithm

The Shortest path algorithms so you go to google maps and you want to find the shortest path from one city to another.  Two algorithms can help you, they both calculate the shortest distance from a source node into all other nodes, one node can handle negative weights with cycles and another cannot, Dijkstra cannot and bellman ford can. One is Dijkstra if you run the Dijkstra algorithm on this map its input would be a single source node and its output would be the path to all other vertices.  However, there is a caveat if Elon mask comes and with some magic creates a black hole loop which makes one of the edges negative weight then the Dijkstra algorithm would fail to give you the answer. This is where bellman Ford algorithm comes into place, it's like the Dijkstra algorithm only it knows to handle well negative weight in edges. Dijkstra has an issue handling negative weights and cycles Bellman's ford algorithm target is to find the shortest path from a single node in a graph t