The software engineering industry precedes other industries in it's tooling, and with the understanding of the importance of dealing with technical debt. For example software developers use versioning, markdown for fast documentation, text to diagram tools, continuous deployments, these among many other tools are things you don't and sometimes cannot see in other industries.
One place where the software industry is lagging a lot behind is the ability to create software in a well organized form, instead it's using processes which try to make sense of coding on the go like scrum. As cal Newport said just imagine a car company where someone runs with a part, puts it in the car, then another email comes they and decide to change the color to blue, people moving around, decisions in mail with regards how many cars to take out this week, then something blasts at one part of the car manufacturing area, so they open the graphs they see that indeed they overutilized their robots, so they add new ones, you get the idea.
Having said all the above the software industry has the lead, the edge of the lead with the ability to store or revisions diagrams alongside with comments, having amazing tools to create documentation, understanding the importance of technical debt and having active measurements to fix or avoid it.
Remember that in ML systems in many cases only a small fraction of the code is actually dealing with the learning and the prediction in many cases only about 5% of code is for learning and prediction while the vast majority of code in ML systems is plumbing and plumbing code is highly susceptible to technical debt.
The paper analyzes multiple common practices taking today for example separation of research and engineering teams and many additional practices which will review here, and then analyzes the problems that these practices cause in the scope of machine learning.
It goes ahead to suggest some potential remedies to the technical debt that they uncover.
They start by summing up the list of main technical debt they describe which include boundary erosion, pipeline jungle, entanglement, hidden feedback loops, undeclared consumers, data dependencies, configuration issues, changes in the external world, and a variety of system level antipatterns.
I really wish there were more paper like this one dealing with the boundaries in between engineering and research so thank you Dietmar Ebner, Vinary Chaundhary, Michael Young, Jean Francois Crespo, Dan Dennison, you have really created a masterpiece paper.
Deploying and developing ML systems should have according to the paper a similar technical debt reduction mechanisms just as software engineering have. The technical debt metaphor to remind us was coined by Ward Cunningham in 1992, this is the long term cost of taking short term technical debt just like any debt. The mitigation to technical debt is in two parts, one take existing technical debt and reduce it, the other to invest actively in present in order to reduce technical debt in the future.
Reducing technical debt is usually paid with refactoring, improving unit tests, deleting dead code, reducing dependencies, tightening API's, improving documentation.
Deferring any such payment results in the same way of delaying a money debt, the compounding interest is dangerous because the technical debt compounds silent until you can't do any progress anymore.
ML systems tend to have a lot of glue code, generic ML packages, more over tend to be affected by the external world as input to those models more than standard systems, The paper focuses on ML systems technical debt the claim here is that we can accumulate even more technical debt and it's even more dangerous, due to the nature of ML systems. In general the debt in ML systems tend to accumulate in system level which is more difficult to notice because we have pipelines of data, separate processes, and debt is in between and among them. So the most obvious piece here is that technical debt for ML is mainly occurring on system level which is harder than project level.
ML systems tend to have CACE property where
Changing Anything Changes Everything.
We have a lot of complexity in this way.
ML Systems tend to be entangled meaning we have a mix of signals coming in, as input to our model, sometimes the whole purpose of the model is to take in features x1, x2, x3, ... in addition you have at the same box or model hyper parameters, settings, sampling methods, thresholds all these are in many times not isolated, so where is your best practice isolation?
One mitigation to this is to serve ensembles of model where each model handles a specific aspect of the problem, but this is only useful in cases where the sub problems decompose naturally and this is not always the case.
There are many cases where correcting a problem with system A could cause a problem in system B, for example lets say you have system A where you correct it, it's bought too many stocks, now the fact that system A bought too many stocks can effect model B which takes as input the stocks purchased its data is now effected although you fixed a problem in model A.
The system is dependent on data coming in, for the regeneration of model and then for prediction, what this means that any instability in the input signals which cause instability in the output sometimes this is good but sometimes this is additional instability on the system an improvement as we said in system A can cause instability in signals for system B.
A mitigation that appears in industry is to assign a version to the data models so that you change the model only when the whole system was tested as a whole, just like software release, if when you release software projects you have a specific version that crosses the projects and you tested and you know that it's good why are models being shipped without such tested snapshots as a whole? This can be at least a partial mitigation.
In standard software engineering we can have some modules or code that are dead code path, when reducing technical debt we delete those, in ML systems these can simply be features that don't contribute an enough to the output to justify the complexity they induce. They can also come up as bug, when the input changes let's say all stock symbols now have a prefix, this would not be a good day for the models.
This is especially true for ML systems as you tend to do experimentation and experimentation sometimes or many times ends up as dead code you need to clean up.
We have more hidden feedback loops in ML systems where one system effect another, one model puts a stock on top, another model picks it up and decides to reduce some other number effecting back the first model.
There is a concept of Pipeline Jungle. The pipeline jungle is a special case of glue code, this involves all the data preparation, the problem is that evolves you add another signal another data source, another join, another step, and then you end up with this untested unverified pipeline jungle which is a pile of data coming in and out in steps where it's far from a well-designed and clean piece of a project.
The way to mitigate pipeline jungles is the same as avoiding a jungle in engineering projects - by taking a holistic approach to the ML system, designing and reevaluating it the main point is to reevaluate it as a WHOLE!.
Reevaluate the ML System as a whole to avoid pipeline jungles.
When you work with ML code usually you see data frames, but there has to be also data modelling and relationship declarations between the objects otherwise you have an abstraction debt. Which make it harder both to understand the system and evolve it. There is a well known smell for it named the plain old data type smell where you see the lot of integers and string data types used instead of proper modelling.
Additional smells for general ML technical debt are the multiple language smell, where you start with one computer language with data scientists or for experimentation but when you get to production you move to another language that the engineering use.
ML Systems have configuration, hyper parameters settings, which you tune, once you start having a lot of this configuration it's becomes actually coding with configuration, if you have configuration like this you must be able to have proper tooling you have for computer languages, validations, tests, CI CD, versioning, code review for configurations etc.
In many ML systems there are fixed thresholds, while this can be fine in some systems namely static systems this will not be good in Dynamic systems, this is a decision threshold, which are tweaked manually, but with new data there is a time-consuming effort of re-updating them.
To sum up, ML systems tend to incur technical debt even more than standard systems that is both because the lack of education about ML technical debt and also because of the specific nature of ML systems where it's harder to reproduce the state as it's changes with the external world changes, and also the cultural division between researches and engineering practices. If don't think yet you are ready to adopt technical debt reduction in ML systems at least start by measuring it, how hard is it to change it? How much time do you invest in thresholds, how easy to change configurations without surprises, does improve one model has a downside on another, how quickly do new team members are being effective and understand the system.
Comments
Post a Comment