Data science Smells

ML models, data engineering pipelines, and AI systems - all of them can work without bugs but still have traces - smells - that will eventually break everything and wreck havoc.

Jun 26, 2024

Here’s my proposed list; feel free to add/remove whatever you have! FWIW, data science smells are different from what Robert Martin calls code smells because data flows through the system all the time. The whole point of data science systems is to make sense of data, so those traces should relate to the data and code, not just the code itself.

Duplicate-fragility. The system cannot handle receiving a data row twice.
Data-rigidity. The system cannot handle data of a slightly different type.
Data-immobility. I cannot use parts of the system to handle different types of data easily.
Missing-control-groups. The system doesn’t check if it has an influence on anything

P.S. Fwiw, if you haven’t read “Clean Code” by Robert C. Martin, I suggest you read at least a quick summary. Here’s one, including the list of his code smells.

That's Stinky Man | YOUR DATA SCIENCE PROJECT | image tagged in that's stinky man | made w/ Imgflip meme maker

Datacisions

Discussion about this post