Show Me How: Build Streaming Data Pipelines for Real-Time Data Warehousing | Register Today
The Thing, made back in 1982 is my second favorite horror movie of all time, second only to Alien. Without giving too much away, it’s about a shape-shifting alien creature discovered in the frozen wastes of Antarctica. Because it can change its shape to that of its victims, paranoia and fear take hold amongst the survivors of the Antarctic science station as they get hunted down one by one. When you can’t tell the difference between friend and murderous shape-shifting alien beast, everyone is suspect and nothing is safe.
This is the perpetual nightmare of the data team and its stakeholders. No…data teams and their stakeholders don’t have to contend with actual shape-shifting alien beasts but they do have to contend with something equally difficult – bucket loads of data.
When automated insight is the product, things start getting dicey. Who is to say that an insight is right or wrong? Who is to say if the input data was correct? When working with input datasets in the sizes of terabytes or petabytes, who is going to see an undercounting or skewed statistic? What safeguards exist to prevent a flawed statistical analysis from getting into production?
As Arthur C. Clarke said: “Any sufficiently advanced technology is indistinguishable from magic.” One can also say the same about ML models or complex statistical methods, which can be as inscrutable as a magical incantation to any data consumers and possibly even to the data scientists themselves in some cases.
When you can’t easily tell the difference between good insights and bad, every result can be suspect and nothing may feel safe. Stakeholders can live in suspicion and fear, just like the scientists in The Thing, unsure who is human and who is the monster. One could argue that data teams have it even harder than Kurt Russel; at least one can kill an alien beast, but battling bad data and inscrutable ML models will never end, it is the day-to-day reality that data teams have to deal with. There is no end.
In the end, it comes down to trust. Trust that your data teams have a handle on data quality, are choosing the right statistical methods and are able to detect failures before they end up causing real damage to the business. The loss of trust in one insight damages the credibility of all other insights both past, present and future. This all means that trust must come first.
How do we as an industry confront this challenge? Insights without trust are worthless and insights which are trusted but wrong can be dangerous or even cause a company to layoff 20% of its staff. The only option is to build processes and practices to safeguard against bad things happening and even then there can be cases where an upstream methodology or software error produces subtly bad data that no-one detects for several months. This is the stuff of nightmares and there will be data team leaders across the world who sleep restlessly at night when they consider the challenges they face.
So when trust takes so long to build but can be destroyed through a single mistake or one bad input dataset – how do data teams gain trust and retain trust over the long term? This I believe will be one of the central themes of 2023. Data observability, data contracts, data testing, and better abstractions are all part of the solution, as is the general maturation of the craft. As data analytics and algorithmic automation become more and more prominent we need to find answers to these problems. Software has a long history of disasters and computer science students around the world study famous case studies. More and more of these case studies are going to be based on analytics and machine learning, it’s just a matter of time. Whatever the case it is clear, trust comes first and how we deliver that trust is the challenge of our industry in 2023 and beyond.
The OCTOlog is a weekly publication of noteworthy concepts, trends, and technologies relevant to the data streaming landscape—by Confluent’s Office of the CTO (OCTO).
Building data streaming applications, and growing them beyond a single team is challenging. Data silos develop easily and can be difficult to solve. The tools provided by Confluent’s Stream Governance platform can help break down those walls and make your data accessible to those who need it.