At Yotpo, we have a rich and busy data lake consisting of thousands of data sets ingested and digested by different engines, the main one being Spark. We built our data infrastructure to enable our users to produce and consume data via self-service tooling, giving them the utmost freedom.
This freedom came with a cost.
We had trouble with bad standardization, little data reusability, lack of data lineage, and flaky data sets. We also witnessed the landscape under which we built our platform change dramatically and so have our analytics needs and expectations.
We came to an understanding that the modeling layer should be decoupled from the execution layer in order to get rid of the limitations we were bounded by - Batch and stream should be no more than attributes as part of a wider abstraction A Kafka topic and a data lake table are no different and should be treated the same way Observability of our data pipelines should have the same quality and depth across all execution engines, storage methods, and formats Governance should be an implicit part of our ecosystem to serve as a basis for both exploration and automation/anomaly detection
That's when we started building YODA (soon to be open sourced) that gives us killer dev experience with the level of abstraction we always dreamed of. Combining DBT, Databricks, lakeFS, and a multitude of streaming engines - we started seeing our vision come to life. In this talk, we'll share from our journey redesigning the data lake, and how to best address organizational needs, without having to give up on high-end tooling and technology. We are taking this to the next level.