[Webinar] Mastering Kafka Security Across Hybrid Environments → Register Now

How Instacart Upscaled Its Data Pipelines to Handle 10 Years of Growth in 6 Weeks

Written By

Instacart’s core mission—to create a world where everyone has access to the food they love and more time to enjoy it together—took on a new level of importance and urgency when the pandemic hit in March 2020. At the time, Instacart partnered with about 25,500 stores across the U.S. Now, we partner with more than 65,000 stores nationwide and in Canada, and serve more than 85% of U.S. households. We had to rethink our tech stack and adopt new technologies quickly to connect disparate systems and applications, update inventory in real time, and support our end-to-end order fulfillment process to get groceries to any family that needs them. With Apache Kafka® and Confluent Cloud, we were able to quickly achieve this goal and introduce new capabilities and bring event processing to the core of our IT architecture.

New normal requires a new way of thinking about data

You might be able to estimate how much data and computing power is required to get a bag of groceries—with the correct items in it—onto someone’s porch within a specified time frame. But think about multiplying that by about 10 million daily customers, and you get a sense of just how much data Instacart deals with.

At the start of the pandemic, we saw enormous growth of our platform. Virtually overnight, we suddenly became the primary fulfillment mechanism for most major grocery store lines in the country.

This rapid growth created some data issues. Typically, as a company grows, it accumulates decisions. Those decisions layer on top of one another, and a lot of those layers are reflected in the data. Two of the biggest pain points for any rapidly growing organization are data quality and data discoverability. As a data scientist charged with creating a prediction or even just a simple dashboard, questions arise, including, “Well, what table is that data in?”; “What is the quality of the data in that table?”; and “Can we rely on that data?”

The central problem we faced at Instacart was how to improve our data quality to ensure our decisions were rooted in deep confidence in the data underlying them, and then make that data more discoverable across the organization to provide more context for better, faster decisions.

Getting started with Kafka and event streaming

In early 2020, the rapid spread of COVID-19 across the globe combined with government stay-at-home orders created a surge in demand for Instacart’s services. Each night we had to figure out how to create sufficient headroom in our systems for the inevitable daily rush of new signups, both by new customers and our shoppers. Our data systems were under tremendous load, with daily new user signups reaching an all-time high.

Given the speed and scale challenges presented by our rapid growth, I knew that switching from batch to stream processing would be critical for our infrastructure going forward, both to support the scale of the data systems we needed and the systems integration that we envisioned doing later on.

In early 2020, we began to use Kafka for an initial Customer 360 use case, using event streaming to create a complete picture of what was happening with our customers and with the software we were running. The movement and processing of those events can get very expensive at scale, so I wanted to be sure that we had a very reliable backbone to push the events through and a platform upon which to assemble non-batched processing pipelines.

So we took our client streams from our web clients and mobile clients and migrated them onto the Kafka platform. At that point in time, the quality of the data was relatively low because the clients were changing and moving so fast, but the scale and the volume of the data were very high. This gave us the opportunity to experiment with how we design our topics, how we create a more resilient infrastructure, and how we partition work to limit the blast radius of any failure.

The first Customer 360 use case helped us develop patterns that we’re now applying to subsequent use cases. This initiative also provided us with experience for applying event streaming to our business as a whole, to solve some bigger issues that were starting to hit us as we rapidly grew. We also wanted to use Kafka and event streaming to help us solve product catalog and inventory struggles. For example, as a consumer, if you order bread, you just want bread and you don’t really care how hard that is for us to accomplish. At Instacart, we use predictive models that try to figure out whether bread is on the shelf or not when consumers make an order, and those models require real-time data to work.

Integrating Confluent Cloud to revamp the architecture

At Instacart, we use Kafka as the broker of our event streams. In particular, we use Confluent Cloud to ensure high performance, scalability, and durability, ensuring the integrity of our core streams. Confluent Cloud is serverless, which allows us to easily scale up or down, and we never have to worry about outgrowing our storage capacity because Confluent Cloud has Infinite Storage and retention in its revamped Kafka storage engine. This means if we need more scale, it’s there.

Data ownership and streamlined usage are key parts of our data strategy. As a distributed organization, it can be difficult to determine which team owns certain data domains, resulting in barriers to exploration, discovery, and usage. Reducing these barriers and standardizing ownership is essential to us. Additionally, our data strategy also includes incorporating real-time data into software development and operational systems, so we can react to events within our business as they occur.

To address these areas, Confluent Cloud has become the home, a central nervous system, for all data in motion across Instacart. For example, this includes change data capture from our postcodes databases and natively generated events coming from core services. This enables customers, such as our advertising team, to consume and use them as necessary for their business purposes. Using Confluent Cloud allows us to focus on delivering groceries to our millions of customers rather than deploying, maintaining, and securing Kafka.

Essentially, all data across the company that is shared outside of the originating service flows through the shared substrate of Kafka. This gives us one place to hook in all of our tools to consume, process, and route to the products that need it across the organization.

Future use of Confluent within Instacart

Instacart’s current data initiatives revolve around making data easier to use, including reducing the time it takes both people and systems to use that data. For people, this means making it easy to find trustworthy, reliable, and high-quality data. For systems, this means making it easy for services to couple on that data, ingest it, transform it, and use it for their own purposes, all in real time.

To accomplish this new level of data visibility and discoverability, we are creating a data mesh strategy for Instacart. This means rethinking our relationship with the important data that we want to externalize across our organization. We are focusing on delineating team and product boundaries in a consistent and uniform way, so that data production and ownership responsibilities are clearly defined. By treating our data as first-class products, we’re able to empower other teams to find and consume the data they need for their business use cases in a consistent and uniform way.

Our catalog team has already seen the benefits of this strategy. Tasked with managing inventory from stores across the country, they can leverage event data to ensure accurate inventory computations.

We are using this opportunity to improve our recommendation algorithms to incorporate these real-time signals for better search personalization and stronger fraud detection. Additionally, we can also streamline the customer experience significantly due to the easy access to data provided by the data mesh. There are countless opportunities across the entire surface of all the different domains of our product to uplevel the customer experience.

In the end, if we can provide great tooling that enables teams—without them having to be experts on the underpinning technology—that’s great. And if we can give them tools that help them own that data and improve the quality of that data, even better.

Confluent gives us a stable data and integration platform upon which to expand rapidly and make event streaming the core of our IT architecture.

Learn more about other customers adopting Confluent for data streaming. If you’re ready to start setting data in motion, sign up for a free trial of Confluent Cloud and use the code CL60BLOG for an additional $60 of free usage.*


  • Dusty Pearce joined Instacart in January 2020 as the first Vice President of Infrastructure, overseeing the company’s Infrastructure Engineering, Security, and Data Warehousing teams. Before joining Instacart, Dusty was the Director of Service Engineering at Slack where he scaled the company’s software infrastructure globally. While at Slack, Dusty founded the company’s SRE engineering and resilience engineering practices. Prior to Slack, Dusty was the Head of Platform at location technology company Life360, where he grew the engineering team tenfold.

Did you like this blog post? Share it now

MiFID II: Data Streaming for Post-Trade Reporting

The Markets in Financial Instruments Directive II (MiFID II) came into effect in January 2018, aiming to improve the competitiveness and transparency of European financial markets. As part of this, financial institutions are obligated to report details of trades and transactions (both equity and...

Unlocking the Edge: Data Streaming Goes Where You Go with Confluent

While cloud computing adoption continues to accelerate due to its tremendous value, it has also become clear that edge computing is better suited for a variety of use cases. Organizations are realizing the benefits of processing data closer to its source, leading to reduced latency, security and...