In the past 12 months, games and other forms of content made with the Unity platform were installed 33 billion times reaching 3 billion devices worldwide. Apart from our real-time 3D development platform, Unity has a monetization network that is among the largest in the world.
The centralized data team that supports the Unity development platform and the monetization network—my team—is relatively small. Despite just having roughly a dozen people, we have large responsibilities that include managing the data infrastructure that underpins the Unity platform and helping make Unity a data-driven company. As a result, each of us has to be very productive and do quite a bit with the resources we have. That’s one of the reasons we built our data infrastructure on Confluent Platform and Apache Kafka®. Today, this infrastructure handles on average about half a million events per second, with peaks of about a million events per second. It also reliably handles millions of dollars of monetary transactions. In fact, since we went live with Confluent Platform and Kafka a year ago, we have had zero outages that resulted in money loss.
Though our company-wide data infrastructure is running smoothly now, it was not always so well integrated. Similar to many companies, Unity has a range of different departments—analytics, R&D, monetization, cloud services, etc.—that each have their own data pipelines running on individual technology stacks. Some were using Amazon RedShift and Snowflake while others had been using Kafka for quite some (dating back to version 0.7). We wanted to bring all of this together and build a common data pipeline for all of Unity. We went with Kafka because our previous experience showed us it was stable, easy to deploy, and high performing.
Then last year, Unity undertook a major cloud provider migration from AWS to GCP. At that point, after seeing the benefits of better control and additional support, plus receiving guidance on our Kafka architecture and best practices, we reached out to Confluent. Engineers from the Confluent Professional Services team met with us at our offices. We spent about a week going through our use cases and discussing the best way to set up our architecture both for the migration and for long-term operation. They gave us advice on the type of cluster and nodes that would work best for our specific needs.
We started running a new Kafka cluster on GCP and began mirroring traffic between it and our AWS Kafka cluster with Confluent Replicator, which actively synchronizes and replicates data between public clouds and datacenters. It was critically important for us to finish this migration in a safe way because even one day of outage could cost the company considerably.
On the other hand, we needed to move as quickly as was reasonably possible, since every month we spent running a company as large as Unity on both AWS and GCP was essentially doubling our cloud costs. We had a huge incentive to move fast, but an equally huge incentive to make absolutely certain it was done right. With Confluent support, we successfully completed the massive migration over the course of a few months, moving petabytes of data—and our operations—from AWS to GCP.
Having a well-proven data infrastructure based on Confluent Platform has opened a lot of new possibilities for product teams across Unity. Many of them have reached out to us for help in building new solutions based on the central Kafka cluster that we support.
The Confluent Platform based data infrastructure enabled us to move from batch model thinking to a more event streaming model. For example, the previous data lake at Unity was working with a two-day latency, and an ETL job ran once a day and results were ready the next day. Now, latency is down to 15 minutes. Reducing the latency between a particular event and when an informed decision can be made based on the event has led to a variety of business improvement opportunities.
Aside from reduced latency, the improved reliability we’ve seen since deploying Confluent Platform has also led to significant shifts in the way our internal developers think about our data platform. It’s important to note here that we don’t have anyone whose job it is to simply look after Kafka. We expect it to work reliably on its own, and today it is rock solid. But it wasn’t always that way.
We were on Kafka 0.10 for some time in the past, and although we knew that new versions likely had fixes that we needed, we put off upgrading because we didn’t know if the update would have other unforeseen effects. If something went wrong in our environment back then, Kafka was typically on our list of possible suspects. Now that we are on Confluent Platform and staying current with the latest releases, Kafka is not even on that list. And when we perform an update, we can use Ansible Playbooks from Confluent to test the upgrade. Additionally, we know that Confluent support will respond within the hour if we run into anything unexpected.
The fact that we completed the migration from AWS into GCP with zero downtime and zero data loss was a huge win for Unity, as was changing how Kafka and event streaming are perceived by our teams. Now that those teams have seen how stable it is, they are starting to build more and more event streaming systems on top of Confluent Platform. In fact, we’re exploring usage of the platform to support faster training of machine learning models. Kai Waehner gave an interesting talk on this, and we see several initiatives that would benefit from using Kafka to feed machine learning models in real time.
If you’d like to know more, you can download the Confluent Platform to get started with a complete event streaming platform built by the original creators of Apache Kafka.
Oguz Kayral is the engineering manager at Unity Technologies. Oguz leads the centralized data team that supports the Unity development platform and the monetization network.