How many times per day do you unlock your mobile phone? Statistics published by SlickText indicate that an average smartphone user unlocks their phone 150 times per day and checks their phone for messages or emails 63 times per day.
Smartphone users are constantly receiving notifications and checking their phones for the time, weather, games, and news. With all of this going on, how do you deliver notifications that actually stand out from the noise? Some organizations are tackling this by delivering personalized notifications to the lock screen or home screen based on the individual's behavior. In fact, you might have already noticed such apps in the market today.
In this blog post, we’ll explore the journey of one such organization that we’ve worked with to build a modern data infrastructure. They’re now successfully delivering personalized content to its customers through their mobile application—using the power of Confluent Cloud.
Today, there are about 5 billion+ smartphones in the world. This is projected to grow manyfold in the coming years. Assuming this organization plans to capture 5% of the smartphone market, which is 250 million users, we are talking about handling a huge volume of data every day and every second. Users are from various parts of the world and need content in different languages. Multiplying this with various categories of content (news, fun facts, live sports, games, etc.), means there will be a significant volume of content available to serve up to users. That creates billions to trillions of events generated by users each day when they interact with the served content.
The challenge for this organization is that their content should be attractive and relevant enough for the user to click/watch/play—but that they have only a fraction of a second to achieve this before the user opens some other phone application. Delivering such highly personalized and relevant content to each and every user at a huge scale is a massive task. Content delivered to each user will be different depending on various factors such as age, gender, location, preferences, time of the day, previous history, local trends, etc. How does this organization ensure they’re delivering the right content to all users within such a short time and at scale?
The organization’s older platform used a combination of technologies. Content aggregation services were hosted on one of the major cloud service providers (CSPs). These services bring together content that changes frequently, categorizes them, and stores them in the message queues, which are part of the native services the same cloud provider offers. However, this platform was unable to scale during critical events such as live games or major breaking news events across the globe. Also, the platform went down at times, resulting in real-time content not being delivered to the users.
There was also a feedback system hosted on another major CSP that collects clickstream events when the user clicks or interacts by liking or sharing the content. All the events are captured and stored in a cloud-specific pub-sub (publish-subscribe) messaging system. This system, however, lacked the capability to deliver the same data to multiple consumers simultaneously. Hence, data was duplicated multiple times, resulting in higher cloud storage costs.
In some scenarios, this system could not scale up to the expectations of the business. The user recommendations were delayed in the range of 15 to 30 minutes, resulting in users looking at stale content. Training of the machine learning (ML) models was done once a day. So, the recommendations were not accurate enough. When the content was uninteresting or irrelevant, some users chose to disable or uninstall the app or switch over to a competitor, resulting in customer churn.
Some of the business outcomes that the organization was looking to achieve (and which they weren't able to do with traditional approaches discussed above) included:
Delivering exceptional customer experiences with relevant content to the users in real time and therefore avoiding customer churn
Faster time to market: Quick and easy to way add new business use cases or categories through partner integrations/mergers and acquisitions
Efficient operations with maximum platform uptime to ensure minimal to no business impact for delivering the content
Scale seamlessly to multi-fold of the current volumes as the user base grows
Future-looking solution that can work across multiple clouds and also assist with business continuity/disaster recovery plans
On the technical side of things, developers were spending a considerable amount of time scaling, managing, and troubleshooting the infrastructure of point-to-point tools instead of focusing on new innovations. Different teams chose different systems without an organization-wide vision for data streaming. Often, the tools lacked the capability to connect with their existing systems, so developers ended up writing custom integrations. Some of the alternatives they considered included open source technologies. However, the team realized those were too cumbersome to manage due to lack of internal expertise.
The team narrowed down the list to the following capabilities they would need to address the technical challenges:
Offload the provisioning, management of infrastructure, day-2 operations such as patches, and upgrades
Provider management of upscaling and downscaling operations during critical events
Replace multiple siloed tools such as message queues, pub-sub, event streaming, and data pipelining products
Enable developers to spend time building new business cases instead of managing and troubleshooting open-source or pointed tools
Broad integration support with existing external systems and products.
Help the company embark on a multicloud/hybrid cloud journey and avoid getting locked into cloud-specific technologies
The cloud architects’ group at this organization came across Apache Kafka when attempting to solve their challenges. They eventually chose Confluent’s data streaming platform, finding it much more powerful, cloud-native and complete than open source Kafka.
Some of the capabilities that stood out for choosing Confluent Cloud were:
SLA-backed managed multi-zone Kafka clusters available on all major cloud provider regions
Managed connectors to build streaming pipelines quickly instead of writing custom producers and consumers
Ability to scale up and scale down per demand and also pay per use instead of paying for unused provisioned capacity
Support for fanout use cases: Independent consumers can read the same message simultaneously without duplication of data, saving storage costs
Support for stream processing on the same platform as data streaming
Expert technical support for troubleshooting reactive issues
Professional services for assistance the migration, design, and implementation of the new solution
Here is what the Confluent Cloud data streaming platform architecture looks like:
Content for the users is generated both in-house and from the external sources. All the content is aggregated using the content aggregation service, categorized, and produced in real time to Confluent Cloud. This is streamed on to the content delivery service, which, in conjunction with the intelligence provided by the content recommendation service, serves interesting and relevant content for each user. A copy of the content is streamed into Cloud Storage for training ML models and for archival purposes. ML models now have access to all the content in real time. The organization can now choose to retrain the models in as little as a few minutes. There is usually a trade-off between the cost of retraining and accuracy. However, the organization has the flexibility now to retrain as and when needed.
A beacon service keeps track of the endpoints where the application is actively being used. This information is needed for tracking app usage in real-time for dashboard and reporting purposes. Additionally, it is also required for the recommendation service to prepare for the next relevant content to be delivered to active users. The recommendation service uses the beacon’s data to determine the content only for active users, resulting in efficient use of the resources and saving cost. The data captured by the beacon service is very minimal. Previously, the organization had to do a lookup with the user’s data present in Postgres DB for enrichment. This approach slowed down the Postgres DB from performing its normal functions. This got a lot better on Confluent Cloud when the organization started streaming Users data from Postgres DB using a managed CDC Source connector to Kafka. Confluent’s real-time stream-processing engine ksqlDB helped create a materialized view of the Users data using ksqlDB tables. Now, the latest user info is always available in ksqlDB. Using stream-table joins, Confluent Cloud helped to enrich the Active Users data by joining required User attributes, all in real time. They are also exploring how Apache Flink can help to expand the scope of real-time stream processing.
All the user interactions with the content such as clicks/watches/scrolls are captured by the mobile app and relayed to the clicks tracker service, which publishes it to Confluent Cloud. This again required additional information related to user attributes, which was achieved using similar ksqlDB stream-table joins. One of the significant architectural advantages with Confluent Cloud is the ability to read the same data multiple times by individual consumers for various use cases. This is called fanout, which is supported inherently in the platform. This helped the organization save costs without duplicating the same message into multiple queues. Dashboarding and reporting in real-time was possible using Google Cloud BigQuery Managed Sink Connector collecting all the data from the required topics and sinking it into Google BigQuery, which in turn is integrated with visualization tools.
Confluent Cloud’s Stream Governance capabilities have served the company well. With the help of Stream Lineage, the teams are able to visualize the real-time data streams comprising the producers, topics, consumers, connectors and ksqlDB queries. Stream Lineage is helping them discover where data came from, where it’s going, and where, when, and how it was transformed.
There are a few standout scenarios to highlight from our time working with this organization. Once, a team walked up and said they produced 9 million messages to Kafka but only 7 million messages are available in the downstream application. By looking at Stream Lineage, the Confluent admin was able to point out that the producer is publishing to 45 topics but only 40 are being subscribed by the consumer. It would have taken a lot of time and effort to figure this out without Stream Lineage.
In another instance, the admin team noticed that there was 3x increase in client connections to their prod cluster. Stream Lineage helped them point out the consumers and their client IDs, which were creating new connections to a set of topics. By combining the client IDs with audit logs available on Confluent Cloud, they discovered a specific team was trying to consume data from a set of topics and were using improper client configs. After this incident, the team started adding metadata information such as tags to call out the PII data, team names, use cases, and owner names on Confluent Cloud resources. Stream Catalog, another part of Stream Governance, is assisting to search for required data, find the owners, and contact them.
With the help of Confluent Professional Services, the team was able to replace all their siloed messaging tools with Confluent Cloud. They also started expanding the use of Confluent Cloud by implementing additional use cases, such as performing windowed aggregations on ksqlDB. This helped them to retire a custom-built internal tool that was batch-based and couldn’t scale well. Windowed aggregates helped them capture data—popular content consumed in the last minute, content with highest impressions such as likes and shares, and more. Supplying this information to the recommendation systems has improved the accuracy of the system in delivering relevant and personalized content to users.
After implementing this architecture, the organization no longer has to take care of the day-to-day operations because they are leveraging multiple fully managed components on Confluent Cloud such as Kafka, Connectors, ksqlDB, and Schema Registry. Each of the modules, such as producer systems, Kafka, and consumer systems, can scale up and down independently. They have integrated Confluent Cloud with a popular monitoring tool to observe some of the important metrics and can scale the cluster by calling an API. They are leveraging role-based access capability to control the access to resources within Confluent Cloud. Next up, they want to leverage Confluent’s Terraform Provider to automate management activities such as the creation, updating, and deletion of topics, schemas, and connectors.
With Confluent Cloud, this organization has been able to deliver exceptional customer experiences for real-time personalization at scale. They no longer worry about cluster downtimes and the teams are focusing on bringing in new business use cases, such as playing games on the lock screen. They were also able to onboard the resources and use cases of a company they acquired recently without much hassles. The team is very confident that Confluent Cloud can handle their scale as it grows higher and higher than what they are doing today.
Check out the real-time customer personalization webpage for more resources.