Confluent
Should You Put Several Event Types in the Same Kafka Topic?
Apache Kafka

Should You Put Several Event Types in the Same Kafka Topic?

Martin Kleppmann

If you adopt a streaming platform such as Apache Kafka, one of the most important questions to answer is: what topics are you going to use? In particular, if you have a bunch of different events that you want to publish to Kafka as messages, do you put them in the same topic, or do you split them across different topics?

The most important function of a topic is to allow a consumer to specify which subset of messages it wants to consume. At the one extreme, putting absolutely all your data in a single topic is probably a bad idea, since it would mean consumers have no way of selecting the events of interest—they would just get everything. At the other extreme, having millions of different topics is also a bad idea, since each topic in Kafka has a cost, and thus having a large number of topics will harm performance.

Actually, from a performance point of view, it’s the number of partitions that matters. But since each topic in Kafka has at least one partition, if you have n topics, you inevitably have at least n partitions. A while ago, Jun Rao wrote a blog post explaining the cost of having many partitions (end-to-end latency, file descriptors, memory overhead, recovery time after a failure). As a rule of thumb, if you care about latency, you should probably aim for (order of magnitude) hundreds of topic-partitions per broker node. If you have tens of thousands, or even thousands of partitions per node, your latency will suffer.

That performance argument provides some guidance for designing your topic structure: if you’re finding yourself with many thousands of topics, it would be advisable to merge some of the fine-grained, low-throughput topics into coarser-grained topics, and thus reduce the proliferation of partitions.

However, performance is not the end of the story. Even more important, in my opinion, are the data integrity and data modelling aspects of your topic structure. We will discuss those in the rest of this article.

Topic = collection of events of the same type?

The common wisdom (according to several conversations I’ve had, and according to a mailing list thread) seems to be: put all events of the same type in the same topic, and use different topics for different event types. That line of thinking is reminiscent of relational databases, where a table is a collection of records with the same type (i.e. the same set of columns), so we have an analogy between a relational table and a Kafka topic.

The Confluent Schema Registry has traditionally reinforced this pattern, because it encourages you to use the same Avro schema for all messages in a topic. That schema can be evolved while maintaining compatibility (e.g. by adding optional fields), but ultimately all messages have been expected to conform to a certain record type. We’ll revisit this later in the post, after we’ve covered some more background.

For some types of streaming data, such as logged activity events, it makes sense to require that all messages in the same topic conform to the same schema. However, some people are using Kafka for more database-like purposes, such as event sourcing, or exchanging data between microservices. In this context, I believe it’s less important to define a topic as a grouping of messages with the same schema. Much more important is the fact that Kafka maintains ordering of messages within a topic-partition.

Imagine a scenario in which you have some entity (say, a customer), and many different things can happen to that entity: a customer is created, a customer changes their address, a customer adds a new credit card to their account, a customer makes a customer support enquiry, a customer pays an invoice, a customer closes their account.

The order of those events matters. For example, we might expect that a customer is created before anything else can happen to a customer, and we might expect that after a customer closes their account nothing more will happen to them. When using Kafka, you can preserve the order of those events by putting them all in the same partition. In this example, you would use the customer ID as the partitioning key, and then put all these different events in the same topic. They must be in the same topic because different topics mean different partitions, and ordering is not preserved across partitions.

Ordering problems

If you did use different topics for (say) the customerCreated, customerAddressChanged, and customerInvoicePaid events, then a consumer of those topics may see the events in a nonsensical order. For example, the consumer may see an address change for a customer that does not exist (because it has not yet been created, since the corresponding customerCreated event has been delayed).

The risk of reordering is particularly high if a consumer is shut down for a while, perhaps for maintenance or to deploy a new version. While the consumer is stopped, events continue to be published, and those events are stored in the selected topic-partition on the Kafka brokers. When the consumer starts up again, it consumes the backlog of events from all of its input partitions. If the consumer has only one input, that’s no problem: the pending events are simply processed sequentially in the order they are stored. But if the consumer has several input topics, it will pick input topics to read in some arbitrary order. It may read all of the pending events from one input topic before it reads the backlog on another input topic, or it may interleave the inputs in some way.

Thus, if you put the customerCreated, customerAddressChanged, and customerInvoicePaid events in three separate topics, the consumer may well see all of the customerAddressChanged events before it sees any of the customerCreated events. And so it is likely that the consumer will see a customerAddressChanged event for a customer that, according to its view of the world, has not yet been created.

You might be tempted to attach a timestamp to each message and use that for event ordering. That might just about work if you are importing events into a data warehouse, where you can order the events after the fact. But in a stream process, timestamps are not enough: if you get an event with a certain timestamp, you don’t know whether you still need to wait for some previous event with a lower timestamp, or if all previous events have arrived and you’re ready to process the event. And relying on clock synchronisation generally leads to nightmares; for more detail on the problems with clocks, I refer you to Chapter 8 of my book.

When to split topics, when to combine?

Given that background, I will propose some rules of thumb to help you figure out which things to put in the same topic, and which things to split into separate topics:

  1. The most important rule is that any events that need to stay in a fixed order must go in the same topic (and they must also use the same partitioning key). Most commonly, the order of events matters if they are about the same entity. So, as a rule of thumb, we could say that all events about the same entity need to go in the same topic.The ordering of events is particularly relevant if you are using an event sourcing approach to data modeling. Here, the state of an aggregate object is derived from a log of events by replaying them in a particular order. Thus, even though there may be many different event types, all of the events that define an aggregate must go in the same topic.
  2. When you have events about different entities, should they go in the same topic or different topics? I would say that if one entity depends on another (e.g. an address belongs to a customer), or if they are often needed together, they might as well go in the same topic. On the other hand, if they are unrelated and managed by different teams, they are better put in separate topics.It also depends on the throughput of events: if one entity type has a much higher rate of events than another entity type, they are better split into separate topics, to avoid overwhelming consumers who only want the entity with low write throughput (see point four). But several entities that all have a low rate of events can easily be merged.
  3. What if an event involves several entities? For example, a purchase relates a product and a customer, and a transfer from one account to another involves at least those two accounts.I would recommend initially recording the event as a single atomic message, and not splitting it up into several messages in several topics. It’s best to record events exactly as you receive them, in a form that is as raw as possible. You can always split up the compound event later, using a stream processor—but it’s much harder to reconstruct the original event if you split it up prematurely. Even better, you can give the initial event a unique ID (e.g. a UUID); that way later on when you split the original event into one event for each entity involved, you can carry that ID forward, making the provenance of each event traceable.
  4. Look at the number of topics that a consumer needs to subscribe to. If several consumers all read a particular group of topics, this suggests that maybe those topics should be combined.If you combine the fine-grained topics into coarser-grained ones, some consumers may receive unwanted events that they need to ignore. That is not a big deal: consuming messages from Kafka is very cheap, so even if a consumer ends up ignoring half of the events, the cost of this overconsumption is probably not significant. Only if the consumer needs to ignore the vast majority of messages (e.g. 99.9% are unwanted) would I recommend splitting the low-volume event stream from the high-volume stream.
  5. A changelog topic for a Kafka Streams state store (KTable) should be a separate from all other topics. In this case, the topic is managed by Kafka Streams process, and it should not be shared with anything else.
  6. Finally, what if none of the rules above tell you whether to put some events in the same topic or in different topics? Then by all means group them by event type, by putting events of the same type in the same topic. However, I think this rule is the least important of all.

Schema management

If you are using a data encoding such as JSON, without a statically defined schema, you can easily put many different event types in the same topic. However, if you are using a schema-based encoding such as Avro, a bit more thought is needed to handle multiple event types in a single topic.

As mentioned above, the Avro-based Confluent Schema Registry for Kafka currently relies on the assumption that there is one schema for each topic (or rather, one schema for the key and one for the value of a message). You can register new versions of a schema, and the registry checks that the schema changes are forward and backward compatible. A nice thing about this design is that you can have different producers and consumers using different schema versions at the same time, and they still remain compatible with each other.

More precisely, when Confluent’s Avro serializer registers a schema in the registry, it does so under a subject name. By default, that subject is <topic>-key for message keys and <topic>-value for message values. The schema registry then checks the mutual compatibility of all schemas that are registered under a particular subject.

I have recently made a patch to the Avro serializer that makes the compatibility check more flexible. The patch adds two new configuration options: key.subject.name.strategy (which defines how to construct the subject name for message keys), and value.subject.name.strategy (how to construct the subject name for message values). The options can take one of the following values:

  • io.confluent.kafka.serializers.subject.TopicNameStrategy (default): The subject name for message keys is <topic>-key, and <topic>-value for message values. This means that the schemas of all messages in the topic must be compatible with each other.
  • io.confluent.kafka.serializers.subject.RecordNameStrategy: The subject name is the fully-qualified name of the Avro record type of the message. Thus, the schema registry checks the compatibility for a particular record type, regardless of topic. This setting allows any number of different event types in the same topic.
  • io.confluent.kafka.serializers.subject.TopicRecordNameStrategy: The subject name is <topic>-<type>, where <topic> is the Kafka topic name, and <type> is the fully-qualified name of the Avro record type of the message. This setting also allows any number of event types in the same topic, and further constrains the compatibility check to the current topic only.

With this new feature, you can easily and cleanly put all the different events for a particular entity in the same topic. Now you can freely choose the granularity of topics based on the criteria above, and not be limited to a single event type per topic.

Interested in More?

If you have enjoyed this article, you might want to continue with the following resources to learn more about stream processing on Apache Kafka:

 

Subscribe to the Confluent Blog

Subscribe

More Articles Like This

Noise Mapping with KSQL, a Raspberry Pi and a Software-Defined Radio
Simon Aubury

Noise Mapping with KSQL, a Raspberry Pi and a Software-Defined Radio

Simon Aubury .

Our new cat, Snowy, is waking early. She is startled by the noise of jets flying over our house. Can I determine which plane is upsetting her by utilizing open ...

Leave a Reply

Your email address will not be published. Required fields are marked *

Comments

  1. Great post! Really answered some of the questions we’ve been having as we build out our services using Kafka.

    When do you think your patch will make it to a release?

  2. Awesome post Martin; How many topics and how many events seems to be common topic in most of the forums for the architects/designers; You post is great guidance and people like to me to learn and apply these principles;

    The section when to split topics and when to combine – is great decision tree model for providing some crystal clear approach.

    Thank you for the post
    BMK

  3. Thanks for the post! According to this, the developer has to think about possible performance bottlenecks in advance already when designing the domain model. This sounds like a premature optimization, and might call for an additional layer between the domain and Kafka: perhaps a “dispatcher” layer that makes all these transparent.

    -Gyula

  4. Is it possible to you use the KStream API to consume these different events for a particular entity in the same topic ? Must one use the lower level Processor API to do this?

  5. Hi,

    Excellent post. Great guidance on topic modelling. I’m interested in your statement regarding performance in the third paragraph.

    “As a rule of thumb, if you care about latency, you should probably aim for (order of magnitude) hundreds of topic-partitions per broker node. If you have tens of thousands, or even thousands of partitions per node, your latency will suffer.”

    Can you give some additional context on the specifications of a Kafka cluster here like CPU, memory, network, storage type, etc…

    Thanks,
    Paul

  6. The events could be out of order even if they are put into a single topic when they arrive from different applications at slightly different times. In that case the consuming applications need to have some logic to handle it correct? I think some of this might be addressed when using Kafka Streams APIs.

  7. This is great.

    Do you know if this made the 4.1 release.

    Added value.subject.name.strategy=io.confluent.kafka.serializers.subject.RecordNameStrategy to connect-avro-standalone.properties but get default TopicNameStrategy.

  8. Great post! I’d be great to show how is the deserializing bit in order to be able to deserialize all types that live in the topic without getting serialization exceptions and to be able to do a different treatment of the messages of the different types. At the moment my config looks like this:
    schema.registry.url = [http://localhost:18081]
    auto.register.schemas = true
    max.schemas.per.subject = 1000
    basic.auth.credentials.source = URL
    schema.registry.basic.auth.user.info = [hidden]
    specific.avro.reader = true
    value.subject.name.strategy = class io.confluent.kafka.serializers.subject.TopicRecordNameStrategy
    key.subject.name.strategy = class io.confluent.kafka.serializers.subject.TopicRecordNameStrategy

    But when trying to deserialize, I am using a specific Record type from one of the messages, then the RecordType2 is throwing a serialization exception.

    How could we make a consumer app to just deserialize messages of one type, or to deserialize both types but being able to know which one is which in order to do a different processing or even ignore some.

    Thanks

  9. Thanks for the post and your book – extremely helpful for learning. This may be more of an Avro question, but I haven’t found a good example of defining the schema to put multiple event types onto a single topic. Would the unioned event type look something like this?

    { “namespace”: “com.example.customer.events”,
    “type”: “record”,
    “name”: “CustomerEvent”,
    “fields”: [
    {“name”: “customerId”, “type”: “int”},
    {“name”: “event”, “type”: [“CustomerCreated”, “CustomerAddressChanged”, “CustomerInvoicePaid”] }
    ]
    }

    The customerId would probably also be used as topic key. And then it would be up to the application to do an instanceof/typecheck on the “event” field to know which type it is, correct?

    1. Does this work with confluent 4.1.1 community edition?

      It seems that the strategy always default to TopicNameStrategy. I tried passing it to both standalone connector properties and my connector properties.

      Is there another way to override this behavior?

  10. Thanks for this post !

    What about a use-case where you want to queue events of the same type, however, you would like to avoid possible starvation within this topic.

    For example, let’s say you have a fraud detection entity that issues an alert upon failed user signed-in, and upon each 3 such failed attempts you would like to take an action.

    Having all those alerts queued on the same topic, may cause a starvation, as it is possible that the fraud detection entity, may produce 300 of failed sign-ins for user A while the alert for 3 failed sign-ins of user B will be delayed.

    Can you please recommend a possible topic strategy here?

  11. Thanks for this post !

    What about use-cases where you would like to avoid starvation within different instances of events of the same type.

    For example, let’s say you have fraud detection entity that issue a new event upon each failed sign-in, and an action is taking place every X failed sing-ins. Now, assume that user A gets 100X failed sing-ins (some attack??) and user B gets 2X failed sing-ins. In such case, notification for user B will get delayed on the account of user A.

    What would be a good topic strategy here, in your opinion?

    Thanks
    Hanan

  12. Great post and cool patch! I’m curious about an issue you might encounter when using multiple Types per Topic and wanting to maintain backward compatibility:

    If I already have Consumers of my Topic that know about message Types X and Y (call them Domain Events), how might I introduce a new message Type Z (a new Domain Event to expose) without breaking those existing Consumers? It seems like introducing Z would break deserialization for those Consumers if they don’t yet “know” about this new message Type Z. Would this rather be considered a backward incompatible change, and therefore necessitate a new Topic?

  13. I was exploring this same issue and found the article very useful. While having a single topic for related event types resolves the issue of ordering in consumption, do you have any suggestions on how to handle this if the different event types are simply data from different tables sourced using confluent JDBC connector. For example if there is a parent and child table each read independently using JDBC connector, is there any option to sequence them so they go in right order into the topic.

  14. Hi martin. I was greatly inspired by this blog post and I architected my system accordingly.

    I have a `User` topic and a separate `Friendship` topic. User topic is hashed by userId and Friendship topic is hashed by a sorted tuple where firstUserId < secondUserId.

    In rare but possible cases, when user1 and user2 are both created and immediately one adds the other as friends. Since `User` and `Friendship` go separate topics and are processed by different consumers. It could happen that `FriendshipCreatedEvent` is processed earlier than the `UserCreatedEvent`.

    Now I am in a dilemma that in order to make sure user1 and user2's created events are processed earlier than their friendship event. I have to put all events into the same topic and same partition, which leaves me the only option to create a single partitioned-topic for all messages, which makes it impossible to parallelize consumers.

    I would be very grateful if you can provide some insights.

  15. Hi,

    Nice tutorial.
    Is it possible to attach the schema at topic in kafka so any messages published to topic gets filtered out if not satisfy the schema.
    In short how to attach the schema to particular topic?

  16. Dear Martin,

    I like your article and your way of thinking about ordering and event types. For me it’s about use cases that cover more than one message at a time.

    I would like to see you extending this thoughts on how use cases change over time. From time to time, I guess, you have reorganise or migrate topics if you business cases are developing or shifting direction. Especially if you develop in an agile process.

    Greg Young covered this topic in the context of versioning event sources systems, but I guess you could do a lot of his principles, like copy and replace, split, merge, filter, etc., in the kafka log as well.

    Any thoughts on that?

Try Confluent Platform

Download Now

We use cookies to understand how you use our site and to improve your experience. Click here to learn more or change your cookie settings. By continuing to browse, you agree to our use of cookies.