Build Predictive Machine Learning with Flink | Workshop on Dec 18 | Register Now

Top 5 Best Practices for Building Event-Driven Architectures Using Confluent and AWS Lambda

Written By

Confluent and AWS Lambda can be used for building real-time, scalable, fault-tolerant event-driven architectures, ensuring that your application logic is executed reliably in response to specific business events.

Confluent provides a streaming SaaS solution based on Apache Kafka® and built on Kora: The Cloud Native Apache Kafka Engine, allowing you to focus on building event-driven applications without operating the underlying infrastructure.

AWS Lambda is a serverless compute service that provides you with a code execution framework that abstracts the need to provision, operate, and scale any of the underlying infrastructure. 

In this blog post, we will discuss using two different patterns—the Fully managed AWS Lambda Sink Connector and the Native event source mapping (ESM)—to integrate Confluent with AWS Lambda to build event-driven applications (EDAs). We'll also highlight the top five best practices developers and architects should know when building EDAs with Confluent and Lambda.

Pattern 1: Using the fully managed AWS Lambda Sink Connector for Confluent Cloud

In this first pattern, the fully managed AWS Lambda Sink Connector can read data from Confluent Cloud input topics and invoke Lambda functions synchronously or asynchronously. Synchronous mode ensures sequential batch processing for data consistency, while asynchronous mode prioritizes speed but may affect ordering guarantees.

Benefits of using the connector

The benefits of using this pattern include:

  • High throughput and low latency: Configuring the connector in asynchronous mode prioritizes high throughput and low latency but provides at-most-once processing without retries on function failures. On the other hand, configuring the connector in Synchronous mode ensures data consistency by waiting for each function to complete before processing the next batch, but it may limit overall throughput. Configuration is done with the aws.lambda.invocation.type parameter.

  • Multiple topics and multiple Lambda functions: The connector supports one or more Confluent topics and can invoke one or more Lambda functions. This is useful if you have a single topic that triggers multiple Lambda functions. Configuration is done with the  aws.lambda.topic2function.map configuration. 

  • Simple transformations before sending to the Lambda function: Single Message Transformations (SMTs) is a Kafka Connect feature that allows you to modify events within the connector before they are sent to Lambda. For example, filtering messages by field to reduce costs.

  • Native out-of-the-box integration with Schema Registry: The connector supports various input data formats for Confluent topic keys and values. These formats include Avro, JSON Schema (JSON_SR), Protobuf, JSON (schemaless), and Bytes. The key advantage here is the elimination of any need to write any code or manage Lambda layer libraries for the deserialization process.

Considerations

The following are a few considerations to take into account when using the connector:

  • Added complexity: The connector is available as a fully managed offering from Confluent, but it remains an additional component in the architecture that requires scaling and monitoring

  • Auto-scaling is not available: The connector supports multiple tasks, it lacks auto-scaling functionality. As a result, manual adjustments are necessary to match varying workloads. Therefore, it might not be the ideal choice for handling unpredictable and highly spiky workloads. 

Pattern 2: Using native event source mapping

In this pattern, the AWS Lambda service continuously polls events from the event source (Confluent) and invokes the Lambda function synchronously, optionally using batching for added benefits.

Benefits of using native event source mapping

There are a couple of benefits to choosing this method:

  • Auto-scaling: Lambda initially assigns one function to handle all partitions in the input topic. It continuously monitors consumer lag, scaling functions as needed when the lag exceeds a threshold, up to one function per partition. This scaling occurs every three minutes to minimize consumer rebalances.

  • Polling logic is done for free: With ESM, AWS takes full responsibility for managing the infrastructure required to poll events from Confluent, and the best part is that the polling logic is provided at no additional cost to you. You only get charged for running your actual business logic.

Considerations

The following are a few considerations to take into account when using the native event source mapping:

  • Concurrent batches per partition are unsupported: ESM ensures ordering guarantees by sending one batch at a time per partition, but this can limit throughput and increase end-to-end latency, making it less suitable for high-throughput, low-latency use cases.

  • Lack of retry logic: With ESM, you are unable to customize essential parameters like the number of retry attempts or define specific actions for handling events that expired and failed all retry attempts.

  • Added complexity and dependency management for data with schemas: When input data contains a schema, Lambda function code must handle deserialization, increasing complexity and requiring dependency management, which can be managed through Lambda layers or in the code zip archive. 

  • No native support for fan-out to multiple Lambda functions: For fanout to multiple Lambda functions, consider using Amazon EventBridge which offers two integration methods: EventBridge Pipes and the recently introduced EventBridge Sink Connector. In forthcoming blog posts, we will delve into both integration methods. 

Note that EventBridge introduces its own associated costs and adds an extra component to the pipeline.

Best practices when using Confluent and AWS Lambda

Here are some best practices for running an event-driven Confluent and Lambda solution. 

Batching calls to Lambda

Typically, Lambda polls the Kafka topic frequently for new records, and if records exist, it invokes the function runtime. Since you pay for compute (number of invocations + processing), this could become inefficient as your workloads and throughput requirements scale. 

Use AWS Lambda batching controls for high throughput workloads to save on lambda invocation costs. Batching allows you to send a batch of events to a Lambda invocation, reducing the overhead of spinning up and shutting down a function on every single event. 

The Lambda batching controls are compatible with Kafka protocol brokers like your Confluent Cloud cluster!

Error handling in the Lambda code: Avoiding duplicates and stalled partitions

Lambda polls records from Confluent in batches, if one event in the batch fails the entire batch is retried. This approach presents three potential drawbacks. Firstly, because the entire batch is retried, any events from the start of the batch up to the failing event will be processed multiple times, leading to at-least-once processing guarantees.

Secondly, events from the failing event to the end of the batch are left unprocessed, resulting in at-most-once processing guarantees. These events will not be attempted again, potentially leading to data loss.

Finally, this method may lead to a stalled partition, where subsequent events following the failing batch will remain unprocessed until the batch either succeeds or expires. This situation can cause delays in processing new incoming data and increase latency.

To illustrate this scenario, consider an example where there are 300 events in the input Confluent partition, and the batch size is set to 200 events. If event 100 fails, the entire Lambda batch will be retried, causing events 1 to 100 to be processed by function A again. Furthermore, events 101 to 200 will remain unprocessed if event 100 continues to fail and will remain permanently unprocessed if the Lambda batch expires. Finally, records 200 to 300 from the Kafka topic will not be processed until the first Lambda batch is either successfully processed or reaches its expiration, leading to increased latency.

To fix this, the best practice is to always catch any errors inside the code, log them to an external service (like Amazon CloudWatch), and always return to the Lambda service successfully. However, it's worth noting that there are exceptional cases where maintaining order processing is important. In such instances, implementing head-of-line blocking and persistently continual failures becomes necessary to avoid potential data inconsistencies and undesirable states.

Doing this will significantly reduce the chances of reprocessing events, eliminate the chances of events not being processed, and avoid any stalled partition.

Use Lambda functions idempotently

Ideally, records are produced to and consumed only once from a given topic. However, in cloud-native systems, we must build fault-tolerant systems for the non-ideal scenarios (see the CAP theorem). Given that the Lambda sink connector and the native event source mapping already offer at-least-once guarantees, it's crucial to protect your workloads from potential duplicate processing, configuration errors, or operational failures. To achieve this, we highly recommend implementing the idempotent consumption pattern.

By adopting the idempotent consumption pattern, you can ensure that processing the same event multiple times will have the same outcome as processing it just once. To learn more about how to implement idempotent readers/consumers, check out this Confluent Developer article.

Keep long-lived connections to Kafka (initialize outside the handler) for better efficiency

AWS Lambda provides a secure and isolated runtime environment to invoke your function. If a Lambda instance is not available to respond to an invocation, a new instance must be created and initialized. This cold start introduces a delay between the invocation and runtime processing, causing delays in the E2E processing.

When the Lambda function has to work with Apache Kafka or Confluent, establishing a connection with the broker introduces additional delays in the processing.

One way to get around these cold start issues is to establish long-lived connections outside the function handler to ensure that it can be reused across different Lambda invocations, thus improving the overall performance of the Lambda function and resource utilization on your Kafka cluster, while also reducing delays in processing.

Here is a code sample showcasing where the connection to the Kafka cluster must be established: 

from confluent_kafka.schema_registry.avro import AvroSerializer
from confluent_kafka import SerializingProducer
from confluent_kafka.schema_registry import SchemaRegistryClient
from confluent_kafka.serialization import StringSerializer
import boto3
import os

# Initialize the producer outside the handler
# This method avoids instantiation on each invocation of the Lambda

 producer = SerializingProducer({
'bootstrap.servers': <bootstrap_servers>,
'sasl.mechanisms': 'PLAIN',
'security.protocol': 'SASL_SSL',
'sasl.username': <sasl_username>,
'sasl.password': <sasl_password>,
'key.serializer': StringSerializer(),
'value.serializer': AvroSerializer(SCHEMA_REGISTRY_CLIENT, schema),
'client.id': 'aws-lambda',
'batch.size': 5000)

def lambda_handler(event, context):
	# If producer is initialized here -
# The step will be done in each invocation

	try: 
# Process any events before producing to Confluent

Processed_event = process_event(event)
PRODUCER.produce('final', key, message)
except Exception as e:
      print(f"Exception while producing message - {message} : {e}")
PRODUCER.flush()

PRODUCER.produce('final', key, Processed_event)

def process_event(event):
...
...
...
...
...

In the above sample, we instantiate a producer instance outside the Lambda handler, within the lambda_handler(event, context) method. This approach ensures that the step is performed only once during the startup of the Lambda function, rather than on every invocation.

Use schemas for rich data validation

In event-driven architectures with Lambda and Confluent, multiple Lambda functions (representing different microservices owned by distinct teams) communicate through events. To ensure seamless communication and collaboration between these microservices, the use of schemas and Schema Registry is crucial.

Schemas provide data integrity by enforcing a strict format for the data. They define key-names, value-types, data format, mandatory and optional inclusions, default values, and range constraints, accompanied by human and machine-readable documentation. Implementing schemas results in a substantial reduction in errors and data inconsistencies. By enforcing these rules, the system gains a strong level of validation and ensures that data adheres to the predefined structure and constraints. 

Typically, schemas are defined in a Schema Registry, which acts as a centralized repository for managing and versioning schemas. This repository establishes data contracts between producers and consumers, ensuring seamless data exchange. 

Adding to this, by utilizing a Schema Registry, producers and consumers can establish a clear understanding of schema evolution guidelines based on a specific compatibility check. As long as producers adhere to this compatibility check, they are permitted to make changes to their schemas without causing conflicts with existing consumers. This approach fosters a flexible and scalable architecture, allowing for iterative improvements and updates to individual microservices without negatively impacting the overall system's stability and functionality.

Confluent offers a few functionalities that facilitate the adoption and enforcement of schema checks, streamlining your data processing and ensuring data consistency:

  1. Confluent offers a fully managed Schema Registry for each Confluent Cloud environment. For each environment, you have the option to choose from two Stream Governance Packages: Essentials and Advanced.

  2. Confluent offers Broker-Side Schema Validation, which enforces strict adherence to data schemas registered in the Schema Registry. When producers attempt to write data to Confluent, the Broker-Side Schema Validation steps in to verify whether the data is accompanied by a corresponding schema in the Schema Registry. This validation ensures that only data adhering to the registered schemas for a specific topic is accepted and processed within the system. Any data lacking a valid schema will be rejected, preventing the possibility of incompatible or erroneous data entering the topic.

  3. Confluent offers governance controls that allow you to set up rules to validate data quality using data validation rules. These data rules or policies can enforce integrity constraints that govern the structure and content of your data. For instance, you can enforce that a field containing sensitive information must be encrypted, ensuring data security and privacy. Similarly, you can require that an age field must only contain positive integers, or an email address must include the "@" sign, guaranteeing data consistency and validity. 

Conclusion

We hope this was a helpful guide to get you started with integrating AWS Lambda and Confluent Cloud into your workloads and employing the best practices.

Confluent, with its suite of services including Confluent Cloud, Confluent Schema Registry and REST API, utilizes the de facto streaming standard Apache Kafka along with the next generation Kora engine to provide a solution that is cloud native, complete, and everywhere. Sign up for a free trial of Confluent Cloud in AWS Marketplace.

  • Ahmed Zamzam is a staff partner solutions architect at Confluent, specializing in integrations with cloud service providers (CSPs). Leveraging his expertise, he collaborates closely with CSP service teams to develop and refine integrations, empowering customers to leverage their data in real time effectively. Outside of work, Zamzam enjoys traveling, playing tennis, and cycling.

  • Ashish Chhabria is a staff product manager at Confluent, where he focuses on Apache Kafka REST API and Client SDKs. Prior to Confluent, he held product roles at Microsoft, working on Azure Service Bus. He started his career in the financial services industry building real-time systems for risk management and reconciliation. Ashish has a MS in Computer Science from Columbia University, and lives in the Seattle area with his family.

Did you like this blog post? Share it now