Build Predictive Machine Learning with Flink | Workshop on Dec 18 | Register Now

Aggregate and Stream Your Security Data to Amazon Security Lake at Scale with Confluent

Written By

Data breaches and cyber attacks are a growing concern among many companies today. With the ever-increasing cost of data breaches, businesses are constantly searching for ways to secure their data and infrastructure better.

The Problem with Secure Data Management

According to IBM's annual Cost of a Data Breach report,' data breach costs reached an all-time high, averaging USD 4.35 million in 2022, where 83% percent of organizations surveyed experienced multiple data breaches. But with the sheer volume of data generated by various security tools and infrastructure, businesses continually need help managing and securing their workloads, applications, and data, especially in the face of increasingly sophisticated bad actors. That's where Confluent and Amazon Security Lake come in—offering a new purpose-built data lake for security-related data. This blog post will explore how this partnership can help businesses tackle the challenges of securing their data and infrastructure.

Amazon Security Lake is a new purpose-built data lake for security-related data. It can automatically aggregate data from cloud and on-premises infrastructure, firewalls, and endpoint security solutions. It helps enterprises centralize all of their security data in a single data lake, using a standards-based format, and manage the life cycle of this data.

Amazon Security Lake aggregates data from AWS services like CloudTrail and Lambda, as well as its security tools like AWS Security Hub, GuardDuty, or the AWS Firewall Manager, in addition to many third-party log sources from SaaS and on-premises. It supports the new Open Cybersecurity Schema Framework (OCSF), which facilitates a common way to store telemetry, making it far easier to integrate tools. In addition, tools can pass information to one another. The schema is consistent and data flows seamlessly into data lakes and analytics tools. Confluent helps you quickly aggregate data and send it to Amazon Security Lake, wherever it is and at any scale.

The Challenges of Moving Data to a Streaming Model

Amazon Security Lake is a powerful security platform that ingests data from AWS native services as well as custom enterprise data with the help of partners. Confluent offers data governance features, massive scaling, and a connector ecosystem that complement Amazon Security Lake, making it easier to ingest and process data from various locations like on-prem, at the edge, or in a co-location, into S3, ensuring a streamlined and efficient data pipeline. As a data streaming platform, Confluent can scale to millions of events per second, making it an ideal layer for enterprises with massive data estates. 

So how should you get data to Confluent?

The first option is to produce events directly using one of our client libraries (Java, C/C++, Python, Go, .NET) to send relevant events to a topic, which is a logical collection of events. For Amazon Security Lake, this might be a microservice that pulls events from network devices, then generates Kafka events.

Another option would be to use our connector ecosystem. Confluent, with help from our partners, has over 120 connectors that will allow you to pull data from various disparate sources. You can also sink Confluent events into data destinations, but we will cover that a bit later. So if you are interested in security events in any flavor of relational database (MSSQL, MySQL, PostgreSQL, Oracle, etc.), you can use connectors to pull data directly from these incumbent systems to generate new events in Confluent.

Once you get data to Confluent, we must ensure that data conforms to the Open Cybersecurity Schema Framework (or OCSF format for short) per Amazon Security Lake requirements—the OCSF is a collaborative, open-source effort by AWS and leading partners in the cybersecurity industry. OCSF provides a standard schema for everyday security events, defines versioning criteria to facilitate schema evolution, and includes a self-governance process for security log producers and consumers. Confluent can help with OCSF conformity in two ways. First is through our Data Governance features, which include Schema Registry. Schema Registry allows you to set up and enforce specific schemas, like OCSF, at a topic level. This means events will be rejected if they do not conform to OCSF. Confluent also formats events into OCSF using ksqlDB or our future Flink offering (more on that  in a future blog post). The last step is getting that data to an S3 bucket managed by Amazon Security Lake.

 The Solution to Siloed Data

Remember when we were talking about Confluent sink connectors? Our S3 sink connector is one of our most popular and does all of the neat things Amazon Security Lake requires. Here’s how to get started. First, deploy a Confluent Connect Worker somewhere in AWS. Many options exist, including EC2, ECS, and even EKS Fargate. In this example, we’ll use EC2 to simplify our initial setup. Use Amazon Security Lake to set up a source S3 bucket and associated IAM role. You’ll need an EC2 instance profile to allow you to assume the role that Amazon Security Lake created. Next, use this example connect work configuration:

bootstrap.servers={BOOTSTRAP_ENDPOINT}
security.protocol=SASL_SSL
sasl.jaas.config={AUTHENICATION_CREDENTIALS}
sasl.mechanism=PLAIN
consumer.security.protocol=SASL_SSL
consumer.sasl.jaas.config={AUTHENICATION_CREDENTIALS}
consumer.sasl.mechanism=PLAIN
producer.security.protocol=SASL_SSL
producer.sasl.jaas.config={AUTHENICATION_CREDENTIALS}
producer.sasl.mechanism=PLAIN
# Required for correctness in Apache Kafka clients prior to 2.6
client.dns.lookup=use_all_dns_ips
group.id=distributed-connect
config.storage.topic=distributed-connect-config
offset.storage.topic=distributed-connect-offsets
status.storage.topic=distributed-connect-status
plugin.path=/usr/local/share/kafka/plugins

# Best practice for higher availability in Apache Kafka clients prior to 3.0
session.timeout.ms=45000

# Best practice for Kafka producer to prevent data loss
acks=all

# Required connection configs for Confluent Cloud Schema Registry
schema.registry.url={SCHEMA_REGISTRY_URL)
basic.auth.credentials.source=USER_INFO
basic.auth.user.info={SCHEMA_REGISTRY_AUTHENTICATION)

# Schema Registry specific settings
value.converter.basic.auth.credentials.source=USER_INFO
value.converter.schema.registry.basic.auth.user.info={SCHEMA_REGISTRY_AUTHENTICATION)
value.converter.schema.registry.url={SCHEMA_REGISTRY_URL)
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=io.confluent.connect.avro.AvroConverter

Once you have the worker up and running, the next step is to deploy an S3 Sink task to the worker. The S3 Sink will watch all of the topics you have OCSFs events in and send those to S3 as Parquet objects. There are specific settings and partitioning requirements by Amazon Security Lake; the following S3 Sink task configuration includes those requirements to get you started:

{
	"connector.class": "io.confluent.connect.s3.S3SinkConnector",
	"s3.credentials.provider.class": "io.confluent.connect.s3.auth.AwsAssumeRoleCredentialsProvider",
	"s3.credentials.provider.sts.role.arn": "arn:aws:iam::{ACCOUNT_ID}:role/{ROLE_NAME}",
	"s3.credentials.provider.sts.role.session.name": "{SESSION_NAME}",
	"s3.credentials.provider.sts.role.external.id": "{EXTERNAL_ID}",
	"topics": "{TOPIC_NAME}",
	"output.data.format": "PARQUET",
    "s3.bucket.name": "{BUCKET_BAME}",
	"s3.region": "{REGION}",
	"input.data.format": "AVRO",
	"time.interval": "DAILY",
	"partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
	"partition.duration.ms": "1",
	"locale": "en-US",
	"timezone": "America/Los_Angeles",
	"timestamp.extractor": "Record",
	"topics.dir": "ext",
	"path.format": "'region=us-west-2'/'accountId={ACCOUNT_ID}'/'eventDay='YYYYMMdd",
	"rotate.interval.ms": "600000",
	"format.class": "io.confluent.connect.s3.format.parquet.ParquetFormat",
	"storage.class": "io.confluent.connect.s3.storage.S3Storage",
	"value.converter.enhanced.avro.schema.support": "true",
	"flush.size": "1000",
	"tasks.max": "1"
}

And you’re ready to go!

In addition to sending these events to Amazon Security Lake, you could also have native Confluent consumers using these topics for notifications, business logic, or event firing off AWS Lambda functions to kick off remediation actions. With Confluent and Amazon Security Lake, any organization at any scale can start deriving security insights in near real-time.

Additional contributions from Michael Worthington, Sr. Product Marketing Manager

  • Joseph Morais started early in his career as a network/solution engineer working for FMC Corporation and then Urban Outfitters (UO). At UO, Joseph joined the e-commerce operations team, focusing on agile methodology, CI/CD, containerization, public cloud architecture, and infrastructure as code. This led to a greenfield AWS opportunity working for a startup, Amino Payments, where he worked heavily with Kafka, Apache Hadoop, NGINX, and automation. Before joining Confluent, Joseph helped AWS enterprise customers scale through their cloud journey as a senior technical account manager. At Confluent, Joseph serves as cloud partner solutions architect and Confluent Cloud evangelist.

  • Geetha Anne is a solutions engineer at Confluent with previous experience in executing solutions for data-driven business problems on cloud, involving data warehousing and real-time streaming analytics. She fell in love with distributed computing during her undergraduate days and followed her interest ever since. Geetha provides technical guidance, design advice, and thought leadership to key Confluent customers and partners. She also enjoys teaching complex technical concepts to both tech-savvy and general audiences.

  • Weifan Liang is a Senior Partner Solutions Architect at AWS. He works closely with AWS top strategic data analytics software partners to drive product integration, build optimized architecture, develop long-term strategy, and provide thought leadership. Innovating together with partners, Weifan strives to help customers accelerate business outcomes with cloud powered digital transformation.

Did you like this blog post? Share it now

Win the CSP & MSP Markets by Leveraging Confluent’s Data Streaming Platform and OEM Program

This blog explores how cloud service providers (CSPs) and managed service providers (MSPs) increasingly recognize the advantages of leveraging Confluent to deliver fully managed Kafka services to their clients. Confluent enables these service providers to deliver higher value offerings to wider...


Atomic Tessellator: Revolutionizing Computational Chemistry with Data Streaming

With Confluent sitting at the core of their data infrastructure, Atomic Tessellator provides a powerful platform for molecular research backed by computational methods, focusing on catalyst discovery. Read on to learn how data streaming plays a central role in their technology.