[Virtual Event] Agentic AI Streamposium: Learn to Build Real-Time AI Agents & Apps | Register
Getting raw documents, scanned PDFs, and complex images ready for RAG and agentic AI in real time remains an engineering bottleneck. Structured data pipelines have matured. But unstructured data processing too often relies on brittle synchronous API chains or legacy batch scripts that break under pressure.
This article gives AI engineers, data platform architects, and infrastructure leads a blueprint for designing durable, event-driven streaming pipelines that transform messy binary blobs into structured, AI-ready data.
The focus is on the core mechanics of production engineering: system architecture, state transitions, fault tolerance, and resiliency. You'll learn how to design streaming control planes that manage cost variations, protect downstream rate limits, and ensure data freshness so your AI applications get the structural context they need to avoid hallucinations.
Trigger event-driven streaming workflows on every document upload: Build unstructured data pipelines (PDFs, scans, images) as event-driven streaming workflows, not batch ETL or synchronous API chains, to keep RAG/agents fresh and reduce hallucinations
Use the Claim Check pattern: Store binaries in object storage and put only URIs, metadata, and processing state on Apache Kafka®
Model processing as staged topics and state transitions: Flow data through raw_documents → refined_documents → curated_ai_assets, emitting a new event after each stage
Protect brittle OCR, LLM, and embedding APIs with buffering and backpressure: Use throttling strategies like Kafka's consumer pause()/resume() API to handle bursts and rate limits
Ensure production correctness at every layer. Use application-level content hashes to detect duplicate payloads, Kafka's exactly-once transactions to prevent partial writes, and DLQs to isolate corrupt or unsupported files
Publish curated outputs via managed sink connectors: Route to vector databases, search indexes, and warehouses using incremental upserts for cost-efficient freshness
Data freshness directly determines RAG quality. Recent system benchmarks, like the RAGPerf framework, show that systems ignoring new data updates suffer from poor context recall and accuracy. When your pipeline operates on stale context, your AI applications hallucinate. Keeping documents immediately searchable as they arrive is what restores retrieval accuracy to near-perfect levels, according to independent research on recency priors in real-time systems.
For structured data, achieving that freshness is a solved problem. A JSON event representing a financial transaction arrives ready to index. The pipeline reads a field, transforms it, and writes it downstream in milliseconds with predictable cost and compute. Unstructured data breaks every one of those assumptions.
A multi-page scanned invoice or complex legal contract requires serious computing just to extract baseline text, let alone infer semantic structure. The extraction itself is lossy: naive tools destroy table layouts, sever key-value relationships, and flatten multi-column reading order. This means the pipeline must be fast and accurate.
Processing times are unpredictable because a digital-native PDF with an embedded text layer takes a fraction of the time and cost of a degraded scan that requires GPU-accelerated OCR. That unpredictability extends directly to cost.
As of early 2026, basic text extraction costs roughly $1-2 per 1,000 pages on major cloud platforms. Extracting structured forms or key-value pairs can cost more than $50 per 1,000 pages on those same platforms. The pipeline must route each document to the correct extraction tier, or costs spiral out of control.
External API rate limits compound the problem. Azure Document Intelligence defaults to around 15 transactions per second (check current documentation for your tier), and Google Document AI strictly limits online processing requests. Hit these endpoints synchronously with a sudden burst of uploaded documents, and you'll get cascading failures. A decoupled, buffered architecture isn't optional here.
The engineering challenge specific to unstructured data is achieving real-time freshness despite variable compute costs, lossy extraction, and hard API throughput ceilings. The three common architectural approaches handle these constraints very differently.
Dimension | Legacy batch ETL | Synchronous API chains | Event-driven streaming |
Data freshness | Stale (hours to days) | Real-time (if successful) | Real time (milliseconds/seconds) |
Scalability | Vertical, brittle | Fails under burst loads | Horizontal, elastic |
Rate limit handling | Manual throttling scripts | Throws HTTP 429 errors | Topic-based buffering with consumer-side throttling |
Cost control | All-or-nothing processing | Over-provisions resources | Tiered, granular routing |
Event-driven streaming addresses each of the challenges above. Durable topics act as buffers between producers and consumers, absorbing document bursts without overwhelming rate-limited extraction APIs. Decoupled consumer groups let fast digital-PDF extractors and slow GPU-accelerated OCR engines run at their own pace from the same input stream, handling unpredictable compute times without blocking each other.
Routing logic inside the stream processor directs each document to the appropriate extraction tier, giving the pipeline granular cost control. Because every document triggers processing the moment it arrives, downstream RAG applications always work with fresh context.
Every upload, document scan, or image arrival should be treated as an asynchronous event that kicks off a continuous pipeline. The streaming platform acts as the control plane, routing data through specialized worker pools and managing each document's processing state.
Handling unstructured data requires a crucial architectural rule: don't send large binary files through the message broker.
Event streaming platforms like Apache Kafka® are optimized for high-throughput, low-latency transmission of small to medium-sized messages. Push a 500MB PDF or high-resolution image directly into Kafka topics and you'll degrade cluster performance, consume excessive memory, and routinely violate default message size limits.
The solution is the Claim Check pattern. Heavy binary blobs live in durable object storage. The event stream carries only the claim ticket: the object URI, security metadata, pipeline processing state, and eventually, the lightweight extracted downstream outputs.
This pattern works well for real-time use cases because major object stores now provide strong read-after-write consistency. When your application writes a PDF to Amazon S3 or Google Cloud Storage, the object is immediately available for reading the moment the success response comes back. A downstream Kafka consumer triggered by that upload event won't encounter a missing file error.
To maintain a reliable control plane, design your architecture so each major processing step emits a distinct new event:
Document received? Trigger an event.
OCR completes? Trigger another.
Text chunked? Publish a final event.
This staged emission makes your pipeline observable, decoupled, and replayable.
Because event payloads evolve differently across stages (a raw event carries only a URI while a curated event carries embeddings and chunk metadata), register and enforce schemas for each topic using a schema registry. Schema enforcement prevents a producer bug in the extraction stage from publishing malformed events that silently break downstream chunking or embedding consumers. Catching incompatible changes at write time is far cheaper than debugging corrupted vectors in production.
Transitioning unstructured data from a messy blob into a production-ready vector requires a methodical, staged approach. Map distinct data states to dedicated stream topics and you can build fault-tolerant pipelines that scale horizontally. The reference architecture below uses three topics (raw_documents, refined_documents, curated_ai_assets) to represent the data at each stage of processing. Name them whatever fits your domain, but keep the separation.
This stage defines the Raw Layer, representing the immutable record of the original upload. The payload contains no extracted text, only the original object storage URI, critical source metadata including upload timestamp and originating user, and a unique processing correlation ID.
The pipeline begins the moment a storage bucket or upstream application emits an event. A lightweight stream processor captures this event and performs immediate validation:
Inspects metadata and detects the MIME type or magic bytes to verify file type, identifying it as a digital PDF, PNG image, or DOCX file
Determines the correct processing path before any heavy, expensive computation begins
Rejects malformed files or unsupported formats immediately, or sends them to an error queue to save compute cycles
Once validated, the event is published to the raw_documents topic, serving as the definitive starting point for extraction.
This stage defines the Refined Layer, containing the extracted content and deep structural information pulled from the raw blob. Key outputs mapped into this JSON payload include raw text, vital layout data such as page numbers and bounding boxes, OCR confidence scores, and normalized key-value fields.
A stream processing application (built with Apache Flink®, Kafka Streams, or an equivalent framework) reads continuously from the raw_documents topic and acts as a router. It evaluates file type and business requirements, then sends the job to the most appropriate asynchronous processing service:
Digital-native PDFs with easily readable embedded text layers get routed to fast, inexpensive, CPU-bound text extractors
Flat scanned documents or complex multi-column layouts go to highly accurate, GPU-accelerated cloud OCR engines
This routing logic maps directly to the Event Router pattern. In Flink SQL, routing documents by type is a single CTAS statement per destination:
[ADD CODE BLOCK]
With Kafka Streams, use a TopicNameExtractor to route each document dynamically based on its detected file type:
[ADD CODE BLOCK]
Regardless of which extraction path handles the job, every worker normalizes its output into a standard enterprise schema and publishes the structured payload to the single refined_documents topic. All paths converge on one topic, preserving the clean staged-topic model and decoupling the fragile, rate-limited extraction APIs from downstream consumers.
This stage defines the Curated Layer, the final, domain-ready data prepared specifically for downstream RAG or analytics use cases. The payload includes context-aware text chunks, structured business entities like invoice line items, generative summaries, and dense vector embeddings.
Flink jobs or Kafka Streams applications consume heavy JSON payloads from the refined_documents topic. Rather than splitting text blindly by character count, they use layout metadata to chunk along page boundaries, keep tabular data intact, and respect section hierarchies.
Following chunking, the processor performs enrichments: calling a large language model API to extract named entities, or hitting an embedding model to vectorize the chunks.
On Confluent Cloud, teams have two options for embedding generation. Write the query yourself using Flink SQL and the AI_EMBEDDING function:
[ADD CODE BLOCK]
Or skip the SQL entirely and use the Create Embeddings Action which generates and manages the Flink job through the Console UI without custom code.
Either way, the final, application-ready assets are published to the curated_ai_assets topic.
The final step moves curated data from the event stream into serving layers where applications can query it. This relies on managed sink connectors that continuously consume from the curated_ai_assets topic and reliably publish data to its final destination without custom integration code.
The pipeline routes different representations of the data to specific systems:
Vector databases (Pinecone, Milvus, Weaviate, or Qdrant) receive dense embeddings for semantic RAG retrieval
Search indexes (Elasticsearch or Azure Cognitive Search) receive normalized text and metadata for traditional keyword and hybrid search
Cloud data warehouses (Snowflake or BigQuery) ingest structured entities for classical reporting and analytics
Confluent provides pre-built sink connectors for each of these destination types.
Because curated_ai_assets is a standard Kafka topic, multiple sink connectors can consume from it independently. A single document update publishes once to the curated topic and fans out to every downstream system: the vector database, the search index, and the warehouse each receive the update through their own connector, with no coordination required between them.
Each of these sinks must handle incremental updates efficiently. By publishing events equipped with content-hashed chunk identifiers, the sink connectors issue targeted upserts and deletes rather than wiping and rebuilding an entire document index when a single paragraph changes. Production research on streaming vector updates shows this pattern reduces content reprocessing from 100% of the document to just 10-15%, drastically cutting database compute costs while maintaining data freshness across all downstream systems.
Routing documents to the right processing service is only half the problem. For structurally complex documents like invoices, tax forms, and multi-column reports, treating extraction as mere text retrieval is an engineering error. The spatial and structural context of that data is just as important as the words themselves. For simple single-column text, basic extraction works fine.
The problem shows up when naive parsing tools flatten complex documents from left to right, top to bottom. This destroys table structures, ruins reading order in multi-column layouts, and severs relationships between key-value pairs.
In vendor-published benchmarks using structural evaluation frameworks, layout-aware parsers achieved table extraction scores of 0.844 on complex documents. Without that awareness, an AI agent answering a question from a flattened table encounters merged columns and produces hallucinated associations.
By explicitly preserving structural metadata, including strict page boundaries, element confidence scores, and coordinate bounding boxes, in the Refined event payload, you dictate the quality of downstream retrieval. This layout data allows chunking algorithms in the Curated layer to keep related concepts grouped together. When the vector database returns a chunk to the LLM, the context is accurate and spatially coherent.
The four-stage pipeline above assumes document input, but modern AI architectures also require multimodal processing for images, photographs, and visual assets. The same staged model applies (raw, refined, curated, serving), but the extraction and enrichment stages require a flexible architecture that supports parallel event branches.
When an image arrives in the object store, treating it as a single linear process limits its utility. Instead, the ingestion event should fan out, triggering multiple specialized consumer groups simultaneously.
The extraction outputs required for the Refined layer in a multimodal pipeline are significantly broader than document parsing. A complete image processing pipeline must:
Extract visible text via OCR
Generate machine-generated classifications and safety labels using zero-shot vision models
Map bounding boxes for object detection
Generate dense semantic image embeddings for visual similarity search
To satisfy diverse needs of downstream consumers, use a dual-output strategy within the Curated layer. The pipeline should produce two distinct representations of the same asset:
A human-readable, normalized JSON payload containing text, classifications, and metadata, which is critical for auditability, debugging, and user interfaces
A machine-optimized representation containing high-dimensional vectors
Publish both representations to the stream. Analytical data warehouses consume the structured JSON while vector databases consume the embeddings, all from the same real-time control plane.
Unstructured data pipelines interact heavily with brittle, third-party machine learning (ML) APIs, making them uniquely susceptible to timeouts, partial failures, and throughput bottlenecks. Designing for resiliency is now the core requirement rather than an afterthought.
Extraction APIs and embedding models are computationally expensive. So, processing the same document or chunk twice due to a network hiccup creates real cost and correctness problems. Two layers of protection work together here:
Content hashing (application level): Compute a SHA-256 hash of each raw blob or refined text chunk and use it as the record key or deduplication lookup. If a consumer encounters a hash it has already processed, it skips the record. This prevents duplicate processing of the same asset during consumer retries or intentional pipeline replays.
Platform-level guarantees: At the infrastructure level, enforce idempotency using your messaging platform's native guarantees. Configure clients as idempotent producers so network retries don't result in duplicate messages appended to the log. For the Kafka-side commit path, where a consumer reads from one topic and writes results to another, transactional APIs guarantee exactly-once read-process-write semantics. If the process fails mid-execution, the transaction aborts and no partial data commits downstream.
Note that exactly-once covers the Kafka log itself. It does not roll back calls already made to external OCR or embedding APIs. That is where content-hash-based idempotency fills the gap: even if an external API call executes twice, the duplicate result is detected and discarded before it reaches the next topic.
Cloud based AI models and OCR endpoints enforce aggressive rate limits. If your pipeline tries pushing 500 documents per second into an API configured for 15 transactions per second, the external service will throw HTTP 429 Too Many Requests errors, causing cascading task failures.
Decoupled stream topics inherently protect downstream systems by buffering incoming document bursts. But consumers must still throttle their reads to match external API capacity.
For granular control without destabilizing the cluster, use Kafka's consumer pause() and resume() API to implement backpressure. If API latency spikes or rate limits get hit, the consumer calls pause() on specific partitions to temporarily stop fetching from them.
This halts ingestion for that specific worker without shutting down the consumer or triggering a costly, cluster-wide consumer group rebalance. When the external API recovers, the consumer calls resume() and processing continues from where it left off.
Unstructured data is inherently unpredictable. Corrupt PDFs, password protected files, or illegible handwriting will inevitably cause parsing failures or return unacceptably low confidence scores. Silently dropping these files degrades data integrity, and infinite retries will permanently block the partition. Neither option is acceptable.
Implement robust Dead-Letter Queues (DLQs) to handle these cases cleanly. When a worker encounters an unparseable file or persistent API failure, it wraps the original event along with the stack trace and error metadata, then publishes it to a dedicated DLQ topic. The main pipeline continues processing healthy documents uninterrupted.
DLQ topics serve a dual purpose: they alert engineering teams to systemic parser failures, and they act as the routing mechanism for human-in-the-loop review systems where operations teams can manually correct edge cases.
Processing unstructured data at production scale incurs large cloud bills if managed poorly. Treat cost as a dynamic engineering variable instead of a fixed infrastructure expense.
Pipelines should employ smart, tiered processing that escalates cost with complexity. The routing logic in Stage 2 already separates cheap text extraction from expensive OCR. Extend that principle further by having the stream processor dynamically split exceptionally large PDFs into page groups and parallelize extraction across concurrent workers. Keep pages that share cross-page context together (tables that span pages, continued paragraphs, footnote references) rather than splitting blindly at every page boundary. Expensive multimodal or generative APIs should only trigger when the extraction stage explicitly flags a document as requiring them.
This article covered the end-to-end architecture for processing unstructured documents and images in real time: ingesting and validating raw uploads, extracting text and layout with structure-aware parsing, chunking and embedding content for RAG, and serving curated outputs to vector databases, search indexes, and warehouses. Along the way, we addressed the production concerns that make unstructured data uniquely difficult: rate limit protection, idempotency across external APIs, failure isolation with DLQs, and cost control through tiered routing.
Unstructured data pipelines succeed or fail based on durable, observable system design. Transition data through explicitly defined Raw, Refined, and Curated states using a streaming control plane and you can eliminate batch staleness, protect against rate limits, and provide AI models with the exact structural context they need.
When adopting this architecture, start small. Map out the complete event lifecycle and Claim Check pattern of a single, well-understood document type before attempting to scale into complex multimodal architectures. Establish your idempotency and backpressure mechanics early.
Building this infrastructure from scratch requires significant operational effort. Confluent's complete data streaming platform, combining the resilience of Apache Kafka, the stateful stream processing of Apache Flink, and an ecosystem of fully managed connectors, provides the exact control plane you need. The platform lets you build, scale, and monitor real-time AI data pipelines without bearing the heavy operational burden of managing complex DIY infrastructure.
No. Store binaries in object storage and publish only a claim check message (URI, metadata, correlation ID, and processing state) to Kafka to avoid message-size limits and cluster performance issues.
The Claim Check pattern is an architecture where the event stream carries a "ticket" (object URI + metadata) while the file stays in object storage. Consumers fetch the file when needed and emit new events as processing completes.
Raw = immutable ingestion record (URI + metadata). Refined = extracted text + layout/structure (OCR output, bounding boxes, confidence). Curated = RAG-ready chunks, entities/summaries, and embeddings for downstream serving.
Buffer work in Kafka topics and throttle consumers to match external capacity. Use backpressure techniques like Kafka's pause()/resume() API and controlled concurrency so bursts don't overload downstream APIs.
Use content hashes (e.g., SHA-256) for documents and chunks plus idempotent producers. For read-process-write stages, use transactions/exactly-once semantics so partial writes don't commit.
A Dead-Letter Queue is a separate topic for events that repeatedly fail (corrupt PDFs, unsupported formats, persistent API errors). DLQs prevent partitions from being blocked and enable human review or specialized reprocessing.
Detect file type and complexity early (MIME/magic bytes, presence of embedded text, layout complexity). Route digital-native PDFs to low-cost extractors and reserve OCR for scans/images or complex layouts.
Use stable chunk IDs derived from content hashes and perform targeted upserts/deletes. Only changed chunks are re-embedded and updated, reducing reprocessing and cost while maintaining freshness.
Produce both (1) human-readable structured JSON (OCR text, labels, bounding boxes, metadata) and (2) machine-optimized vectors (image embeddings and/or text embeddings) so different sinks can consume the right representation.
Send embeddings to a vector DB for semantic retrieval, normalized text/metadata to a search index for keyword/hybrid search, and structured entities to a warehouse/data lake for analytics. Ideally, use managed sink connectors from the curated topic.
Kafka is your event backbone, not your inference runtime. This guide breaks down three patterns for running AI alongside Kafka (external API, embedded, sidecar), when to use each, and how to handle topic design, dead-letter queues, idempotency, and LLM cost control.
Batch ETL feeds AI models data that's hours old. That causes context drift in RAG, training-serving skew in fraud detection, and broken operational AI. This guide covers the Ingest, Process, Serve architecture using Kafka and Flink to keep embeddings, features, and context fresh in milliseconds.