From the KSQL command prompt, we’ll first declare the stream of events (orders), using the Kafka topic populated from the CSV files:
ksql> CREATE STREAM orders WITH (KAFKA_TOPIC='orders',VALUE_FORMAT='avro');
It’s easy to check that we’ve got data—note the use of SET 'auto.offset.reset' = 'earliest';
to tell KSQL to process all data in the topic, rather than the default which to only new data:
ksql> SET 'auto.offset.reset' = 'earliest';
Successfully changed local property 'auto.offset.reset' from 'null' to 'earliest'
ksql> SELECT * FROM orders LIMIT 1;
1521209173235 | - | 1 | 288 | 2018-03-05T07:30:55Z | Wine - Sherry Dry Sack, William | 1.53
LIMIT reached for the partition.
Query terminated
Now let’s look at the customers. We are going to declare this as a table in KSQL, because it is a set of values for a given key (as opposed to a stream of events, which is what the orders are). But, we need to be careful with the key. Before declaring the object (CREATE TABLE
) we can use the PRINT
command to inspect the topic:
ksql> PRINT 'asgard.demo.customers' FROM BEGINNING;
Format:AVRO
16/03/18 14:01:27 GMT, +�, {"id": 245, "first_name": "Sergent", "last_name": "Greenmon", "email": "sgreenmon6s@wordpress.com", "gender": "Male", "comments": "Synergized optimizing pricing structure", "messagetopic": "asgard.demo.customers", "messagesource": "Debezium CDC from MySQL on asgard"}
Note the special characters in the record key (the , +�,
after the timestamp, before the message payload). This is because the key is actually an Avro key — and KSQL can’t handle Avro keys yet, so blanks it out. We can verify that this is the case by looking at the raw message and its key:
$ kafka-avro-console-consumer \
--bootstrap-server localhost:9092 \
--property schema.registry.url=http://localhost:8081 \
--topic asgard.demo.customers --property print.key=true --max-messages=1 --from-beginning
{"id":1} {"id":1,"first_name":{"string":"Kania"},"last_name":{"string":"Eggleson"},"email":{"string":"keggleson0@tripadvisor.com"},"gender":{"string":"Female"},"comments":{"string":"Multi-channelled web-enabled ability"},"messagetopic":{"string":"asgard.demo.customers"},"messagesource":{"string":"Debezium CDC from MySQL on asgard"}}
Processed a total of 1 messages
We could workaround this upstream in the Kafka Connect config (by using "key.converter":"org.apache.kafka.connect.storage.StringConverter"
in the configuration), or we could just use KSQL!
ksql> SET 'auto.offset.reset' = 'earliest';
ksql> CREATE STREAM CUSTOMERS_SRC WITH (KAFKA_TOPIC='asgard.demo.customers',VALUE_FORMAT='AVRO');
ksql> CREATE STREAM CUSTOMERS_REKEYED AS SELECT * FROM CUSTOMERS_SRC PARTITION BY ID;
The above statement takes the source topic which is flowing through from MySQL via Debezium, and explicitly partitions it on the supplied key—the ID
column. KSQL does this and the resulting topic is keyed as we want, and using a simple String for the key this time:
ksql> PRINT 'CUSTOMERS_REKEYED' FROM BEGINNING;
Format:AVRO
02/03/18 23:48:05 GMT, 5, {"ID": 5, "FIRST_NAME": "Marion", "LAST_NAME": "Micklem", "EMAIL": "mmicklem4@mail.ru", "GENDER": "Male", "COMMENTS": "Reactive clear-thinking functionalities", "MESSAGETOPIC": "asgard.demo.customers", "MESSAGESOURCE": "Debezium CDC from MySQL on asgard"}
Inspect the message—we can’t use kafka-avro-console-consumer
because it assumes that the key is also Avro-encoded, which it isn’t this time. Instead, we’ll use the excellent kafkacat tool:
$ kafkacat -C -K: -b localhost:9092 -f 'Key: %k\nValue: %s\n\n' -t CUSTOMERS_REKEYED -c1
Key: 5
Value:
MarionMicklem"mmicklem4@mail.rMaleNReactive clear-thinking functionalities*asgard.demo.customersBDebezium CDC from MySQL on asgard
We can now use the correctly-keyed topic for our KSQL table:
ksql> CREATE TABLE CUSTOMERS WITH (KAFKA_TOPIC='CUSTOMERS_REKEYED', VALUE_FORMAT='AVRO', KEY='ID');
Check that the table’s declared key (ID
) matches that of the Kafka message key:
ksql> SELECT ROWKEY,ID FROM CUSTOMERS LIMIT 5;
5 | 5
6 | 6
10 | 10
12 | 12
15 | 15
LIMIT reached for the partition.
Query terminated
Now the bit we’ve all been waiting for…enrich the stream of inbound orders data with customer data from MySQL!
ksql> SELECT O.ORDER_TS, O.PRODUCT, O.ORDER_TOTAL_USD, \
C.ID, C.FIRST_NAME, C.LAST_NAME, C.EMAIL \
FROM ORDERS O \
LEFT OUTER JOIN CUSTOMERS C \
ON O.CUSTOMER_ID = C.ID \
LIMIT 5;
2018-03-13T01:50:53Z | Onions - Spanish | 9.44 | 115 | Alexandr | Willcot | awillcot36@facebook.com
2018-03-04T07:58:10Z | Halibut - Whole, Fresh | 5.11 | 929 | Ulick | Dumberell | udumberellps@ucla.edu
2018-02-09T19:11:15Z | Beef Wellington | 7.33 | 632 | Jennie | McMichell | jmcmichellhj@miitbeian.gov.cn
2018-03-11T15:39:49Z | Chocolate Eclairs | 1.45 | 270 | Margareta | Kerfod | mkerfod7h@nhs.uk
2018-03-04T23:27:04Z | Wine - George Duboeuf Rose | 6.68 | 117 | Duky | Raden | draden38@marketwatch.com
LIMIT reached for the partition.
Query terminated
(I’m using the \
line-continuation character to make it easier to read the KSQL statements, but you can put it all on one line if you want)
We can persist this streaming query with a CREATE STREAM
statement:
ksql> CREATE STREAM ORDERS_ENRICHED AS \
SELECT O.ORDER_TS, O.PRODUCT, O.ORDER_TOTAL_USD, \
C.ID, C.FIRST_NAME, C.LAST_NAME, C.EMAIL \
FROM ORDERS O \
LEFT OUTER JOIN CUSTOMERS C \
ON O.CUSTOMER_ID = C.ID \
Message
Stream created and running
This is a continuous query that executes in the background until explicitly terminated by the user. In effect, these are stream processing applications, and all we need to create them is SQL! Here all we’ve done is an enrichment (joining two sets of data), but we could easily add predicates to the data (simply include a WHERE
clause), or even aggregations.
You can see which queries are running with the SHOW QUERIES;
statement. All queries will pause if the KSQL server stops, and restart automagically when the KSQL server starts again.
The DESCRIBE EXTENDED
command can be used to see information about the derived stream such as the one created above. As well as simply the columns involved, we can see information about the underlying topic, and run-time stats such as the number of messages processed and the timestamp of the most recent one.
ksql> DESCRIBE EXTENDED ORDERS_ENRICHED;
Type : STREAM
Key field : O.CUSTOMER_ID
Timestamp field : Not set - using <ROWTIME>
Key format : STRING
Value format : AVRO
Kafka output topic : ORDERS_ENRICHED (partitions: 4, replication: 1)
Field | Type
ROWTIME | BIGINT (system)
ROWKEY | VARCHAR(STRING) (system)
ORDER_TS | VARCHAR(STRING)
PRODUCT | VARCHAR(STRING)
ORDER_TOTAL_USD | VARCHAR(STRING)
ID | INTEGER
FIRST_NAME | VARCHAR(STRING)
LAST_NAME | VARCHAR(STRING)
EMAIL | VARCHAR(STRING)
Queries that write into this STREAM
id:CSAS_ORDERS_ENRICHED - CREATE STREAM ORDERS_ENRICHED AS SELECT O.ORDER_TS,O.PRODUCT,O.ORDER_TOTAL_USD,C.ID, C.FIRST_NAME, C.LAST_NAME, C.EMAIL FROM ORDERS O LEFT OUTER JOIN CUSTOMERS C ON O.CUSTOMER_ID = C.ID LIMIT 5;
For query topology and execution plan please run: EXPLAIN <QueryId>
Local runtime statistics
messages-per-sec: 15.08 total-messages: 1500 last-message: 14/03/18 16:15:07 GMT
failed-messages: 0 failed-messages-per-sec: 0 last-failed: n/a
(Statistics of the local KSQL server interaction with the Kafka topic ORDERS_ENRICHED)
Underneath every persisted KSQL stream or table query (i.e. CSAS
or CTAS
) is a Kafka topic. This is just a Kafka topic as any other:
$ kafka-avro-console-consumer \
--bootstrap-server localhost:9092 \
--property schema.registry.url=http://localhost:8081 \
--topic ORDERS_ENRICHED --max-messages=1 --from-beginning|jq '.'
{
"ORDER_TS": {
"string": "2018-03-13T01:50:53Z"
},
"PRODUCT": {
"string": "Onions - Spanish"
},
"ORDER_TOTAL_USD": {
"string": "9.44"
},
"ID": {
"int": 115
},
"FIRST_NAME": {
"string": "Alexandr"
},
"LAST_NAME": {
"string": "Willcot"
},
"EMAIL": {
"string": "awillcot36@facebook.com"
}
}
Processed a total of 1 messages