Streaming Data from Oracle using Oracle GoldenGate and the Connect API in Kafka

The Connect API in Kafka is part of the Confluent Platform, providing a set of connectors and a standard interface with which to ingest data to Apache Kafka, and store or process it the other end. Initially launched with a JDBC source and HDFS sink, the list of connectors has grown to include a dozen certified connectors, and twice as many again ‘community’ connectors. These cover technologies such as MongoDB, InfluxDB, Kudu, MySQL – and of course as with any streaming technology, twitter, the de-facto source for any streaming how-to. Two connectors of note that were recently released are for Oracle GoldenGate as a source, and Elasticsearch as a sink. In this article I’m going to walk through how to set these up, and demonstrate how the flexibility and power of the Kafka Connect platform can enable rapid changes and evolutions to the data pipeline.

Oracle GoldenGate Kafka Connect

The above diagram shows an overview of what we’re building. Change Data Capture (CDC) on the database streams every single change made to the data over to Kafka, from where it is streamed into Elasticsearch. Once in Elasticsearch it can be viewed in tools search as Kibana, for search and analytics:

Oracle GoldenGate Elastic Search

Oracle GoldenGate Kibana

Oracle GoldenGate (OGG) is a realtime data replication tool, falling under the broad umbrella of Change Data Capture (CDC) software, albeit at the high end in terms of functionality. It supports multiple RDBMS platforms, including – obviously – Oracle, as well as DB2, MySQL, and SQL Server. You can find the full certification list here. It uses log-based technology to stream all changes to a database from source, to target – which may be another database of the same type, or a different one. It is commonly used for data integration, as well as replication of data for availability purposes.

In the context of Kafka, Oracle GoldenGate provides a way of streaming all changes made to a table, or set of tables, and making them available to other processes in our data pipeline. These processes could include microservices relying on an up-to-date feed of data from a particular table, as well as persisting a replica copy of the data from the source system into a common datastore for analysis alongside data from other systems.

Elasticsearch is an open-source distributed document store, used heavily for both search, and analytics. It comes with some great tools including Kibana for data discovery and analysis, as well as a Graph tool. Whilst Elasticsearch is capable of being a primary data store in its own right, it is also commonly used as a secondary store in ordier to take advantage of its rapid search and analytics capabilities. It is the latter use-case that we’re interested in here – using Elasticsearch to store a copy of data produced in Oracle.

Confluent’s Elasticsearch Connector is a source-available connector plug-in for the Connect API in Kafka that sends data from Kafka to Elasticsearch. It is highly efficient, utilising Elasticsearch’s bulk API. It also supports all Elasticsearch’s data types which it automatically infers, and evolves the Elasticsearch mappings from the schema stored in Kafka records.

Oracle GoldenGate can be used with Kafka to directly stream every single change made to your database. Everything that happens in the database gets recorded in the transaction log (OGG or not), and OGG takes that and sends it to Kafka. In this blog we’re using Oracle as the source database, but don’t forget that Oracle GoldenGate supports many sources. To use Oracle GoldenGate with Kafka, we use the “Oracle GoldenGate for Big Data” version (which has different binaries). Oracle GoldenGate has a significant advantage over the JDBC Source Connector for the Connect API in that it is a ‘push’ rather than periodic ‘pull’ from the source, thus it :

Has much lower latency
Requires less resource on the source database, since OGG mines the transaction log instead of directly querying the database for changes made based on a timestamp or key.
Scales better, since entire schemas or whole databases can be replicated with minimal configuration changes. The JDBC connector requires each table, or SQL statement, to be specified.

Note that Oracle Golden Gate for Big Data also has its own native Kafka Handler, which can produce data in various formats directly to Kafka (rather than integrating with the Kafka Connect framework).

Environment

I’m using the Oracle BigDataLite VM 4.5 as the base machine for this. It includes Oracle 12c, Oracle GoldenGate for Big Data, as well as a CDH installation which provides HDFS and Hive for us to also integrate with later on.

On to the VM you need to also install:

Confluent Plaform 3.0
Oracle GoldenGate Kafka Connect plugin
Elasticsearch Kafka Connect plugin
Elasticsearch 2.4

To generate the schema and continuous workload, I used Swingbench 2.5.

For a step-by-step guide on how to set up these additional components, see this gist.

Starting Confluent Platform

There are three processes that need starting up, and each retains control of the session, so you’ll want to use screen/tmux here, or wrap the commands in nohup [.. command ..] & so that they don’t die when you close the window.

On BigDataLite the Zookeeper service is already installed, and should have started at server boot:

[oracle@bigdatalite ~]$ sudo service zookeeper-server status
zookeeper-server is running

If it isn’t running, then start it with sudo service zookeeper-server start.

Next start up Kafka:

# On BigDataLite I had to remove this folder for Kafka to start
sudo rm -r /var/lib/kafka/.oracle_jre_usage
sudo /usr/bin/kafka-server-start /etc/kafka/server.properties

and finally the Schema Registry:

sudo /usr/bin/schema-registry-start /etc/schema-registry/schema-registry.properties

Note that on BigDataLite the Oracle TNS Listener is using port 8081 – the default for the Schema Registry – so I amended /etc/schema-registry/schema-registry.properties to change

listeners=http://0.0.0.0:8081

listeners=http://0.0.0.0:18081

Configuring Oracle GoldenGate to send transactions to the Connect API in Kafka

Oracle GoldenGate (OGG) works on the concept of an Extract process which reads the source-specific transaction log and writes an OGG trail file in a generic OGG format. From this a Replicat process reads the trail file and delivers the transactions to the target.

In this example we’ll be running the Extract against Oracle database, specifically, the SOE schema that Swingbench generated for us – and which we’ll be able to generate live transactions against using Swingbench later on.

The Replicat will be sending the transactions from the trail file over to Kafka Connect.

I’m assuming here that you’ve already successfully defined and set running an extract against the Swingbench schema (SOE), with a trail file being delivered to /u01/ogg-bd/dirdat. For a step-by-step guide on how to do this all from scratch, see here.

You can find information about the OGG-Kafka Connect adapter in the README available as part of the download.

To use it, first configure the replicat and supporting files as shown.

Replicat parametersCreate /u01/ogg-bd/dirprm/rconf.prm with the following contents:

REPLICAT rconf
TARGETDB LIBFILE libggjava.so SET property=dirprm/conf.props
REPORTCOUNT EVERY 1 MINUTES, RATE
GROUPTRANSOPS 1000
MAP *.*.*, TARGET *.*.*;

Handler configurationEdit the existing /u01/ogg-bd/dirprm/conf.props and amend gg.classpath as shown below. The classpath shown works for BigDataLite – on your own environment you need to make the necessary jar files available per the dependencies listed in the README (available as part of the download).
```
gg.handlerlist=confluent
```

#The handler properties gg.handler.confluent.type=oracle.goldengate.kafkaconnect.KafkaConnectHandler gg.handler.confluent.kafkaProducerConfigFile=confluent.properties gg.handler.confluent.mode=tx gg.handler.confluent.sourceRecordGeneratorClass=oracle.goldengate.kafkaconnect.DefaultSourceRecordGenerator

#The formatter properties gg.handler.confluent.format=oracle.goldengate.kafkaconnect.formatter.KafkaConnectFormatter gg.handler.confluent.format.insertOpKey=I gg.handler.confluent.format.updateOpKey=U gg.handler.confluent.format.deleteOpKey=D gg.handler.confluent.format.treatAllColumnsAsStrings=false gg.handler.confluent.format.iso8601Format=false gg.handler.confluent.format.pkUpdateHandling=abend

goldengate.userexit.timestamp=utc goldengate.userexit.writers=javawriter javawriter.stats.display=TRUE javawriter.stats.full=TRUE

#Set the classpath here gg.classpath=dirprm/:/u01/ogg-bd/ggjava/resources/lib*:/usr/share/java/kafka-connect-hdfs/:/usr/share/java/kafka/

javawriter.bootoptions=-Xmx512m -Xms32m -Djava.class.path=.:ggjava/ggjava.jar:./dirprm

Note the gg.log.level setting – this can be very useful to switch to DEBUG if you’re investigating problems with the handler.

Now we can add the replicat. If not already, launch ggsci from the ogg-bd folder:

Note that on BigDataLite 4.5 VM there are two existing replicats configured, RKAFKA and RMOV. You can ignore these, or delete them if you want to keep things simple and clear.

Testing the Replication

We’ll run Swingbench in a moment to generate some proper throughput, but let’s start with a single transaction to check things out.

Connect to Oracle and insert a row, not forgetting to commit the transaction (he says, from frustrating experience 😉 )

Now if you list the topics defined within Kafka, you should see a new one has been created, for the SOE.LOGON table:

Hit Ctrl-C to cancel the consumer — otherwise it’ll sit there and wait for additional messages to be sent to the topic. Useful for monitoring when we’ve got lots of records flowing through, but not so useful now.

You can then pipe the output of kafka-console-consumer through jq to pretty-print it:

or even show just sections of the message using jq’s syntax (explore it here):

So we’ve got successful replication of Oracle transactions into Kafka, via Oracle GoldenGate. Now let’s bring Elasticsearch into the mix.

Configuring Elasticsearch

We’re going to use Elasticsearch as a destination for storing the data coming through Kafka from Oracle. Each Oracle table will map to a separate Elasticsearch index. In Elasticsearch an ‘index’ is roughly akin to an RDBMS table, a ‘document’ to a row, a ‘field’ to a column, and a ‘mapping’ to a schema.

Elasticsearch itself needs no configuration out of the box if you want to just get up and running with it, you simply execute it:

Note that this wouldn’t suffice for a Production deployment, in which you’d want to allocate heap space, check open file limits, configure data paths, and so on.

With Elasticsearch running, you can then load Kopf, which is a web-based admin plugin. You’ll find it at http://<server>:9200/_plugin/kopf

From Kopf you can see which nodes there are in the Elasticsearch cluster (just the one at the moment, with a random name inspired by Marvel), along with details of the indices as they’re created – in the above screenshot there are none yet, because we’ve not loaded any data.

Setting up the Elasticsearch Connect API handler

Create a configuration file for the Elasticsearch Kafka Connect API handler. I’ve put it in with the Elasticsearch configuration itself at /opt/elasticsearch-2.4.0/config/elasticsearch-kafka-connect.properties; you can use other paths if you want.

The defaults mostly suffice to start with, but we do need to update the topics value:

Because Elasticsearch indices cannot be uppercase, we need to provide a mapping from the Kafka topic to the Elasticsearch index, so add a configuration to the file:

If you don’t do this you’ll get an InvalidIndexNameException. You also need to add

Note that the global key.ignore is currently ignored if you are also overriding another topic parameter. If you don’t set this flag for the topic, you’ll get org.apache.kafka.connect.errors.DataException: STRUCT is not supported as the document id..

Now we can run our connector. I’m setting the CLASSPATH necessary to pick up the connector itself, as well as the dependecies. I also set JMX_PORT so that the metrics are exposed on JMX for helping with debug/monitoring.

You’ll not get much from the console after the initial flurry of activity, except:

But if you head over to Elasticsearch you should now have some data. In Kopf you’ll see that there are now ‘documents’ in the index:

In addition the header bar of Kopf has gone a yellow/gold colour, because your Elasticsearch cluster is now in “YELLOW” state – we’ll come back to this and the cause (unassigned shards) shortly.

Interactions with Elasticsearch are primarily through a REST API, which you can use to query the number of records in an index:

This is looking good! But … there’s a wrinkle. Let’s fire up Kibana, an analytical tool for data in Elasticsearch, and see why.

Go to http://<server>:5601/ and the first thing you’ll see (assuming this is the first time you’ve run Kibana) is this:

Elasticseach, Index Mappings, and Dynamic Templates

Kibana is a pretty free-form analysis tool, and you don’t have to write SQL, define dimensions, and so on — but what you do have to do is tell it where to find the data. So let’s specify our index name, which in this example is soe.logon:

Note that the Time-field name remains blank. If you untick Index contains time-based events and then click Create you’ll see the index fields and their types:

Columns that are timestamps are coming across as strings – which is an issue here, because Time is one of the dimensions by which we’ll pretty much always want to analyse data, and if it’s not present Kibana (or any other user of the Elasticsearch data) can’t do its clever time-based filtering and aggregation, such as this example taken from another (time-based) Elasticsearch index:

As a side note, the schema coming through from OGG Kafka Connect connector is listing these timestamp fields as strings, as we can see with a bit of fancy jq processing to show the schema entry for one of the fields (op_ts):

This string-based schema is actually coming through from the OGG replicat itself – whilst the Connect API handler interprets and assumes the datatypes of columns such as numbers, it doesn’t for timestamps.

So – how do we fix these data types in Elasticsearch so that we can make good use of the data? EnterDynamic Templates. These enable you to specify the mapping (similar to a schema) of an index prior to it being created for a field for the first time, and you can wildcard field names too so that, for example, anything with a _ts suffix is treated as a timestamp data type.

To configure the dynamic template we’ll use the REST API again, and whilst curl is fine for simple and repeated command line work, we’ll switch to the web-based Elasticsearch REST API client, Sense. Assuming that you installed it following the process above you can access it at http://<server>:5601/app/sense.

Click Get to work to close the intro banner, and in the main editor paste the following JSON (gist here)

Put the cursor over each statement and click the green play arrow that appears to the right of the column.

For the DELETE statement you’ll get an error the first time it’s run (because the index template isn’t there to delete), and the PUT should succeed with

Now we’ll delete the index itself so that it can be recreated and pick up the dynamic mappings. Here I’m using curl but you can run this in Sense too if you want.

Watch out here, because Elasticsearch will delete an index before you can say ‘oh sh….’ — there is no “Are you sure you want to drop this index?” type interaction. You can even wildcard the above REST request for real destruction and mayhem – action.destructive_requires_name can be set to limit this risk.

So, to recap – we’ve successfully run Kafka Connect to load data from a Kafka topic into an Elasticsearch index. We’ve taken that index and seen that the field mappings aren’t great for timestamp fields, so have defined a dynamic template in Elasticsearch so that new indices created will set any column ending _ts to a timestamp. Finally, we deleted the existing index so that we can use the new template from now on.

Let’s test out the new index mapping. Since we deleted the index that had our data in (albeit test data) we can take advantage of the awesomeness that is Kafka by simply replaying the topic from the start. To do this change the name value in the Elasticsearch connector configuration (elasticsearch-kafka-connect.properties), e.g. add a number to its suffix:

If you’re running Kafka Connect in standalone mode then you could also just delete the offsets file to achieve the same.

Whilst in the configuration file you need to also add another entry, telling the connector to ignore the schema that is passed from Kafka and instead dynamically infer the types (as well as honour the dynamic mappings that we specified)

Now restart the connector (make sure you did delete the Elasticsearch index per above, otherwise you’ll see no difference in the mappings)

Note that the two timestamp columns are now date type. If you still see them as strings, make sure you’ve set the topic.schema.ignore configuration as shown above in the Kafka Connect properties for the Elasticsearch connector.

Looking at the row count, we can see that all the records from the topic have been successfully replayed from Kafka and loaded into Elasticsearch. This ability to replay data on demand whilst developing and testing the ingest into a subsequent pipeline is a massive benefit of using Kafka!

Over in Kibana head to the Index Patterns setting page (http://<server>:5601/app/kibana#/settings/indices), or from the Settings -> Indices menu buttons at the top. If you already have the index defined here then delete it – we want Kibana to pick up the new shiny version we’ve created because it includes the timestamp columns. Now configure a new index pattern:

Note that the Time-field name field is now populated. I’ve selected op_ts. Click on Create and then go to the Discover page (from the option at the top of the page). You may well see “No results found” – if so use the button in the top-right of the page to change the time window to broaden it to include the time at which you inserted record(s) to the SOE.LOGON table in the testing above.

To explore the data further you can click on the add button that you get when hovering over each of the fields on the left of the page, which will add them as columns to the main table, replacing the default _source (which shows all fields):

In this example you can see that there was quite a few testing records inserted (op_type = I), with nothing changing between than the LOGON_DATE.

Connector errors after adding dynamic templates

then check the Elasticsearch log/stdout, where you’ll find more details. This kind of thing that can cause problems would be an index not deleted before re-running it with the new template, as well as a date format in the template that doesn’t match the data.

Running a Full Swingbench Test

Configuration

If you’ve made it this far, congratulations! Now we’re going to set up the necessary configuration to run Swingbench. This will generate a stream of changes to multiple tables, enabling us to get a feel for how the pipeline behaves in ‘real world’ conditions.

Connected to: Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

TABLE_NAME

WAREHOUSES PRODUCT_INFORMATION PRODUCT_DESCRIPTIONS ORDER_ITEMS ORDERS ORDERENTRY_METADATA LOGON INVENTORIES CUSTOMERS CARD_DETAILS ADDRESSES

The OGG replication is already defined with a wildcard, to pick up all tables in the SOE schema:

The OGG Kafka Connect handler will automatically create a topic for every table that it receives from OGG. So all we need to do now is add each table to the Elasticsearch Sink configuration. For this, I created a second version of the configuration file, at /opt/elasticsearch-2.4.0/config/elasticsearch-kafka-connect-full.properties

Having created the configuration, run the connector. If the previous connector from the earlier testing is running then stop it first, otherwise you’ll get a port clash (and be double-processing the ORCL.SOE.LOGONtopic).

Running Swingbench

Each of the columns with abbreviated headings are different transactions run, and as soon as you see numbers above zero in them it indicates that you should be getting data in the Oracle tables, and thus through into Kafka and Elasticsearch.

Auditing the Pipeline

But, this includes the records that were pre-seeded by Swingbench before we set up the OGG extract. How do we know how many have been read by GoldenGate since, and should therefore be downstream on Kafka, and Elasticsearch? Enter logdump. This is a GoldenGate tool that gives a commandline interface to analysing the OGG trail file itself. You can read more about it here, here, and here.

And then launch logdump (optionally, but preferably, with rlwrap to give command history and search):

and then filter to only show records relating to the table we’re interested in:

Here we can see that there are a total of 45 insert/update records that have been captured.

So from OGG, the data flows via the Kafka Connect connect into a Kafka topic, one per table. We can count how many messages there are on the corresponding Kafka topic by running a console consumer, redirecting the messages to file (and using & to return control to the console):

and then issue a wc to count the number of lines currently in the resulting file:

Since the console consumer process is still running in the background (type fg to bring it back to the foreground if you want to cancel it), you can re-issue the wc as required to see the current count of messages on the topic.

Finally, to see the number of documents on the corresponding Elasticsearch index:

Here we’ve proved that the number of records written by Oracle are making it all the way through our pipeline.

Monitoring the Pipeline

Kafka and Kafka Connect expose metrics through JMX. There’s a variety of tools for capturing, persisting, and visualising this, such as detailed here. For now, we’ll just use JConsole to inspect the metrics and get an idea of what’s available.

You’ll need a GUI for jconsole, so either a desktop session on the server itself, X11 forwarded, or you can also run JConsole from a local machine (it’s bundled with any JDK) and connect to the remote JMX. In this example I simply connected to the VM’s desktop and ran JConsole locally there. You launch it by running it from the shell prompt:

From here I connected to the ‘Remote Process’ on localhost:4243 to access the Kafka server process (because it’s running as root the jconsole process (running as a non-root user) can’t connect to it as a ‘Local Process’). The port 4243 is what I specified as an environment variable as part of the kafka process launch.

On the MBeans tab there are a list of MBeans under which the bespoke application metrics (as opposed to JVM ones like heap memory usage) are found. For example, the rate at which data is being received and sent from the cluster:

By default when you see an attribute for an MBean is it point-in-time – doubleclick on it to make it a chart that then tracks subsequent changes to the number.

By connecting to localhost:4243 (press Ctrl-N for a new connection in the same JConsole instance) you can inspect the metrics from the Kafka Connect elasticsearch sink

You can also access JMX metrics for the OGG Kafka handler by connecting to the local processs (assuming you’re running JConsole locally). To find the PID for the RCONF replicat, run:

Then select that PID from the JConsole connection list – note that the process name may show as blank.

The producer stats show metrics such as the rate at which topic is being written to:

Conclusion

In this article we’ve seen how to stream transactions from a RDBMS such as Oracle into Kafka and out to a target such as Elasticsearch, utilising the Kafka Connect platform and its standardised connector framework. We also saw how to validate and audit the pipeline at various touchpoints, as well as a quick look at accessing the JMX metrics that Kafka provides.

Streaming Data from Oracle using Oracle GoldenGate and the Connect API in Kafka

Get started free with Confluent

Watch demo: Kafka streaming in 10 minutes

Written By

Environment

Starting Confluent Platform

Configuring Oracle GoldenGate to send transactions to the Connect API in Kafka

Testing the Replication

Configuring Elasticsearch

Setting up the Elasticsearch Connect API handler

Elasticseach, Index Mappings, and Dynamic Templates

Connector errors after adding dynamic templates

Running a Full Swingbench Test

Configuration

TABLE_NAME

Running Swingbench

Auditing the Pipeline

COUNT(*)

Monitoring the Pipeline

Conclusion

Get started free with Confluent

Watch demo: Kafka streaming in 10 minutes

Did you like this blog post? Share it now

Introducing KIP-848: The Next Generation of the Consumer Rebalance Protocol

How to Query Apache Kafka® Topics With Natural Language

Environment

Starting Confluent Platform

Configuring Oracle GoldenGate to send transactions to the Connect API in Kafka

Testing the Replication

Configuring Elasticsearch

Setting up the Elasticsearch Connect API handler

Elasticseach, Index Mappings, and Dynamic Templates

Connector errors after adding dynamic templates

Running a Full Swingbench Test

Configuration

TABLE_NAME

Running Swingbench

Auditing the Pipeline

COUNT(*)

Monitoring the Pipeline

Conclusion

Get started free with Confluent

Watch demo: Kafka streaming in 10 minutes

Did you like this blog post? Share it now

Subscribe to the Confluent blog

Introducing KIP-848: The Next Generation of the Consumer Rebalance Protocol

How to Query Apache Kafka® Topics With Natural Language