Observability is the ability to measure the internal states of a system by examining its outputs.
For application performance monitoring (APM), observability means analyzing telemetry data – logs, traces, and metrics. This post will concentrate mostly on observability in the context of APM.
Rising in popularity is also data observability, which is the ability to observe the flow of business data through a system from end to end.
“Observability is about getting the right information at the right time into the hands of the people who have the ability and responsibility to do the right thing. Helping them make better technical and business decisions driven by real data, not guesses, or hunches, or shots in the dark. Time is the most precious resource you have — your own time, your engineering team’s time, your company’s time.”
Observability is important because it allows you to spot bottlenecks, resolve outages, and glean valuable insights about how your software behaves.
It is possible to observe complex distributed systems by correlating telemetry data – traces, metrics, and logs.
Here is a simple example where you can observe the flow of execution that happens when you call a method called requestStarted, where the entire trace is broken down into its constituent spans:
With this information, it becomes possible to find bugs, isolate performance bottlenecks, or set intelligent alerts.
Sometimes metrics go wonky and make you raise an eyebrow, at which time you’ll check the traces. Sometimes traces show increased latency, which makes you check on the metrics to see what they might tell you. The application may be emitting logs that provide more information about what was going on at the time of degraded performance. In this way, we start to see how analyzing metrics, traces, and logs together using an observability platform makes it much easier to understand and troubleshoot your complex distributed systems. Here are some popular application performance monitoring (APM) observability platforms:
Several of these companies use Confluent to move and process telemetry data at high throughput and low latency, and you can use Confluent’s fully managed connectors to more easily integrate with the observability backend of your choice. Confluent has fully managed sink connectors for Datadog, Splunk, and Elastic, with many more being added all the time.
Confluent also offers a metrics API to give users observability into their own Confluent Cloud usage, with first-class integrations to Datadog and Grafana Cloud.
Logs, metrics, and traces pertain to observability as it relates to application performance monitoring (APM). However, businesses are also highly interested in observing how business data flows end-to-end. This is called data observability and is often spoken about in the context of “data governance”.
For example, Confluent Cloud offers a powerful Stream Lineage interface to observe data as it flows throughout a business.
The purpose of this lab is to explore a working example of how OpenTelemetry enables metrics and traces in Java using the OpenTelemetry Java agent. What is nice about the Java agent is that it automatically sends telemetry data by simply instantiating Meter and Tracer objects and setting some environment variables.
There is a SpringBoot Java application that exposes an endpoint at http://localhost:8888/hello. App metrics and request traces are sent via OTLP (OpenTelemetry Protocol) to an observability backend (Elastic Observability APM in this case).
Launch the lab environment by clicking [https://gitpod.io/#https://github.com/riferrei/otel-with-java[^]](https://gitpod.io/#https://github.com/riferrei/otel-with-java[^]). ** On launch, all services are built and started with docker-compose.
Inspect the source code of the HelloApp Java application. Specifically, look at src/main/java/riferrei/otel/java/HelloAppController.java. This is where OpenTelemetry tracing and custom metrics are implemented.
Send GET requests to the Hello app.
Repeat the previous curl command several times to the /hello endpoint as well as others (other endpoints are expected to result in error responses).
Execute the following echo command and Ctrl+Click the resulting URL to open the traces for the hello-app in the Kibana UI.
NOTE: The URL will look something like https://5601-aquamarine-python-rsq28cwb.ws-us17.gitpod.io/app/apm/services/hello-app/transactions
Scroll down to the bottom of the page and select the /hello endpoint from the Transactions section.
Scroll down again to see the trace sampling, which shows latency measurements at various stages of execution. + TIP: These trace samples are a great tool for understanding what is happening in a transaction. In more complex applications, this trace sample would show the flow of execution across many microservices, helping you to identify bugs and performance bottlenecks much more quickly.
Execute the following echo command and Ctrl+Click the resulting URL to open the "discover" area of the Kibana UI, where OpenTelemetry metrics will be automatically discovered.
NOTE: The URL will look something like https://5601-aquamarine-python-rsq28cwb.ws-us17.gitpod.io/app/discover. Ignore warnings.
Investigate the custom.metric.heap.memory and custom.metric.number.of.exec, which are the custom metrics defined in Constants.java and HelloAppController.java.
NOTE: This lab comes from [https://github.com/riferrei/otel-with-java[^]](https://github.com/riferrei/otel-with-java[^]), created by Ricardo Ferreira. There is a sibling repository at [https://github.com/riferrei/otel-with-golang[^]](https://github.com/riferrei/otel-with-golang[^]). The main difference between the Java and Go implementations is that the OpenTelemety Java agent creates trace spans automatically while Golang requires more manual instrumentation. There is an associated in-depth video walkthrough from SREcon 2021.
In this lab, you explored how to instrument a Java application using OpenTelemetry and send those app metrics and request traces to an observability backend for analysis.
From the original creators of Apache Kafka, learn why Confluent’s data streaming technologies are used by 70% of the Fortune 100. Build real-time data pipelines, unlock real-time data governance, and stream data from infinite souces for seamless data observability, monitoring, and metrics on any cloud.