The 2015 State of DevOps Report, done by Puppet Labs, demonstrates that high-performing teams may resolve incidents 168 times faster than others. Their monitoring systems enable ops to quickly identify and resolve issues. Containers are synonymous with production these days, but this new technology is not much different from monitoring other infrastructure components.
When high-level monitoring is implemented and consistently applied, this makes it easier for engineers to diagnose and resolve issues. Choosing a strategy involves understanding the technical requirements, desired outcomes, and tradeoffs. Let’s start by looking at what to monitor before moving on to possible implementations.
What to Monitor
Containers are no different from traditional processes, with extra machinery to isolate them from other processes. This means you’ll need common metrics like memory utilization and CPU usage. You’ll also need container-specific metrics such as the CPU limit (how much of the host CPU the container is allowed) and memory limit (how much host memory the container is allowed). These four metrics provide important utilization ratios, and they can provide information on when to scale up, out, or in.
There’s more to the story than just CPU and memory though. Consider your container infrastructure. If you are using Kubernetes, then you’ll also need telemetry on the cluster itself. The same goes for something like DC/OS or Docker Swarm. You’ll need ratios of cluster- allocatable memory and CPUs, as well as other orchestration-specific metrics. These ratios tell you when to scale up, out, or in.
Step inside the container for a moment. Let’s assume a container runs an HTTP server. You’ll need to collect standard metrics like request counts and counts on 2xx, 4xx, 5xx, and latency. Assume you have another container-processing job off queue. You’ll need to collect metrics such as the number of processed jobs, failed jobs, or retries. These are application-layer metrics, separate from the infrastructure layer, which are impossible to collect from the container runtime. Instead, the monitoring system must pull them from the process itself or the process must push them to the monitoring system. A good monitoring strategy accounts for different needs at different layers throughout the stack.
There is also nonnumeric data coming from multiple layers in the stack. It may be an error like “could not connect to database,” “container restart,” or “process thrashing.” These usually come across as text-based logs and tend to contain genuinely useful information. A good monitoring solution considers this information, as well.
Comparing Monitoring Systems
Good monitoring systems have a few things in common. You can identify what system is good for you by considering the following:
- How easy is it to add instrumentation to existing code? How are the existing libraries for the languages in use? Are different data types supported?
- How easy is to configure alarms on different data? How do alarms connect with the on-call team?
- How well does the visualization system allow you to explore the data? Can you create ad-hoc charts and/or saved dashboards?
How many existing infrastructure components—especially container orchestration—are supported?
These questions provide a strong framework for assessing how telemetry works across the stack, along with visualization and alerting. Here are five options for container monitoring that do well in all areas.
Datadog uses an agent approach to pull metrics from various components. It supports Docker, Kubernetes, DCOS, Docker Swarm, and a host of other common components. Datadog provides all the key metrics out of the box and integrates easily with different layers in the stack. The Datadog web application features dashboards and alerts, and the agent is open source and highly configurable. Engineers can write custom integrations to create events from text-based logs and custom metrics. Datadog also provides DogStatsD, a StatsD server that reports to Datadog. This is great for teams already using StatsD or looking for an easy application-layer telemetry solution.
3. ELK (Elasticsearch, Logstash, Kibana)
ELK is a flexible, open-source solution that handles text streams and time-series data. The stack is comprised of three components – Elasticsearch, Logstash and Kibana, each responsible for a different stage in the data pipeline. This makes it easier to integrate with a variety of infrastructure components without imposing requirements on them.
Running all these components in large environments is the largest trade-off with this setup as there is a lot of maintenance and manual work required. Scaling and maintaining Elasticsearch can be a full-time job. Adding a buffer or queuing mechanism, such as Kafka or Redis is a must to persist data. Archiving, alerting, security — all these need to be added to the stack if you want a production-grade monitoring system.
Hosted ELK solutions such as Logz.io or Elastic Cloud are a good option if you want to save on resources, as they will operate your Elasticsearch cluster and configure all components for you. Pushing data to the cluster, which is easy, is your responsibility, and off you go.
Sysdig is a general-purpose monitoring solution, and Sysdig Cloud provides a cloud-hosted monitoring system. The Sysdig CLI may be installed and run on any Linux system to collect and analyze data for specific time windows. Sysdig Cloud is similar to Datadog’s agent approach. The product offers Docker monitoring; alerting and troubleshooting with intelligent Kubernetes, Mesos, and Swarm integration; and visualization and alerting.
5. Roll Your Own!
There’s always an option of rolling your own solution. You may choose this if none of the existing options fits your requirements or paid solutions are too costly. Rolling your own solution will likely require some of the components already discussed. You’ll need a way to collect and aggregate time-series data across multiple layers in the stack. Tools like Collected handle infrastructure data. Something flexible like StatsD fits into the application layer. The data must be stored for analysis, visualization, and alerting. You may opt for something like Graphite (and hosted variations) for use with StatsD. InfluxDB is another common choice for storing time-series data. You’ll also need a way to visualize data and create alerts. There are different architectures for this as well. One approach is to push everything to a system like Riemann, which can handle aggregations, alerting, and data forwarding. Rolling your own solution requires careful planning at all layers of the stack and more maintenance than other solutions.
How to Choose
Everything is a tradeoff in engineering, and choosing your monitoring approach is no different. You must consider what aspects are most important, such as real-time log stream, time-series data visualization, off-the-shelf integrations, or flexibility for custom integrations. Ultimately your solution should provide, at a minimum:
- key container metrics for CPU, memory, I/O;
- container orchestration integration such as Kubernetes, DC/OS, or Docker Swarm;
- off-the-shelf integrations with common components like Redis (don’t spend time reinventing the wheel for common cases); and
- time-series visualization and alerting.
All the options discussed here meet the above criteria. It’s up to you to decide which combination of tradeoffs fits for your team. Beware of rolling your own if this is your first time; you’re likely to create more problems than you solve, and that’s the exact opposite of what you need in a monitoring system. Put the monitoring system into production as soon as possible
Strong production environments revolve around a well-oiled monitoring system. But your job doesn’t end there. Your monitoring system keeps production safe, so experiment and improve it as your stack evolves. Your on-call team will thank you for it.