Okay so you’ve successfully deployed your new or legacy application to a Docker container. You have also wired up dependencies and are ready for deployment to an integration environment, so now what? Building through continuous integration for your application is a great idea to ensure the success of a project, but how does your application deal with failure? If chaos engineering is an unfamiliar concept, be sure to read up on the Principles of Chaos Engineering. The idea is to ensure that weaknesses within a distributed system are uncovered and that the system behaves predictably in a controlled environment.
There are a few tools available specifically for Chaos Testing including Chaos Monkey, Pumba, and Gremlins. Due to the project’s focus on containers, this post discusses Pumba for Chaos Testing. In their own words, “Pumba is a resilience testing tool, that helps applications tolerate random Docker container failures: process, network and performance.”
Getting Started
To follow along, you can clone the associated git repository and initialize the project containers. It is important that you ensure that the application is functioning properly prior to executing any chaos testing.
$ git clone https://github.com/n3integration/dockerize-java.git $ cd dockerize-java && git co chaos $ ./gradlew clean build $ docker-compose up -d db $ ./gradlew flywayMigrate $ docker-compose up -d www
Traffic Generation
We’ll also need a tool to generate a constant amount of traffic. For this purpose, although it may be overkill, we’ll use Vegeta. I’ve provided a Vegeta Docker image which supports two environment variables DURATION and TARGET. DURATION is the amount of time for the traffic generation to run, which can be measured in seconds or minutes (eg. 1s, 1m). TARGET is the URL of the web service under test. To begin generating traffic to the web service, you can run the following command.
$ docker-compose up -d generator
Chaos Testing with Pumba
Pumba provides five commands that are used to simulate error conditions within a controlled environment.
kill
The kill
command can be used to simulate containers being randomly terminated. The command allows users to provide a custom signal. If left unspecified, SIGKILL is sent to the container process.
netem
The netem
command is one of the more advanced testing capabilities provided by Pumba used to emulate misbehaving networks. This command provides several capabilities, which include rate limiting, delaying, reordering, duplicating, and losing packets. This is considerably important due to the variability of networks.
pause
The pause
command pauses running containers whose name matches the supplied expression. Unless you are running within an orchestration framework, such as Kubernetes or Amazon’s Container Service, this command is especially helpful. It works by periodically pausing/resuming one or more containers for a user specified duration.
stop
The stop
command stops running containers whose name matches the supplied expression. First, Pumba will attempt to gracefully terminate the process running within the target container(s). If unable to stop the process, the process is terminated with a SIGKILL signal.
rm
The rm
command removes running containers and attached volumes. This command is especially destructive since by default it will also remove any underlying volumes attached to the target container(s).
For more information about Pumba and its capabilities, refer to the project’s source repository.
Putting it All Together
To run through a set of tests, you will need at least two separate terminal windows open. In one terminal, run the Chaos Test. Since all tests are being executed without an orchestration framework, we want to run Pumba’s pause command.
$ docker-compose up chaos-pause
In another window, begin the traffic generator.
$ docker-compose up generator
You should see output similar to the following once the test has completed.
generator_1 | Requests [total, rate] 3000, 50.02 generator_1 | Duration [total, attack, wait] 1m25.386995253s, 59.979999887s, 25.406995366s generator_1 | Latencies [mean, 50, 95, 99, max] 2.990878328s, 1.025975081s, 10.16703107s, 14.752185309s, 30.01735129s generator_1 | Bytes In [total, mean] 103777, 34.59 generator_1 | Bytes Out [total, mean] 0, 0.00 generator_1 | Success [ratio] 61.43% generator_1 | Status Codes [code:count] 500:1146 0:11 200:1843 generator_1 | Error Set: generator_1 | 500 Server Error
As you can see, Pumba was able to interfere with the test results. You can also see that the service is returning a 500 error when it is unable to connect to the Postgres database, which is less than ideal. Although a simplistic case, we could add caching to our service to prevent any errors from occurring and end up with a much higher success rate. Let’s see how Pumba’s network emulation commands can vary test results. In one window run an instance of the chaos-loss container.
$ docker-compose up chaos-loss
Again, in another window, let’s run the traffic generator.
$ docker-compose up generator
Again, you should see results similar to the following, when completed.
generator_1 | Requests [total, rate] 3000, 50.02 generator_1 | Duration [total, attack, wait] 1m1.017726406s, 59.979999917s, 1.037726489s generator_1 | Latencies [mean, 50, 95, 99, max] 557.375498ms, 19.780692ms, 3.008849215s, 6.938442708s, 26.792086014s generator_1 | Bytes In [total, mean] 114851, 38.28 generator_1 | Bytes Out [total, mean] 0, 0.00 generator_1 | Success [ratio] 52.97% generator_1 | Status Codes [code:count] 200:1589 500:1411 generator_1 | Error Set: generator_1 | 500 Server Error
Hopefully, you’ve gained an appreciation for Chaos Engineering and have a better understanding of how it can help to improve the resiliency of your production applications. I encourage you to learn more about the capabilities of both Vegeta and Pumba.