Verifying benchmarks with ministat
Most people know to benchmark to verify performance improvements, but comparing a handful of results isn't enough. You need to be sure your results are statistically significant. ministat is a tool written by Poul-Henning Kamp that determines if two result sets are statistically significant and display the difference in performance between the two.
I recently used ministat to measure performance changes for my GVL release patch for Ruby's Zlib extension. The patch will allow multiple zlib or gzip streams to be deflated or inflated in parallel. I was asked to show any performance difference because the GVL release/acquire can be expensive depending upon the work done without the GVL.
I posted a video showing the difference between inflating very large files, but what about single-threaded use of Zlib? What about the worst case? While parallel processing of large streams is great, it shouldn't overly penalize single-thread or small-stream processing. I ran several benchmarks of single-threaded use to show minimal difference, but in the worst-case benchmarks I couldn't be sure there was any significant performance change.
For simplicity I'll only be mentioning inflate from here on, but the same loop handles inflate and deflate of gzip, zlib and raw deflate streams identically.
While the Zlib extension is written in C, I've written this in roughly equivalent ruby while hand-waving over some of the details of how libz works since they're unimportant to the performance comparison. Before my patch run loop looked like this:
setup_input_buffer stream setup_output_buffer stream loop do status = inflate stream schedule_another_thread return output_buffer if status == STREAM_END if status == BUF_ERROR then expand_output_buffer stream next else # handle other errors end end
First the input and output buffers are set up. The input buffer contains the deflated string to be inflated and the output buffer will contain the inflated string. The default output buffer size is 1024 bytes.
Inside the loop inflate is called which will expand as much of the input stream as will fit in the output buffer, then ruby is given the chance to schedule another thread to provide fairness, then errors are checked.
BUF_ERROR indicates there isn't enough room in the output buffer and more space needs to be added. The expand_output_buffer function adds up to another 16KB to the output buffer. Combined with scheduling another thread each time inflate returned, fairness is provided.
With my patch the run loop looks something like this:
setup_input_buffer stream setup_output_buffer stream release_GVL do until interrupted do status = inflate stream return output_buffer if status == STREAM_END if status == BUF_ERROR then expand_output_buffer_without_GVL stream next else # … end end end
The differences are releasing the GVL, the interrupted flag and the change in expanding the output buffer.
To reduce copying the Zlib extension uses a ruby String for the output buffer, so expanding it requires GC interaction. The new function uses
realloc(3) directly to avoid GVL contention and mirrors
The interrupted flag allows processing of timeouts, a killed thread or clean shutdown to proceed smoothly. Like the no-GVL version the buffer is only expanded by 16KB to allow interruptions to be serviced quickly. Without this flag a large stream would need to complete before the thread it is running in could be shut down which may take several seconds. (In the C version of release_GVL,
rb_thread_blocking_region(), you provide an unblocking function that sets the interrupted flag when ruby wants to regain control of the thread.)
Releasing the GVL allow other threads to run the Ruby VM so while a stream is being inflated other threads can run on other CPUs. For large streams this can result in a performance increase when other Ruby threads are running.
Worst Case Benchmark
After inspecting the two implementations we have enough information to create a worst-case scenario for testing the performance of the patch.
Since GVL release/acquire can be expensive we want to do that as much as possible which means inflate should do as little as possible. The buffer should not be expanded since that allows more than one trip through inflate() per GVL release/acquire, so inflated strings must fit in the 1024 byte output buffer.
I wrote a benchmark that inflates many strings with an inflated size of 1000 bytes using multiple threads, but not so many strings as to prevent a quick return from the benchmark section. (While inflating strings of zero bytes would probably be worse, that seems unrealistic.)
Here's the benchmark program I wrote:
r = Random.new 0
file_count = 10_000
deflated = (0..file_count).map do
input = r.bytes 1000
times = Benchmark.measure do
deflated.each do |input|
end.each do |t|
Instead of using time,
Benchmark.measure is used, but only around the inflate part. This reduces the amount of noise in the benchmark. By choosing only 10,000 files multiple benchmarks can be run (on my machine the inflate portion takes a little over ½ second). By using the same random seed the benchmark is repeatable.
Before using ministat I ran the same benchmark but with 100,000 files I had the following results without the patch:
$ for f in `jot 5`; do ruby20 test.rb; done 5.420000 5.970000 11.390000 ( 8.162893) 5.400000 6.270000 11.670000 ( 8.263046) 5.460000 5.920000 11.380000 ( 8.133742) 5.410000 6.290000 11.700000 ( 8.289913) 5.500000 6.620000 12.120000 ( 8.478085)
and with the patch:
$ for f in `jot 5`; do make -s runruby; done 5.120000 6.240000 11.360000 ( 8.039715) 5.240000 6.260000 11.500000 ( 8.097961) 5.280000 5.940000 11.220000 ( 8.004246) 5.210000 6.360000 11.570000 ( 8.171124) 5.240000 6.200000 11.440000 ( 8.054929)
So while running with the patch seemed faster, the results overlap by a small margin. I wanted to be reasonably sure there was an improvement which is where ministat comes in. In order to get useful data out of ministat I would need more data (this is easy to verify by running ministat on these results and seeing that it says "No difference").
To capture the data I used a shell for loop to run the benchmark 50 times and redirect the output to a file: for f in `jot 50`; do make -s runruby; done > with, and again without the patch applied directed to a "without" file.
To determine if the results were statistically significant I ran the data through ministat with a confidence interval of 99%. From the ministat man page:
The ministat command was written by Poul-Henning Kamp out of frustration over all the bogus benchmark claims made by people with no understanding of the importance of uncertainty and statistics.
So through ministat we will be able to determine if the suspected difference from the previous benchmark is an actual difference, and what the certainty of that difference is.
$ ministat -c 99 -s without with x without + with +------------------------------------------------------------------------------+ | + + + x * xx x | | + x + + + + + + x **+xx*xx x * x + x | |+ ++++ x + x*+ * + ++x+x++x*++x* ***+x**xx * +x**** * + x x xx| | |_____________A_____________| | | |______________A_______________| | +------------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 50 0.61660767 0.71641994 0.66669798 0.66647618 0.022548872 + 50 0.59242272 0.69294882 0.64941692 0.64917859 0.02509235 Difference at 99.0% confidence -0.0172976 +/- 0.0125332 -2.59538% +/- 1.88051% (Student's t, pooled s = 0.0238545)
Looking at the last four lines, you can see there is a statistically significant difference in the benchmark results at 99% confidence. Running the benchmark with the patch shows a 17ms decrease in time ± 13ms or a 2.6% decrease in time ± 1.9% for this sample set, so I can say "minor improvement".
The box shows a plot of the with and without values along with the average (and median) values, and standard deviation bars. For full details see the ministat man page.
There are a couple of likely reasons for the slight increase in performance such as the GVL release/acquire being inexpensive enough that inflate takes longer allowing parallelism or that GVL release/acquire isn't as expensive as thought, but without an easy way to measure GVL contention it's only speculation.
To my knowledge ministat only ships with FreeBSD, but fortunately it is really easy to add to nearly any system with a C compiler. You can find a copy of ministat in this repository (which, ironically, uses autotools) or you can grab the C file directly and run cc ministat.c -lm -o ministat and move the ministat executable into your PATH like I did. comments
Comments are disabled