Verifying benchmarks with ministat
drbrain |
Most people know to benchmark to verify performance improvements, but comparing a handful of results isn't enough. You need to be sure your results are statistically significant. ministat is a tool written by Poul-Henning Kamp that determines if two result sets are statistically significant and display the difference in performance between the two.
I recently used ministat to measure performance changes for my GVL release patch for Ruby's Zlib extension. The patch will allow multiple zlib or gzip streams to be deflated or inflated in parallel. I was asked to show any performance difference because the GVL release/acquire can be expensive depending upon the work done without the GVL.
I posted a video showing the difference between inflating very large files, but what about single-threaded use of Zlib? What about the worst case? While parallel processing of large streams is great, it shouldn't overly penalize single-thread or small-stream processing. I ran several benchmarks of single-threaded use to show minimal difference, but in the worst-case benchmarks I couldn't be sure there was any significant performance change.
Zlib background
For simplicity I'll only be mentioning inflate from here on, but the same loop handles inflate and deflate of gzip, zlib and raw deflate streams identically.
While the Zlib extension is written in C, I've written this in roughly equivalent ruby while hand-waving over some of the details of how libz works since they're unimportant to the performance comparison. Before my patch run loop looked like this:
setup_input_buffer stream
setup_output_buffer stream
loop do
status = inflate stream
schedule_another_thread
return output_buffer if status == STREAM_END
if status == BUF_ERROR then
expand_output_buffer stream
next
else
# handle other errors
end
end
First the input and output buffers are set up. The input buffer contains the deflated string to be inflated and the output buffer will contain the inflated string. The default output buffer size is 1024 bytes.
Inside the loop inflate is called which will expand as much of the input stream as will fit in the output buffer, then ruby is given the chance to schedule another thread to provide fairness, then errors are checked.
A BUF_ERROR indicates there isn't enough room in the output buffer and more space needs to be added. The expand_output_buffer function adds up to another 16KB to the output buffer. Combined with scheduling another thread each time inflate returned, fairness is provided.
With my patch the run loop looks something like this:
setup_input_buffer stream
setup_output_buffer stream
release_GVL do
until interrupted do
status = inflate stream
return output_buffer if status == STREAM_END
if status == BUF_ERROR then
expand_output_buffer_without_GVL stream
next
else
# …
end
end
end
The differences are releasing the GVL, the interrupted flag and the change in expanding the output buffer.
To reduce copying the Zlib extension uses a ruby String for the output buffer, so expanding it requires GC interaction. The new function uses realloc(3) directly to avoid GVL contention and mirrors rb_str_resize().
The interrupted flag allows processing of timeouts, a killed thread or clean shutdown to proceed smoothly. Like the no-GVL version the buffer is only expanded by 16KB to allow interruptions to be serviced quickly. Without this flag a large stream would need to complete before the thread it is running in could be shut down which may take several seconds. (In the C version of release_GVL, rb_thread_blocking_region(), you provide an unblocking function that sets the interrupted flag when ruby wants to regain control of the thread.)
Releasing the GVL allow other threads to run the Ruby VM so while a stream is being inflated other threads can run on other CPUs. For large streams this can result in a performance increase when other Ruby threads are running.
Worst Case Benchmark
After inspecting the two implementations we have enough information to create a worst-case scenario for testing the performance of the patch.
Since GVL release/acquire can be expensive we want to do that as much as possible which means inflate should do as little as possible. The buffer should not be expanded since that allows more than one trip through inflate() per GVL release/acquire, so inflated strings must fit in the 1024 byte output buffer.
I wrote a benchmark that inflates many strings with an inflated size of 1000 bytes using multiple threads, but not so many strings as to prevent a quick return from the benchmark section. (While inflating strings of zero bytes would probably be worse, that seems unrealistic.)
Here's the benchmark program I wrote:
require 'zlib'
require 'benchmark'
r = Random.new 0
file_count = 10_000
deflated = (0..file_count).map do
input = r.bytes 1000
Zlib::Deflate.deflate input
end
times = Benchmark.measure do
(0..3).map do
Thread.new do
deflated.each do |input|
Zlib::Inflate.inflate input
end
end
end.each do |t|
t.join
end
end
puts times.real
Instead of using time, Benchmark.measure is used, but only around the inflate part. This reduces the amount of noise in the benchmark. By choosing only 10,000 files multiple benchmarks can be run (on my machine the inflate portion takes a little over ½ second). By using the same random seed the benchmark is repeatable.
Before using ministat I ran the same benchmark but with 100,000 files I had the following results without the patch:
$ for f in `jot 5`; do ruby20 test.rb; done 5.420000 5.970000 11.390000 ( 8.162893) 5.400000 6.270000 11.670000 ( 8.263046) 5.460000 5.920000 11.380000 ( 8.133742) 5.410000 6.290000 11.700000 ( 8.289913) 5.500000 6.620000 12.120000 ( 8.478085)
and with the patch:
$ for f in `jot 5`; do make -s runruby; done 5.120000 6.240000 11.360000 ( 8.039715) 5.240000 6.260000 11.500000 ( 8.097961) 5.280000 5.940000 11.220000 ( 8.004246) 5.210000 6.360000 11.570000 ( 8.171124) 5.240000 6.200000 11.440000 ( 8.054929)
So while running with the patch seemed faster, the results overlap by a small margin. I wanted to be reasonably sure there was an improvement which is where ministat comes in. In order to get useful data out of ministat I would need more data (this is easy to verify by running ministat on these results and seeing that it says "No difference").
To capture the data I used a shell for loop to run the benchmark 50 times and redirect the output to a file: for f in `jot 50`; do make -s runruby; done > with, and again without the patch applied directed to a "without" file.
ministat
To determine if the results were statistically significant I ran the data through ministat with a confidence interval of 99%. From the ministat man page:
The ministat command was written by Poul-Henning Kamp out of frustration over all the bogus benchmark claims made by people with no understanding of the importance of uncertainty and statistics.
So through ministat we will be able to determine if the suspected difference from the previous benchmark is an actual difference, and what the certainty of that difference is.
$ ministat -c 99 -s without with
x without
+ with
+------------------------------------------------------------------------------+
| + + + x * xx x |
| + x + + + + + + x **+xx*xx x * x + x |
|+ ++++ x + x*+ * + ++x+x++x*++x* ***+x**xx * +x**** * + x x xx|
| |_____________A_____________| |
| |______________A_______________| |
+------------------------------------------------------------------------------+
N Min Max Median Avg Stddev
x 50 0.61660767 0.71641994 0.66669798 0.66647618 0.022548872
+ 50 0.59242272 0.69294882 0.64941692 0.64917859 0.02509235
Difference at 99.0% confidence
-0.0172976 +/- 0.0125332
-2.59538% +/- 1.88051%
(Student's t, pooled s = 0.0238545)
Looking at the last four lines, you can see there is a statistically significant difference in the benchmark results at 99% confidence. Running the benchmark with the patch shows a 17ms decrease in time ± 13ms or a 2.6% decrease in time ± 1.9% for this sample set, so I can say "minor improvement".
The box shows a plot of the with and without values along with the average (and median) values, and standard deviation bars. For full details see the ministat man page.
There are a couple of likely reasons for the slight increase in performance such as the GVL release/acquire being inexpensive enough that inflate takes longer allowing parallelism or that GVL release/acquire isn't as expensive as thought, but without an easy way to measure GVL contention it's only speculation.
Getting ministat
To my knowledge ministat only ships with FreeBSD, but fortunately it is really easy to add to nearly any system with a C compiler. You can find a copy of ministat in this repository (which, ironically, uses autotools) or you can grab the C file directly and run cc ministat.c -lm -o ministat and move the ministat executable into your PATH like I did.
Replace your test helpers with reusable API
drbrain |
test/test_helper.rb is a great idea Rails brought to the Ruby world as a place for functionality that helps you write better tests. There's now a standard place for you to implement common setup/teardown, shortcuts and custom assertions. However, a test helper is not the best place to store this functionality for a Ruby library.
One of the benefits you get out of writing tests is knowing where your API is clumsy and inadequate. If you have a test helper file full of methods to make your library easy to use in a test why is that not part of your library's API? Wouldn't your users also want a bunch of methods that make your library easier to use in their applications?
For example, in RubyGems we have a test helper that does this: gem_file = Gem::Builder.new(spec).build which is a little silly. Every time you create a Gem::Builder you want to build a gem. You don't create a Gem::Builder object for fun! To help out RubyGems users I added a new method: gem_file = Gem::Builder.build spec which immediately creates and builds them gem which is much nicer for everyone (but really, you should use Gem::PackageTask when building gems).
Whether you're writing a library or a Rails app, this kind of functionality belongs in your library (or application) code, not in the test helper where only your tests can benefit from it.
Even after you improve your API by moving helpful functionality back into your library there's still going to be some things that only make sense for tests. For example, you probably don't want to type t = Some::Deeply::Namespaced::Thing.new 1, 2, 3 many, many times in your tests, so you write a short wrapper method you can call like this: t = thing 1, 2, 3. Your tests may need setup and teardown to maintain a clean environment between tests, custom assertions for readability or you may want to include a pre-built stub or mock.
While this having this functionality in a test helper is fine for a Rails app, it shouldn't go in a library's test helper. When you keep testing functionality hidden in the test directory a user who wants to write a third-party extension for your gem can't access them. Why force a happy user to re-implement (possibly poorly or incorrectly) the work you've done to have nice, clean tests that are easy to read and write?
Instead of having a private test helper I have a public test case like MyGem::TestCase that lives in lib/my_gem/test_case.rb. This gives anyone who wants to extend my libraries a documented, ready-to-go API for writing tests for their extension.
My gem-specific test case typically contains all the requires needed to load the library (ideally require 'my_gem'), proper setup and teardown to sandbox the tests, any utility methods that don't belong in the library itself and possibly some custom assertions. This makes a brand new test easy to start:
require 'my_gem/test_case'
class TestMyGemSomeClass < MyGem::TestCase
def setup
super
# …
end
def test_something
# …
end
end
There is the minor downside that an extension writer must use minitest (my preferred testing library) to test their extension. Perhaps this inconvenience could be solved by a module providing setup, teardown and shortcuts that is included in the proper place for the extension writer's favorite testing library.
PS: Actually, Gem::Builder.build is Gem::Package.build since Gem::Format, Gem::Builder and Gem::Package are getting merged into one convenient class that deals with reading and writing gem files for RubyGems 2.0. This means there will only be one place to look for the API of messing with packages and it reduces the implementation of Gem::Installer a bit.
How to Sleep in Tests
drbrain |
And Use Other Kernel Methods
Ruby provides many methods for you in Kernel including such favorites as #sleep, #fork, #open, #system and #` that can be hard to test reliably and quickly.
When you write code that uses these methods you want your tests to be reliable and fast but not need to set up too much state, like having files to open, before running (most importantly because that's more typing, but also because it's more stuff that may break). You also don't want the tests to be slow so actually sleeping or running a command are undesirable.
Let's start with some sample code. Imagine you've got a severe case of NIH and you're implementing cron (poorly):
def run command, every
loop do
status = system command
raise "#{command} failed" unless status
sleep every
end
end
And here's a test:
def test_run
e = assert_raises RuntimeError do
@cron.run 'false', 0
end
assert_equal 'false failed', e.message
end
How do we test the sleep behavior though? We'd need a command that failed on the second invocation. We'd also want to check the sleep duration somehow which would involve waiting.
There's another way though, since both #system and #sleep are in Kernel we can avoid calling the real implementations through the power of inheritance! If we have our tests inject an implementation of #sleep and #system into the implementation so the real methods never get called:
def test_run
def @cron.sleep time
@sleep = time
end
def @cron.system command
@commands ||= []
@commands << command
@commands.length <= 1
end
e = assert_raises RuntimeError do
@cron.run 'any old command', 2**30
end
assert_equal 'any old command failed', e.message
assert_equal 2**30, @cron.instance_variable_get(:@sleep)
assert_equal ['any old command', 'any old command'],
@cron.instance_variable_get(:@commands)
end
This test never calls Kernel#sleep nor Kernel#system, problem solved! As an additional benefit, this test will work on windows where the first will not because false may not exist.
Of course, you can always use a mocking or stubbing framework, but ruby is powerful enough on its own that you don't really need one.
Testing Effectively with Rails
drbrain |
I wrote a blog post for AT&T Interactive's Engineering Blog on how to test effectively with rails.
I know you all know how to write tests, so this blog post is not about that. I want to move your testing to the next level. Your current testing practices are OK, but some changes in how you approach testing will allow you to refactor and improve your code even faster and more easily than you currently do.
Check it out!
How I Use Autotest
drbrain |
During the Q&A portion of my talk, I was asked a very important question about autotest, something like “How often do you save when using autotest?” I save all the time when using autotest. There was a followup question, something like “Don’t you get syntax errors?” and I don’t.
Before I wrote autotest I was making fine-grained saves that were syntactically correct. I wrote autotest to automate the running of tests so I wouldn’t have to choose which tests to run. My changes were so small that I spent an annoying fraction of my time editing my command line.
When I’m using autotest, rather than writing all of a method’s tests, I start by performing the setup (which should cause an error), add a flunk, and save, which is usually four lines:
def test_blah
result = @thingy.blah
flunk
end
When the tests run they’ll see the method blah doesn’t exist and fail, so I define the method, (the def blah and end lines), and save. Then I write an assertion and save the test, which will fail, so I implement what I need to make the assertion pass, and save. Now I repeatedly perform the minimum changes to go from failure to flunking until my test is complete, remove the flunk, and start over with the next method.
Each save I do is a handful of lines, so it’s easy to keep them syntactically correct. I know I’m going from working state to working state as I work towards my end goal so I can easily roll back my changes with undo.
Firebrigade Home Page Fixed
drbrain |
I denormalized a bit and beat the tests back into shape and the Firebrigade home page is back to life! Next I’m going to sand down a few rough edges on RubyGems to get a beta shipped suitable for feedback.
Firebrigade is now fully vladified using perforce.
Here’s the two setup steps you need for perforce:
- Require ‘vlad/perforce’ at the top of config/deploy.rb.
- Your .p4config goes in the scm/ directory of the server’s checkout.
- Run p4 client in the scm/ directory and set your View to //path/to/project/… //clientname/…
Hopefully we can fully automate this so the vlad:setup_app task can handle this automatically.
I also had to do some custom setup for firebrigade because it uses RubyInline and needs the INLINEDIR set. I just added extra stuff to the setup_app task:
namespace :vlad do
remote_task :setup_app do
cmds = [
"mkdir #{inline_dir}",
"sudo chown www:www #{inline_dir}",
]
end
Update: Automatic p4 setup is done!
Notes on Heckle
drbrain |
When I started heckle I was hoping for somehing better than rcov that would tell you that your code is poorly tested… But code so poorly engineered that changing anything as bound to make everything fall apart looks like well tested (read: heckle-proof) code in a sense. We both think that running just the unit tests that directly test the MUT (method under test) would help, but really it wouldn’t help enough. […]
Suggestions? How would you differentiate between “well tested” and “tightly coupled crap”?
—Notes on Heckle via Polishing Ruby
I’m puzzled by this too. How can we detect well-tested vs tightly-coupled (and poorly tested) code automatically?
tinderbox version 1.0.0 has been released!
drbrain |
tinderbox version 1.0.0 has been released!
http://seattlerb.rubyforge.org/tinderbox
Description
Tinderbox tests projects and tries to make them break by running them on as
many different platforms as possible.
Features & Problems
- Tests gems in a sandbox
- Submits gem test results to http://firebrigade.seattlerb.org
- Understands test/unit and RSpec
Changes:
1.0.0 / 2007-01-30
- Tests gems in a sandbox
- Submits results to Firebrigade
- Birthday!
firebrigade_api version 1.0.0 has been released!
drbrain |
http://seattlerb.rubyforge.org/firebrigade_api
firebrigade_api is an API wrapper for http://firebrigade.seattlerb.org
Changes
- Birthday!
- Ruby interface to http://firebrigade.seattlerb.org
- User-friendly cache for implementing Tinderbox
Test Profiling by Lines Logged
drbrain |
At work I've been cleaning up the tests and trying to make them run faster. One way of doing this is profiling the tests and fixing the slow spots. That only works so well, especially if there's lots of duplication or extra work in the tests. With Rails you can go about this a different way, since you have a second source of information on your tests' operation, the log file.
I added the following code to test/test_helper.rb:
raise 'require\'d test/test_helper twice!, you broke it!' if
ENV['RAILS_ENV'] == 'test'
At the very top, even above ENV["RALIS_ENV"] = "test" to make sure that the hack to Test::Unit::TestCase below happens only once, then the magic to figure out which test outputs which log lines:
class Test::Unit::TestCase
alias unlogged_run run
def run(result, &block)
RAILS_DEFAULT_LOGGER.debug "RUNNING #{self.class} #{@method_name}"
unlogged_run result, &block
end
end if ENV['PROFILE_LOG']
Each test run will be prefixed with text like RUNNING SomeTest test_blah which I can then run this script on:
#!/usr/local/bin/ruby -w
test = nil
tests = Hash.new 0
File.open 'log/test.log' do |fp|
fp.each_line do |line|
if line.strip =~ /^RUNNING (.*)/ then
test = $1
else
tests[test] += 1
end
end
end
tests.sort_by { |test, count| -count }.each do |test, count|
puts "%5d %s" % [count, test]
end
To get the profile information, I run:
rake log:clear; PROFILE_LOG=y rake && script/count_logs
When run on Firebrigade, I get the following output (top 10 only):
130 ProjectViewTest test_show 74 DummyControllerTest test_error_500 59 OwnerTest test_class_owner_count 42 ProjectControllerTest test_index 39 HomeControllerTest test_index_no_builds 38 RestControllerTest test_add_build 35 ProjectControllerTest test_show_no_versions 35 ProjectControllerTest test_search_many_matches 34 RestControllerTest test_add_project 33 RestControllerTest test_add_version
Now I can quickly discover good candidates for refactoring. #test_error_500 above dumps an email into the logs, so it is a bogus result. That leaves ProjectViewTest#test_show as a candidate for simplification or refactoring.

