Streaming zlib processing for Ruby

drbrain | Tue, 10 Jul 2012 19:52:18 GMT

Earlier today I checked in a patch that adds streaming zlib processing to Ruby. This allows you to process a stream without needing to allocate space to hold the entire result. To add the streaming support I changed #inflate and #deflate to accept a block. A handful of other methods (such as #finish) accept a block as well due to the internals of Zlib. Here's an example:

require 'zlib'

# dd if=/dev/zero of=/dev/stdout bs=1m count=1024 | gzip -c > 1G.gz
gzipped = File.read '1G.gz'

# auto-detect gzip
z = Zlib::Inflate.new Zlib::MAX_WBITS + 32

# This prints NULs
z.inflate gzipped do |chunk|
  puts chunk
end

# Don't forget to finish the stream, there may be bytes leftover.
puts z.finish

If the input was a zlib stream instead of a gzip stream, this would work:

Zlib.inflate deflated do |chunk|
  puts chunk
end

But Zlib.inflate doesn't allow specification of window bits for gzip auto-detection.

The biggest difference is in memory used. Here's 1GB of gzip-compressed NULs:

$ ll 1G.gz
-rw-r--r--  1 drbrain  staff  1042069 Jul  9 17:09 1G.gz

Here's the memory used inflating this stream:

$ /usr/bin/time -l make -s runruby > /dev/null
        3.56 real         3.25 user         0.29 sys
  46682112  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
     12644  page reclaims
         0  page faults
         0  swaps
         0  block input operations
         0  block output operations
         0  messages sent
         0  messages received
         1  signals received
         3  voluntary context switches
        14  involuntary context switches

Only 46MB used!

Here's without streaming:

$ /usr/bin/time -l make -s runruby > /dev/null
        3.82 real         3.21 user         0.60 sys
1079824384  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
    264889  page reclaims
         0  page faults
         0  swaps
         0  block input operations
         0  block output operations
         0  messages sent
         0  messages received
         1  signals received
         3  voluntary context switches
        18  involuntary context switches

Over 1GB used!

As you may have noticed, despite all the extra calls back to ruby to execute the block, the streaming output was 250ms faster, but does that hold up?

Benchmarks

For all these benchmarks I chose files of NULs as the inflate to large files as this exercises the growth of the output buffer the most, which is what this new feature changes. In general, these benchmarks show that in single-thread use streaming improves inflate speed especially as the output size increases while in multi-thread use streaming reduces inflate speed unless the output size grows very large.

I chose the input sizes of 10KB, 100KB and 1GB as tests of the internals of the zlib extension. At 10KB buffer expansion will only happen once for streaming mode, but several times for non-streaming. 100KB is in the realm of compressed HTTP responses and has a reasonable balance between the way streaming and non-streaming work are handled. 1GB is extremely large and when multi-threaded would be taxing on my laptop's memory capacity for non-streaming and taxing on the GVL while streaming.

Using files that are less compressible would be a better real-world example and may change the speedups shown. For single-thread tests, the speedup would likely increase while for multi-thread tests the slowdown would likely decrease.

With a 1GB file of NULs, ministat shows a 6% speedup:

x stream
+ no-stream
+------------------------------------------------------------------------------+
|                                                                   +          |
|                                                                   +          |
|                                                                   ++         |
|                                                                 ++++         |
|              x                                                 +++++         |
|         x  xxx                                                +++++++        |
|  x x xxxx  xxx x                                              +++++++        |
|  x xxxxxxx xxxxx          x                                  ++++++++   +    |
|x x xxxxxxx xxxxxx  x      x   x       x     x            x   ++++++++++ +++ +|
|  |_________MA__________|                                                     |
|                                                                |__A__|       |
+------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  50     3.4484119      3.690871      3.497622     3.5034132   0.045816556
+  50     3.7066622     3.7709568     3.7281418     3.7290156   0.013754228
Difference at 95.0% confidence
	0.225602 +/- 0.013422
	6.4395% +/- 0.383111%
	(Student's t, pooled s = 0.0338255)

I believe this is due to the non-streaming execution needing to realloc() the output buffer during expansion. The streaming version yields the buffer to the block and creates a new buffer which avoids the realloc() overhead for large strings.

For 100KB of NULs inflated 100 times, no speedup is shown:

x stream
+ no-stream
+------------------------------------------------------------------------------+
|     +   xxxxx                                                                |
|    +++ +xxxxx                                                                |
|    +++++*xxxxx x                                                             |
|  +++++++*xxxxxxx                  + +  +                +                    |
|+ +++++++**xxxxx* * xxx   +       +*++ ++   +  + +       ++ +  x    x     +  +|
|    |_______M__A__________|                                                   |
||________M____________A____________________|                                  |
+------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  50   0.036626816   0.045912027   0.037170887   0.037605586  0.0017813789
+  50   0.035252094   0.047364235   0.036613226   0.038649669  0.0033843024
No difference proven at 95.0% confidence

At 10KB of NULs inflated 1000 times each run, the improvement drops to 6%:

x stream
+ no-stream
+------------------------------------------------------------------------------+
|              ++                                                              |
|    xxx       ++                                                              |
|    xxx       +++                                                             |
|    xxx       +++                                                             |
|   xxxx x     +++                                                             |
|   xxxx x    ++++                                                             |
|   xxxx x    ++++                                                             |
|   xxxxxx    ++++ +                                                           |
|  xxxxxxx   +*+++++  +                                                        |
|  xxxxxxxx  **++++++ ++  *                + +  *   +                         +|
||____M_A______|                                                               |
|       |_______M___A___________|                                              |
+------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  50   0.046477079    0.05746603   0.047408819   0.047745066  0.0016545295
+  50   0.049071074   0.064721823   0.049803019   0.050768676   0.002927008
Difference at 95.0% confidence
	0.00302361 +/- 0.000943385
	6.33282% +/- 1.97588%
	(Student's t, pooled s = 0.00237748)

In this case the improvement is likely due to the way the buffer is expanded. Streaming mode jumps from the initial buffer size (1KB) to the maximum buffer size (16KB) immediately while non-streaming mode increments the buffer in steps. I'll need further performance tests to see if eliminating stepping output buffer growth is worthwhile for both modes.

When multiple zlib streams are being processed in parallel streaming is 20% slower when expanding a 10KB streams of NULs 1000 times in four threads:

x stream
+ no-stream
+------------------------------------------------------------------------------+
|              +                                         xx                    |
|           +  +                                       x xx                    |
|           ++ +                                       x xx                    |
|          +++ +                                       x xx x                  |
|          ++++++                                      x xx x                  |
|        ++++++++                                     xxxxx x x                |
|   +    ++++++++ +                                   xxxxxxx x x              |
|++ +  + ++++++++ ++    +         +                  xxxxxxxxxxxx xxx  x      x|
|                                                     |___MA____|              |
|       |____A____|                                                            |
+------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  50    0.19220901    0.21440268    0.19658995     0.1977196  0.0041590721
+  50    0.14678812    0.17618799    0.15705991    0.15724883  0.0045392638
Difference at 95.0% confidence
	-0.0404708 +/- 0.0017274
	-20.4688% +/- 0.87366%
	(Student's t, pooled s = 0.00435332)

With 100KB of NULs expanded 100 times in four threads there's a 14% slowdown:

x stream
+ no-stream
+------------------------------------------------------------------------------+
|        + +++ +                                      xxxx x                   |
|        + +++ +                                      xxxxxx  x                |
| + ++ +++++++ ++  +    +                   +       xxxxxxxx  x +            x |
|++ ++++++++++ ++  +  + +   + +++          ++    xxxxxxxxxxxxxx *xxx   x    xxx|
|                                                   |____M_A______|            |
|  |________M____A_____________|                                               |
+------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  50   0.069380999   0.076375008   0.071210146   0.071699142  0.0016866267
+  50   0.057549715   0.073060989   0.060353994   0.061419401  0.0034865206
Difference at 95.0% confidence
	-0.0102797 +/- 0.0010867
	-14.3373% +/- 1.51564%
	(Student's t, pooled s = 0.00273866)

These slowdowns are likely due to increased GVL contention. As the output size grows the slowdown decreases.

With 1G of NULs expanded once in four thread's there's a 1% speedup:

x stream
+ no-stream
+------------------------------------------------------------------------------+
|    x x +                                                                     |
|    x x++x++                                                                  |
|    x ***x++                                                                  |
|    xx******                                                                  |
|    xx******                                                                  |
|  xx********+++                                                               |
|  x**********++   +x  x +   +                 +                              +|
|    |__A___|                                                                  |
||________M__A__________|                                                      |
+------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  50     5.1492789     5.3822649     5.2116029     5.2147137   0.043439149
+  50     5.1704421     6.0051818     5.2339363     5.2634319     0.1322479
Difference at 95.0% confidence
	0.0487181 +/- 0.0390566
	0.934244% +/- 0.748968%
	(Student's t, pooled s = 0.0984288)

The speedup is due to realloc() needing to copy the string multiple times.

Posted in  | 2 comments | no trackbacks