Streaming zlib processing for Ruby
drbrain |
Earlier today I checked in a patch that adds streaming zlib processing to Ruby. This allows you to process a stream without needing to allocate space to hold the entire result. To add the streaming support I changed #inflate and #deflate to accept a block. A handful of other methods (such as #finish) accept a block as well due to the internals of Zlib. Here's an example:
require 'zlib'
# dd if=/dev/zero of=/dev/stdout bs=1m count=1024 | gzip -c > 1G.gz
gzipped = File.read '1G.gz'
# auto-detect gzip
z = Zlib::Inflate.new Zlib::MAX_WBITS + 32
# This prints NULs
z.inflate gzipped do |chunk|
puts chunk
end
# Don't forget to finish the stream, there may be bytes leftover.
puts z.finish
If the input was a zlib stream instead of a gzip stream, this would work:
Zlib.inflate deflated do |chunk|
puts chunk
end
But Zlib.inflate doesn't allow specification of window bits for gzip auto-detection.
The biggest difference is in memory used. Here's 1GB of gzip-compressed NULs:
$ ll 1G.gz -rw-r--r-- 1 drbrain staff 1042069 Jul 9 17:09 1G.gz
Here's the memory used inflating this stream:
$ /usr/bin/time -l make -s runruby > /dev/null
3.56 real 3.25 user 0.29 sys
46682112 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
12644 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
1 signals received
3 voluntary context switches
14 involuntary context switches
Only 46MB used!
Here's without streaming:
$ /usr/bin/time -l make -s runruby > /dev/null
3.82 real 3.21 user 0.60 sys
1079824384 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
264889 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
1 signals received
3 voluntary context switches
18 involuntary context switches
Over 1GB used!
As you may have noticed, despite all the extra calls back to ruby to execute the block, the streaming output was 250ms faster, but does that hold up?
Benchmarks
For all these benchmarks I chose files of NULs as the inflate to large files as this exercises the growth of the output buffer the most, which is what this new feature changes. In general, these benchmarks show that in single-thread use streaming improves inflate speed especially as the output size increases while in multi-thread use streaming reduces inflate speed unless the output size grows very large.
I chose the input sizes of 10KB, 100KB and 1GB as tests of the internals of the zlib extension. At 10KB buffer expansion will only happen once for streaming mode, but several times for non-streaming. 100KB is in the realm of compressed HTTP responses and has a reasonable balance between the way streaming and non-streaming work are handled. 1GB is extremely large and when multi-threaded would be taxing on my laptop's memory capacity for non-streaming and taxing on the GVL while streaming.
Using files that are less compressible would be a better real-world example and may change the speedups shown. For single-thread tests, the speedup would likely increase while for multi-thread tests the slowdown would likely decrease.
With a 1GB file of NULs, ministat shows a 6% speedup:
x stream
+ no-stream
+------------------------------------------------------------------------------+
| + |
| + |
| ++ |
| ++++ |
| x +++++ |
| x xxx +++++++ |
| x x xxxx xxx x +++++++ |
| x xxxxxxx xxxxx x ++++++++ + |
|x x xxxxxxx xxxxxx x x x x x x ++++++++++ +++ +|
| |_________MA__________| |
| |__A__| |
+------------------------------------------------------------------------------+
N Min Max Median Avg Stddev
x 50 3.4484119 3.690871 3.497622 3.5034132 0.045816556
+ 50 3.7066622 3.7709568 3.7281418 3.7290156 0.013754228
Difference at 95.0% confidence
0.225602 +/- 0.013422
6.4395% +/- 0.383111%
(Student's t, pooled s = 0.0338255)
I believe this is due to the non-streaming execution needing to realloc() the output buffer during expansion. The streaming version yields the buffer to the block and creates a new buffer which avoids the realloc() overhead for large strings.
For 100KB of NULs inflated 100 times, no speedup is shown:
x stream
+ no-stream
+------------------------------------------------------------------------------+
| + xxxxx |
| +++ +xxxxx |
| +++++*xxxxx x |
| +++++++*xxxxxxx + + + + |
|+ +++++++**xxxxx* * xxx + +*++ ++ + + + ++ + x x + +|
| |_______M__A__________| |
||________M____________A____________________| |
+------------------------------------------------------------------------------+
N Min Max Median Avg Stddev
x 50 0.036626816 0.045912027 0.037170887 0.037605586 0.0017813789
+ 50 0.035252094 0.047364235 0.036613226 0.038649669 0.0033843024
No difference proven at 95.0% confidence
At 10KB of NULs inflated 1000 times each run, the improvement drops to 6%:
x stream
+ no-stream
+------------------------------------------------------------------------------+
| ++ |
| xxx ++ |
| xxx +++ |
| xxx +++ |
| xxxx x +++ |
| xxxx x ++++ |
| xxxx x ++++ |
| xxxxxx ++++ + |
| xxxxxxx +*+++++ + |
| xxxxxxxx **++++++ ++ * + + * + +|
||____M_A______| |
| |_______M___A___________| |
+------------------------------------------------------------------------------+
N Min Max Median Avg Stddev
x 50 0.046477079 0.05746603 0.047408819 0.047745066 0.0016545295
+ 50 0.049071074 0.064721823 0.049803019 0.050768676 0.002927008
Difference at 95.0% confidence
0.00302361 +/- 0.000943385
6.33282% +/- 1.97588%
(Student's t, pooled s = 0.00237748)
In this case the improvement is likely due to the way the buffer is expanded. Streaming mode jumps from the initial buffer size (1KB) to the maximum buffer size (16KB) immediately while non-streaming mode increments the buffer in steps. I'll need further performance tests to see if eliminating stepping output buffer growth is worthwhile for both modes.
When multiple zlib streams are being processed in parallel streaming is 20% slower when expanding a 10KB streams of NULs 1000 times in four threads:
x stream
+ no-stream
+------------------------------------------------------------------------------+
| + xx |
| + + x xx |
| ++ + x xx |
| +++ + x xx x |
| ++++++ x xx x |
| ++++++++ xxxxx x x |
| + ++++++++ + xxxxxxx x x |
|++ + + ++++++++ ++ + + xxxxxxxxxxxx xxx x x|
| |___MA____| |
| |____A____| |
+------------------------------------------------------------------------------+
N Min Max Median Avg Stddev
x 50 0.19220901 0.21440268 0.19658995 0.1977196 0.0041590721
+ 50 0.14678812 0.17618799 0.15705991 0.15724883 0.0045392638
Difference at 95.0% confidence
-0.0404708 +/- 0.0017274
-20.4688% +/- 0.87366%
(Student's t, pooled s = 0.00435332)
With 100KB of NULs expanded 100 times in four threads there's a 14% slowdown:
x stream
+ no-stream
+------------------------------------------------------------------------------+
| + +++ + xxxx x |
| + +++ + xxxxxx x |
| + ++ +++++++ ++ + + + xxxxxxxx x + x |
|++ ++++++++++ ++ + + + + +++ ++ xxxxxxxxxxxxxx *xxx x xxx|
| |____M_A______| |
| |________M____A_____________| |
+------------------------------------------------------------------------------+
N Min Max Median Avg Stddev
x 50 0.069380999 0.076375008 0.071210146 0.071699142 0.0016866267
+ 50 0.057549715 0.073060989 0.060353994 0.061419401 0.0034865206
Difference at 95.0% confidence
-0.0102797 +/- 0.0010867
-14.3373% +/- 1.51564%
(Student's t, pooled s = 0.00273866)
These slowdowns are likely due to increased GVL contention. As the output size grows the slowdown decreases.
With 1G of NULs expanded once in four thread's there's a 1% speedup:
x stream
+ no-stream
+------------------------------------------------------------------------------+
| x x + |
| x x++x++ |
| x ***x++ |
| xx****** |
| xx****** |
| xx********+++ |
| x**********++ +x x + + + +|
| |__A___| |
||________M__A__________| |
+------------------------------------------------------------------------------+
N Min Max Median Avg Stddev
x 50 5.1492789 5.3822649 5.2116029 5.2147137 0.043439149
+ 50 5.1704421 6.0051818 5.2339363 5.2634319 0.1322479
Difference at 95.0% confidence
0.0487181 +/- 0.0390566
0.934244% +/- 0.748968%
(Student's t, pooled s = 0.0984288)
The speedup is due to realloc() needing to copy the string multiple times.

