Subscribe to Stuck in an Infiniteloop        RSS Feed
-----

GzipWriter vs. Zlib::deflate in Ruby

Icon Leave Comment
I spent the better part of the yesterday afternoon trying to figure out why I was getting different results when using GzipWriter versus using Zlib::deflate and putting my own headers on it. Lo and behold while both gzip and zlib use DEFLATE, the headers are incompatible. In hindsight, this is obvious, but a cursory glance at the compressed data would seem to indicate the algorithms are different.

Consider the following binary data, represented in hex:

Quote

16f45f1372d0bb5b176d2428fa6c451716f45f1372d0bb5b176d2428fa6c451716f45f1372d0bb5b176d2428fa6c451716f45f1372d0bb5b176d2428fa6c451716f45f1372d0bb5b176d2428fa6c4517


Nothing special, SecureRandom.hex provided 16 bytes and I copied and pasted it a few times.

Running the following script produces two files for inspection:

#!/usr/bin/env ruby

require 'zlib'

bin_data = File.open("sample_file", "r").read.scan(/../).map{|byte| byte.hex.chr}.join

puts "Creating 1.file.writer via GzipWriter..."

File.open("1.file.writer", "w") do |f|
  gz = Zlib::GzipWriter.new(f)
  gz.write(bin_data)
  gz.close
end


puts "Creating 2.file.deflate via Deflate..."

File.open("2.file.deflate", "w") do |f|
  header = [0x1f, 0xb8, 0x08, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x03].map{|item| item.chr}.join
  size = bin_data.length
  crc = Zlib::crc32(bin_data)
  f.write(header)
  f.write(Zlib::deflate(bin_data))
  4.times {|i| f.write(((crc >> (8*i)) & 0xFF).chr)}
  4.times {|i| f.write(((size >> (8*i)) & 0xFF).chr)}
end



In the first, we are letting Ruby handle everything. In the latter we are created a normalized gzip header, deflating, and then putting a footer on it.

Before we look at the hex dumps, a crash course on gzip/zlib.

gzip
Offset   Length   Contents
  0      2 bytes  magic header  0x1f, 0x8b (\037 \213)  
  2      1 byte   compression method
                     0: store (copied)
                     1: compress
                     2: pack
                     3: lzh
                     4..7: reserved
                     8: deflate
  3      1 byte   flags
                     bit 0 set: file probably ascii text
                     bit 1 set: continuation of multi-part gzip file, part number present
                     bit 2 set: extra field present
                     bit 3 set: original file name present
                     bit 4 set: file comment present
                     bit 5 set: file is encrypted, encryption header present
                     bit 6,7:   reserved
  4      4 bytes  file modification time in Unix format
  8      1 byte   extra flags (depend on compression method)
  9      1 byte   OS type



Nicely formatted info from here.

The footer is crc32 followed by the length of the uncompressed data.

All we are really concerned about:
Magic number: 1F 8B
Compression: 08
All ZEROS
OS: 03 //unix

zlib

First two bytes: 78 9c //indicates default compression, other compression options change these two bytes
The rest is some bit twiddling that is beyond the scope of this post.

The footer is Addler32.

So knowing this, let's look at the hexdumps of the two files:

1.file.writer
(I normalized the header that GzipWriter puts on.)
00000000  1f 8b 08 00 00 00 00 00  00 03 13 fb 12 2f 5c 74  |.......U...../\t|
00000010  61 77 b4 78 ae 8a c6 af  1c 57 71 31 0a f9 00 6d  |aw.x.....Wq1...m|
00000020  49 e4 9c 50 00 00 00                              |I..P...|
00000027



2.file.deflate
00000000  1f b8 08 00 00 00 00 00  00 03 78 9c 13 fb 12 2f  |..........x..../|
00000010  5c 74 61 77 b4 78 ae 8a  c6 af 1c 57 71 31 0a f9  |\taw.x.....Wq1..|
00000020  00 21 11 1f ff 6d 49 e4  9c 50 00 00 00           |.!...mI..P...|
0000002d



See where they diverge? Same common sequences highlighted:

1.file.writer

Quote

00000000 1f 8b 08 00 00 00 00 00 00 03 13 fb 12 2f 5c 74 |.......U...../\t|
00000010 61 77 b4 78 ae 8a c6 af 1c 57 71 31 0a f9 00 6d |aw.x.....Wq1...m|
00000020 49 e4 9c 50 00 00 00 |I..P...|
00000027


2.file.deflate

Quote

00000000 1f b8 08 00 00 00 00 00 00 03 78 9c 13 fb 12 2f |..........x..../|
00000010 5c 74 61 77 b4 78 ae 8a c6 af 1c 57 71 31 0a f9 |\taw.x.....Wq1..|
00000020 00 21 11 1f ff 6d 49 e4 9c 50 00 00 00 |.!...mI..P...|
0000002d


If we wanted to we could strip the two byte header and 4 byte header, but at that point, why not just use the existing tools? Interestingly enough, the zlib.h has handles that allow you to write a gzip file [in C]. From the header comments:

/*
...
  The library also supports reading and writing files in gzip (.gz) format
  with an interface similar to that of stdio using the functions that start
  with "gz".  The gzip format is different from the zlib format.  gzip is a
  gzip wrapper, documented in RFC 1952, wrapped around a deflate stream.
...
*/



No such direct functionality is provided by the Zlib module in Ruby that I can find after digging through ext/zlib/zlib.c for a while.

------

Pretend for a moment that all you knew was that gzip and zlib both used deflate and you could simulate the former by slapping on a header to the latter. Confusion abounds! Surely there must be an error somewhere in the underlying implementation! Nope, the problem existed between the computer and the chair.

0 Comments On This Entry

 

May 2019

S M T W T F S
   1234
567891011
12131415161718
1920 21 22232425
262728293031 

Tags

    Recent Entries

    Recent Comments

    Search My Blog

    3 user(s) viewing

    3 Guests
    0 member(s)
    0 anonymous member(s)