Make it possible to read and write xerial snappy #127

GregBowyer · 2014-02-19T23:10:25Z

Fixes #126

TL;DR

This makes it possible to read and write snappy compressed streams that
are compatible with the java and scala kafka clients (the xerial
blocking format))

Xerial Details

Kafka supports transparent compression of data (both in transit and at
rest) of messages, one of the allowable compression algorithms is
Google's snappy, an algorithm which has excellent performance at the
cost of efficiency.

The specific implementation of snappy used in kafka is the xerial-snappy
implementation, this is a readily available java library for snappy.

As part of this implementation, there is a specialised blocking format
that is somewhat none standard in the snappy world.

Xerial Format

The blocking mode of the xerial snappy library is fairly simple, using a
magic header to identify itself and then a size + block scheme, unless
otherwise noted all items in xerials blocking format are assumed to be
big-endian.

A block size (xerial_blocksize in implementation) controls how
frequent the blocking occurs 32k is the default in the xerial library,
this blocking controls the size of the uncompressed chunks that will be
fed to snappy to be compressed.

The format winds up being

Header	Block 1 Len	Block 1 Data	... Block n Len	... Block n Data
16 Bytes	int32	snappy output	int32	snappy output

It is important to not that the blocksize is the amount of uncompressed
data presented to snappy at each block, whereas the blocklen is the
number of bytes that will be present in the stream, that is the
length will always be <= blocksize.

Xerial blocking header

Marker	Magic String	Null / Pad	Version	Compat
byte	c-string	byte	int32	int32
-126	'SNAPPY'	\0	variable	variable

The pad appears to be to ensure that SNAPPY is a valid cstring, and to
align the header on a word boundary.

The version is the version of this format as written by xerial, in the
wild this is currently 1 as such we only support v1.

Compat is there to claim the minimum supported version that can read a
xerial block stream, presently in the wild this is 1.

Implementation specific details

The implementation presented here follows the Xerial implementation as
of its v1 blocking format, no attempts are made to check for future
versions. Since none-xerial aware clients might have persisted snappy
compressed messages to kafka brokers we allow clients to turn on xerial
compatibility for message sending, and perform header sniffing to detect
xerial vs plain snappy payloads.

Fixes #126 TL;DR ===== This makes it possible to read and write snappy compressed streams that are compatible with the java and scala kafka clients (the xerial blocking format)) Xerial Details ============== Kafka supports transparent compression of data (both in transit and at rest) of messages, one of the allowable compression algorithms is Google's snappy, an algorithm which has excellent performance at the cost of efficiency. The specific implementation of snappy used in kafka is the xerial-snappy implementation, this is a readily available java library for snappy. As part of this implementation, there is a specialised blocking format that is somewhat none standard in the snappy world. Xerial Format ------------- The blocking mode of the xerial snappy library is fairly simple, using a magic header to identify itself and then a size + block scheme, unless otherwise noted all items in xerials blocking format are assumed to be big-endian. A block size (```xerial_blocksize``` in implementation) controls how frequent the blocking occurs 32k is the default in the xerial library, this blocking controls the size of the uncompressed chunks that will be fed to snappy to be compressed. The format winds up being | Header | Block1 len | Block1 data | Blockn len | Blockn data | | ----------- | ---------- | ------------ | ---------- | ------------ | | 16 bytes | BE int32 | snappy bytes | BE int32 | snappy bytes | It is important to not that the blocksize is the amount of uncompressed data presented to snappy at each block, whereas the blocklen is the number of bytes that will be present in the stream, that is the length will always be <= blocksize. Xerial blocking header ---------------------- Marker | Magic String | Null / Pad | Version | Compat ------ | ------------ | ---------- | -------- | -------- byte | c-string | byte | int32 | int32 ------ | ------------ | ---------- | -------- | -------- -126 | 'SNAPPY' | \0 | variable | variable The pad appears to be to ensure that SNAPPY is a valid cstring, and to align the header on a word boundary. The version is the version of this format as written by xerial, in the wild this is currently 1 as such we only support v1. Compat is there to claim the minimum supported version that can read a xerial block stream, presently in the wild this is 1. Implementation specific details =============================== The implementation presented here follows the Xerial implementation as of its v1 blocking format, no attempts are made to check for future versions. Since none-xerial aware clients might have persisted snappy compressed messages to kafka brokers we allow clients to turn on xerial compatibility for message sending, and perform header sniffing to detect xerial vs plain snappy payloads.

Make it possible to read and write xerial snappy

dpkp added a commit that referenced this pull request Feb 25, 2014

Merge pull request #127 from GregBowyer/master

8811298

Make it possible to read and write xerial snappy

dpkp merged commit 8811298 into dpkp:master Feb 25, 2014

GregBowyer mentioned this pull request May 13, 2014

Kafka 0.8.0 Snappy #167

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make it possible to read and write xerial snappy #127

Make it possible to read and write xerial snappy #127

Uh oh!

GregBowyer commented Feb 19, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Make it possible to read and write xerial snappy #127

Make it possible to read and write xerial snappy #127

Uh oh!

Conversation

GregBowyer commented Feb 19, 2014

TL;DR

Xerial Details

Xerial Format

Xerial blocking header

Implementation specific details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants