Skip to content

Latest commit

 

History

History
165 lines (125 loc) · 5.42 KB

File metadata and controls

165 lines (125 loc) · 5.42 KB

Bitarray representations

The bitarray library offers many ways to represent bitarray objects. Here, we take a closer look at those representations and discuss their advantages and disadvantages.

Binary representation

The most common representation of bitarrays is its native binary string representation, which is great for interactively analyzing bitarray objects:

>>> from bitarray import bitarray
>>> a = bitarray('11001')
>>> repr(a)  # same as str(a)
"bitarray('11001')"
>>> a.to01()  # the raw string of 0's and 1's
'11001'

However, this representation is very large compared to the bitarray object itself, and it not efficient for large bitarrays.

Byte representation

As bitarray objects are stored in a byte buffer in memory, it is very efficient (in terms of size and time) to use this representation of large bitarrays. However, this representation is not very human readable.

>>> a = bitarray('11001110000011010001110001111000010010101111000111100')
>>> a.tobytes()  # raw buffer
b'\xce\r\x1cxJ\xf1\xe0'

Here, the number of pad bits within the last byte, as well as the bit-endianness, is not part of the byte buffer itself. Therefore, extra work is required to store this information. The utility function serialize() adds this information to a header byte:

>>> from bitarray.util import serialize, deserialize
>>> x = serialize(a)
>>> x
b'\x13\xce\r\x1cxJ\xf1\xe0'
>>> b = deserialize(x)
>>> assert a == b and a.endian == b.endian

The header byte is structured the following way:

>>> x[0]        # 0x13
19
>>> x[0] % 16   # number of pad bits (0..7) within last byte
3
>>> x[0] // 16  # bit-endianness: 0 little, 1 big
1

Hence, valid values for the header byte are in the ranges 0 .. 7 or 16 .. 23 (inclusive). Moreover, if the serialized bitarray is empty (x only consists of a single byte - the header byte), the only valid values for the header are 0 or 16 (corresponding to a little-endian and big-endian empty bitarray). The functions serialize() and deserialize() are the recommended and fasted way to (de-) serialize bitarray objects to bytes objects (and vice versa). The exact format of this representation is guaranteed to not change in future releases.

Hexadecimal representation

As four bits of a bitarray may be represented by a hexadecimal digit, we can represent bitarrays (whose length is a multiple of 4) as a hexadecimal string:

>>> from bitarray.util import ba2hex, hex2ba
>>> a = bitarray('1100 1110 0001 1010 0011 1000 1111')
>>> ba2hex(a)
'ce1a38f'
>>> hex2ba('ce1a38f')
bitarray('1100111000011010001110001111')

Note that the representation is different for the same bitarray if the endianness changes:

>>> a.endian
'big'
>>> b = bitarray(a, 'little')
>>> assert a == b
>>> b.endian
'little'
>>> ba2hex(b)
'3785c1f'

The functions ba2hex() and hex2ba() are very efficiently implemented in C, and take advantage of byte level operations.

Base 2, 4, 8, 16, 32 and 64 representation

The utility function ba2base() allows representing bitarrays by base n, with possible bases 2, 4, 8, 16, 32 and 64. The bitarray has to be multiple of length 1, 2, 3, 4, 5 or 6 respectively:

>>> from bitarray.util import ba2base
>>> a = bitarray('001010111111100000111011100110110001111100101110111110010010')
>>> len(a)          # divisible by 2, 3, 4, 5 and 6
60
>>> ba2base(2, a)   # binary
'001010111111100000111011100110110001111100101110111110010010'
>>> ba2base(4, a)   # quaternary
'022333200323212301330232332102'
>>> ba2base(8, a)   # octal
'12774073466174567622'
>>> ba2base(16, a)  # hexadecimal
'2bf83b9b1f2ef92'
>>> ba2base(32, a)  # base 32 (using RFC 4648 Base32 alphabet)
'FP4DXGY7F34S'
>>> ba2base(64, a)  # base 64 (using standard base 64 alphabet)
'K/g7mx8u+S'

Note that ba2base(2, a) is equivalent to a.to01() and that ba2base(16, a) is equivalent to ba2hex(a). Unlike ba2hex(), ba2base() does not take advantage of byte level operations and is therefore slower, although it is also implemented in C. The inverse function is called base2ba().

Variable length representation

In some cases, it is useful to represent bitarrays in a binary format that is "self terminating" (in the same way that C strings are NUL terminated). That is, when an encoded bitarray of unknown length is encountered in a stream of binary data, the format lets us know when the end of the encoded bitarray is reached. See variable length format for this representation.

Compressed sparse bitarrays

Another representation is compressed sparse bitarrays, whose format is also "self terminating". This, format actually uses different representations dependent on how sparsely the population of the bitarray (even sections of the bitarray) is. For large sparse bitarrays, the format reduces (compresses) the amount of data very efficiently, while only requiring a very tiny overhead for non-sparsely populated bitarrays.