Avro Binary Format


Last Updated on Sep 28, 2022

Avro specifies two serialization encodings: binary and JSON. Binary encoding is the default and used by most applications as it is smaller and faster. JSON encoding is primarily used for debugging and web-based applications.

Avro's binary encoding does not include field information: field names, type info, field/record separators. This results in smaller data, but it also means that the data can only be read with the schema used during encoding.

Primitive Types

The specification to encode primitive types is straightforward:

  • null is written as zero bytes
  • a boolean is written as a single byte with value 0 (false) or 1 (true)
  • int and long values are written using variable-length zig-zag coding.
  • a float is written as 4 bytes
  • a double is written as 8 bytes
  • bytes are encoded as a long followed by that many bytes of data
  • a string is encoded similarly to bytes: a long followed by that many bytes of UTF-8 encoded character data.

The full specification for primitive types is here.

Let's walkthrough an example to understand better. We will use the same User schema as the one in introduction.

# Imports
import avro.schema
import io
import json
from avro.io import BinaryEncoder, DatumWriter
# Define the write schema
schema_json_str = json.dumps({
"namespace": "ambitious.app",
"type": "record",
"name": "User",
"fields": [
{"name": "first_name", "type": "string"},
{"name": "last_name", "type": ["null", "string"]},
{"name": "email", "type": "string"},
{"name": "age", "type": ["int", "null"]},
]
})
# Parse json string schema to an Avro schema object
schema = avro.schema.parse(schema_json_str)

A User record that matches the schema can be encoded with the help of DatumWriter class. We will be using the BinaryEncoder for this session:

# Construct Writer object and write encoded data
writer = io.BytesIO()
encoder = BinaryEncoder(writer)
datum_writer = DatumWriter(writers_schema=schema)
datum_writer.write({'first_name': 'John',
'email': 'john.doe@example.com',
'last_name': 'Doe',
'age': 30}, encoder)

writer would now contain the encoded value. You can decode the data to verify its sanctity with BinaryDecoder and DatumReader objects:

from avro.io import BinaryDecoder, DatumReader
reader = io.BytesIO()
writer.seek(0)
decoder = BinaryDecoder(writer)
writer.seek(0)
reader = io.BytesIO(writer.getvalue())
decoder = BinaryDecoder(reader)
datum_reader = DatumReader(schema)
datum_reader.read(decoder)
{'first_name': 'John',
'last_name': 'Doe',
'email': 'john.doe@example.com',
'age': 30}

That looks alright!

We can breakdown the binary value to understand Avro encoding better. Let's start by printing the binary value as-is:

print(reader.getvalue())
b'\x08John\x02\x06Doe(john.doe@example.com\x00<'

Yep, it doesn't look readable because the data is in binary.

Printing the array of bytes as a hex array looks better:

# Restart at the beginning
reader.seek(0)
byte_arr = reader.getvalue()
print([format(byte, 'x') for byte in byte_arr])
['8', '4a', '6f', '68', '6e', '2', '6', '44', '6f', '65', '28', '6a', '6f', '68', '6e', '2e', '64', '6f', '65', '40', '65', '78', '61', '6d', '70', '6c', '65', '2e', '63', '6f', '6d', '0', '3c']

Here is a visualization of the content for easier understanding.

Binary Value

You can also decompose the encoded value in Python step-by-step as a stream:

# Restart at the beginning
reader.seek(0)
# First Name
print(f"First Name: {decoder.read_bytes()}")
First Name: b'John'
# 0-based index position of Value Type in Union of ["null", "string"]
value_type = ['null', 'string'][decoder.read_int()]
# Last Name
if value_type == 'null':
print(f"Last Name is blank")
else:
print(f"Last Name: {decoder.read_bytes()}")
Last Name: b'Doe'
# Email
print(f"Email: {decoder.read_bytes()}")
Email: b'john.doe@example.com'
# 0-based index position of Value Type in Union of ["int", "null"]
value_type = ['int', 'null'][decoder.read_int()]
# Age
if value_type == 'null':
print(f"Age has not been provided")
else:
print(f"Age: {decoder.read_int()}")
Age: 30

We examined the encoded value of an object with primitive types. It get's more complicated when we use Complex types like Records, or collections like Arrays, and Maps. Let's explore them next.


© 2022 Ambitious Systems. All Rights Reserved.