Avro Binary Format
Avro specifies two serialization encodings: binary and JSON. Binary encoding is the default and used by most applications as it is smaller and faster. JSON encoding is primarily used for debugging and web-based applications.
Avro's binary encoding does not include field information: field names, type info, field/record separators. This results in smaller data, but it also means that the data can only be read with the schema used during encoding.
Primitive Types
The specification to encode primitive types is straightforward:
null
is written as zero bytes- a
boolean
is written as a single byte with value 0 (false) or 1 (true) int
andlong
values are written using variable-length zig-zag coding.- a
float
is written as 4 bytes - a
double
is written as 8 bytes bytes
are encoded as a long followed by that many bytes of data- a
string
is encoded similarly to bytes: a long followed by that many bytes of UTF-8 encoded character data.
The full specification for primitive types is here.
Let's walkthrough an example to understand better. We will use the same User schema as the one in introduction.
# Importsimport avro.schemaimport ioimport jsonfrom avro.io import BinaryEncoder, DatumWriter# Define the write schemaschema_json_str = json.dumps({"namespace": "ambitious.app","type": "record","name": "User","fields": [{"name": "first_name", "type": "string"},{"name": "last_name", "type": ["null", "string"]},{"name": "email", "type": "string"},{"name": "age", "type": ["int", "null"]},]})# Parse json string schema to an Avro schema objectschema = avro.schema.parse(schema_json_str)
A User record that matches the schema can be encoded with the help of DatumWriter
class. We will be using the BinaryEncoder
for this session:
# Construct Writer object and write encoded datawriter = io.BytesIO()encoder = BinaryEncoder(writer)datum_writer = DatumWriter(writers_schema=schema)datum_writer.write({'first_name': 'John','last_name': 'Doe','age': 30}, encoder)
writer
would now contain the encoded value. You can decode the data to verify its sanctity with BinaryDecoder
and DatumReader
objects:
from avro.io import BinaryDecoder, DatumReaderreader = io.BytesIO()writer.seek(0)decoder = BinaryDecoder(writer)writer.seek(0)reader = io.BytesIO(writer.getvalue())decoder = BinaryDecoder(reader)datum_reader = DatumReader(schema)datum_reader.read(decoder)
That looks alright!
We can breakdown the binary value to understand Avro encoding better. Let's start by printing the binary value as-is:
print(reader.getvalue())
Yep, it doesn't look readable because the data is in binary.
Printing the array of bytes as a hex array looks better:
# Restart at the beginningreader.seek(0)byte_arr = reader.getvalue()print([format(byte, 'x') for byte in byte_arr])
Here is a visualization of the content for easier understanding.

You can also decompose the encoded value in Python step-by-step as a stream:
# Restart at the beginningreader.seek(0)# First Nameprint(f"First Name: {decoder.read_bytes()}")
# 0-based index position of Value Type in Union of ["null", "string"]value_type = ['null', 'string'][decoder.read_int()]# Last Nameif value_type == 'null':print(f"Last Name is blank")else:print(f"Last Name: {decoder.read_bytes()}")
print(f"Email: {decoder.read_bytes()}")
# 0-based index position of Value Type in Union of ["int", "null"]value_type = ['int', 'null'][decoder.read_int()]# Ageif value_type == 'null':print(f"Age has not been provided")else:print(f"Age: {decoder.read_int()}")
We examined the encoded value of an object with primitive types. It get's more complicated when we use Complex types like Records, or collections like Arrays, and Maps. Let's explore them next.