Let us walk through a simple example of using Apache Avro with Python.


Assuming you have installed a version of python3, you can install Avro's python package with pip:

> pip install avro

You can verify that avro is installed properly with the packaged avro command-line script:

!avro --v

Schema Definition

Avro relies on schemas to understand the structure of data to encode to and decode from binary format. Unlike other serialization systems, though, Avro always needs schemas to be present when reading data. Since a schema is always around to refer to, there is no need to attach schema info to the generated binary value, making its serialization both fast and small.

Avro schemas are defined with JSON, allowing most languages with JSON libraries to implement its specification.

Let us define a simple User schema to begin with:

import json
schema_json_str = json.dumps({
"namespace": "ambitious.app",
"type": "record",
"name": "User",
"fields": [
{"name": "first_name", "type": "string"},
{"name": "last_name", "type": ["string", "null"]},
{"name": "email", "type": "string"},
{"name": "age", "type": ["int", "null"]},

The schema defines four fields in a User record. Three of those fields - first_name, last_name, and email - are strings, while age is an integer. Also, last_name and age are optional fields, so the schema is valid even if they are not supplied.

schema_json_str is a simple JSON formatted string, which can now be registered with Avro:

import avro.schema
schema = avro.schema.parse(schema_json_str)

Avro's schema.parse accepts a JSON string and constructs a schema.Schema object that can be used to encode data. You can crosscheck the constructed schema with to_json or simply printing the schema.

{'type': 'record',
'name': 'User',
'namespace': 'ambitious.app',
'fields': [{'type': 'string', 'name': 'first_name'},
{'type': ['string', 'null'], 'name': 'last_name'},
{'type': 'string', 'name': 'email'},
{'type': ['int', 'null'], 'name': 'age'}]}

All seems to be in order. Let us now attempt to write a few records using the schema.

from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter
writer = DataFileWriter(open("users.avro", "wb"), DatumWriter(), schema)
writer.append({"first_name": "John", "last_name": "Doe", "email": "john.doe@example.com"})
writer.append({"first_name": "Jane", "age": 23, "email": "jane.doe@example.com"})

That was easy!

A DataFileWriter object writes serialized items to a data file on disk. It accepts:

  1. The file to write to. Notice that we are initializing/opening a file in wb mode (write and binary) to accept Avro encoded data.
  2. DatumWriter object which is responsible for actually serializing items to Avro's binary format.
  3. The schema to use to encode the data.

DataFileWriter.append adds items, represented as Python dicts, to the file users.avro with the schema supplied earlier.

Remember that the data is in binary format, so trying to read the generated `users.avro` content may result in the printing of some unreadable characters.

We can now get Avro to read the information from the file to verify its contents:

reader = DataFileReader(open("users.avro", "rb"), DatumReader())
for user in reader:
{'first_name': 'John', 'last_name': 'Doe', 'email': 'john.doe@example.com', 'age': None}
{'first_name': 'Jane', 'last_name': None, 'email': 'jane.doe@example.com', 'age': 23}

Yep, the contents look alright!

DataFileReader (similar to DataFileWriter) and DatumReader (analogous to DatumWriter) objects read the data back from the disk. As you can see, DataFileReader is an iterator that returns dicts corresponding to the serialized items.

Let us next explore the schema definition language a little deeper, and understand its capabilities and nuances.

