Encoding with Apache Avro
Let us walk through a simple example of using Apache Avro with Python.
Installation
Assuming you have installed a version of python3, you can install Avro's python package with pip:
> pip install avro
Collecting avroUsing cached avro-1.10.1-py3-none-any.whlInstalling collected packages: avroSuccessfully installed avro-1.10.1
You can verify that avro is installed properly with the packaged avro
command-line script:
!avro --v
Schema Definition
Avro relies on schemas to understand the structure of data to encode to and decode from binary format. Unlike other serialization systems, though, Avro always needs schemas to be present when reading data. Since a schema is always around to refer to, there is no need to attach schema info to the generated binary value, making its serialization both fast and small.
Avro schemas are defined with JSON, allowing most languages with JSON libraries to implement its specification.
Let us define a simple User
schema to begin with:
import jsonschema_json_str = json.dumps({"namespace": "ambitious.app","type": "record","name": "User","fields": [{"name": "first_name", "type": "string"},{"name": "last_name", "type": ["string", "null"]},{"name": "email", "type": "string"},{"name": "age", "type": ["int", "null"]},]})
The schema defines four fields in a User
record. Three of those fields - first_name
, last_name
, and email
- are strings, while age
is an integer. Also, last_name
and age
are optional fields, so the schema is valid even if they are not supplied.
schema_json_str
is a simple JSON formatted string, which can now be registered with Avro:
import avro.schemaschema = avro.schema.parse(schema_json_str)
Avro's schema.parse
accepts a JSON string and constructs a schema.Schema
object that can be used to encode data. You can crosscheck the constructed schema with to_json
or simply printing the schema.
schema.to_json()
All seems to be in order. Let us now attempt to write a few records using the schema.
from avro.datafile import DataFileReader, DataFileWriterfrom avro.io import DatumReader, DatumWriterwriter = DataFileWriter(open("users.avro", "wb"), DatumWriter(), schema)writer.close()
That was easy!
A DataFileWriter
object writes serialized items to a data file on disk. It accepts:
- The file to write to. Notice that we are initializing/opening a file in
wb
mode (write
andbinary
) to accept Avro encoded data. DatumWriter
object which is responsible for actually serializing items to Avro's binary format.- The schema to use to encode the data.
DataFileWriter.append
adds items, represented as Python dict
s, to the file users.avro
with the schema supplied earlier.
We can now get Avro to read the information from the file to verify its contents:
reader = DataFileReader(open("users.avro", "rb"), DatumReader())for user in reader:print(user)reader.close()
Yep, the contents look alright!
DataFileReader
(similar to DataFileWriter
) and DatumReader
(analogous to DatumWriter
) objects read the data back from the disk. As you can see, DataFileReader
is an iterator that returns dict
s corresponding to the serialized items.
Let us next explore the schema definition language a little deeper, and understand its capabilities and nuances.