Encoding Data in Regular Python Classes with Avro


Last Updated on Sep 28, 2022

Previously, we saw how you can encode a JSON complex object (a User record) into Avro's binary format.

This is useful, but you seldom manipulate JSON data directly like this in your applications. Data is typically enclosed in data containers, like plain python classes or constructs like Data Classes. These containers are what we initialize, load from the database, update, persist, or destroy.

Wouldn't it be nice if the underlying code, which utilizes data containers, could automatically encode/decode to/from Avro format? We will explore this aspect with plain Python classes here and look at using Data Classes next.

Let's start by defining a simple User Python class.

A Simple Python Class

class PlainUser:
def __init__(self, first_name, email, last_name=None, age=None):
self.first_name = first_name
self.last_name = last_name
self.email = email
self.age = age

Initializing an instance of PlainUser is straightforward:

plain_user = PlainUser(first_name='Baby', email='[email protected]', last_name='Doe', age=4)

You can retrieve the object's data with its __dict__ method:

plain_user.__dict__
{'first_name': 'Baby',
'last_name': 'Doe',
'email': '[email protected]',
'age': 4}

Avro is strongly-typed, so regular Python classes won't do in our case since they lack type information. We will annotate the class with Type Hints, that can be reinterpreted as their corresponding Avro types later.

Python Class with Type Hints

Let's add type hints to PlainUser's class attributes:

from typing import Optional, _UnionGenericAlias
class PlainUser:
first_name: str
last_name: Optional[str] = None
email: str
age: Optional[int] = None
def __init__(self, first_name, email, last_name=None, age=None):
self.first_name = first_name
self.last_name = last_name
self.email = email
self.age = age

Because we have decorated the class with Type Hints, we can extract type information with the __annotations__ special method:

PlainUser.__annotations__
{'first_name': str,
'last_name': typing.Optional[str],
'email': str,
'age': typing.Optional[int]}

Dynamic Generation of Avro Schema

Let's write a naive transformation method that reads these annotations and generates an Avro JSON schema.

Note: Bear with the code below that looks outlandish with our efforts to gather type information from type hints. It should get much better when we use Dataclasses.

Every Avro schema has a namespace, type, a name, and one or more fields defined in a specific order. We can initialize the schema with these attributes, to begin with:

json_schema = {
"namespace": "ambitious.app",
"type": "record",
"name": PlainUser.__name__,
"fields": [],
}

Next, let us define a naive (or simplistic) map to link python datatypes to Avro's datatypes.

type_map = {
str: "string",
int: "int"
}

We also have a few optional fields as part of PlainUser class, and they will appear as instances of _UnionGenericAlias in annotations.

With this information, we can cycle through the annotations and start populating the fields of Avro schema:

for attr_name, attr_type in PlainUser.__annotations__.items():
if attr_type in [str, int]:
json_schema['fields'].append({'name': attr_name, 'type': type_map[attr_type]})
elif isinstance(attr_type, _UnionGenericAlias):
json_schema['fields'].append({"name": attr_name, 'type': ['null', type_map[attr_type.__args__[0]]]})

We manage to generate a JSON schema that is the same as the one we saw in the introduction:

json_schema
{'namespace': 'ambitious.app',
'type': 'record',
'name': 'PlainUser',
'fields': [{'name': 'first_name', 'type': 'string'},
{'name': 'last_name', 'type': ['null', 'string']},
{'name': 'email', 'type': 'string'},
{'name': 'age', 'type': ['null', 'int']}]}

Encode/Decode data with Dynamic Schema

Next, we can use the schema just like before to write and read encoded data.

To begin, stringify the JSON schema and parse it into Avro format:

import avro.schema
import json
json_schema_str = json.dumps(json_schema)
schema = avro.schema.parse(json_schema_str)

Then you can use DataFileWriter class to encode and write the data into a file, and DataFileReader to decode and read the data later:

from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter
writer = DataFileWriter(open("users-encoded-from-python-class.avro", "wb"), DatumWriter(), schema)
writer.append(plain_user.__dict__)
writer.close()
reader_buffer = DataFileReader(open("users-encoded-from-python-class.avro", "rb"), DatumReader())
for user_record in reader_buffer:
print(user_record)
reader_buffer.close()
{'first_name': 'Baby', 'last_name': 'Doe', 'email': '[email protected]', 'age': 4}

Success!

Of course, the type map was a crude implementation and only served our interests in this narrow example. But you get to experience one of the advantages of Avro over other systems such as Thrift, Protocol Buffers, etc. Because Avro does not need code generation to read or write data files or use or implement RPC protocols, Avro schemas can be generated on the fly, which is a significant benefit if you use dynamic languages. Code generation is treated as an optimization and can be optionally implemented for statically typed languages.

Using Python's TypeHints makes type inferencing a bit clunky. Let's try to accomplish the same thing using Python's Dataclasses which come pre-packaged with utilities to handle type information in a class.


© 2022 Ambitious Systems. All Rights Reserved.