Encoding Data in Regular Python Classes with Avro
Previously, we saw how you can encode a JSON complex object (a User
record) into Avro's binary format.
This is useful, but you seldom manipulate JSON data directly like this in your applications. Data is typically enclosed in data containers, like plain python classes or constructs like Data Classes. These containers are what we initialize, load from the database, update, persist, or destroy.
Wouldn't it be nice if the underlying code, which utilizes data containers, could automatically encode/decode to/from Avro format? We will explore this aspect with plain Python classes here and look at using Data Classes next.
Let's start by defining a simple User
Python class.
A Simple Python Class
class PlainUser:def __init__(self, first_name, email, last_name=None, age=None):self.first_name = first_nameself.last_name = last_nameself.email = emailself.age = age
Initializing an instance of PlainUser
is straightforward:
You can retrieve the object's data with its __dict__
method:
plain_user.__dict__
Avro is strongly-typed, so regular Python classes won't do in our case since they lack type information. We will annotate the class with Type Hints, that can be reinterpreted as their corresponding Avro types later.
Python Class with Type Hints
Let's add type hints to PlainUser
's class attributes:
from typing import Optional, _UnionGenericAliasclass PlainUser:first_name: strlast_name: Optional[str] = Noneemail: strage: Optional[int] = Nonedef __init__(self, first_name, email, last_name=None, age=None):self.first_name = first_nameself.last_name = last_nameself.email = emailself.age = age
Because we have decorated the class with Type Hints, we can extract type information with the __annotations__
special method:
PlainUser.__annotations__
Dynamic Generation of Avro Schema
Let's write a naive transformation method that reads these annotations and generates an Avro JSON schema.
Note: Bear with the code below that looks outlandish with our efforts to gather type information from type hints. It should get much better when we use Dataclasses.
Every Avro schema has a namespace
, type
, a name
, and one or more field
s defined in a specific order. We can initialize the schema with these attributes, to begin with:
json_schema = {"namespace": "ambitious.app","type": "record","name": PlainUser.__name__,"fields": [],}
Next, let us define a naive (or simplistic) map to link python datatypes to Avro's datatypes.
type_map = {str: "string",int: "int"}
We also have a few optional fields as part of PlainUser
class, and they will appear as instances of _UnionGenericAlias
in annotations.
With this information, we can cycle through the annotations and start populating the fields
of Avro schema:
for attr_name, attr_type in PlainUser.__annotations__.items():if attr_type in [str, int]:json_schema['fields'].append({'name': attr_name, 'type': type_map[attr_type]})elif isinstance(attr_type, _UnionGenericAlias):json_schema['fields'].append({"name": attr_name, 'type': ['null', type_map[attr_type.__args__[0]]]})
We manage to generate a JSON schema that is the same as the one we saw in the introduction:
json_schema
Encode/Decode data with Dynamic Schema
Next, we can use the schema just like before to write and read encoded data.
To begin, stringify the JSON schema and parse it into Avro format:
import avro.schemaimport jsonjson_schema_str = json.dumps(json_schema)schema = avro.schema.parse(json_schema_str)
Then you can use DataFileWriter
class to encode and write the data into a file, and DataFileReader
to decode and read the data later:
from avro.datafile import DataFileReader, DataFileWriterfrom avro.io import DatumReader, DatumWriterwriter = DataFileWriter(open("users-encoded-from-python-class.avro", "wb"), DatumWriter(), schema)writer.append(plain_user.__dict__)writer.close()
reader_buffer = DataFileReader(open("users-encoded-from-python-class.avro", "rb"), DatumReader())for user_record in reader_buffer:print(user_record)reader_buffer.close()
Success!
Of course, the type map was a crude implementation and only served our interests in this narrow example. But you get to experience one of the advantages of Avro over other systems such as Thrift
, Protocol Buffers
, etc. Because Avro does not need code generation to read or write data files or use or implement RPC protocols, Avro schemas can be generated on the fly, which is a significant benefit if you use dynamic languages. Code generation is treated as an optimization and can be optionally implemented for statically typed languages.
Using Python's TypeHints
makes type inferencing a bit clunky. Let's try to accomplish the same thing using Python's Dataclasses
which come pre-packaged with utilities to handle type information in a class.