Encoding Data in Regular Python Classes with Avro
Previously, we saw how you can encode a JSON complex object (a User record) into Avro's binary format.
This is useful, but you seldom manipulate JSON data directly like this in your applications. Data is typically enclosed in data containers, like plain python classes or constructs like Data Classes. These containers are what we initialize, load from the database, update, persist, or destroy.
Wouldn't it be nice if the underlying code, which utilizes data containers, could automatically encode/decode to/from Avro format? We will explore this aspect with plain Python classes here and look at using Data Classes next.
Let's start by defining a simple User Python class.
A Simple Python Class
class PlainUser:def __init__(self, first_name, email, last_name=None, age=None):self.first_name = first_nameself.last_name = last_nameself.email = emailself.age = age
Initializing an instance of PlainUser is straightforward:
You can retrieve the object's data with its __dict__ method:
plain_user.__dict__
Avro is strongly-typed, so regular Python classes won't do in our case since they lack type information. We will annotate the class with Type Hints, that can be reinterpreted as their corresponding Avro types later.
Python Class with Type Hints
Let's add type hints to PlainUser's class attributes:
from typing import Optional, _UnionGenericAliasclass PlainUser:first_name: strlast_name: Optional[str] = Noneemail: strage: Optional[int] = Nonedef __init__(self, first_name, email, last_name=None, age=None):self.first_name = first_nameself.last_name = last_nameself.email = emailself.age = age
Because we have decorated the class with Type Hints, we can extract type information with the __annotations__ special method:
PlainUser.__annotations__
Dynamic Generation of Avro Schema
Let's write a naive transformation method that reads these annotations and generates an Avro JSON schema.
Note: Bear with the code below that looks outlandish with our efforts to gather type information from type hints. It should get much better when we use Dataclasses.
Every Avro schema has a namespace, type, a name, and one or more fields defined in a specific order. We can initialize the schema with these attributes, to begin with:
json_schema = {"namespace": "ambitious.app","type": "record","name": PlainUser.__name__,"fields": [],}
Next, let us define a naive (or simplistic) map to link python datatypes to Avro's datatypes.
type_map = {str: "string",int: "int"}
We also have a few optional fields as part of PlainUser class, and they will appear as instances of _UnionGenericAlias in annotations.
With this information, we can cycle through the annotations and start populating the fields of Avro schema:
for attr_name, attr_type in PlainUser.__annotations__.items():if attr_type in [str, int]:json_schema['fields'].append({'name': attr_name, 'type': type_map[attr_type]})elif isinstance(attr_type, _UnionGenericAlias):json_schema['fields'].append({"name": attr_name, 'type': ['null', type_map[attr_type.__args__[0]]]})
We manage to generate a JSON schema that is the same as the one we saw in the introduction:
json_schema
Encode/Decode data with Dynamic Schema
Next, we can use the schema just like before to write and read encoded data.
To begin, stringify the JSON schema and parse it into Avro format:
import avro.schemaimport jsonjson_schema_str = json.dumps(json_schema)schema = avro.schema.parse(json_schema_str)
Then you can use DataFileWriter class to encode and write the data into a file, and DataFileReader to decode and read the data later:
from avro.datafile import DataFileReader, DataFileWriterfrom avro.io import DatumReader, DatumWriterwriter = DataFileWriter(open("users-encoded-from-python-class.avro", "wb"), DatumWriter(), schema)writer.append(plain_user.__dict__)writer.close()
reader_buffer = DataFileReader(open("users-encoded-from-python-class.avro", "rb"), DatumReader())for user_record in reader_buffer:print(user_record)reader_buffer.close()
Success!
Of course, the type map was a crude implementation and only served our interests in this narrow example. But you get to experience one of the advantages of Avro over other systems such as Thrift, Protocol Buffers, etc. Because Avro does not need code generation to read or write data files or use or implement RPC protocols, Avro schemas can be generated on the fly, which is a significant benefit if you use dynamic languages. Code generation is treated as an optimization and can be optionally implemented for statically typed languages.
Using Python's TypeHints makes type inferencing a bit clunky. Let's try to accomplish the same thing using Python's Dataclasses which come pre-packaged with utilities to handle type information in a class.