Avro Writer's vs. Reader's Schema

Avro always needs a schema to read encoded data. By default, there is always be atleast one schema present: the writer's schema. Data is encoded and written using the writer's schema.

Most systems simply package schema info along with data in the container file. Optionally, one can supply a separate reader's schema.

For the reader's schema to be able to read data encoded with a different schema, both schemas should match as per a set of well-defined rules.

Let us walkthrough an example of this schema resolution in action with data of record type. We will focus on the following aspects of resolution:

The number of fields and their order can vary in the reader's schema.
Only fields in Reader's schema are read from data.
Fields can be introduced in the Reader's schema with default values.

Note: The writer's schema is still needed with a different reader's schema.

To begin, let's hand-define the writer's schema:

import json
import avro.schema

writer_schema = avro.schema.parse(json.dumps({
    "namespace": "ambitious.app",
    "type": "record",
    "name": "User",
    "fields": [
        {"name": "first_name", "type": "string"},
        {"name": "last_name",  "type": ["string", "null"]},
        {"name": "email", "type": "string"},
        {"name": "age", "type": ["int", "null"]},
    ]
}))

And write a couple of records to a datafile:

from avro.datafile import DataFileWriter
from avro.io import DatumWriter

writer = DataFileWriter(open("users-multiple-schemas.avro", "wb"), DatumWriter(), writer_schema)

writer.append({"first_name": "John", "last_name": "Doe", "email": "[email protected]"})
writer.append({"first_name": "Jane", "age": 23, "email": "[email protected]"})

writer.close()

Now let us create a custom reader's schema and introduce a few key differences as compared to the writer's schema:

reader_schema = avro.schema.parse(json.dumps({
    "namespace": "ambitious.app",
    "type": "record",
    "name": "User",
    "fields": [
        {"name": "email", "type": "string"},
        {"name": "first_name", "type": "string"},
        {"name": "age", "type": ["long", "null"]},
        {"name": "timezone", "type": "string", "default": "US/Eastern"}
    ]
}))

You will observe that:

The ordering of fields has changed
The type of age field has changed from int to long
last_name has been dropped
A new field timezone has been introduced with a default value of US/Eastern

from avro.datafile import DataFileReader
from avro.io import DatumReader

reader = DataFileReader(
    open("users-multiple-schemas.avro", "rb"),
    DatumReader(readers_schema=reader_schema))

for user in reader:
    print(user)

reader.close()

{'first_name': 'John', 'email': '[email protected]', 'age': None, 'timezone': 'US/Eastern'}

{'first_name': 'Jane', 'email': '[email protected]', 'age': 23, 'timezone': 'US/Eastern'}

As you can see, the two schemas were successfully matched and the reader was able to pull values from data encoded with a different schema.

Refer to the full Schema Resolution rules for examples of the rest of the matching rules.