Avro Writer's vs. Reader's Schema


Last Updated on Sep 28, 2022

Avro always needs a schema to read encoded data. By default, there is always be atleast one schema present: the writer's schema. Data is encoded and written using the writer's schema.

Most systems simply package schema info along with data in the container file. Optionally, one can supply a separate reader's schema.

For the reader's schema to be able to read data encoded with a different schema, both schemas should match as per a set of well-defined rules.

Let us walkthrough an example of this schema resolution in action with data of record type. We will focus on the following aspects of resolution:

  • The number of fields and their order can vary in the reader's schema.
  • Only fields in Reader's schema are read from data.
  • Fields can be introduced in the Reader's schema with default values.

Note: The writer's schema is still needed with a different reader's schema.

To begin, let's hand-define the writer's schema:

import json
import avro.schema
writer_schema = avro.schema.parse(json.dumps({
"namespace": "ambitious.app",
"type": "record",
"name": "User",
"fields": [
{"name": "first_name", "type": "string"},
{"name": "last_name", "type": ["string", "null"]},
{"name": "email", "type": "string"},
{"name": "age", "type": ["int", "null"]},
]
}))

And write a couple of records to a datafile:

from avro.datafile import DataFileWriter
from avro.io import DatumWriter
writer = DataFileWriter(open("users-multiple-schemas.avro", "wb"), DatumWriter(), writer_schema)
writer.append({"first_name": "John", "last_name": "Doe", "email": "john.doe@example.com"})
writer.append({"first_name": "Jane", "age": 23, "email": "jane.doe@example.com"})
writer.close()

Now let us create a custom reader's schema and introduce a few key differences as compared to the writer's schema:

reader_schema = avro.schema.parse(json.dumps({
"namespace": "ambitious.app",
"type": "record",
"name": "User",
"fields": [
{"name": "email", "type": "string"},
{"name": "first_name", "type": "string"},
{"name": "age", "type": ["long", "null"]},
{"name": "timezone", "type": "string", "default": "US/Eastern"}
]
}))

You will observe that:

  • The ordering of fields has changed
  • The type of age field has changed from int to long
  • last_name has been dropped
  • A new field timezone has been introduced with a default value of US/Eastern
from avro.datafile import DataFileReader
from avro.io import DatumReader
reader = DataFileReader(
open("users-multiple-schemas.avro", "rb"),
DatumReader(readers_schema=reader_schema))
for user in reader:
print(user)
reader.close()
{'first_name': 'John', 'email': 'john.doe@example.com', 'age': None, 'timezone': 'US/Eastern'}
{'first_name': 'Jane', 'email': 'jane.doe@example.com', 'age': 23, 'timezone': 'US/Eastern'}

As you can see, the two schemas were successfully matched and the reader was able to pull values from data encoded with a different schema.

Refer to the full Schema Resolution rules for examples of the rest of the matching rules.


© 2022 Ambitious Systems. All Rights Reserved.