Avro Writer's vs. Reader's Schema
Avro always needs a schema to read encoded data. By default, there is always be atleast one schema present: the writer's schema. Data is encoded and written using the writer's schema.
Most systems simply package schema info along with data in the container file. Optionally, one can supply a separate reader's schema.
For the reader's schema to be able to read data encoded with a different schema, both schemas should match as per a set of well-defined rules.
Let us walkthrough an example of this schema resolution in action with data of record
type. We will focus on the following aspects of resolution:
- The number of fields and their order can vary in the reader's schema.
- Only fields in Reader's schema are read from data.
- Fields can be introduced in the Reader's schema with default values.
Note: The writer's schema is still needed with a different reader's schema.
To begin, let's hand-define the writer's schema:
import jsonimport avro.schemawriter_schema = avro.schema.parse(json.dumps({"namespace": "ambitious.app","type": "record","name": "User","fields": [{"name": "first_name", "type": "string"},{"name": "last_name", "type": ["string", "null"]},{"name": "email", "type": "string"},{"name": "age", "type": ["int", "null"]},]}))
And write a couple of records to a datafile:
from avro.datafile import DataFileWriterfrom avro.io import DatumWriterwriter = DataFileWriter(open("users-multiple-schemas.avro", "wb"), DatumWriter(), writer_schema)writer.append({"first_name": "John", "last_name": "Doe", "email": "john.doe@example.com"})writer.append({"first_name": "Jane", "age": 23, "email": "jane.doe@example.com"})writer.close()
Now let us create a custom reader's schema and introduce a few key differences as compared to the writer's schema:
reader_schema = avro.schema.parse(json.dumps({"namespace": "ambitious.app","type": "record","name": "User","fields": [{"name": "email", "type": "string"},{"name": "first_name", "type": "string"},{"name": "age", "type": ["long", "null"]},{"name": "timezone", "type": "string", "default": "US/Eastern"}]}))
You will observe that:
- The ordering of fields has changed
- The type of age field has changed from
int
tolong
last_name
has been dropped- A new field
timezone
has been introduced with a default value of US/Eastern
from avro.datafile import DataFileReaderfrom avro.io import DatumReaderreader = DataFileReader(open("users-multiple-schemas.avro", "rb"),DatumReader(readers_schema=reader_schema))for user in reader:print(user)reader.close()
As you can see, the two schemas were successfully matched and the reader was able to pull values from data encoded with a different schema.
Refer to the full Schema Resolution rules for examples of the rest of the matching rules.