Avro Schema Fingerprints
Unlike other serialization systems, Avro does not package type information within encoded data. When Avro data is read, the schema used when writing it is always present. While this translates to faster serialization and smaller payloads, it can mean that the schemas used for encoding data have to be kept around forever.
There are cases when the schema is packaged along with encoded data. Avro supports a simple object container file format if you are encoding objects in bulk. The file's header includes a metadata
attribute that encloses the schema used to encode the objects.
But in many situations, Avro schemas have to be made available separately:
- They need to be stored in a centrally accessible location for use by multiple applications
- There may be variations of the same schema for optimized data consumption (reader schemas)
- Individual data records are serialized and stored for an extended period of time (Messagesin an Apache Kafka topic, for example)
- Quickly negotiate schemas between reader and writer applications
In these instances, it is beneficial to generate a fingerprint of the schema that can uniquely identify it instead of relying on an explicit marker or version identifier. A "fingerprint" is a shorter bit string that uniquely identifies the original schema.
This section will step through an example of generating a unique fingerprint of an Avro Schema.
Schema Fingerprints
To transform a Writer Schema into a fingerprint, we first need to convert the writer's schema into a Parsing Canonical Form:
- Parsing because the transformation strips away parts of the schema that are irrelevant to readers like "doc" attributes.
- Canonical Form because the transformations normalize the JSON text in a way that eliminates unimportant differences between schemas.
So if the Parsing Canonical Forms of two schemas are textually equal, then those schemas are considered the "same" as far as a reader is concerned. In the Avro context, fingerprints are generated on these Parsing Canonical Forms to guarantee schema usability.
There is a fundamental trade-off between the fingerprint's length and the probability of collisions when designing fingerprinting algorithms. While many fingerprinting algorithms are available, Avro recommends the following three variations that balance the length of the fingerprint with interoperability and ease of implementation:
- SHA-256: A longer 256-bit fingerprint algorithm with the least possibility of collisions.
- Rabin is a small 64-bit CRC fingerprint that guarantees the uniqueness of schema caches of up to a million entries - sufficient for most implementations.
- MD5: 128-bit fingerprints that come in handy when manipulating tens of millions of schemas.
Rabin Fingerprints
The python example that follows will showcase generating a fingerprint using the Rabin fingerprint algorithm.
Note: We will be using the deprecated avro-python3 package for this example, as the mainstream avro
package does not have normalization and fingerprinting abilities yet. We will only be using the schemanormalization
module from the avro-python3
package and will depend on the regular avro
package for the rest of the functionality.
To start, install the avro-python3
package:
> pip install avro-python3
Let us initialize a writer schema with a User
record for the purpose of this example:
import jsonimport avro.schemanormalizationfrom avro.schema import parsewriter_schema = parse(json.dumps({"namespace": "ambitious.app","doc": "A simple user schema","type": "record","name": "User","fields": [{"name": "first_name", "type": "string"},{"name": "last_name", "type": ["string", "null"]},{"name": "email", "type": "string"},{"name": "age", "type": ["int", "null"]},]}))
We will next generate the Parsing Canonical Form of the writer's schema:
writer_schema_canonical = avro.schemanormalization.ToParsingCanonicalForm(writer_schema)json.loads(writer_schema_canonical)
You will notice tiny variations in the canonicalized form to remove irrelevant attributes, like "doc" and "namespace".
We can then generate a fingerprint of the Canonical Form:
digest = avro.schemanormalization.Fingerprint(writer_schema_canonical, "CRC-64-AVRO")
Trying to print the digest outputs binary encoded data:
digest
To visualize the fingerprint better, let's transform the binary data into hexadecimal form:
hex_fingerprint = "".join(format(b, "02x") for b in digest).zfill(16)hex_fingerprint
That is more readable! We can clearly use this fingerprint like a unique identifier to store and retrieve schemas.
Any schema that is a variant of this schema will result in the same fingerprint. Let us experiment to see if changing the order of attributes within fields will result in a different fingerprint:
writer_schema_duplicate = parse(json.dumps({"namespace": "ambitious.app","doc": "A simple user schema","type": "record","name": "User","fields": [{"type": "string", "name": "first_name"},{"type": ["string", "null"], "name": "last_name"},{"type": "string", "name": "email"},{"type": ["int", "null"], "name": "age"},]}))writer_schema_duplicate_canonical = avro.schemanormalization.ToParsingCanonicalForm(writer_schema_duplicate)json.loads(writer_schema_duplicate_canonical)digest_duplicate = avro.schemanormalization.Fingerprint(writer_schema_duplicate_canonical, "CRC-64-AVRO")hex_fingerprint_duplicate = "".join(format(b, "02x") for b in digest_duplicate).zfill(16)print("Same Fingerprint!" if hex_fingerprint_duplicate== hex_fingerprint else "Different Fingerprint!")
Analysing the duplicate schema reveals why the same fingerprint is generated. The schema when converted into the Parsing Canonical Form resulted in a structure that is the same as before!
json.loads(writer_schema_duplicate_canonical)
The Parsing Canonical Form ensures that the irrelevant details of the Avro schema are ignored, and two schemas that are structurally the same always generate the same fingerprint.