Avro Schema Fingerprints · Ambitious Systems

Unlike other serialization systems, Avro does not package type information within encoded data. When Avro data is read, the schema used when writing it is always present. While this translates to faster serialization and smaller payloads, it can mean that the schemas used for encoding data have to be kept around forever.

There are cases when the schema is packaged along with encoded data. Avro supports a simple object container file format if you are encoding objects in bulk. The file's header includes a metadata attribute that encloses the schema used to encode the objects.

But in many situations, Avro schemas have to be made available separately:

They need to be stored in a centrally accessible location for use by multiple applications
There may be variations of the same schema for optimized data consumption (reader schemas)
Individual data records are serialized and stored for an extended period of time (Messagesin an Apache Kafka topic, for example)
Quickly negotiate schemas between reader and writer applications

In these instances, it is beneficial to generate a fingerprint of the schema that can uniquely identify it instead of relying on an explicit marker or version identifier. A "fingerprint" is a shorter bit string that uniquely identifies the original schema.

This section will step through an example of generating a unique fingerprint of an Avro Schema.

Schema Fingerprints

To transform a Writer Schema into a fingerprint, we first need to convert the writer's schema into a Parsing Canonical Form:

Parsing because the transformation strips away parts of the schema that are irrelevant to readers like "doc" attributes.
Canonical Form because the transformations normalize the JSON text in a way that eliminates unimportant differences between schemas.

So if the Parsing Canonical Forms of two schemas are textually equal, then those schemas are considered the "same" as far as a reader is concerned. In the Avro context, fingerprints are generated on these Parsing Canonical Forms to guarantee schema usability.

There is a fundamental trade-off between the fingerprint's length and the probability of collisions when designing fingerprinting algorithms. While many fingerprinting algorithms are available, Avro recommends the following three variations that balance the length of the fingerprint with interoperability and ease of implementation:

SHA-256: A longer 256-bit fingerprint algorithm with the least possibility of collisions.
Rabin is a small 64-bit CRC fingerprint that guarantees the uniqueness of schema caches of up to a million entries - sufficient for most implementations.
MD5: 128-bit fingerprints that come in handy when manipulating tens of millions of schemas.

Rabin Fingerprints

The python example that follows will showcase generating a fingerprint using the Rabin fingerprint algorithm.

Note: We will be using the deprecated avro-python3 package for this example, as the mainstream avro package does not have normalization and fingerprinting abilities yet. We will only be using the schemanormalization module from the avro-python3 package and will depend on the regular avro package for the rest of the functionality.

To start, install the avro-python3 package:

> pip install avro-python3

Collecting avro-python3
  Downloading avro-python3-1.10.2.tar.gz (38 kB)
Building wheels for collected packages: avro-python3
  Building wheel for avro-python3 (setup.py) ... done
  Created wheel for avro-python3: filename=avro_python3-1.10.2-py3-none-any.whl size=44009 sha256=391618afb1d882f646b284754662c1a5f3cc7227d27f97761558e0d666cc1e98
  Stored in directory: /Users/subhashb/Library/Caches/pip/wheels/5a/29/4d/510c0e098c49c5e49519f430481a5425e60b8752682d7b1e55
Successfully built avro-python3
Installing collected packages: avro-python3
Successfully installed avro-python3-1.10.2

Let us initialize a writer schema with a User record for the purpose of this example:

import json

import avro.schemanormalization
from avro.schema import parse

writer_schema = parse(json.dumps({
    "namespace": "ambitious.app",
    "doc": "A simple user schema",
    "type": "record",
    "name": "User",
    "fields": [
        {"name": "first_name", "type": "string"},
        {"name": "last_name",  "type": ["string", "null"]},
        {"name": "email", "type": "string"},
        {"name": "age", "type": ["int", "null"]},
    ]
}))

We will next generate the Parsing Canonical Form of the writer's schema:

writer_schema_canonical = avro.schemanormalization.ToParsingCanonicalForm(writer_schema)

json.loads(writer_schema_canonical)

{'name': 'ambitious.app.User',
 'type': 'record',
 'fields': [{'name': 'first_name', 'type': 'string'},
  {'name': 'last_name', 'type': ['string', 'null']},
  {'name': 'email', 'type': 'string'},
  {'name': 'age', 'type': ['int', 'null']}]}

You will notice tiny variations in the canonicalized form to remove irrelevant attributes, like "doc" and "namespace".

We can then generate a fingerprint of the Canonical Form:

digest = avro.schemanormalization.Fingerprint(writer_schema_canonical, "CRC-64-AVRO")

Trying to print the digest outputs binary encoded data:

digest

b'6\xf2\xfc\xd4\xf0=\xbd1'

To visualize the fingerprint better, let's transform the binary data into hexadecimal form:

hex_fingerprint = "".join(format(b, "02x") for b in digest).zfill(16)

hex_fingerprint

'36f2fcd4f03dbd31'

That is more readable! We can clearly use this fingerprint like a unique identifier to store and retrieve schemas.

Any schema that is a variant of this schema will result in the same fingerprint. Let us experiment to see if changing the order of attributes within fields will result in a different fingerprint:

writer_schema_duplicate = parse(json.dumps({
    "namespace": "ambitious.app",
    "doc": "A simple user schema",
    "type": "record",
    "name": "User",
    "fields": [
        {"type": "string", "name": "first_name"},
        {"type": ["string", "null"], "name": "last_name"},
        {"type": "string", "name": "email"},
        {"type": ["int", "null"], "name": "age"},
    ]
}))

writer_schema_duplicate_canonical = avro.schemanormalization.ToParsingCanonicalForm(writer_schema_duplicate)

json.loads(writer_schema_duplicate_canonical)

digest_duplicate = avro.schemanormalization.Fingerprint(writer_schema_duplicate_canonical, "CRC-64-AVRO")
hex_fingerprint_duplicate = "".join(format(b, "02x") for b in digest_duplicate).zfill(16)

print("Same Fingerprint!" if hex_fingerprint_duplicate== hex_fingerprint else "Different Fingerprint!")

Same Fingerprint!

Analysing the duplicate schema reveals why the same fingerprint is generated. The schema when converted into the Parsing Canonical Form resulted in a structure that is the same as before!

json.loads(writer_schema_duplicate_canonical)

{'name': 'ambitious.app.User',
 'type': 'record',
 'fields': [{'name': 'first_name', 'type': 'string'},
  {'name': 'last_name', 'type': ['string', 'null']},
  {'name': 'email', 'type': 'string'},
  {'name': 'age', 'type': ['int', 'null']}]}

The Parsing Canonical Form ensures that the irrelevant details of the Avro schema are ignored, and two schemas that are structurally the same always generate the same fingerprint.