Binary encoding based on schemas is better than textual data formats


Last Updated on Jan 15, 2021

Built-in encoding libraries do not scale well for system integrations. They are slow, tied to a specific programming language, don't handle versioning well, and have security issues.

Textual data formats like JSON, XML, and CSV are easier to understand and used extensively for integration systems. But Standardized encodings do not accurately represent schemas. They have issues interpreting numbers, don't work well with binary strings, and the schema does not support the specification of detailed validation rules.

Moreover, applications that use JSON or XML formats seldom catalog the data structures for future reference. They are generally thought of as data transfer formats, not schemas.

Binary encoding formats based on schemas like Thrift, Protocol Buffers, and Avro have emerged as good alternatives to JSON and XML. In particular, schemas bring a whole range of other advantages.

Schemas provide better data guarantees and tooling.

  • Schema languages like Thrift, Protocol Buffers, and Avro are simpler than XML or JSON schemas. At the same time, they have much better support for detailed validation rules, including the ability to handle complex data types like arrays and dictionaries and field markers like optional values and defaults.
  • Tools can dynamically generate schemas from existing data, like in the case of Apache Avro.
  • Schemas can be a valuable form of documentation. They are guaranteed to be up to date because they are actively used for encoding/decoding data.
  • Schemas can be evolved, with guarantees on forward and backward compatibilities with the help of tooling.

Binary encodings based on schemas have emerged as a viable option for data formats.

The idea of binary encoding data itself is not new. Most new encodings have a lot in common in ASN.1, a schema definition language first standardized in 1984. Network protocols extensively used the format of ASN.1 for data transmissions, and its binary encoding is still used to encode SSL certifications (X.509).

More commonly, many data systems (like relational databases) use proprietary binary encoding for data transfer. JSON and XML have many binary format variants, but none have had widespread adoption like textual formats.

Combining the idea of schema evolution with binary encoding seems to be particularly beneficial, and that is where Thrift, Protocol Buffers, and Avro shine.

Data encoded by these libraries is more compact than that produced by binary variants of JSON and XML because they omit field names from the encoded data. Avro can even generate code from the schema for statically typed languages, enabling type checking at compile time.

So schema evolution provides the same kind of flexibility as schemaless/schema-on-read JSON databases that benefit large applications, while also providing better guarantees about data and better tooling.


© 2022 Ambitious Systems. All Rights Reserved.