Standardized encodings do not accurately represent schemas


Last Updated on Jan 22, 2021

Built-in encoding libraries do not scale well for system integrations. To integrate multiple systems or participate in a multi-application environment (which is what most businesses need and do nowadays), the byte sequences shared over the wire have to be usable by all applications.

Enter standardized encodings like JSON, XML, and CSV. They are textual formats, so they are somewhat readable. These encodings also tend to pack data and its structural information together, making them extremely useful. The schema information is embedded within the message and does not have to be maintained separately.

But these standard encodings have the problem of not being able to represent the data schema fully.

While there is optional support for JSON and XML schemas, learning and implementing the schema languages is quite complicated, prompting most applications not to bother using schemas. Since the correct interpretation of data depends on the schema, applications need to hardcode the encoding/decoding logic. JSON also does not support namespaces, which would allow applications to combine data from separate systems.

There is also a lot of ambiguity around encoding numbers. XML and CSV do not distinguish between a number and a string, while JSON can. But JSON does not distinguish integers and floating-point numbers, and it doesn't specify a precision. Applications dealing with large numbers have a significant problem because some programming languages do not accurately parse double-precision floating-point numbers.

Standardized encoding formats lack support binary strings (sequences of bytes without a character encoding). Applications can get around this limitation by encoding the binary data as text using Base64 representation and indicating that the value is Base-64 encoded in the schema. Base64 strings work well, but they increase the data size by 33% and need careful handling in application code.

CSV does not support any schema, so it is left to the application to define each row and column's meaning. It serves well as the lowest-common-denominator encoding format but can be hard to maintain as application schema changes and evolves.

Despite these shortcomings, JSON, XML, and CSV have proven valuable for many purposes, especially as data interchange formats. While efficiency and performance are essential for encoding formats, their ability to get disparate, remotely-located teams and applications to talk to each other is their most significant advantage.


© 2022 Ambitious Systems. All Rights Reserved.