Built-in encoding libraries do not scale well for system integrations

Applications seldom work in isolation these days. They are part of a network of systems that work in tandem to solve business problems.

These systems often share data over the wire to coordinate with each other. It is best to represent the data as a sequence of bytes for reliability and accuracy of transfer over the wire. The data could be simply a byte representation of the data in memory, kept in objects, structs, lists, arrays, hash tables, or trees, or it could be a custom representation specifically designed for communication.

Many programming languages ship with built-in libraries for encoding the in-memory objects into byte sequences. Java has java.io.Serializable, ruby has Marshal, Python has pickle, and so on.

These libraries are very convenient to use and need minimal coding. They come pre-packaged as part of standard libraries, and the language implementation itself uses them extensively.

But because of some inherent problems due to the nature of their structure and usage, these libraries cannot reliably encode and transfer data to other systems.

The encoded data is often tied to the specific programming language and its encoding implementation. It isn't easy to read this data in applications built in other languages, which inadvertently means that all applications on the network will need to be in the same programming language. Integrating systems with disparate technologies will be complicated.

Also, byte sequences are usually decoded and restored into the original object types. So the object structures will need to be shared across applications as well, which creates dependencies between them. A simple change in object attribute structure will have a ripple effect across systems.

The ability to construct objects dynamically from a byte sequence is also a security issue. Bad actors can maliciously instantiate arbitrary classes, allowing them to execute arbitrary code on the remote server.

These data structures are also not well equipped to handle versioning or backward and forward compatibility. They are built for quick and easy encoding of data, not for the intricacies of handling data model changes over time.

Another significant issue is that these libraries are not built for speed. They are often non-performant, inefficiently encoding/decoding data and bloating the size of the encoded output.

Considering these problems, it is generally a bad idea to go with built-in libraries for data encoding. Applications can still use them for transient and straightforward purposes, but the risk far outweighs the rewards. It is better to use standardized encodings like JSON, XML, and CSV instead.

Related: