System Design: Encoding and Evolution

Modern applications are always being updated and as such, the data being stored is also changing. Therefore, there is the need for both backwards and forwards compatibility. Ensuring newer and older code can coexist without service downtime.
Image source; Google Gemini
In encoding data, there are two main formats that are used, data stored memory are stored as objects, hash tables, lists arrays, structs, trees and so on whereas data written to a file or sent via a network is typically encoded as a self-contained sequence of bytes, most commonly JSON. Translating from in-memory representation to byte sequence is typically referred to as encoding, serialisation or marshalling and the reverse is decoding, deserialisations or unmarshalling. 

There are language specific formats for encoding as well, Java has java.io.Serializable, Ruby has Marshal, Python has pickle and so on. These allow in-memory objects to be saved and restored with minimal additional code however, they also constrained to the language, are not the most efficient and the decoding process needs to be able to instantiate arbitrary classes.

The other most commonly used are JSON, XML and CSV formats. These are more popular due to being language independent but also present challenges such as encoding numbers and the lack of inbuilt support for binary strings.

To work around the challenge of huge dataset, typically terabytes and petabytes of data, the binary encoding format was introduced. As a bonus, extending the datatypes that can be encoded. However, the binary encoding format is not human readable. 

Down the line, in August, 2007, there was the introduction of two new encoding formats, from Google and Meta; Protocol Buffers and Apache Thrift. These required a schema for every encoded data, and uses its code generation tool to produce classes that implement the schema in various programming languages. These encoding formats were able to reduce the byte sequence, 34 bytes and 33 bytes respectively. With backward and forward compatibility, Apache Thrift and Protocol Buffers use field tags which are numbers that act like aliases for fields in the data. Once the schema has been configured, all field tags cannot be edited and only optional field tags or tags with default values can be added ensuring old code can access newer code data and vice versa. 

Another encoding format is Apache Avro, started in 2009 as a subproject of Hadoop(if you want to read more about Hadoop, link here). It introduced 2 schema languages specify the structure of the data to be encode    d, Avro IDL for human editing and another based on JSON for machine-readability. With a 32 bytes sequence, there are no numbers(field tags), the encoding simply consists of values concatenated together. Since data encoded with Avro needs to be read in a particular order to make sense of the data, the writer and the reader need to have a compatible schema, which is used in translating from one schema to the other. With Avro, forward compatibility means that you can have a new version of the schema as writer and an old version of the schema as reader. Conversely, backward compatibility means that you can have a new version of the schema as reader and an old version as writer.

Now, after analysing the various encoding format, the next question how does the data actually flow through the system. There are three main model through which data flows through a system the first is through databases, the second is through service calls and the last is through asynchronous message passing. 

Through databases, the process that writes to a database encodes the data whereas the process that reads data from the database decodes it. Given that databases allow data to written at any point in time, while possibly having parallel processes, there is the danger of old code, removing data created by new code. As a work around, schema evolution thus allows the entire database to appear as if it was encoded with a single schema, even though the underlying storage may contain records encoded with various historical versions of the schema.

In computer networks, there are two main roles; clients andservers. The servers expose an API over the network, and the clients can connect to the servers to make requests to that API. The API exposed by the server is known as a service. These typically are have encoded data flowing through REST and RPC (where one process sends a request over the network to another process and expects a response as quickly as possible).

For asynchronous message-passing systems, which are somewhere between RPC and databases. They are similar to RPC in that a client’s request is delivered to another process with low latency. They are similar to databases in that the message is not sent via a direct network connection, but goes via an intermediary called a message broker, which stores the message temporarily. The broker act as a buffer if the recipient is unavailable or overloaded, improving system reliability, can automatically delver messages to a system that has crashed, retaining messages and decouples the sender from the recipient. Message brokers typically don’t enforce any particular data model, so you can use any encoding format. If the encoding is backward and forward compatible, there is the greatest flexibility to change publishers and consumers independently and deploy them in any order.

Reference

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
Book by Martin Kleppmann

 

This blog post is a summary of my personal notes and understanding from reading "Designing Data-Intensive Applications" by Martin Kleppmann. All credit for the original ideas belongs to the author. 

 

Comments

Popular Posts