Protocol buffers may be the next data serialization game changer

Protocol buffers may be the next data serialization game changer

What is data serialization and why should I be concerned?

Raw data as such is as good as no data at all. The only way we make sense out of data is by formatting it into a legible format like a container, file format or data structure. However as the size of data increases the overhead of memory consumed becomes a bottleneck for performance, in such cases efficient mechanism is to create a map of a byte array . When using byte arrays, one should note that having an optimized serialization mechanism is critical to seeing a reduction of memory consumption. The Byte array is opaque to the core system.

Serialization is the conversion of an object to a series of bytes so that the object can be easily saved to persistent storage or streamed across a communication link. The byte stream can then be deserialized - converted into a replica of the original object. We need a serialization scheme which is deterministic across executions of a function, across platforms, and across versions of the serialization framework.

Data structures which don’t enforce ordered serialization (e.g. sets, maps, dicts) should be avoided. The requirement is to consistently produce the same byte array across space and time. In cases where the byte array is interlinked to create a tree-like format, this is highly essential.

So, what has Protocol buffer got to do with this?

” In the simplest sense Protocol buffers (Protobufs) are a way to encode structured data in an efficient yet extensible format.”

Protobufs are Google’s language-independent, platform-independent method of serializing structured data. You define how you want your data to be structured once, then you can use the specially generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages.

Great, How to use them?

We specify how we want the information to be structured when serialized by defining them in a buffer message types in .proto files. Each of these buffer messages contains a logical record of information containing a set of name-value pairs. A basic example of a .proto file of an account is defined below.

Terminal window
syntax: "proto3";
message Account
{
string public_key:1;
int32 acc_no:2;
string acc_name:3;
string acc_email:4;
}
message AccountContainer
{
repeated Account entries:1;
}

The format is visible simple each message has one or more uniquely named fields. Each field has an identifier and a value type. The value type can be integer, floating-point, boolean, string, bytes or even another message type which enables you to create a hierarchy of data.

Other fields like required and optional can be used for data validation while repeated can be used for a collection of similar data [Note this feature is removed from Protobuf v3.6.0 but not from official documentation]. The index for each of the pair denotes the order in which data is serialized or received.

Once defined the .proto file must be compiled in the preferred language compiler to generate the data access classes. These include functions to serialize the the structure to or from raw bytes. For example if the language chosen was python then the compiled file is generated as account_pb2.py, is imported into the application where retrieval or serialization. The process of serializing looks somewhat like this:

Terminal window
account:container.entries.add()
account.public_key: "3123"
account.acc_name:acc_name
account.acc_no:acc_no
account.acc_email:acc_email
state_entries_send:{}
state_entries_send[address] =
container.SerializeToString()

Similarly, The data is retrieved by:

Terminal window
entry:someFunction()
container:account_pb2.AccountContainer()
container.ParseFromString(entry.data)

But why not just use JSON or XML ?

Protocol buffers have many advantages over JSON or XML for serializing structured data.

Excited to get started?

Before starting to embed this in your latest project know how it stores and inter-operates data. Understand the options and decide if the versioned scheme is for your benefit. More often than not, Protobuf will pave the way for easier and efficient data and memory usage.

To begin, first, download the Protocol Buffer package of the preferred choice or use the complete package which includes languages like Python, Java, and C++. Refer to the documentation to build and install the packages.

Refer the official tutorial for an intuitive approach to the recommended conventional usage and implementation. Separate tutorials are made for each of the preferred languages as mentioned before. An instance of its usage can be found in the implementation of Hyperledger Sawtooth , where every data communication and storage in a decentralized manner works with efficiency and high scalability.