Data Serialization
Data serialization is the process of transferring data structures from the in-memory representation that is used for working with the data (e.g. structs) to a binary data stream (usually a sequence of binary octets) that can be transmitted via network or stored on non-volatile memory (e.g. disk, flash). The process in the opposite direction from disk to in memory is called either deserialization or parsing.
There are many ways to serialize data, each having individual advantages and drawbacks. In this post I want to focus on handling generic structured data. Most popular for handling structured data are probably XML and JSON, but there exist also more advanced techniques for creating highly efficient binary data streams.
XML and JSON
XML and JSON are both generic file formats that can be used for almost any kind of application and structured data except for BLOBs (binary large objects - e.g. pictures, video streams, audio data). Both can be used over a variety of platforms, and can be extended easily later if necessary (e.g. later software version with new features), without loosing backward compatibility. Additionally, their representation is human readable, and therefore, it is easy to deal with those kinds of files.
Comparing XML to JSON, XML provides concepts for meta-data, and the ability to restrict data values to certain ranges to provide means for verification and validation of data. But these features are rather heavy weighted when it comes to their implementation and run-time impact. Nevertheless, powerful embedded systems like smartphones often rely on this technique.
For less powerful embedded systems XML is mostly out of scope, as RAM is limited and processing power may be too limited to provide good enough performance. JSON has much lower requirements when it comes to processing power, but it lacks some of those features, which make XML interesting. Nevertheless, JSON libraries for embedded systems are widely available, and JSON is also useful when it comes to interacting with web servers and other generic data processing partners.
Still, even JSON requires a good amount of code and processing time in comparison with binary file formats.
Comparing XML to JSON, XML provides concepts for meta-data, and the ability to restrict data values to certain ranges to provide means for verification and validation of data. But these features are rather heavy weighted when it comes to their implementation and run-time impact. Nevertheless, powerful embedded systems like smartphones often rely on this technique.
For less powerful embedded systems XML is mostly out of scope, as RAM is limited and processing power may be too limited to provide good enough performance. JSON has much lower requirements when it comes to processing power, but it lacks some of those features, which make XML interesting. Nevertheless, JSON libraries for embedded systems are widely available, and JSON is also useful when it comes to interacting with web servers and other generic data processing partners.
Still, even JSON requires a good amount of code and processing time in comparison with binary file formats.
Google protocol buffers
There are many binary file formats available for all kinds of applications: images, videos, audio, archives, and many others to name just a few. They use different concepts to encode and compress the data. Providing data serialization for generic data structures is a quite complex task. Therefore, Google provides Protocol Buffers for specifying structured data and generating code for parsing and serializing data.
Google protocol buffer's language has concepts for extending the file format later. Therefore, every individual data field gets a unique identifier that must not be changed over the life-time of the file format. But the associated data type may be changed and extended when following certain guide-lines. This concepts makes it perfect for use with applications that may need later extensions and may want to obsolete certain functions without breaking compatibility.
Unfortunately, even their lite library implementation is pretty heavy weighted and not targeted for embedded systems. Therefore, I have implemented Wire Format Compiler, which extends the language of protocol buffers with options and concepts for embedded systems. It is also hosted on github, where everybody can participate in the work and file bug reports.
Wire Format Compiler
Wire Format Compiler's language has been directly derived from protocol buffers. Of course if WFC extensions and features are used, the protocol description will not work with protocol buffers. But it is perfectly possible to come up with data structure specifications that work with both compilers.
In contrast to protocol buffers, wire format compiler is designed with a strong focus on embedded systems. Therefore, many optimizations for reducing code size and increasing processing performance are implemented to support embedded systems well. Additionally new data types have been added (e.g. fixed8, sfixed16) that support very small controller families, so that targeting even 8- and 16-bit controller families becomes feasible.
Furthermore, there exists a new concept for specifying options to tailor and optimize one data structure definition for multiple applications on different targets. The serialized data will be usable among all applications, but individual tuning allows to remove unsupported and unneeded data structures by telling the compiler with "used=false" that a specific member will not be used. Furthermore, data types the will be used for strings and byte arrays can be adjusted, so that specialized classes can be used.
Additionally, many options are provided to generate even more optimized code for specific targets. E.g. there exists support to make optimize for little endian systems that support unaligned access. On the other hand, it is still possible to target also big endian systems that work totally different, when it comes to byte placements in memory.
Furthermore, there exists a new concept for specifying options to tailor and optimize one data structure definition for multiple applications on different targets. The serialized data will be usable among all applications, but individual tuning allows to remove unsupported and unneeded data structures by telling the compiler with "used=false" that a specific member will not be used. Furthermore, data types the will be used for strings and byte arrays can be adjusted, so that specialized classes can be used.
Additionally, many options are provided to generate even more optimized code for specific targets. E.g. there exists support to make optimize for little endian systems that support unaligned access. On the other hand, it is still possible to target also big endian systems that work totally different, when it comes to byte placements in memory.
A small example
This is a small example that shows a wfc description and its generated header file. The first part of the wfc description has 3 different option sets that I describe below in more detail. After the option descriptions, a message description follows. This is a data structure that will be generated as a C++ class with methods for handling the data fields, and methods for (de-)serialization.
The description of a field requires three elements: a field type, the field name, and a unique identifier. The unique identifier is set once for every field name, and it must not change later. The associated data is serialized with this unique identifier, and therefore, it is needed for deserialization. Changing the type is only possible, if it is a compatible change. E.g. a variable length integer might be changed from 8 to 16 bits, but the change of type must not impact the serialization concept of the member. Changing the name does only conflict with associated source code. So this is usually without a problem, unless special concepts like JSON support or ASCII generation are used.
The first option set, called "embedded", is optimized for a 32-bit embedded systems without string handling. I.e. the strings that are referenced in the data structure are regular C-strings (char pointer), and their memory management must be done independently. In contrast the option set "embedded_dyn" specifies "astring" as string data type. You could also specify your own data type, as long as it has some basic interfaces of std::string, like the member functions size() and c_str().
The generated header file includes a class definition that reflects the message description. It includes a function for calculating the concrete size of a serialized object (calcSize). This can be used to allocate enough memory before using the toMemory function to serialize the object to memory. After that the memory block can be written to a file, to flash memory, or send via network, as needed. For deserializing the data on the receiver side the fromMemory function can be used.
In addition, wfc supports the generation of functions for using a string instead of a memory block for serialization. This is more convenient in use, but might not be possible on all embedded systems.
For every message object member, functions are generated to get and set their values and to determine the number of elements (size) of repeated members. For optional members functions are added to determine if a certain member has been set (e.g. has_hostname). All generated functions employ statically linked core functions. But options can be used to share the core functionality among multiple message objects.
The options and concepts shown in this post are only the most basic ones. Please refer to the available options and documentation, to learn in what other ways the code generation can be influenced.
If you have questions or want to file a bug report, please refer to the project page on github or get in touch with me directly.