Tuesday, February 11, 2020

Data Serialization - Getting started with Wire Format Compiler

Getting started with Wire Format Compiler

Data serialization is the process of getting an in memory data structure into a stream of bytes that can be sent via a network connection or written to disk and back. For PCs, servers, and smart phones there are many options to do this. There is no wrong or right way to do it, but different concepts offer different advantages and disadvantages.

For embedded systems Wire Format Compiler is a tool that generates C++ source code that performs this task. The generated code is light-weight enough to run on small embedded systems. Its language has been inspired by Google's protocol buffers, but it offers numerous enhancements for use on embedded systems, and provides about 3-4 times better performance when it comes to processing time and ROM footprint. Additionally, with WFC you are able to even target very small embedded systems such as 16 bit devices or even 8 bit devices. For this many options exists that allow to tune the generated code by taking advantage of specific aspects of small devices, such as lower address space, register width or implementation concepts.

First simple Example

As a first example let's take a look at the configuration data of a potential embedded device. The data description in WFC source format looks like this:

message Config
{
        string     hostname   = 1;
        unsigned   baudrate   = 2;
        string     wifi_ssid  = 3;
        string     wifi_pass  = 4;
        bool       station    = 5;
        fixed32    ipv4       = 6;
        fixed8     netmask    = 7;
        fixed32    gateway    = 8;
}
This description specifies a data structure with several members. First the data type is specified, second is the name of that structure member, and last the unique identifier. The unique identifier must never be changed after bringing data into real world usage. The data type can be modified for later software updates with some restrictions, that I may talk about in a later post. Backward and even forward compatibility is very strong for WFC generated code, if you follow some rules when extending messages for a later release. Forward compatibility is achieved by being able to skip over unknown data structures, but still extracting what is well known.

The data types used in this example are:
  • string: this will be implemented as a std::string per default, but this can be changed to a target specific more appropriate type. How to do this, I will write about in a different post.
  • unsigned: This data type will be generated as a platform specific unsigned integer data type. Take care to make sure that the default unsigned type has the appropriate range of values. This data type is serialized as a variable width integer that will require 10 byte at most to be stored plus usually 1 or 2 bytes for the message member tag.
  • bool: This is just the regular boolean type. Data serialization relies also on the concept of variable length integer, but boolean data will of course only take 1 byte for its data. Due to relying on the variable integer concept a boolean type can later be changed to an enumerated data type or some integer type that is not serialzed in signed format.
  • fixed8 and fixed32: These are fixed width integers that will be represented as uint8_t and uint32_t accordingly. Fixed8 will require 1 byte, fixed32 needs 4 bytes in their serial representation plus the tag. 
Of course there are many more types. I will take a look at what other types are available in a later post.

Without any additional options WFC will generate code for serializing an deserializing the data structure to and from memory. Additionally, helper functions are generated to calculated the serialized size for a specific data instance, and to serialize the structure directly to a C++ std::string. Furthermore, options exist to also create an ASCII or JSON output for human interaction or interfacing with other tools.

Integrating the generated code into an application

To integrate the generated code into your application, you need to include the generated header file into your code and add the generated C++ source file into your build process. To make use of the generated code, simply create an instance of the classes defined in the generated head file. In this case the class is called Config, just as the message is called.

So let's take a look how initializing, serializing, and deserializing work in practice. This examples is also included in the WFC source distribution and contains multiple variants how to use the generated code. Here we just take a look at the variant with C++ streams:

To initialize the structure, just set every member of it:
void initDefaults(Config &cfg)
{
        cout << "initializing defaults\n";
        cfg.set_hostname("host001");
        cfg.set_baudrate(115200);
        cfg.set_wifi_ssid("mywifi");
        cfg.set_wifi_pass("1234567890");
}
Next let's serialize the data to a file:
void saveConfig(const Config &cfg)
{
        string str;
        cfg.toString(str);      // serialize the data
        ofstream out;
        out.open(CfgName);      // open file
        out << str;             // write to file
        if (out.good())
                cout << "writing config successful\n";
        else
                cerr << "error writing config\n";
}
Now that was also quite easy. We just serialize the data to a string and write the string to a file. That's it. So now the last important step comes. Deserializing the data:
int readConfig(Config &cfg, const char *CfgName)
{
        ifstream in;
        in.open(CfgName,ios::in);   /
/ open file
        in.seekg(0,ios::end);       // seek to end...
        ssize_t n = in.tellg();     // to get the size.
        in.seekg(0,ios::beg);       // back to begin
        if (n <= 0)
                return 1;
        char *buf = new char[n];    // allocate buffer
        in.read(buf,n);             // read data
       
        cfg.fromMemory(buf,n);      // parse the data
        delete[] buf;               // clean-up
        // That's it! Ready to go.
       
        cout << "read config from " << CfgName << endl;
        return 0;
}
Reading back is more complex, because we need to know how much memory to allocate and the C++ streams interface has no easy way to determine the size of a file. WFC even has a concept that enables it to find the end of the stream by itself. But this is an extension that I do not want to dive into here, because it needs some additional considerations. Restoring the serialized data is also just one line that asks the relevant class object to parse the data from memory. What is done behind the scene can be found in the code generated by WFC. But you don't have to understand anything about it, to use the generated code.

That's it for this post. The next posts will take a look at more data types, how to do some target specific optimizations, and how to use custom string types to reduce the footprint even further.

Get Wire Format Compiler. Happy hacking!

Automatic Versioning with Make/CMake and Mercurial

Keeping Version Strings Up-To-Date

A software package and its binary executables should have a version number or a release name. This is more or less common practice, due to the benefits one has, when it is possible to identify the version of a specific software.

But what is the best way to bring the version tag into the software and keeping it up-to-date? Traditionally, version information has been hard coded into the source code. This is the simplest and most straight forward approach, but this comes with a big question: When and how do you update the version to e.g. version 2.0?

It makes sense to do it in an exclusive commit to the revision control system of the project. Additionally, it should be done after performing all those release and regression tests. But if the version string is hard coded, you will be unable to figure out from which commit of your repository a specific binary was built, unless you update the version string on every commit. But if you cannot ensure that only the latest and greatest binary will be used, it can be quite important to identify to which commit a specific binary relates during development.

Furthermore, if you change the hard-coded version string after release testing, how do you know you didn't break the code while changing the version string? So another release test of the version with the correct string might become necessary. And, as we all know, manual tasks are prone to error. So sooner or later, manually updating the version string will be forgotten, the update will be incorrect or introduce some bug. Maybe the code will not compile or even worse the version string will be incorrect or whatever...

So the simple hard-coded version string concept may work well in many cases, but might not make everybody happy. 

Automatic Version Generation

Therefore, it makes sense to think about a better solution. A better approach is to generate the version string during build time. This concept also can provide more information about the build itself. E.g. hostname of build server, version of tools used for building, time of build and so on come to mind.

For this, support from the build system and the version control system is necessary. The build system must trigger an update of the relevant information and link it to the binary.

But how can this information be gathered, when it should be available at build time, but not hard-coded into the sources? One way to do it, is to use the infrastructure of Mercurial as a revision control system and its tagging mechanism. The advantage of this concept is that it also works for archives that have been generated by mercurial, but have no reference to the repository. With other revision control systems you might have to come up with another approach, but probably will find a similar solution.

Using GNU Make or CMake as build systems, Mercurial's infrastructure can easily be employed to provide the necessary information. Mercurial's tags provide the ability to associate a given commit with a version string. Like this a specific revision can be given a name or version after its commit has been submitted and tested. So you can do the commit, test it, and once you are sure all release prerequisites are fulfilled and it is ready for publication, you tag the revision with a version name without changing any code manually. This reduces the risk of breaking anything dramatically.

Furthermore, Mecurial also lets you query the distance to the latest tag. So if the latest tag is always the latest version number, the delta can be used as a patch level to the named version. For a mercurial repository the latest tag and distance as patch level can be queried with 'hg log -r . "{latesttag}.{latesttagdistance}"'.

Now, if the repository gets exported to a zip or tgz archive, the repository cannot be queried anymore. The good thing is that Mercurial creates a file called .hg_archival.txt that contains just this information. To extract the version information from this file some shell scripting with awk or grep and sed is necessary. All this is demonstrated here in a sample repository with a small shell script, which should work on all UNIX based systems like Linux, BSD, Solaris or MacOSX.

The version string itself must be written either to a header file or a source file for make and cmake being able to recognize a change in the version string and trigger the appropriate build steps, unless you want to do a full build every time. So putting the version string in a command line argument for a #define will not yield the intended result, as a source code delta might be overlooked by that approach by GNU Make and CMake. Therefore, just make sure to generate a source file that is picked up during the build process with all relevant version information that you would like to have integrated into the binary.

Example

Let's take a look at a small example, how to use the template repository. The demo repository contains a file called hello.c that prints the version that has compiled into its executable. Both CMake and GNU Make are supported. Support for BSD Make is missing, so on BSD and Solaris you will have to call gmake instead of make.

First, we start by cloning the template repository:
> hg clone mkversion.hg myproject 

Then we build the project with autoconf and make and take a look at what we get:

> ./configure
checking build system type... x86_64-pc-linux-gnu
checking host system type... x86_64-pc-linux-gnu
checking target system type... x86_64-pc-linux-gnu
checking for cc... cc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether cc accepts -g... yes
checking for cc option to accept ISO C89... none needed
configure: creating ./config.status
config.status: creating Makefile
> make
cc -MM -MG hello.c -o .depend
sh mkversion.sh
creating version.h
cc -g -O2   -c -o hello.o hello.c
cc -g -O2  hello.o -o hello
> cat version.h
#ifndef VERSION_H
#define VERSION_H
#define VERSION         "V0.1.1 (hg:1/7906498bc6e3)"
#define HG_REV          "1"
#define HG_BRANCH       "default"
#define HG_NODE          "7906498bc6e36f95daf03ffce97a18c3000990fb"
#define HG_ID           "7906498bc6e3"
#define HG_TAGS         "tip"
#define HG_LATESTTAG    "V0.1"

#endif
> ./hello
version V0.1.1 (hg:1/7906498bc6e3)


What we see here, is that we got Version V0.1.1 after cloning the repository. But the latest tag is V0.1. So, let's take a look at the log:
> hg log
changeset:   1:7906498bc6e3
tag:         tip
user:        Thomas Maier-Komor <thomas@maier-komor.de>
date:        Thu Jul 25 07:37:32 2019 +0200
summary:     Added tag V0.1 for changeset c6295b293642
 
changeset:   0:c6295b293642
tag:         V0.1
user:        Thomas Maier-Komor <thomas@maier-komor.de>
date:        Thu Jul 25 07:37:25 2019 +0200
summary:     initial checkin of build template with version generation
> hg id
7906498bc6e3 tip
As you can see, the clone updated the sandbox to the latest revision, which is the addition of the tag for changeset 0. Therefore, if we want to get the expected version information, we must update to the revision that we tested and tagged afterwards. I.e.:
> hg up -r V0.1
After that and rebuilding the binary, we get the expected result.
> ./hello
version V0.1.0 (hg:0/c6295b293642)
As written above, Mercurial also provides the infrastructure to determine if the sandbox for building has any uncommitted changes. Like this it is easily possible to integrate this important information into the version string. This template adds a plus character at the end of the version string, if there are uncommitted changes detected at build time. Of course you can change the plus character to something different or even cancel the build if you want to make sure that only reproducible binaries are created. This kind of restriction could also be applied to a specific server.

Let's see how it works. Just make a simple modification to one of the files that are tracked in the repository. E.g. add a newline at the end of hello.c:
> echo >> hello.c
After that trigger a new build with make. The version string then looks like this:
% ./hello 
version V0.1.1+ (hg:1/7906498bc6e3)
This template has no big magic, just a shell script and its integration in the build infrastructure with GNU make and CMake. You can also easily expand it to include the username and/or hostname of the person who triggered the build or whatever else you would like to see. 

Get the template as a Mercurial repository here. I hope you like it. I am rolling this concept out to all of my software development projects.