Protobuf code generation in Rust

Today, I learned how to correctly use Cargo build scripts. Or, more precisely, I learned how to do one particular thing correctly, but it was significant enough for me that I decided write it down. Of course, had I read the Cargo Book more carefully before, I would have saved myself some time, and there would be no dramatic revelation, and no reason to write this post either. I guess what I am trying to say is: thank goodness my reading sucks.

Keywords: cargo build scripts code generation

Context

My problem arose while implementing a little library that reads from and writes to a Protocol Buffer stream. As described by the authors themselves:

Protocol buffers are a language-neutral, platform-neutral extensible mechanism for serializing structured data.

The very first step is to define a message format in Protobuf's dedicated grammar. The message is parsed by the Protobuf library (available for various programming languages), and related code is generated for the target language (in our case, Rust). Each message is an object with a bunch of getters and setters (details differ depending on the language). These messages then can be read from or written to streams available in the Protobuf library. It is all quite straightforward. Take this simple message from the Protobuf web page:

message Person {
  required string name = 1;
  required int32 id = 2;
  optional string email = 3;
}

In Rust, we would use the following chunk of code to construct a Person:

let mut person = Person::new();
person.set_name("Butler");

The details are not very important here. What is important is that this code needs to be generated before the project is compiled, or else there would be no Person to speak of. And, if possible, we would prefer if it was neatly integrated into our build system.

Cargo build scripts

Our situation is by no means unique. Probably the most canonical example is a library that provides Rust API bindings to some C library, such as libc, git2, and many others. Before compiling our Rust crate, we first need to compile the C code, and maybe even generate Rust FFI bindings from C headers (see bindgen crate). Because this is a common pattern and Rust ecosystem is fantastic, there is a standard solution for this: build scripts.

Simply put, Cargo allows us to define a build script, by default named build.rs and located at the root of the project. This is essentially a regular executable (with main function and all), except Cargo provides a bunch of environment variables with useful build information. The build script in turn communicates back to Cargo by writing instructions to the standard output. For example, printing cargo:warning=MESSAGE will instruct Cargo to print a warning to the terminal. More on that a little later, but for now here is a simple example that compiles a C source file using cc crate:

fn main() {
    cc::Build::new()
        .file("src/example.c")
        .compile("example");
}

Note that there is a special section in Cargo.toml where you can define dependencies that are only used for building: [dev-dependencies].

Generating Protobuf code

Now that we have basic information about build scripts, we can take a shot at generating the Rust code for Protobuf messages. Luckily, there is already a crate that will make it very simple:

[dev-dependencies]
protobuf-codegen-pure = "2.14" # Might be different by the time you read this

[dependencies]
protobuf = "2.14" # This will be needed to use the generated code as protobuf messages

A quick look at the documentation explains it all:

fn main() {
    protobuf_codegen_pure::Codegen::new()
        .out_dir("src/protos")
        .inputs(&["protos/person.proto"]),
        .include("protos")
        .run()
        .expect("Codegen failed.");
}

Let's break it down. We first create a Codegen object, which implements a builder pattern. Then, we define where the generated files should be created. Finally, we need to point to the input files that contain the message definitions, and to the directory containing these files. That's it. Simple, right?

Well, not so fast. There is a catch. See, Cargo doesn't want us to write to the src directory:

Build scripts may save any output files in the directory specified in the OUT_DIR environment variable. Scripts should not modify any files outside of that directory.

But why would they care? Well, it is a security concern. If a crate is built remotely, we don't want to allow what is effectively a user-defined program to write anywhere they want. A good example is Docs.rs, which hosts API documentation of all crates available on crates.io. They limit the program's write permissions to only one directory and pass it via an environment variable OUT_DIR. In fact, if you follow the instructions from protobuf-codegen-pure crate, your documentation on Docs.rs will fail to build (this is precisely how I found out about all of this!).

Correcting the build script

So how do we fix our build script? Let's try this:

fn main() {
    let out_dir_env = env::var_os("OUT_DIR").unwrap();
    let out_dir = Path::new(&out_dir_env);
    protobuf_codegen_pure::Codegen::new()
        .out_dir(out_dir)
        .inputs(&["protos/person.proto"]),
        .include("protos")
        .run()
        .expect("Codegen failed.");
}

But where is our file now and how do we use it? Rust provides include! macro that copies the content of a file to the file it is invoked from. For example, here is a little snippet from lib.rs showcasing this:

include!(concat!(env!("OUT_DIR"), "/person.rs"));

fn new_person(name: &str) -> Person {
    let mut person = Person::new();
    person.set_name(name);
    person
}

Does it work now? Unfortunately, not quite. protobuf-codegen-pure takes liberty to add some module-level comments and attributes suppressing certain warnings, which now fail to compile:

error: an inner attribute is not permitted in this context
 --> /home/elshize/dev/ciff/target/debug/build/ciff-e8fd3067377fd4eb/out/common_index_format_v1.rs:5:1
  |
5 | #![allow(unknown_lints)]
  | ^^^^^^^^^^^^^^^^^^^^^^^^
  |
  = note: inner attributes, like `#![no_std]`, annotate the item enclosing them, and are usually found at the beginning of source files. Outer attributes, like `#[test]`, annotate the item following them.

This is caused by the indirection of include!. The good news is that it is a known problem and chances are this is already resolved when you are reading it. But since I don't have the luxury of travelling through time, I would like to find a workaround. Besides, it is a good opportunity to show how powerful build scripts really are. We are by no means limited to what the codegen library generates for us.

The objective is to get rid of those failing comments and attributes. On the other hand, I would still like to be able to suppress warnings from the generated code. To do that, we can create person.rs module, which will simply define attributes and include the generated code, which can be later re-exported by lib.rs. For example:

#![allow(unknown_lints)]
#![allow(clippy::all)]
#![allow(clippy::pedantic)]
#![allow(box_pointers)]
#![allow(dead_code)]
#![allow(missing_docs)]
#![allow(non_camel_case_types)]
#![allow(non_snake_case)]
#![allow(non_upper_case_globals)]
#![allow(trivial_casts)]
#![allow(unsafe_code)]
#![allow(unused_imports)]
#![allow(unused_results)]
include!(concat!(env!("OUT_DIR"), "/person.rs"));

Great, now the only thing that is left is to remove these from the generated file. This can be easily done in build.rs. Once the file is successfully generated, we can read it line by line and filter out any line that starts with #! or //!.

fn main() {
    let out_dir_env = env::var_os("OUT_DIR").unwrap();
    let out_dir = Path::new(&out_dir_env);
    protobuf_codegen_pure::Codegen::new()
        .out_dir(out_dir)
        .inputs(&["protos/person.proto"]),
        .include("protos")
        .run()
        .expect("Codegen failed.");
    // Resolve the path to the generated file.
    let path = out_dir.join("person.rs");
    // Read the generated code to a string.
    let code = read_to_string(&path).expect("Failed to read generated file");
    // Write filtered lines to the same file.
    let mut writer = BufWriter::new(File::create(path).unwrap());
    for line in code.lines() {
        if !line.starts_with("//!") && !line.starts_with("#!") {
            writer.write_all(line.as_bytes()).unwrap();
            writer.write_all(&[b'\n']).unwrap();
        }
    }
}

Result? See for yourself.

Conclusions

I have a few takeaways from my little experiment. First, the Rust ecosystem, although rich and powerful, it not yet fully matured. Certain details are still being ironed out, such as those in protobuf-codegen-pure. But this is to be expected. I think what is more important is that these libraries are out there, and how many people are actively working on making them better each day. But most of all, I am often blown away by how well thought out many functionalities of Rust or Cargo are, especially compared to those available, say, in C++. Build scripts are one of those gems that elevate Rust to the great piece of technology it is.


Questions? Comments? I am @elshize on Twitter and @siedlaczek at Mastodon.Social. Feel free to say hi.