Remote C/C++ compilation using Ccache & Distcc

Remote C/C++ compilation using Ccache & Distcc

Here at Zaleos, we are in favor of testing everything in our development process. In one of our larger projects, we are using C/C++, and we have been adding more tools to our CI pipeline to test more and more aspects of the code. We have unit tests, memory leak checkers, code format checkers, integration tests, API tests, documentation tests…

The problems start kicking in

Even though all the testing, in general, is a great thing, it did have a major drawback: our build times on CI sometimes take up to an hour on the longer jobs. This situation is not ideal, so we had to look into a solution, and our CI provider did not scale adequately for our case.

https://xkcd.com/303/

Giving Ccache & Distcc a spin

We were already using Ccache in our build flow, which is a great, simple tool for caching build artifacts. Ccache has helped us a lot, but it isn’t enough to keep the build times acceptably low. We decided to try distcc as a possible mid-term solution to our problem. For those unfamiliar with distcc, it is a tool used to distribute the compilation of C/C++ code across different machines.

Our goal was to use trigger builds on our local machines which would compile on the build farm, and then cache the results, so that other developers or even the CI platform itself could then use the cache to speed up their build process. Distcc can be set up to use ssh to communicate with the build server, to minimize security risks.

Ccache+distcc setup overview

With this setup, there are 3 possible build scenarios:
1. The cache is not available on the local or remote Ccache
2. The cache is not available locally but is available on the build server
3. The cache is available locally, so the remote build server is not queried at all

We're going to go over each of these mentioned scenarios, let's get started!


Initial setup

On the build client-side, to use Distcc and Ccache together, we use the CCACHE_PREFIX setting for Ccache, which executes the command passed to it (distcc in this case) before calling the compilation command. This is done with the following line in our build script:

export CCACHE_PREFIX=distcc

There are two ways of connecting to the build client and build server using distcc: by direct TCP connection (usually on port 3632), or by SSH (in which the distcc client transparently connects over SSH and executes a distcc command). The advantage of using direct TCP connections is that the communication is faster, but on the downside, it is not secure. This option should only be used on private, trusted networks, otherwise SSH should be used. The setup we made uses SSH for the communication.

On the build server-side, we want distcc to save the compilation results in cache also, so as to not repeat the same builds over and over if it is avoidable. This is done by using the DISTCC_CMDLIST variable, creating a file with a list of compilers to use, specifying the Ccache wrappers:

[vagrant@buildserver1 ~]$ cat /home/vagrant/distcc_cmdlist.cfg
/usr/lib64/ccache/c++
/usr/lib64/ccache/cc
/usr/lib64/ccache/g++
/usr/lib64/ccache/gcc

Once this file is created, we need to set an environment variable pointing to this file:

export DISTCC_CMDLIST=/home/vagrant/distcc_cmdlist.cfg

The easiest way to do this is to add this line to the .bashrc file of the distcc user on the build server. An alternative is to enable the PermitUserEnvironment setting for SSH (you can check the sshd_config man page for details, but beware, enabling this option does pose a security risk).

We have provided a Vagrant file, that uses Ansible for setting up the remote building configuration. First, you must install the given tools, then clone the repository and start the Vagrant provisioning:

git clone https://github.com/zaleos/post-ccache-distcc
cd post-ccache-distcc
vagrant up
Clone repo & initialize Vagrant boxes

Build path 1 - Cache is not available on client or server

This scenario will happen the first time we run a build for a given set of changes.

Compilation times:

[vagrant@buildclient src]$ time ./build-remote.sh
...

real	0m8.361s
user	0m0.553s
sys	0m0.368s

distccmon-text output (note that this command only outputs the state of distcc at the exact time the command is launched):

distccmon-text
 47025  Compile     main.cpp                 10.22.66.101[0]

Local Ccache:

[vagrant@buildclient src]$ ccache -s
cache directory                     /home/vagrant/.ccache
primary config                      /home/vagrant/.ccache/ccache.conf
secondary config      (readonly)    /etc/ccache.conf
stats updated                       Mon Oct 18 13:31:57 2021
stats zeroed                        Mon Oct 18 13:31:49 2021
cache hit (direct)                     0
cache hit (preprocessed)               0
cache miss                             4
cache hit rate                      0.00 %
called for link                        5
no input file                          2
cleanups performed                     0
files in cache                        10
cache size                          57.3 kB
max cache size                       5.0 GB

Remote Ccache:

[vagrant@buildserver1 ~]$ ccache -s
cache directory                     /home/vagrant/.ccache
primary config                      /home/vagrant/.ccache/ccache.conf
secondary config      (readonly)    /etc/ccache.conf
stats updated                       Mon Oct 18 13:37:43 2021
stats zeroed                        Mon Oct 18 13:37:06 2021
cache hit (direct)                     0
cache hit (preprocessed)               0
cache miss                             4
cache hit rate                      0.00 %
cleanups performed                     0
files in cache                         8
cache size                          32.8 kB
max cache size                       5.0 GB

From this, we can see that the cache hit rate of the build server is 0% - this is normal, as the build hasn't been triggered before.

Build path 2 - Cache is not available on the client but is on the server

This scenario may happen when we have already built a changeset and we have cleared the local cache (for space requirements, for example), or if some other developer starts working on the project.

For our tests, we will have to clear the local build client ccache manually before executing the build script to trigger this scenario

[vagrant@buildclient src]$ ccache -Ccz
Cleared cache
Cleaned cache
Statistics zeroed
[vagrant@buildclient src]$ time ./build-remote.sh 
...

real	0m7.600s
user	0m0.532s
sys	0m0.345s
[vagrant@buildclient src]$ 

Local Ccache:

[vagrant@buildclient src]$ ccache -s
cache directory                     /home/vagrant/.ccache
primary config                      /home/vagrant/.ccache/ccache.conf
secondary config      (readonly)    /etc/ccache.conf
stats updated                       Mon Oct 18 13:42:15 2021
stats zeroed                        Mon Oct 18 13:40:59 2021
cache hit (direct)                     0
cache hit (preprocessed)               0
cache miss                             4
cache hit rate                      0.00 %
called for link                        5
no input file                          2
cleanups performed                     0
files in cache                        10
cache size                          57.3 kB
max cache size                       5.0 GB

Remote Ccache:

[vagrant@buildserver1 ~]$ ccache -s
cache directory                     /home/vagrant/.ccache
primary config                      /home/vagrant/.ccache/ccache.conf
secondary config      (readonly)    /etc/ccache.conf
stats updated                       Mon Oct 18 13:42:14 2021
stats zeroed                        Mon Oct 18 13:37:06 2021
cache hit (direct)                     0
cache hit (preprocessed)               4
cache miss                             4
cache hit rate                     50.00 %
cleanups performed                     0
files in cache                        10
cache size                          41.0 kB
max cache size                       5.0 GB

These results are slightly better than the first scenario, but for larger projects, this difference should be greater. Note that the remote Ccache hit rate has gone up to 50% in this case (the 4 cache misses from the first build and the 4 build hits from the second build are taken into account to generate this hit rate).

Build path 3 - Cache is available on the client

This scenario will happen when a given changeset has already been built and the results are stored in the local Ccache.

The output will be similar to:

[vagrant@buildclient src]$ time ./build-remote.sh 
...

real	0m3.334s
user	0m0.322s
sys	0m0.284s

Local Ccache:

[vagrant@buildclient src]$ ccache -s
cache directory                     /home/vagrant/.ccache
primary config                      /home/vagrant/.ccache/ccache.conf
secondary config      (readonly)    /etc/ccache.conf
stats updated                       Mon Oct 18 13:45:42 2021
stats zeroed                        Mon Oct 18 13:40:59 2021
cache hit (direct)                     4
cache hit (preprocessed)               0
cache miss                             4
cache hit rate                     50.00 %
called for link                       10
no input file                          4
cleanups performed                     0
files in cache                        10
cache size                          57.3 kB
max cache size                       5.0 GB

Remote Ccache:

[vagrant@buildserver1 ~]$ ccache -s
cache directory                     /home/vagrant/.ccache
primary config                      /home/vagrant/.ccache/ccache.conf
secondary config      (readonly)    /etc/ccache.conf
stats updated                       Mon Oct 18 13:42:14 2021
stats zeroed                        Mon Oct 18 13:37:06 2021
cache hit (direct)                     0
cache hit (preprocessed)               4
cache miss                             4
cache hit rate                     50.00 %
cleanups performed                     0
files in cache                        10
cache size                          41.0 kB
max cache size                       5.0 GB

As you can see from the output, when the local Ccache is available, the build is faster. Also note that the local Ccache hit rate has gone up, and the remote Ccache remains constant (as the build client has not needed to send any requests to the build server). Again, for larger projects, this really pays off.

Other interesting notes

Checking the build & cache activity

Some useful commands can be used to monitor the status of the distcc client and Ccache:

Show state of remote compilation jobs - this should be run on the build client

[vagrant@buildclient src]$ watch -n 0.5 distccmon-text

Show the state of Ccache usage - can be run on the build client or the build server

[vagrant@buildserver1 ~]$ watch -n 0.5 ccache -s

Additional build servers

If you want to test multiple build servers, you can edit the Vagrantfile to increment the build server counter in this line:

NUM_BUILD_SERVERS = 1 # Add more build servers by increasing this number

And also uncomment and/or add more lines in the build-remote.sh script file:

#init_distcc_build_server 10.22.66.102 16 # buildserver2

Note regarding this function:
What the init_distcc_build_server function does is populate the DISTCC_HOSTS environment variable with the ssh address of the build server it is passed, and add the remote server's fingerprint to the list of known hosts. As the second parameter, you pass the maximum number of concurrent build tasks you want to send to this server. DISTCC_HOSTS will be prefixed with a --randomize argument, which makes distcc select a host randomly to send the build tasks to. Unfortunately, we haven't found a way to dynamically send more or fewer tasks depending on the remote machine's build load, so the server quickly runs out of resources.

Additional distcc configuration

If for some reason Distcc isn't being called on the build server, here are some environment variables that can be set to disable local compilation and enable verbose debugging messages:

export DISTCC_VERBOSE=1          # Enable verbose debug logs
export DISTCC_FALLBACK=0         # Disable local compilation fallback
export DISTCC_SKIP_LOCAL_RETRY=1 # Disable local compilation retry

Conclusion

Our initial impression with distcc was very positive, and it did allow us to use lower-end build clients to achieve similar results to those of our higher-end setups. Unfortunately, when we did some further testing, such as making different changes in our codebase and triggering various simultaneous builds, simulating what would be our daily workload (developers + Continuous Integration), our test build server ran out of RAM very quickly:

htop output of the build server while simulating a production scenario

Some of the issues that we have had with this setup are:

  • There is no deduplication of the build jobs. If identical build jobs are submitted simultaneously, they will all be processed, even if only one compilation is required.
  • Rapidly running out of resources on the build server. When many jobs are submitted by different clients simultaneously, there is no queuing mechanism, so they are all launched at once.
  • Only the compilation step is distributed. The preprocessing and the linking happen on the local machine. Note that we're using the direct mode of Ccache, not the preprocessor mode. Distcc does have a way to distribute preprocessing tasks, but unfortunately, it's not compatible with Ccache (see distcc pump mode).

We could have scaled horizontally at this point, adding more build servers to spread the building load. Services such as clouding.io can be used for spinning up cloud servers for this type of task. Unfortunately, the benefits of caching the build artifacts this way would be lost significantly because the cache is not shared between each build server, and sharing a cache might be difficult and was out of the scope of this work, so we discarded this option.

What’s next?

Distcc may be good for some scenarios, like when fewer concurrent builds are required, but for our current use case, it isn’t the best fit.

This is an ongoing task, but the next step we are going to take is to explore Google build tool called Bazel (https://bazel.build), which looks very promising indeed, although it may require more upfront work to set up in our projects. Bazel uses the Remote Execution API, which is a standard API for spreading builds across multiple hosts. It could also bring build consistency for all our products.

Some of the other tools that we considered using are:

References

Further reading:

Cover image credit goes to Fabio