Implementing a Tor relay from scratch

The last few weeks I have been spending my free time on implementing a Tor OR (Onion Relay) in the Go language (golang.org). After a while I realized the language was not suited for the project, and stopped working on it. Today I am sharing what I learned, and will release the partially working code on github.com/tvdw/gotor.

In this post I will refer to the network protocol as ‘Tor’, the original implementation as ‘CTor’, and my implementation as ‘GoTor’, in an attempt to reduce confusion between the different versions. For consistency all network speeds are noted as a combined upload plus download, so divide by two where you think it is relevant.

Tor

Tor is a peer-to-peer network protocol that provides privacy to its users. It does so by encrypting the data several times, and sending it through a path (‘circuit’) of multiple servers before it reaches the final destination. Each layer of encryption can only be decrypted by the server it is intended for, so that any node in the circuit can only ever know its immediate neighbors. torproject.org has a better explanation of the network in case you’re interested.

The network has a specification document, but carefully following it will result in an implementation that does not work (this has now been corrected). This posed some challenges, but luckily the CTor source code is relatively easy to read.

The goal

For a while I have been running Tor nodes using CTor. In November 2014 those nodes relayed a total of about 400 terabytes, which is an average of roughly 1200 Mbit per second. Each machine has cheap hardware to push the costs down, but the lack of AES-NI instructions on the CPUs cause a significant slowdown. CTor’s implementation is essentially single-threaded, with only a very small amount of work being done in background threads, so the slowdown of having cheap hardware is immediately noticeable in throughput.

To maximize the amount of relayed data, it is normal to simply run multiple instances of the program, up to two per IP address. Running multiple instances of CTor on the same computer causes inefficiencies, such as needing more connections to other servers and trouble in balancing network throughput, so I set out to write a Tor relay in a way that uses all CPU cores. My final goal was to beat the Tor speed record, which was at roughly 200 megabytes per second.

The path

I started by just reading the specification documents, and quickly understood the basics of the network. I then decided on a language: for a while a colleague had been pushing me to write Go, so I picked Go. It took me about a day to pick up the basics and I got started.

First up: the TLS code. Instead of using OpenSSL, Go has its own TLS implementation called “crypto/tls“, apparently because agl__, one of the people working on the language, decided so. Lots of features are missing, from DH key exchange to extracting the ClientHello message. Next to that, a quick performance benchmark showed that this implementation was very slow compared to popular OpenSSL bindings for Go. So I went with OpenSSL. While I had to fork the code to add some extra bindings, it was mostly painless.

Then came the Tor handshake. There are four versions of the link protocol, and the specification requires implementing at least the first one, which is a real pain. Instead of that, I implemented the fourth handshake, and sent a mail to the Tor developers requesting to drop support for the first and second versions. They agreed, and hopefully this week I will be finishing the patch for that.

So up to this point everything was great. I had spent only a few days and even though both the Tor protocol and the Go language were new to me, I had made significant progress. My local TorBrowser was able to connect to the relay and I saw the commands it was sending while trying to establish a circuit to the next server.

As soon as I started implementing the circuit logic, the first implementation issues started appearing. Each connection had its own buffered channel (a queue) and a circuit would be a bridge between the channels, but when both ends of the circuit would be writing to the other end simultaneously, and both channels were full, the code would deadlock. I mitigated this by just increasing the buffer, and planned to refactor the entire system later on.

Fast-forward a bit to the moment I had written enough to have a nearly full implementation of the relay. Some minor features were still missing, but it was stable enough to put a node in the network, which I did. It very slowly started building consensus (it took more than a week to get any weight assigned at all) but data was flowing nicely and it sometimes hit a few megabit per second. A few fixes had to be done but I was happy.

To get the testing up to speed, I moved the configuration and keys from a node that had been running CTor for months, and had my node run it. Traffic quickly ramped up to 500 Mbit/s, which was already more than a single CTor node could handle, but it didn’t really grow much more than that.

memory-day

On top of that, memory usage grew very quickly every time I started the program. I had to restart the node twice a day or it would start swapping. The growth was about 6 megabytes per minute on average, which is a lot less than the node relayed, but definitely suggested that there was a memory leak somewhere.

I mentioned the fast relay on the Tor IRC, and a fellow relay operator TheCthulhu offered to help with some testing hardware. He set up a server with my relay and within a few days we had broken the Tor speed record with a nice 250 megabytes per second, effectively maxing out the network link. CPU usage was at a nice 60% across 12 cores. But his relay also suffered from the memory issues and had to be restarted every few days.

Meanwhile my own relay was still at only 500 Mbit/s, which was a quarter of the server’s capacity. After profiling the code and looking at some graphs, I realized that this was because of Go internals. I’ll talk more about this in a bit.

Then there was the overhead of using OpenSSL. During initial benchmarking it looked minimal, but it turns out Go assumes the worst of all C code and the language will refuse to run C code inside Go code. Instead, it will allocate an entirely new stack just for C calls, and tell the scheduler to create a new thread in case the C code hangs. If 80% of the CPU time spent in your program is on C calls, your Go program won’t hesitate to run 500 OS threads.

In the end I decided to drop the project. It was a fun learning process, but the Go language made the final product too slow. By sharing this blog post I hope that others will learn about these and maybe some day adjust the language.

Lessons learned

  • [cpu] Go cannot call C functions directly. Instead, with cgo it creates an entirely new stack and runs the C code from there, while assuming the code will block for locks or I/O. This causes a massive buildup of OS threads;
  • [cpu] To prevent the buildup of OS threads caused by cgo, you can lock concurrent calls using a lock or buffered channel. However, this causes a lot of futex system calls, slowing down the process;
  • [mem] For every OS thread, Go has its own memory allocation cache. This is good, as it avoids locks, but when combining this with all the threads spawned by cgo you will see quite a bit of memory overhead (this claim has been disputed in the comments, it’s possible I misread something in the Go source);
  • [mem] In Go’s I/O model, every connection must have its own goroutine with its own read buffer. Assuming normal TLS records of 16K and a goroutine stack size of 4K, this means that with only 10.000 active connections you will already be using 200 megabytes just because of these buffers;
  • [mem] Go’s default channel implementation uses an array and a mutex to control access. This means that a buffered queue with a capacity of 2000 slices will always be using 48KB, even if it’s empty. However, to be fair, this could be mitigated by using an alternate channel implementation that uses a linked list;

In the end, I concluded that while Go is an amazing language, it’s not that good if there is C code in performance critical loops, if you have to deal with lots of connections, or if you have a lot of data flowing between goroutines. For most projects these restrictions are not an issue at all, but in my case I hit all three.

Lots of thanks to everyone who helped me during this short project. It was a lot of fun, and I am looking forward to working on CTor.

Screen Shot 2015-01-24 at 22.12.29

14 thoughts on “Implementing a Tor relay from scratch”

    1. I’m not aware of anything called ggcgo. If you are talking about gccgo, then no, I don’t think it’ll make a difference at all. The runtime is essentially the same, and will have roughly the same characteristics.

  1. I would love to see the benchmarks you did for “crypto/tls” vs the openSSL bindings. It seems a lot of the issues you mentioned would be alleviated by patching “crypto/tls” a bit.

    1. While I didn’t spend a lot of time benchmarking this, it was a massive speed difference due to the lack of AES-NI on the chipsets I focused on: “crypto/aes” is super slow even compared to calling OpenSSL using cgo (when AES-NI is available, performance is roughly the same). iirc it was about five to ten times slower.

  2. > [cpu] Go cannot call C functions directly. Instead, with cgo it creates an entirely new stack and runs the C code from there, while assuming the code will block for locks or I/O. This causes a massive buildup of OS threads;
    > [cpu] To prevent the buildup of OS threads caused by cgo, you can lock concurrent calls using a lock or buffered channel. However, this causes a lot of futex system calls, slowing down the process;

    There is only one stack per thread. New stack is not created per cgo call.
    Both channels and mutexes do not have futex calls, they are cooperative. Futex calls are caused by something else.

    [mem] For every OS thread, Go has its own memory allocation cache. This is good, as it avoids locks, but when combining this with all the threads spawned by cgo you will see quite a bit of memory overhead;

    This is not true. Please remove this statement from the post. Such false statements spread very quickly.
    Go uses allocate cache per GOMAXPROC. Number of caches it not affected by number of threads.
    This is way more efficient both in terms of speed and memory consumption than what you get in C.

    [mem] In Go’s I/O model, every connection must have its own goroutine with its own read buffer. Assuming normal TLS records of 16K and a goroutine stack size of 4K, this means that with only 10.000 active connections you will already be using 200 megabytes just because of these buffers;

    This is true. But you can issue a 1-byte read first. And create a larger buffer only when you know there is data available.

    [mem] Go’s default channel implementation uses an array and a mutex to control access. This means that a buffered queue with a capacity of 2000 slices will always be using 48KB, even if it’s empty. However, to be fair, this could be mitigated by using an alternate channel implementation that uses a linked list;

    This is true. But this is a bad usage of channels. In a producer-consumer system buffer size needs to be only large enough to keep all consumers busy while producers prepare more data. Something like 4*GOMAXPROCS is usually enough.

  3. It really all depends on how you write Go code. I’m sure if most of Dimitry’s suggestions are implemented you’ll get very close to the C version (minus 150.000 lines of code compared to the C version).

Leave a Reply

Your email address will not be published.