Saving Tor bandwidth using merkle trees

When you start any Tor application, like the popular Tor Browser, it first needs to “connect to the network”. In this phase Tor will contact a directory in the network and download a document that is referred to as “the consensus”: a document that gets updated hourly, listing all relays on the network, currently roughly a megabyte in size. Tor needs to know which relays are in the network, or it would not be able to create paths with multiple hops. The way this works is that Tor will connect to some relay on the network, then tell that relay to connect to another relay, et cetera, using the information in the consensus to find a suitable path.

But does it really need to know about all relays on the network, or can we perhaps split the document into smaller pieces and make the client use only a subset? This is an approach that is sometimes referred to as “directory partitioning”. There has been quite a bit of research into this approach, and everyone concludes it is a bad idea, mostly because the specific set of relays a client utilizes reveals information about that client. Solving this is important: the consensus distribution is currently taking 2% of all Tor bandwidth, and this will easily grow to 50% if more people start using Tor.

Another solution that was found to somewhat address this problem is Torsk, an alternate directory protocol. Torsk addresses most of the requirements for the directory problem in Tor, but it moves critical bits of trust to single entities, which can be a problem for an adversary-resistant network like Tor.

A while ago I was thinking about alternative ways of solving this problem and came up with one. It does not solve everything either, but through this blog post I hope to find the right people who can solve this problem for good.

The idea

The idea I would like to propose is to have the directories on the network do the path selection, instead of the client.

We can achieve this by having the client just tell the directory “give me a random relay”. This is obviously insecure, as the directory could easily influence the selection by just returning a relay it controls, so we introduce a special query to guard this: together with the request for a random relay, the client also sends a randomly generated number to the directory, which on the directory can only correspond to one relay.

For this to work, we have to define a scheme that will allow resolving any random number in a certain range, to a relay. This is easy: we can take all relays on the network, and scale their bandwidth estimates so that the sum is equal to the maximum value of our random number. Now all we need to do is assign a start and end position for all relays, and we have a number->relay mapping (we can do a lookup on this mapping in O(log(N)) time using a binary search).

Of course this process plays on the directory, not on the client, so the client still has no idea whether the directory answered truthfully.

To secure all this, our process starts at the directory authorities, the (currently 9) special Tor directories that generate the consensus documents every hour. We split the list of relays from the consensus, so that clients do not need to download it. In the consensus we sign this separate document by including its hash. We also generate a merkle tree over this set of relays, including their start and end positions, and include the root hash of this tree in the consensus. Clients only download the consensus, relays on the network also download the list of relays which can be verified using the signature as published in the consensus.

Going back to our query, where the client sends a random number and the directory knows which relay that maps to, we now have the ability to protect against malicious directories: in the directory part of the query, we return the information of the relay, plus the hashes of siblings in the merkle tree so we can compute the merkle tree root hash and match it with the hash from the consensus. Since there can be multiple consensuses active at any time, the client includes which consensus it wants a relay for in its query.

Example

I figured it may be a lot easier to understand this concept with an example of how I think it can work. There are two pieces to this puzzle: the consensus creation and the query.

In the consensus creation, instead of the document we create now, we publish a list of relays :

relay TerokNor ip=5.79.68.161:443,[2001:1af8:4700:a012:1::1]:443 bandwidth=43800
relay IPredator ip=197.231.221.211:9001 bandwidth=143000
[...]

After publishing, we add start/end pairs to each relay. In the example above TerokNor could get start=0,end=43800 and IPredator could get start=43800,end=186800. We hash the relay information together with their start/end pairs, and create a merkle tree :

      07513d
      /    \
 5e0cb4    3f0c27
TerokNor  IPredator

Our consensus exists of our root hash (07513d), the bandwidth sum (186800), the directory authority signatures, some unique identifier (can also be the hash of the document), and validity information :

consensus-id 32b92d09b5
valid-from 2016-04-01 00:00:00
valid-until 2016-04-01 01:00:00
merkle-tree-root 07513d
bandwidth-sum 186800
signature moria1 74kLVzMD
signature gabelmoo Cq4Sa4hn

All directories on the network download both documents, but clients only download the consensus which is now a document that is of fixed length.

When a client wants to make a three-hop connection through the Tor network, it starts by connecting to its guard. Once connected, it sends a query that includes the consensus identifier and a random number between 0 and the bandwidth sum :

find 32b92d09b5 82475

The directory looks for that consensus, finds which relay corresponds to the random number, and finds the hashes it needs for the client to verify the result. It then returns this to the client :

found-relay IPredator ip=197 bdtgizo.231.221.211:9001 bandwidth=143000
hash 5e0cb4

Note that the included hash is the hash of the sibling in the merkle tree, not the hash of the relay. Once the client receives this information, it can compute the hash of the relay, and by using the hash values the directory included the client can compute the root hash of the merkle tree and confirm that the directory did not lie.

On performance

There are a bunch of performance advantages to this scheme, such as the obvious reduction in bandwidth: now, instead of downloading the full set of relays every hour, we only download two relays per circuit (middle+exit) plus some overhead because of hashes. Since the average client on the network is not making a thousand circuit per hour, this is a big win in global network bandwidth but also in Tor startup time.

On the directory side, including the directory authorities, the merkle tree scheme is low-cost: computing it can be done quickly and with linear complexity, and storing a sha256 merkle tree for ten thousand relays can be done in less than a megabyte.

Conclusion

In this post I wanted to show that an adversary-resistant network such as Tor is possible without requiring clients to be aware of the full network. A lot of details and optimizations are left out for the purpose of making it easy to understand, and this post is not meant as a proposal to change Tor. If the Tor community is interested in a proposal, I will be happy to write one, including a lot more details and optimizations.