Holepunching

As you probably know, iroh is in the business of holepunching. And gives you a QUIC connection on top. The typical scenario is establishing a direct connection between two devices, like laptops or phones, both on different home networks. Home networks tend to have a NAT router in front of them, and even when using IPv6 tend to block new incoming connections in the same fashion as a NAT router would. And to be fair, blocking random incoming connections to a home network is a sensible choice.

The simplified theory of how UDP holepunching works is that both endpoints send a packet to each other at the same time. Both routers see the outgoing datagram first, and when they receive the incoming datagram, it is considered to be same connection and is allowed in. To achieve this in practice you need two things:

A means of communicating the coordination. Iroh uses the relay server as a network path between the two endpoints for this. We explained this in more detail in the iroh on QUIC Multipath post.

The address the NAT router is going to be using for the other endpoint. Because this is where you have to send your holepunching datagrams to.

The second part is often called "address discovery", and it seems an impossible task. How are we supposed to predict how a random router on the internet is going to behave?

NAT Types

NAT routers have existed for a very long time, and as the world tried to understand them many words have been wasted classifying and naming them. It's a confusing mess. RFC 4787 can be used as a jumping point to explore the bewildering number of references to older RFC as well as updates to it. Practical people today mostly classify NATs in two types however:

Destination Independent
Destination Endpoint Dependent

What does this mean? A NAT router's job is to map an internal IP + port to an external IP + port. When a new connection is created from inside the network the endpoint decides on the source IP + port. The NAT router then creates a mapping and sends the datagram from some external IP address and port. Incoming datagrams to to this external IP + port are then looked up in the mapping table to deliver back to the origial source IP + port of the endpoint.

For a Destination Endpoint Indenpendent mapping the mapping is very simple: for each unique source IP + port pair is mapped to one external IP + port pair, independently of the destination IP + port of the datagram. That means a single source IP + port can send datagrams to many destinations on the internet, and they will all share the same external IP + port on the NAT router. This is very convenient for holepunching.

For a Destination Endpoint Dependent mapping there could be several variations. However for a home router that typically does only have one external IP address only the external port can change. So the NAT router can pick a new port for each destination, even if the source IP + port remains the same.

Now think back to holepunching: you need to know the external IP + port the NAT router will map to in order to send the holepunching datagrams to each other at the same time. With Destination Endpoint Independent NAT you can use the information from another connection for this. Destination Endpoint Dependent NAT however makes this much harder. There are still tricks you can do, but for now iroh does not yet support this.

Reflexive Transport Address

This brings us to the fancy term "Reflexive Transport Address". Consider you are a server sitting on the internet and you receive some datagrams from an endpoint behind a NAT router. The IP header of the received datagram will contain the source IP address, while the UDP header will contain the source port number. The IP + port the server will see is the external IP + port of the mapping the NAT router makes. To send a response you'd send a datagram addressed to this IP + port.

In oder words, the source IP + port the server observes, is the address it sends responses too. Thus you can build a server that informs a client endpoint about the clients address as observed by the server. The the client this is the Reflexive Transport Address.

If the client is behind a NAT router this will be a different address than the client itself is sending from. So a client can use this to detect if it is behind a NAT. A client can go even further and use multiple such servers. Now it can tell if the NAT router is Destination Endpoint Dependent or Destination Endpoint Independent.

Session Traversal Utilities for NAT: STUN

Naturally such servers have existed for a while. As part of all the standardisation around audio-video calls in the form of SIP and WebRTC there was a need for endpoints to learn about their reflexive transport addresses. For this the STUN spec was created, initially in RFC 3489 and several versions later we are now at RFC 8489 if we didn't miss anything.[^rfc-numbers]

Not going to lie about it: I've never read the full STUN spec.[^spec-reading] It contains a lot and can do many things. And yet, the really useful part is surprisingly small. Until version 0.32 iroh used STUN exclusively. It worked pretty simple:

Generate a STUN transaction ID.
Send a STUN request to a STUN server in a UDP datagram (the iroh relay server).
Wait for a response from the server matching transaction ID.

That's it.

So why change working systems? Let's look at what we don't get from this:

Encryption. While in theory you can encrypt STUN requesets using DTLS it's not something that is done much. It's also DTLS...
Reliability. It's a simple UDP-based protocol. If the request is lost you eventually time out and need to resend it, very primitive.
Congestion Control. You will be sending application traffic over the same sockets. STUN happens outside of this however, which makes packet loss much more likely if the application is busy.

All of these are things that are solved in QUIC: QUIC is a secure, reliable transport with advanced congestion control and loss detection. And we already use it for our application protocol so we won't have two different endpoints sending and receiving on the same socket.

[rfc-numbers]: In between there was RFC 5389. RFC number cuteness tricks will never stop being cute.

[spec-reading]: While I have read many QUIC RFCs in their entirity, several times. So it's not like I'm adverse to reading lengthy IETF specs.

QUIC Address Discovery

This is such an obvious idea that someone already wrote it down as an IETF draft (thanks Maarten and Christian!): https://quicwg.org/address-discovery/draft-ietf-quic-address-discovery.html

QUIC Address Discovery, or QAD as we call it, is an extension to the QUIC protocol that gets negotiated during the QUIC handshake. If negotiated the remote side will send you a new OBSERVED_ADDRESS frame containing the reflexive transport address it observed for you.

One of the cool things is that this can happen regardless of the application protocol being used, as it happens entirely in QUIC frames. So you can still use this connection to carry application data.

Another really nice feature flowing from this is that this isn't a request-response protocol anymore. QUIC supports connection migration for clients, e.g. when your NAT router updates the mapping for some reason, or when you move from a Wifi network to mobile data, QUIC will detect this and migrate the connection to this new network, without losing any data or breaking the connection. And whenever that happens while the QAD extension is negotiated, a new reflexive transport address is observed and will be sent in a new OBSERVED_ADDRESS frame. Thus this becomes event-based rather than request response.

QAD in iroh Relay Servers

Since iroh 0.32 both iroh and the relay servers have supported, and used, QAD as well as STUN. Since the 0.90 release we have switched to QAD exclusively.

The work is not finished yet though. iroh still uses a special-purpose QUIC connection for QAD. At some point we would like to also support making the normal relay connection over QUIC when possible, in addition to the current HTTPS1.1/WebSocket connection. This would be one fewer connection to the relay server and truly allow us to benefit from the event-based nature of QAD. This is something for after the 1.0 release however.

Footnotes

Iroh is a dial-any-device networking library that just works. Compose from an ecosystem of ready-made protocols to get the features you need, or go fully custom on a clean abstraction over dumb pipes. Iroh is open source, and already running in production on hundreds of thousands of devices.
To get started, take a look at our docs, dive directly into the code, or chat with us in our discord channel.