Debugging IPv6 support
One of our customers is running a monitoring network First2Know and has one of the monitoring nodes hosted with us in one of our London sites. He chose us because we have IPv4 and IPv6 support and the monitoring network does full IPv4/IPv6 monitoring from every location. He kept seeing connectivity issues between his Raleigh node hosted by RootBSD and his London node hosted by us.
Initial investigation indicated that only some IPv6 hosts on our network were affected, in particular he could reliably ping only one of two machines with IPv6 addresses in the same netblock hosted on the same switch within our network. We escalated the issue with us and RootBSD and they helpfully gave me a VM on their network so I could do some end to end testing.
Analysing at both ends with tcpdump indicated that packets were only being lost on the return path from RootBSD to Mythic Beasts, on the out path they always travelled fine. Testing more specifically showed that the connectivity issue was reproducible based on source/destination address and port numbers.
This connect command never succeeds,
# nc -p 41452 -6 2607:fc50:1:4600::2 22
This one reliably works,
 # nc -p 41451 -6 2607:fc50:1:4600::2 22
 SSH-2.0-OpenSSH_5.3
What’s probably happening is somewhere along the line the packets are being shared across multiple links using a layer3 hash, this means the link is chosen by an implementation like
md5($source_ip . $source_port . $destination_ip . $destination_port) % (number of links)
This means that each connection always sees the packets travel down the same physical link minimising the risk of a performance loss due to out of order packet arrival, but each connection effectively gets put down a different link at random.
Statistically we think that either 1 in 2 or 1 in 3 links at the affected point were throwing our packets away on this particular route. Now nobody in general has noticed because in dual stack implementations it falls back to IPv4 if the IPv6 connection doesn’t connect. We only found it because this application is IPv6 only; our IPv6 monitoring is single stack IPv6 only.
Conversation with RootBSD confirmed that the issue is almost certainly within one of the Tier 1 providers on the link between our networks, neither of us have any layer 3 hashing options enabled on any equipment on the path taken by the packets.
Now in this case we also discovered that we had some suboptimal IPv6 routing, once we’d fixed the faulty announcement our inbound routes changed and became shorter via a different provider and all the problems went away and we were unable to reproduce the issues again.
However as a result of this we’ve become a customer of First2Know and we’re using their worldwide network to monitor our global IPv4 and IPv6 connectivity so we can be alerted and fix issues like these well before our customers find them.
If this sounds like the sort of problem you’d like to work on, we’re always happy to accept applications at our jobs page.