LEdoian's Blog

Angry Eyeballs and a few things learned about tc

I was poking some Eyeballs today and I learned stuff. I wanted the Eyeballs to be less Happy.

Happy Eyeballs

Happy Eyeballs are a mechanism that tries to use the fastest connection available latency-wise when multiple connections are available – most often in presence of both IPv6 and IPv4 connection. There are already several versions of the algorithm described in RFCs: v1 in RFC 6555 and v2 in RFC 8305, v3 is being worked on.

The basic idea is the same: try all the ways of reaching the destination (often a website) in paralell, staggered by several tens of milliseconds, trying the newest one first, and then use the one, whose TCP handshake completes first. This way, IPv6 is given a slight headstart but the connection will be established quickly regardless of what protocol is actually working / used / fast.

It could be viewed as “Race condition as a feature”.

Poking the Eyeballs

However, there are some usecases when I don't want my endpoint to be reached quickly, especially over the legacy IPv4. One such usecase is migrating sites to be IPv6-only and returning an explanation page like this one. In this case, the domain needs to be reachable over both IPv6 (the actual thing) and IPv4 (the explanation), but it would be quite unfortunate, if the IPv4 won the race, for any reason.

One way to avoid this is to manually steer Happy Eyeballs to the IPv6 version. We know it's a matter of latency, we probably cannot make IPv6 connectivity faster, but IPv4 can be slowed down.

At this point I thought: this is just a matter of basic network shaping, right.

Basic network shaping: a recap

Disclaimer first: I am using the word “shaping” quite liberally here, e.g. tc(8) only considers managing egress rates to be shaping. The general term is “Quality-of-Service” or QoS, but it's a bit confusing to use that in sentences. (The disclaimer will make sense shortly.)

There are two kinds of packets as seen by a network card (or any NIC – network interface controller): the incoming ones from the network, called ingress, and the ones the machine sends (for whatever reason, even the ones that came from another network but go out through this NIC), called egress.

The machine (router, server, …) cannot do anything about how/when the packets come in, so there is little to do on the ingress side of things. It is possible to drop packets, but there is little to gain in reordering them, because it's still needed to process all of the ones that were not dropped. So most of the shaping happens on egress.

Egress shaping consists of various sorts of reordering packets, dropping them, delaying them etc. This is generally the most flexible (and therefore confusing) part of QoS. From now on, unless explicitly stated, we are only talking about egress shaping, still meaning any sort of QoS imaginable.

Basic network shaping, in Linux

When Linux (the kernel) decides to send (or forward) a packet, it first chooses the network interface to use. With this interface there is a queuing discipline (or qdisc) associated. This is the algorithm, which decides the order of packets. The kernel (“software part”) adds packets into the qdisc, the qdisc gives the kernel (“the hardware part”) the packets that should get sent and then they go to the network card drivers and on the wire. (I am not sure whether the packets can be modified by qdisc, but that is irrelevant for the problem and it will get confusing even without that.)

In the basic case, the qdisc just manages the packets by itself. This is called a classless qdisc and it will be clear why shortly. Some examples are pfifo, which just gives packets in the original order, pfifo_fast that prioritises depending on the Traffic Class of packets, and fq_codel that just tries to do the right thing in case of congestion.

Some classes are classful, meaning they manage piles of packets and pick packets from the piles according to their algorithm. The piles are called classes. Other qdiscs can be attached to classes. Some qdiscs create their classes automatically, some can have classes added dynamically by a separate command. (If there is no qdisc attached to a class, the manpage says it behaves like pfifo)

Some classful qdisc just have one class (e.g. tbf), but often there are multiple classes. In that case, the packets need to be sorted to the classes according to filters. Filters are therefore attached to a qdisc and are basically an ordered list of rules for sorting the packet into classes. Setting up the qdiscs and filters gives the administrator a way to policy and shape the traffic. One example is the prio qdisc, which has different classes for different priorities and it only sends traffic of some priority if there is no more important packet to be sent.

A diagram of how qdiscs, classes and filters interact

A complete-ish picture of the hierarchy.

There are several naming rules. Qdiscs have handles of form MAJ:, their classes then have automatically generated handles MAJ:MIN. Both MAJ and MIN are hex numbers, with 0 and ffff being special. The order of filter rules are called priorities, but they behave like “line numbers” as usual in Linux networking.

I don't want to go into too many details here. I don't really have too deep experience with QoS, so this is just a less-technical summary of what tc(8) manpage says – go read it for the technical details and overview of the many qdiscs and filters available.

And yes, I am waving around with tc(8) – that is the userspace utility for configuring this part of the network stack. It comes in the iproute2 package, has a syntax similar to ip (with some rough edges I will talk below), some of the qdiscs are explained in manpages like tc-fq_codel and there is an overview in the manpage for tc itself. You very likely need root privileges for setting anything with tc.

How Angry Eyeballs should work

So, the idea is simple: pick IPv4 traffic, delay it forcefully, then mix it back using some of the classful qdiscs to send out.

Delaying is simple: there is netem (“Network Emulator”) qdisc for that. According to the tldr-pages, suffices to use something like … netem delay 200ms and that is it. [1]

The tricky part is to only delay the right packets. We need a classful qdisc with two classes (one for fast traffic, one for IPv4), attach the netem qdisc to the slow class and set up filters for that.

It would be kind of obvious to do this on ingress, but since the delay has to occur anywhere during the handshake, egress is fine. It's better supported, possibly a bit easier to understand and I was just happy to make it work anyhow and using the default egress was easier.

The rest of the article is me describing my struggles with tc and the lessons learned. Also, the explanation in the rest of this article may be wrong, because they are completely made of guesswork. [2]

Let's implement it

Last things first: I'll show the final solution for the first attempt, because it will show how things are supposed to work (and serve as a cheat sheet), then I'll elaborate on the choices and pitfalls. Also, for now this is setup on an isolated network of two QEMU VMs booted from Arch ISO. I named the interfaces en0 on both VMs for simplicity. We will configure just one of the VMs anyway.

The setup is as follows:

# Create the classful qdisc
# I picked ets, so I give the two classes quanta 60 and 40
tc qdisc add dev en0 handle 42: root ets quanta 60 40

# Now we have classes 42:1 and 42:2, with everything going into 42:2
# Add the filters
tc filter add dev en0 prio 40 protocol ipv4 matchall classid 42:2
tc filter add dev en0 prio 60 matchall classid 42:1

# Now we add the `netem` qdisc
tc qdisc add dev en0 parent 42:2 netem delay 200ms
# done.

As shown, it might be quite obvious what this does, especially given the comments. The only nontrivial thing is the ETS setup. The handle 42 is very arbitrary, I just need to know it in the rest of the script, so I pick it explicitly. It also needs to have ETS parameters set. In my case, we create two classes (because there are two numbers after quanta) and the output ratio will be 60:40 in case both classes of traffic want to send data. The classes are then automatically numbered :1 and :2.

The numbers 40 and 60 in filter priorities only set the order and have nothing to do with the quanta set before, it just has the nice semantic of IPv4 traffic and IPv6 traffic and are spaced apart in case I need to add more rules in between. The matchall filter just matches all traffic, classid X:Y sets the target class. Filtering to a specific address family is apparently done by the filter system itself.

The netem qdisc is just put to that class. We do not specify a handle, in my case it got assigned 8003: and it does not really matter, if we don't want to attach anything under it.

Now it happily works:

# ping fe80::1615:16ff:fe11:2%en0 -c 1
PING fe80::1615:16ff:fe11:2%en0 (fe80::1615:16ff:fe11:2%en0) 56 data bytes
64 bytes from fe80::1615:16ff:fe11:2%en0: icmp_seq=1 ttl=64 time=0.379 ms
# ping 192.168.210.2 -c 1
PING 192.168.210.2 (192.168.210.2) 56(84) bytes of data.
64 bytes from 192.168.210.2: icmp_seq=1 ttl=64 time=201 ms

We can even see the configuration, and use -g to have a nice ASCII graph of classes:

# tc qdisc show dev en0
qdisc ets 42: root refcnt 2 bands 2 quanta 60 40 priomap 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
qdisc netem 8003: parent 42:2 limit 1000 delay 200ms seed 5997207230187148512
# tc filter show dev en0
filter parent 42: protocol ip pref 40 matchall chain 0
filter parent 42: protocol ip pref 40 matchall chain 0 handle 0x1 flowid 42:2
  not_in_hw
filter parent 42: protocol all pref 60 matchall chain 0
filter parent 42: protocol all pref 60 matchall chain 0 handle 0x1 flowid 42:1
  not_in_hw
# tc class show dev en0
class ets 42:1 root quantum 60
class ets 42:2 root leaf 8003: quantum 40
# tc -g class show dev en0
+---(42:2) ets quantum 40
+---(42:1) ets quantum 60

Yeah, everything is fine, unless it isn't

How and why is this broken: types of qdiscs

An obvious thing to try is running both ping commands in parallel. Keep it running and look at it when the configuration of QoS is changed. At some point, it might look like this [3]:

64 bytes from fd37:81a4:145e:801:1:1:0:6: icmp_seq=5 ttl=64 time=0.417 ms
64 bytes from fd37:81a4:145e:801:1:1:0:6: icmp_seq=6 ttl=64 time=0.427 ms
64 bytes from fd37:81a4:145e:801:1:1:0:6: icmp_seq=7 ttl=64 time=0.421 ms
64 bytes from fd37:81a4:145e:801:1:1:0:6: icmp_seq=8 ttl=64 time=186 ms
64 bytes from fd37:81a4:145e:801:1:1:0:6: icmp_seq=9 ttl=64 time=186 ms
64 bytes from fd37:81a4:145e:801:1:1:0:6: icmp_seq=10 ttl=64 time=187 ms

Yikes. This can be explained by how ETS works. We are using only bandwidth-sharing bands, on which the selection is made by Deficit Round Robin (DRR) algorithm, so I'll only explain that part of ETS.

DRR has a counter for each class, initialised to its quantum. It then iterates through classes which have packets in them, and if the topmost packet is smaller than the current counter, it gets sent and the counter is decreased by the packet's size, and this repeats. When all such packets have been sent, the counter for the class is increased by the quantum and next class with packets is tried. This leads to the proportional bandwidth sharing (in the case above the ratio is 60:40 in favour of IPv6).

The problem is, DRR is a work-conserving qdisc, which very approximately means that when there are packets in any of the classes, it tries to send them. This clashes with the netem qdisc, because there are packets in the IPv4 class (the pings), but trying to send them is slow (added delay). DRR does not know it would be slow, it knows that at that time, small enough packet is in this class so it must be sent. And therefore it waits processing this class and does not send IPv6 packets from the other class, increasing its delay, too!

(The reason why some of the IPv6 packets were fast is a coincidence – the IPv6 class got to send packets in the right moment. This time, it is a race condition as a bug.)

Netem, on the other hand, is non-work-conserving, meaning that even with queued packets, it might not want to send them right away. (There is a better discussion about these kinds of schedulers on Wikipedia.)

The fact that you should not attach non-work-conserving qdiscs to work-conserving qdiscs is actually written in both tc-ets(8) and tc-drr(8) manpages:

Attaching to them non-work-conserving qdiscs like TBF does not make sense -- other qdiscs in the active list will be skipped until the dequeue operation succeeds.

However, the tc-netem(8) manpage does not say anything about work conservation, so it was not until the kernel started complaining with lines like ets_qdisc_dequeue: netem qdisc 8004: is non-work-conserving? in dmesg.

Having used the word “scheduler”, let's mention how qdiscs differ according to function:

  • Schedulers only reorder packets. They don't delay them and are often work-conserving. ETS belongs here, as does PRIO. They help prioritise important packets, but are not concerned with transmission rates.
  • Shapers on the other hand try to ensure transmission is under control, which involves management of the transmission rates and burstiness. TBF is an example of a pure shaper that never reorders traffic. The only way they could be work-conserving is if they swiftly dropped any packet that they would deem unsendable, which would not be very nice.
  • Some qdiscs do both, like fq_codel: it sorts traffic into flows and then drops and delays packets in each flow, so that all flows have similar bandwidth.
  • Some qdiscs are very dumb, like pfifo, and do none of the above.

So, now we know what we need: a qdisc that would be non-work-conserving and classful. We aren't very interested in scheduling, because we don't have traffic to prioritise (it would be nice-to-have, but making IPv6 faster is not the main goal).

Looking at the list of qdiscs in tc(8) manpage, we are left with:

By simple skimming, I dedided that HFSC is too complex to try to understand for my usecase, TBF only supports one class [4], so HTB it is. Once again, skim through its manpage, then also manpage for tc-tbf and have this as the final config (asuming gigabit network):

tc qdisc add dev en0 root handle 42: htb default 6
tc class add dev en0 parent 42: classid 42:4 htb rate 400mbit ceil 1gbit burst 1200kb prio 20
tc class add dev en0 parent 42: classid 42:6 htb rate 600mbit ceil 1gbit burst 1200kb prio 10
tc filter add dev en0 protocol ipv4 matchall classid 42:4
tc qdisc add dev en0 parent 42:4 netem delay 200ms

HTB lets us name our classes and specify the default class itself, so that is a bit different, but otherwise it is very similar. Also, we don't specify quanta, but rather specific rates for sending packets. We promise a bit more bandwidth to IPv6 (the rate parameter), but both protocols can use all of the available bandwidth (ceil) if the other protocol does not use it.

Once again, the kernel warns us with htb: netem qdisc 8005: is non-work-conserving?, but this time we use an algorithm that should not be badly influenced by non-work-conserving qdiscs, so I think it is safe to ignore that. Pings work as expected even in parallel, IPv4 iperf is a bit slower than I'd expect, but that seems to be due to the delay (maybe too short window?).

I call this a success.

The tc pitfalls and other lessons learned

As I said, coming up with this was not straightforward. Most of this stems from tc not having great error messages, so often I was left with Error: Parent Qdisc doesn't exists. We have an error talking to the kernel and just had to guess what to do. As an example, not even the -g option in the last command above is described in what tc help shows (but is in manpage).

I also learned that the manpage is not entirely true. For example, it seems that tbf and mqprio are actually classful qdiscs, even though they are found in the classless list. The manpage also claims that the default qdisc for unconfigured interfaces is pfifo_fast, but tc qdisc show suggests that the devices use fq_codel in case they are actually hardware-based and something called noqueue for “software” interfaces (VLANs, bridges, wireguard tunnels, …).

The main pitfall for debugging and understanding is: you have to specify the interface almost every time. No matter tc filter help says Usage: tc filter [ add | del | change | replace | show ] [ dev STRING ] (among other). No matter that the commands end successfully and don't complain. tc filter show gives you nothing, tc filter show dev en0 gives you info. But for qdisc, it will work always, even with just tc qd.

Therefore, at first, I didn't even realise that the classes were there at all.

What does protocol mean in the filter is still a mystery to me, but somehow ip is the correct answer according to the output, and apparently ipv4 can be specified too. Though I did not know that from beginning, so the first filter I had was tc filter add dev en0 prio 40 basic match 'meta(sk_family eq 2)' classid 42:2 and that works too, with the value 2 taken from what AF_INET means in Python (which takes it from C), which feels ugly and hacky.

The rules about which qdisc can be attached where still boggle me. I can put ETS as a root or under TBF, but if it is under TBF, it is not possible to attach a filter to the ETS: Error: Class doesn't support blocks. We have an error talking to the kernel. And in the tc class show dev … it shows class ets 42:1 root quantum 60, even though it should not be a root in any sense.

Another example of weird error messages is deleting qdisc. The task is to delete a qdisc 15:. All commands in this paragraph start with tc qdisc del dev en0, I'll shorten it to just . The naïve … handle 15: yields Error: Classid cannot be zero.. Removing the colon does not help. Using … handle 15 parent 8004 gives Error: Failed to find qdisc with specified classid., even though that qdisc with that qdisc id exists. Ha, here the colon is requied, so … handle 15 parent 8004: works, as does specifying the class parent 8004:1. The class's minor number is required if the parent qdisc has multiple classes, otherwise the above would tell you Error: Specified class not found. And it is correct, because qdisc belongs to the class, not qdisc, and tc qdisc help says to use CLASSID. But at this point, if you are like me, you would just randomly tweak the command and hope it does what you want, especially when before it complained about qdisc, not class. (And surprisingly, the handle is not required sometimes. Even though the manpage says: “A qdisc can be deleted by specifying its handle, which may also be 'root'.”)

There are also chains, the manpage does not say pretty much anything about those.

I think some of these pitfalls are the kind that bites you only at first, because if you spend some time with tc, it becomes normal and transparent. I am writing this mainly to show novices like me that this happens and what that might mean.

Also, it might be worth knowing that unlike with ip, it is possible to reset the tc state to (some kind of) default by removing the root qdisc: tc qdisc del dev en0 root. Does not even need to specify what qdisc.

How my setup should be improved?

As I was writing this article, I needed to read what fq_codel does. At that point I realised that having HTB as the root qdisc means that there is no flow balancing, which is not so great. It would be great to incorporate a flow-aware scheduler in the mix. However, both fq_codel and netem are classless, so they don't mix. Using fq_codel as the qdisc for the fast class is an option though.

I am not aware of a classful flow-aware non-work-conserving scheduler. Or a classful packet delayer. Either of them would help the IPv4 class. But I don't care too much about IPv4 anyway :-)

Also, another warning popped up when adding the HTB classes: Warning: sch_htb: quantum of class 150004 is big. Consider r2q change. That should be understood and considered :-)

Closing thoughts

It might be a bit surprising that this is not filed under awful-networks, given the amount of witchcraft involved. The reason is, my aim with this is a bigger project, but that needs a bit more preparation. Stay tuned and maybe remind me in a month if nothing will have happened. (If you know me in person, you might already know what I'm talking about.)

Since I want to use this in another project, it is actually implemented as a systemd service on my gitea: https://gitea.ledoian.cz/LEdoian/angry-eyeballs.

The info in this post might be wrong, misunderstood or right for wrong reason. If you find out an error or can disprove something I have written above, please let me know, I'll be happy to learn better!


[1]Actually quite unfortunately, the tldr page for tc(8) only describes usage of this qdisc. I understand how it's of most use for devs, but it means that it's hard to find resources on actually using tc. The manpage is very technical and it is not clear how stuff works, either. (I knew a bit about QoS in Linux, in theory, and still struggled rather hard.)
[2]I keep thinking about calling this post “class struggle”, but I just didn't want to make the joke too early :-D
[3]

The IP addresses and stuff is a bit different in this part of the article. I thought I had the final version of Angry Eyeballs, so I started already integrating it in another pet project, so this is networked differently than those Arch ISO VMs and also runs Debian.

The discovery of the issue also prompted me to scrap the part of the article where I gave (vague) arguments for choice of ETS, because now I needed to understand a bit more and not pitch ETS to you anymore.

[4]Technically, this would not be a problem, because according to tc(8) manpage: “[…] higher level filters may also point directly to lower classes.” However, HTB is similar enough to TBF that filter hacking is not useful for me, and I think I would need another classful qdisc between the root TBF and the Netem. It could in this case be work-conserving, but again, it would just make the setup more complex without benefit.