Angry Eyeballs and a few things learned about tc
I was poking some Eyeballs today and I learned stuff. I wanted the Eyeballs to be less Happy.
Happy Eyeballs
Happy Eyeballs are a
mechanism that tries to use the fastest connection available latency-wise when
multiple connections are available – most often in presence of both IPv6 and
IPv4 connection. There are already several versions of the algorithm described
in RFCs: v1 in RFC 6555 and
v2 in RFC 8305, v3 is
being worked on.
The basic idea is the same: try all the ways of reaching the destination (often
a website) in paralell, staggered by several tens of milliseconds, trying the
newest one first, and then use the one, whose TCP handshake completes first.
This way, IPv6 is given a slight headstart but the connection will be
established quickly regardless of what protocol is actually working / used / fast.
It could be viewed as “Race condition as a feature”.
Poking the Eyeballs
However, there are some usecases when I don't want my endpoint to be reached
quickly, especially over the legacy IPv4. One such usecase is migrating sites
to be IPv6-only and returning an explanation page like this one. In this case, the domain needs to be
reachable over both IPv6 (the actual thing) and IPv4 (the explanation), but it
would be quite unfortunate, if the IPv4 won the race, for any reason.
One way to avoid this is to manually steer Happy Eyeballs to the IPv6 version.
We know it's a matter of latency, we probably cannot make IPv6 connectivity
faster, but IPv4 can be slowed down.
At this point I thought: this is just a matter of basic network shaping, right.
Basic network shaping: a recap
Disclaimer first: I am using the word “shaping” quite liberally here, e.g.
tc(8) only considers managing egress rates
to be shaping. The general term is “Quality-of-Service” or QoS, but it's a bit
confusing to use that in sentences. (The disclaimer will make sense shortly.)
There are two kinds of packets as seen by a network card (or any NIC – network
interface controller): the incoming ones from the network, called ingress, and
the ones the machine sends (for whatever reason, even the ones that came from
another network but go out through this NIC), called egress.
The machine (router, server, …) cannot do anything about how/when the packets
come in, so there is little to do on the ingress side of things. It is possible
to drop packets, but there is little to gain in reordering them, because it's
still needed to process all of the ones that were not dropped. So most of the
shaping happens on egress.
Egress shaping consists of various sorts of reordering packets, dropping them,
delaying them etc. This is generally the most flexible (and therefore
confusing) part of QoS. From now on, unless explicitly stated, we are only
talking about egress shaping, still meaning any sort of QoS imaginable.
Basic network shaping, in Linux
When Linux (the kernel) decides to send (or forward) a packet, it first chooses
the network interface to use. With this interface there is a queuing
discipline (or qdisc) associated. This is the algorithm, which decides the
order of packets. The kernel (“software part”) adds packets into the qdisc, the
qdisc gives the kernel (“the hardware part”) the packets that should get sent
and then they go to the network card drivers and on the wire. (I am not sure
whether the packets can be modified by qdisc, but that is irrelevant for the
problem and it will get confusing even without that.)
In the basic case, the qdisc just manages the packets by itself. This is called
a classless qdisc and it will be clear why shortly. Some examples are
pfifo, which just gives packets in the original order, pfifo_fast that
prioritises depending on the Traffic Class of packets, and fq_codel that
just tries to do the right thing in case of congestion.
Some classes are classful, meaning they manage piles of packets and pick
packets from the piles according to their algorithm. The piles are called
classes. Other qdiscs can be attached to classes. Some qdiscs create their
classes automatically, some can have classes added dynamically by a separate
command. (If there is no qdisc attached to a class, the manpage says it behaves
like pfifo)
Some classful qdisc just have one class (e.g. tbf), but often there are
multiple classes. In that case, the packets need to be sorted to the classes
according to filters. Filters are therefore attached to a qdisc and are
basically an ordered list of rules for sorting the packet into classes. Setting
up the qdiscs and filters gives the administrator a way to policy and shape the
traffic. One example is the prio qdisc, which has different classes for
different priorities and it only sends traffic of some priority if there is no
more important packet to be sent.
There are several naming rules. Qdiscs have handles of form MAJ:, their
classes then have automatically generated handles MAJ:MIN. Both MAJ and
MIN are hex numbers, with 0 and ffff being special. The order of filter
rules are called priorities, but they behave like “line numbers” as usual in
Linux networking.
I don't want to go into too many details here. I don't really have too deep
experience with QoS, so this is just a less-technical summary of what tc(8)
manpage says – go read it for the technical
details and overview of the many qdiscs and filters available.
And yes, I am waving around with tc(8) – that is the userspace utility for
configuring this part of the network stack. It comes in the iproute2 package, has a syntax
similar to ip (with some rough edges I will talk below), some of the qdiscs
are explained in manpages like tc-fq_codel and there is an overview in the
manpage for tc itself. You very likely need root privileges for setting
anything with tc.
How Angry Eyeballs should work
So, the idea is simple: pick IPv4 traffic, delay it forcefully, then mix it
back using some of the classful qdiscs to send out.
Delaying is simple: there is netem (“Network Emulator”) qdisc for that.
According to the tldr-pages, suffices to use something like … netem delay 200ms
and that is it.
The tricky part is to only delay the right packets. We need a classful qdisc
with two classes (one for fast traffic, one for IPv4), attach the netem
qdisc to the slow class and set up filters for that.
It would be kind of obvious to do this on ingress, but since the delay has to
occur anywhere during the handshake, egress is fine. It's better supported,
possibly a bit easier to understand and I was just happy to make it work
anyhow and using the default egress was easier.
The rest of the article is me describing my struggles with tc and the
lessons learned. Also, the explanation in the rest of this article may be
wrong, because they are completely made of guesswork.
Let's implement it
Last things first: I'll show the final solution for the first attempt, because
it will show how
things are supposed to work (and serve as a cheat sheet), then I'll elaborate
on the choices and pitfalls. Also, for now this is setup on an isolated network
of two QEMU VMs booted from Arch ISO. I named the interfaces en0 on both
VMs for simplicity. We will configure just one of the VMs anyway.
The setup is as follows:
# Create the classful qdisc
# I picked ets, so I give the two classes quanta 60 and 40
tc qdisc add dev en0 handle 42: root ets quanta 60 40
# Now we have classes 42:1 and 42:2, with everything going into 42:2
# Add the filters
tc filter add dev en0 prio 40 protocol ipv4 matchall classid 42:2
tc filter add dev en0 prio 60 matchall classid 42:1
# Now we add the `netem` qdisc
tc qdisc add dev en0 parent 42:2 netem delay 200ms
# done.
As shown, it might be quite obvious what this does, especially given the
comments. The only nontrivial thing is the ETS setup. The handle 42 is very
arbitrary, I just need to know it in the rest of the script, so I pick it
explicitly. It also needs to have ETS parameters set. In my case, we create two
classes (because there are two numbers after quanta) and the output ratio
will be 60:40 in case both classes of traffic want to send data. The classes
are then automatically numbered :1 and :2.
The numbers 40 and 60 in filter priorities only set the order and have nothing
to do with the quanta set before, it just has the nice semantic of IPv4 traffic
and IPv6 traffic and are spaced apart in case I need to add more rules in
between. The matchall filter just matches all traffic, classid X:Y sets
the target class. Filtering to a specific address family is apparently done by
the filter system itself.
The netem qdisc is just put to that class. We do not specify a handle, in my
case it got assigned 8003: and it does not really matter, if we don't want
to attach anything under it.
Now it happily works:
# ping fe80::1615:16ff:fe11:2%en0 -c 1
PING fe80::1615:16ff:fe11:2%en0 (fe80::1615:16ff:fe11:2%en0) 56 data bytes
64 bytes from fe80::1615:16ff:fe11:2%en0: icmp_seq=1 ttl=64 time=0.379 ms
# ping 192.168.210.2 -c 1
PING 192.168.210.2 (192.168.210.2) 56(84) bytes of data.
64 bytes from 192.168.210.2: icmp_seq=1 ttl=64 time=201 ms
We can even see the configuration, and use -g to have a nice ASCII graph of classes:
# tc qdisc show dev en0
qdisc ets 42: root refcnt 2 bands 2 quanta 60 40 priomap 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
qdisc netem 8003: parent 42:2 limit 1000 delay 200ms seed 5997207230187148512
# tc filter show dev en0
filter parent 42: protocol ip pref 40 matchall chain 0
filter parent 42: protocol ip pref 40 matchall chain 0 handle 0x1 flowid 42:2
not_in_hw
filter parent 42: protocol all pref 60 matchall chain 0
filter parent 42: protocol all pref 60 matchall chain 0 handle 0x1 flowid 42:1
not_in_hw
# tc class show dev en0
class ets 42:1 root quantum 60
class ets 42:2 root leaf 8003: quantum 40
# tc -g class show dev en0
+---(42:2) ets quantum 40
+---(42:1) ets quantum 60
Yeah, everything is fine, unless it isn't…
How and why is this broken: types of qdiscs
An obvious thing to try is running both ping commands in parallel. Keep it
running and look at it when the configuration of QoS is changed. At some point,
it might look like this :
64 bytes from fd37:81a4:145e:801:1:1:0:6: icmp_seq=5 ttl=64 time=0.417 ms
64 bytes from fd37:81a4:145e:801:1:1:0:6: icmp_seq=6 ttl=64 time=0.427 ms
64 bytes from fd37:81a4:145e:801:1:1:0:6: icmp_seq=7 ttl=64 time=0.421 ms
64 bytes from fd37:81a4:145e:801:1:1:0:6: icmp_seq=8 ttl=64 time=186 ms
64 bytes from fd37:81a4:145e:801:1:1:0:6: icmp_seq=9 ttl=64 time=186 ms
64 bytes from fd37:81a4:145e:801:1:1:0:6: icmp_seq=10 ttl=64 time=187 ms
Yikes. This can be explained by how ETS works. We are using only
bandwidth-sharing bands, on which the selection is made by Deficit Round Robin (DRR) algorithm, so I'll
only explain that part of ETS.
DRR has a counter for each class, initialised to its quantum. It then iterates
through classes which have packets in them, and if the topmost packet is
smaller than the current counter, it gets sent and the counter is decreased by
the packet's size, and this repeats. When all such packets have been sent, the
counter for the class is increased by the quantum and next class with packets
is tried. This leads to the proportional bandwidth sharing (in the case above
the ratio is 60:40 in favour of IPv6).
The problem is, DRR is a work-conserving qdisc, which very approximately
means that when there are packets in any of the classes, it tries to send them.
This clashes with the netem qdisc, because there are packets in the IPv4
class (the pings), but trying to send them is slow (added delay). DRR does
not know it would be slow, it knows that at that time, small enough packet is
in this class so it must be sent. And therefore it waits processing this class
and does not send IPv6 packets from the other class, increasing its delay, too!
(The reason why some of the IPv6 packets were fast is a coincidence – the IPv6
class got to send packets in the right moment. This time, it is a race
condition as a bug.)
Netem, on the other hand, is non-work-conserving, meaning that even with
queued packets, it might not want to send them right away. (There is a better
discussion about these kinds of schedulers on Wikipedia.)
The fact that you should not attach non-work-conserving qdiscs to
work-conserving qdiscs is actually written in both tc-ets(8) and
tc-drr(8) manpages:
Attaching to them non-work-conserving qdiscs like TBF does not make
sense -- other qdiscs in the active list will be skipped until the
dequeue operation succeeds.
However, the tc-netem(8) manpage does not say anything about work
conservation, so it was not until the kernel started complaining with lines
like ets_qdisc_dequeue: netem qdisc 8004: is non-work-conserving? in
dmesg.
Having used the word “scheduler”, let's mention how qdiscs differ according to
function:
- Schedulers only reorder packets. They don't delay them and are often
work-conserving. ETS belongs here, as does PRIO. They help prioritise
important packets, but are not concerned with transmission rates.
- Shapers on the other hand try to ensure transmission is under control, which
involves management of the transmission rates and burstiness. TBF is an
example of a pure shaper that never reorders traffic. The only way they could
be work-conserving is if they swiftly dropped any packet that they would deem
unsendable, which would not be very nice.
- Some qdiscs do both, like fq_codel: it sorts traffic into flows and then
drops and delays packets in each flow, so that all flows have similar
bandwidth.
- Some qdiscs are very dumb, like pfifo, and do none of the above.
So, now we know what we need: a qdisc that would be non-work-conserving and
classful. We aren't very interested in scheduling, because we don't have
traffic to prioritise (it would be nice-to-have, but making IPv6 faster is not
the main goal).
Looking at the list of qdiscs in tc(8) manpage, we are left with:
By simple skimming, I dedided that HFSC is too complex to try to understand for
my usecase, TBF only supports one class , so HTB it is. Once
again, skim through its manpage, then
also manpage for tc-tbf and have this
as the final config (asuming gigabit network):
tc qdisc add dev en0 root handle 42: htb default 6
tc class add dev en0 parent 42: classid 42:4 htb rate 400mbit ceil 1gbit burst 1200kb prio 20
tc class add dev en0 parent 42: classid 42:6 htb rate 600mbit ceil 1gbit burst 1200kb prio 10
tc filter add dev en0 protocol ipv4 matchall classid 42:4
tc qdisc add dev en0 parent 42:4 netem delay 200ms
HTB lets us name our classes and specify the default class itself, so that is a
bit different, but otherwise it is very similar. Also, we don't specify quanta,
but rather specific rates for sending packets. We promise a bit more bandwidth
to IPv6 (the rate parameter), but both protocols can use all of the
available bandwidth (ceil) if the other protocol does not use it.
Once again, the kernel warns us with htb: netem qdisc 8005: is
non-work-conserving?, but this time we use an algorithm that should not be
badly influenced by non-work-conserving qdiscs, so I think it is safe to ignore
that. Pings work as expected even in parallel, IPv4 iperf is a bit slower than
I'd expect, but that seems to be due to the delay (maybe too short window?).
I call this a success.
The tc pitfalls and other lessons learned
As I said, coming up with this was not straightforward. Most of this stems from
tc not having great error messages, so often I was left with Error:
Parent Qdisc doesn't exists. We have an error talking to the kernel and just
had to guess what to do. As an example, not even the -g option in the last
command above is described in what tc help shows (but is in manpage).
I also learned that the manpage is not entirely true. For example, it seems
that tbf and mqprio are actually classful qdiscs, even though they
are found in the classless list. The manpage also claims that the default qdisc
for unconfigured interfaces is pfifo_fast, but tc qdisc show suggests
that the devices use fq_codel in case they are actually hardware-based and
something called noqueue for “software” interfaces (VLANs, bridges,
wireguard tunnels, …).
The main pitfall for debugging and understanding is: you have to specify the
interface almost every time. No matter tc filter help says Usage: tc
filter [ add | del | change | replace | show ] [ dev STRING ] (among other).
No matter that the commands end successfully and don't complain. tc filter
show gives you nothing, tc filter show dev en0 gives you info. But for
qdisc, it will work always, even with just tc qd.
Therefore, at first, I didn't even realise that the classes were there at all.
What does protocol mean in the filter is still a mystery to me, but somehow
ip is the correct answer according to the output, and apparently ipv4
can be specified too. Though I did not know that from beginning, so the first
filter I had was tc filter add dev en0 prio 40 basic match 'meta(sk_family eq
2)' classid 42:2 and that works too, with the value 2 taken from what
AF_INET means in Python (which takes it from C), which feels ugly and hacky.
The rules about which qdisc can be attached where still boggle me. I can put
ETS as a root or under TBF, but if it is under TBF, it is not possible to
attach a filter to the ETS: Error: Class doesn't support blocks. We have an
error talking to the kernel. And in the tc class show dev … it shows
class ets 42:1 root quantum 60, even though it should not be a root in
any sense.
Another example of weird error messages is deleting qdisc. The task is to
delete a qdisc 15:. All commands in this paragraph start with tc qdisc
del dev en0, I'll shorten it to just …. The naïve … handle 15: yields
Error: Classid cannot be zero.. Removing the colon does not help. Using …
handle 15 parent 8004 gives Error: Failed to find qdisc with specified
classid., even though that qdisc with that qdisc id exists. Ha, here the
colon is requied, so … handle 15 parent 8004: works, as does specifying the
class parent 8004:1. The class's minor number is required if the parent
qdisc has multiple classes, otherwise the above would tell you Error:
Specified class not found. And it is correct, because qdisc belongs to the
class, not qdisc, and tc qdisc help says to use CLASSID. But at this
point, if you are like me, you would just randomly tweak the command and hope
it does what you want, especially when before it complained about qdisc, not
class. (And surprisingly, the handle is not required sometimes. Even though
the manpage says: “A qdisc can be deleted by specifying its handle, which
may also be 'root'.”)
There are also chains, the manpage does not say pretty much anything about
those.
I think some of these pitfalls are the kind that bites you only at first,
because if you spend some time with tc, it becomes normal and transparent.
I am writing this mainly to show novices like me that this happens and what
that might mean.
Also, it might be worth knowing that unlike with ip, it is possible to
reset the tc state to (some kind of) default by removing the root qdisc:
tc qdisc del dev en0 root. Does not even need to specify what qdisc.
How my setup should be improved?
As I was writing this article, I needed to read what fq_codel does. At that
point I realised that having HTB as the root qdisc means that there is no flow
balancing, which is not so great. It would be great to incorporate a flow-aware
scheduler in the mix. However, both fq_codel and netem are classless,
so they don't mix. Using fq_codel as the qdisc for the fast class is an
option though.
I am not aware of a classful flow-aware non-work-conserving scheduler. Or a
classful packet delayer. Either of them would help the IPv4 class. But I don't
care too much about IPv4 anyway :-)
Also, another warning popped up when adding the HTB classes: Warning:
sch_htb: quantum of class 150004 is big. Consider r2q change. That should be
understood and considered :-)
Closing thoughts
It might be a bit surprising that this is not filed under awful-networks,
given the amount of witchcraft involved. The reason is, my aim with this is a
bigger project, but that needs a bit more preparation. Stay tuned and maybe
remind me in a month if nothing will have happened. (If you know me in person,
you might already know what I'm talking about.)
Since I want to use this in another project, it is actually implemented as a
systemd service on my gitea: https://gitea.ledoian.cz/LEdoian/angry-eyeballs.
The info in this post might be wrong, misunderstood or right for wrong reason.
If you find out an error or can disprove something I have written above, please
let me know, I'll be happy to learn better!
[2] | I keep thinking about calling this post “class struggle”,
but I just didn't want to make the joke too early :-D |