LEdoian's Blog

My awful networks, chapter 2: from MTU issues to ad-hoc ISP-style infrastructure

As I have written in chapter 1, my uplink is a PPPoE, which has MTU of 1492 instead of the ubiquitous 1500. This is a story about building a “test infra” to debug issues with MTU without breaking things.

The problem

There would be nothing wrong with mixed MTU environment if path MTU discovery (PMTUD) worked. But the culprit is TCP with its Maximum Segment Size (MSS) option in SYN packets. At least in my Linux's stacks, the value of MSS does not seem to reflect PMTU, but only MTU of the interface. This means that it does not cope well with MTU getting smaller along the way – the segments sent would be too big for that link, which would in turn discard the packet and hopefully send the “Fragmentation Needed and Don't Fragment was Set” reply or “Packet Too Big” (as most often packets don't fragment en route even in IPv4). This reply might not get delivered, because especially in IPv4 it was/is quite common to drop ICMP traffic altogether.

The result is that internet connections are semi-working, it is usually possible to establish the connection but then big packets from some sites (those not receiving ICMP I suspect) get blackholed (and 1460 bytes is not that much in today's web). Luckily, this seems to be stable, the websites keep consistently not loading (for me this is e.g. duckduckgo.com and archive.org [1]).

Note: it is possible to find the PMTU with tracepath. Kudos to the person at stack overflow [TODO link] from whom I learned this.

Common workarounds

The easiest workaround is to set (usually using DHCP) the MTU of the local network a bit lower (either to 1492 directly, or maybe even to 1400 to have some headroom for similar issues). This works correctly.

The more common workaround is MSS clamping, that is, rewriting MSS in SYN packets to account for lower (link-)MTU en route. This is a hack, which will not work with other transport protocols, but is very common nevertheless. (In MikroTik, this is configued with /ppp/profile/set <ID> change-tcp-mss=yes and for me it was the default.)

And then there is “Packetisation-layer Path MTU Discovery”, with a lovely abbreviation of PLPMTUD, specified in RFC TODO, and a similar hack for connection-less [TODO check] communication called “Datagram Packetisation-layer Path MTU Discovery” or DPLPMTUD. I call these hacks, because their motivation is to work around ICMP blackholes [TODO cloudflare link] – in my opinon, PMTUD should just work in the first place.

The last option is to not have the problem at all by using baby-giants and have the PPPoE have MTU of 1500.

In my case, I knew two things: with my MikroTik router everything works, while with my OpenWRT-based Omnia [2] it didn't, and at the time I was unaware about MSS clamping nor had the network ready for baby-giants. And then I became aware about MSS clamping, turned it on but for some inexplicable reason Discord still didn't work, and I have no idea why. And later it started working, so I think it was not an issue on my side, but idk. (I still need to check whether baby-giants would work.)

But Discord not working made me dig into the issue, which meant creating a test environment for my Omnia, while having the MikroTik be the main gateway.

The madness: creating the test environment

There are four obvious requirements for what should be built:

  1. The Omnia should have as production configuration as possible,
  2. it has to be possible to open a web browser in the test environment,
  3. uplink to the internet has to have MTU 1500 and
  4. it should be possible to destroy the environment as easily as possible.

The fourth requirement hints at me using random VLANs and network namespaces on a random Linux machine, which obviously is my desktop. The second requirement is fulfilled by the first, because the Omnia is an access point anyway, so it is possible to connect a laptop to it. But also VLANs, so having another container on the desktop connected to the downstream interface of Omnia works.

The tricky and horrible part is the third requirement, because that means that I need a PPPoE server and somehow work around the MikroTik, which clamps MSS for the rest of the network. I need to create the bug and not have it accidentally fixed.

It took me quite some thinking about how to build this. I knew MikroTik could be a PPPoE server (and there is a guide on building that), but I also have no experience with VRF, so I would need to learn that too. And having multiple routing tables on MikroTik did not sound particularly enjoyable. However, I did follow the guide and had the PPPoE server on MikroTik set up (takeaway: TODO passwords and usernames). Then I realised I could have my desktop run PPPoE server (“it's just software running on general purpose hardware, right?”), so I went with that – more on that later.

The other problem is working around the MSS clamp on MikroTik while still eventually sending the traffic through it. Using a WireGuard tunnel was also not an option, since it has lower MTU (at least by default, but avoiding hidden fragmentation sounds like a good idea anyway, unless I want fragmentation). At one point I thought about writing a trivial tunnel that would fragment and reassemble stuff, but that seemed like work.

But hey, PPP can deal with almost any medium, right? PPP-over-SSH would work (at one point the SSH was supposed to run over WireGuard)… Well, luckily I remembered that OpenSSH can create tun (L3) tunnels (ssh -w), and guess what: SSH runs on TCP which creates an illusion of stream transport, so this is not influenced by MSS clamping on MikroTik. And the tunnelled traffic is encrypted and authenticated, so it can't change en route. I still have no idea whether this is TCP-in-TCP or not, but this is not about performance (and the setup below is really not about performance nor sanity), so there is no need to care about that.

And naturally, the tunnel endpoint is going to be my VPS, because it is the most readily available machine to me I have root access to and that has MTU 1500 uplink.

Requirement 5: using as little privileges as possible

I want to be able to experiment and am used to unshare -rn giving me most rights I need. Therefore, I want to use root privileges as little as possible, so that I don't have to care too much about what I'm doing. Obviously, root privileges are still needed to stuff like delegating VLAN to a namespace and creating a user-managed tun interface on VPS.

The actual setup

TODO: drawing ^^

Overall: Omnia has PPPoE over VLAN 4000 [3] as uplink, does NA(P)T (because it does), downstream is WiFi “Test” and VLAN 4001 (with classic switching). Downstream addresses: 192.168.207.0/24 (Omnia's auxiliary subnet) with DHCP, fd37:81a4:145e:fffe::/60.

My desktop has two network namespaces, each connected one VLAN. The client one is useful for configuration of Omnia and for the testing web browser, the PPPoE one runs the PPPoE server and the tunnel to my VPS. PPPoE has addresses 192.168.201.{1,2} from desktop's auxiliary subnet and no IPv6 because ISP has no IPv6.

The above means that the containers on desktop and Omnia live in a disjoint network from the rest, so it cannot interact with the rest. (Also this meant that when my desktop crashed, I needed to add the VLAN again to get access to Omnia :-)

The PPPoE namespace is connected with virtual ethernet to the host, IP addresses 100.64.1.0/30. This is then NA(P)T'd in the host to my regular networks (it didn't have to, but I didn't realise :-D). And of course there is another NA(P)T on the MikroTik, which means so far we have at least 3 NA(P)Ts in the abomination (and given that the ISP uses RFC 1918 addresses on the upstream PPPoE, there has to be another NA(P)T there too).

There are tun interfaces in the PPPoE namespace and on my VPS for the actual traffic, addresses 100.64.15.{1,2}, and because the VPS has only one public IPv4 address, it also runs a NA(P)T.

So, this is what we want to create, now let's create it. From the easiest stuff in order not to overwhelm the readers:

Omnia and the client networks

This is probably the most straightforward part – just use a web browser and click in LuCI. I don't remember having an issue with killing my acces to Omnia, so I probably first created the WiFi and VLAN 4001, then connected with it and killed the other interfaces, but I don't remember that part. PPPoE has no credentials and no IPv6.

Delegating the client VLAN to a network namespace can be done by:

  1. Creating the namespace: user@desktop $ unshare -rn

  2. Finding the namespace's PID: root-in-ns@desktop # echo $$

  3. Creating and delegating the VLAN:

    root-in-host@desktop # ip link add name vl-test link en0 type vlan id 4001
    root-in-host@desktop # ip link set vl-test netns <PID>
    
  4. Configure the netns: something like ip link set vl-test up; dhclient; … [4]

It should now be possible to ping 192.168.207.1 from the namespace. We now leave this namespace, since it is set up completely, and will only use it for checking on Omnia and eventually for testing that Discord works.

Connections to PPPoE netns

First, virtual ethernet. Quite simple, but needs root privileges:

root-in-host@desktop # ip link add name ve_ns type veth peer ve_internet
root-in-host@desktop # ip link set ve_internet netns 18244
root-in-host@desktop # ip link set ve_ns up
root-in-host@desktop # ip addr add 100.64.1.1/30 dev ve_ns
root-in-host@desktop # nft add rule inet nat postrouting iifname ve_ns ip saddr 100.64.1.2/32 masquerade

Then it is needed to configure the network in the netns:

root-in-netns@desktop # ip link set ve_internet up
root-in-netns@desktop # ip addr add 100.64.1.2/30 dev ve_ns
root-in-netns@desktop # ip route add 203.0.113.25 via 100.64.1.1 dev ve_ns # where 203.0.113.25 is the IPv4 of the VPS

The static route is needed because we will need to SSH to the VPS. All other traffic will go through the SSH tunnel.

At this point we should be able to ping the VPS and even SSH there.

Delegating VLAN 4000 is the same as before, but we don't configure IP addresses – we are only interested in having the VLAN for now. So just ip link add …; ip link set … netns … as root and ip link set … up in the namespace.

We also prepare the tun0 interface for the tunnel:

root-in-netns@desktop # ip tuntap add mode tun
root-in-netns@desktop # ip addr add 100.64.15.2 dev tun0 peer 100.64.15.1
root-in-netns@desktop # ip route add default via 100.64.15.1

Keep the interface down for now [TODO i think]

Last, we need to tell the namespace to forward packets:

root-in-netns@desktop # sysctl net.ipv4.conf.all.forwarding=1

Note how this would be required also in the host, but in my case it already is a router so there is no need to run that command. But it would not hurt.

Preparation of VPS

Here we only need to prepare the tun interface and set up the routing table and NAT. Everything as root:

root@vps # ip tuntap add mode tun user ledoian
root@vps # ip addr add 100.64.15.1 dev tun0 peer 100.64.15.2
root@vps # ip route add 192.168.201.0/30 via 100.64.15.2
root@vps # nft add chain inet nat tmp-masq \{ type nat hook postrouting priority srcnat\; \}
root@vps # nft add rule inet nat tmp-masq iifname "tun0" oifname "eth0" ip saddr \{ 100.64.15.2, 192.168.201.0/30 \} masquerade
root@vps # sysctl net.ipv4.conf.tun0.forwarding=1
root@vps # sysctl net.ipv4.conf.eth0.forwarding=1
root@vps # ip link set tun0 up

The interfaces are set up, now we only need to bring up the PPPoE and SSH tunnels.

SSH tunnel

We now reach the smaller of the pain points. The goal is simple: run ssh -w 0:0 -i ~ledoian/.ssh/id_rsa.pub -l ledoian 203.0.113.25 from the namespace and have it create a working tunnel. Note how we need to specify the path to the key and username, because the namespace thinks we are root and not ledoian. Also, -4 might be useful if using a hostname with AAAA DNS record and not IPv4 address directly.

This would not be called a pain point if it would work outright… Turns out the VPS needs to have tunnelling allowed – put this to /etc/ssh/sshd_config:

Match User ledoian
    PermitTunnel point-to-point

And now it should work, but won't anyway. For some reason I don't understand, the SSH tunnel is very wonky and will only work under certain circumstances, including but not limited to:

  • The tunnel is unused on both ends
  • The tunnel is up on both ends
  • The user on both ends has the right to manipulate the tunnel
  • If the tunnel was running too long/too idle, it just dies even though the rest of the connection works [TODO iirc]

This means that whenever I was testing this, I needed to [TODO do what]. Naturally, the output of the ssh, maybe improved by -vv helps somewhat.

In the end, it should be possible to ping all of the public IPv4 internet from the namespace.

PPPoE, finally

This turned out to be a complete cat and mouse game (luckily, not an endless one). I don't know many PPPoE server implementations, so I found rp-pppoe in my repos and decided to go with that. It's idea is simple: do the PPPoE discovery and then hand the channel over to regular pppd to handle the connection.

First issue: It won't run as root. It prints an error message to the syslog and then ends (I did not find a way to run it in foreground). And the network namespace, when run as unshare -rn, has pretty much only root account, which is not actually privileged, but geteuid(2) still says 0.

OK, I found the source code, patched the exit away and recompiled. Why do you have to do this? I know what I am doing…

The patch:

diff --git a/src/common.c b/src/common.c
index ca4c1b2..48be974 100644
--- a/src/common.c
+++ b/src/common.c
@@ -167,11 +167,11 @@ switchToRealID (void) {
        if (saved_gid == (uid_t) -2) saved_gid = getegid();
        if (setegid(getgid()) < 0) {
            printErr("setgid failed");
-           exit(EXIT_FAILURE);
+           //exit(EXIT_FAILURE);
        }
        if (seteuid(getuid()) < 0) {
            printErr("seteuid failed");
-           exit(EXIT_FAILURE);
+           //exit(EXIT_FAILURE);
        }
     }
 }
@@ -190,11 +190,11 @@ switchToEffectiveID (void) {
     if (IsSetID) {
        if (setegid(saved_gid) < 0) {
            printErr("setgid failed");
-           exit(EXIT_FAILURE);
+           //exit(EXIT_FAILURE);
        }
        if (seteuid(saved_uid) < 0) {
            printErr("seteuid failed");
-           exit(EXIT_FAILURE);
+           //exit(EXIT_FAILURE);
        }
     }
 }
@@ -228,7 +228,7 @@ dropPrivs(void)
     }
     if (ok < 2) {
       printErr("unable to drop privileges");
-      exit(EXIT_FAILURE);
+      //exit(EXIT_FAILURE);
     }
 }

Second issue: the compilation flags work in a weird way, so I found no way of having /etc/ppp/pppoe-server-options in another (user-writable) location. Gah, I just edited that one as host-root.

Third issue: pppd on Omnia and the rp-pppoe-managed pppd in the namespace could not agree on parameters. I think there wasn't anything actionable in the logs, but wiresharking the interface and a bit of guesswork lead me to problems with compression negotioation (desktop offered some, Omnia rejected it).

In the end: the /etc/ppp/pppoe-server-options that worked (there are some extra options from testing):

noauth
mru 1492
noipv6
password aaaa
show-password
user test-tr-secret
nobsdcomp
nodeflate
nopcomp
novj
novjccomp

And the invocation: /tmp/rppppoe/sbin/pppoe-server -I test-pppoe -C Zirconium -L 192.168.201.1 -R 192.168.201.2 -N 1. On any change, killall pppoe-server, try again, wait until Omnia's pppd times out and the restart delay passes.

At least once this worked, it connected quickly and stayed connected, unlike the SSH tunnel.

Working PPPoE can be verified by pinging 192.168.201.1 from the end-user namespace or WiFi. And if the SSH tunnel works, even the public IPv4 internet should be reachable, as we wanted in the first place.

The result

To test that the PMTU is as expected (pick any host that does reply to UDP with Destination Unreachable – not as likely as you'd think – or wait until 30 hops time out):

root-in-ns@desktop # tracepath -n 192.0.2.55

Of course no issue with Discord nor anything else showed up. But now I have a guide on how to do that again, so if this saves somebody some despair, it helped :-) Also, I just love to brag about the horrors in my network :-D

Also I gained some experience with the other side of PPPoE (and with SSH tunnels), which also can't hurt.


[1]TODO
[2]TODO
[3]TODO
[4]TODO