The problem
There would be nothing wrong with mixed MTU environment if path MTU discovery
(PMTUD) worked. But the culprit is TCP with its Maximum Segment Size (MSS)
option in SYN packets. At least in my Linux's stacks, the value of MSS does not
seem to reflect PMTU, but only MTU of the interface. This means that it does
not cope well with MTU getting smaller along the way – the segments sent would
be too big for that link, which would in turn discard the packet and hopefully send the
“Fragmentation Needed and Don't Fragment was Set”
reply or “Packet Too Big”
(as most often packets don't fragment en route even in IPv4). This reply might
not get delivered, because especially in IPv4 it was/is quite common to drop
ICMP traffic altogether.
The result is that internet connections are semi-working, it is usually
possible to establish the connection but then big packets from some sites
(those not receiving ICMP I suspect) get blackholed (and 1460 bytes is not that
much in today's web). Luckily, this seems to be stable, the websites keep
consistently not loading (for me this is e.g. duckduckgo.com and
archive.org ).
Note: it is possible to find the PMTU with tracepath. Kudos to the person
at stack overflow [TODO link] from whom I learned this.
Common workarounds
The easiest workaround is to set (usually using DHCP) the MTU of the local
network a bit lower (either to 1492 directly, or maybe even to 1400 to have
some headroom for similar issues). This works correctly.
The more common workaround is MSS clamping, that is, rewriting MSS in
SYN packets to account for lower (link-)MTU en route. This is a hack, which
will not work with other transport protocols, but is very common nevertheless.
(In MikroTik, this is configued with /ppp/profile/set <ID> change-tcp-mss=yes
and for me it was the default.)
And then there is “Packetisation-layer Path MTU Discovery”, with a lovely
abbreviation of PLPMTUD, specified in RFC TODO, and a similar hack for
connection-less [TODO check] communication called “Datagram Packetisation-layer
Path MTU Discovery” or DPLPMTUD. I call these hacks, because their
motivation is to work around ICMP blackholes [TODO cloudflare link] – in my
opinon, PMTUD should just work in the first place.
The last option is to not have the problem at all by using baby-giants and have the PPPoE have MTU of 1500.
In my case, I knew two things: with my MikroTik router everything works, while
with my OpenWRT-based Omnia it didn't, and at the time I was
unaware about MSS clamping nor had the network ready for baby-giants. And then
I became aware about MSS clamping, turned it on but for some inexplicable
reason Discord still didn't work, and I have no idea why. And later it started
working, so I think it was not an issue on my side, but idk. (I still need to
check whether baby-giants would work.)
But Discord not working made me dig into the issue, which meant creating a test
environment for my Omnia, while having the MikroTik be the main gateway.
The madness: creating the test environment
There are four obvious requirements for what should be built:
- The Omnia should have as production configuration as possible,
- it has to be possible to open a web browser in the test environment,
- uplink to the internet has to have MTU 1500 and
- it should be possible to destroy the environment as easily as possible.
The fourth requirement hints at me using random VLANs and network namespaces on
a random Linux machine, which obviously is my desktop. The second requirement
is fulfilled by the first, because the Omnia is an access point anyway, so it
is possible to connect a laptop to it. But also VLANs, so having another
container on the desktop connected to the downstream interface of Omnia works.
The tricky and horrible part is the third requirement, because that means that
I need a PPPoE server and somehow work around the MikroTik, which clamps MSS
for the rest of the network. I need to create the bug and not have it
accidentally fixed.
It took me quite some thinking about how to build this. I knew MikroTik could
be a PPPoE server (and there is a guide on building that), but I
also have no experience with VRF, so I would need to learn that too. And having
multiple routing tables on MikroTik did not sound particularly enjoyable.
However, I did follow the guide and had the PPPoE server on MikroTik set up
(takeaway: TODO passwords and usernames). Then I realised I could have my
desktop run PPPoE server (“it's just software running on general purpose
hardware, right?”), so I went with that – more on that later.
The other problem is working around the MSS clamp on MikroTik while still
eventually sending the traffic through it. Using a WireGuard tunnel was also
not an option, since it has lower MTU (at least by default, but avoiding hidden
fragmentation sounds like a good idea anyway, unless I want fragmentation).
At one point I thought about writing a trivial tunnel that would fragment and
reassemble stuff, but that seemed like work.
But hey, PPP can deal with almost any medium, right? PPP-over-SSH would work
(at one point the SSH was supposed to run over WireGuard)… Well, luckily I
remembered that OpenSSH can create tun (L3) tunnels (ssh -w), and guess
what: SSH runs on TCP which creates an illusion of stream transport, so this is
not influenced by MSS clamping on MikroTik. And the tunnelled traffic is
encrypted and authenticated, so it can't change en route. I still have no idea
whether this is TCP-in-TCP or not, but this is not about performance (and the
setup below is really not about performance nor sanity), so there is no need
to care about that.
And naturally, the tunnel endpoint is going to be my VPS, because it is the
most readily available machine to me I have root access to and that has MTU
1500 uplink.
Requirement 5: using as little privileges as possible
I want to be able to experiment and am used to unshare -rn giving me most
rights I need. Therefore, I want to use root privileges as little as possible,
so that I don't have to care too much about what I'm doing. Obviously, root
privileges are still needed to stuff like delegating VLAN to a namespace and
creating a user-managed tun interface on VPS.
The actual setup
TODO: drawing ^^
Overall: Omnia has PPPoE over VLAN 4000 as uplink, does NA(P)T
(because it does), downstream is WiFi “Test” and VLAN 4001 (with classic
switching). Downstream addresses: 192.168.207.0/24 (Omnia's auxiliary subnet)
with DHCP, fd37:81a4:145e:fffe::/60.
My desktop has two network namespaces, each connected one VLAN. The client
one is useful for configuration of Omnia and for the testing web browser, the
PPPoE one runs the PPPoE server and the tunnel to my VPS. PPPoE has addresses
192.168.201.{1,2} from desktop's auxiliary subnet and no IPv6 because ISP has
no IPv6.
The above means that the containers on desktop and Omnia live in a disjoint
network from the rest, so it cannot interact with the rest. (Also this meant
that when my desktop crashed, I needed to add the VLAN again to get access to
Omnia :-)
The PPPoE namespace is connected with virtual ethernet to the host, IP
addresses 100.64.1.0/30. This is then NA(P)T'd in the host to my regular
networks (it didn't have to, but I didn't realise :-D). And of course there is
another NA(P)T on the MikroTik, which means so far we have at least 3 NA(P)Ts
in the abomination (and given that the ISP uses RFC 1918 addresses on the
upstream PPPoE, there has to be another NA(P)T there too).
There are tun interfaces in the PPPoE namespace and on my VPS for the actual
traffic, addresses 100.64.15.{1,2}, and because the VPS has only one public
IPv4 address, it also runs a NA(P)T.
So, this is what we want to create, now let's create it. From the easiest stuff
in order not to overwhelm the readers:
Omnia and the client networks
This is probably the most straightforward part – just use a web browser and
click in LuCI. I don't remember having an issue with killing my acces to Omnia,
so I probably first created the WiFi and VLAN 4001, then connected with it and
killed the other interfaces, but I don't remember that part. PPPoE has no
credentials and no IPv6.
Delegating the client VLAN to a network namespace can be done by:
Creating the namespace: user@desktop $ unshare -rn
Finding the namespace's PID: root-in-ns@desktop # echo $$
Creating and delegating the VLAN:
root-in-host@desktop # ip link add name vl-test link en0 type vlan id 4001
root-in-host@desktop # ip link set vl-test netns <PID>
Configure the netns: something like ip link set vl-test up; dhclient; …
It should now be possible to ping 192.168.207.1 from the namespace. We now
leave this namespace, since it is set up completely, and will only use it for
checking on Omnia and eventually for testing that Discord works.
Connections to PPPoE netns
First, virtual ethernet. Quite simple, but needs root privileges:
root-in-host@desktop # ip link add name ve_ns type veth peer ve_internet
root-in-host@desktop # ip link set ve_internet netns 18244
root-in-host@desktop # ip link set ve_ns up
root-in-host@desktop # ip addr add 100.64.1.1/30 dev ve_ns
root-in-host@desktop # nft add rule inet nat postrouting iifname ve_ns ip saddr 100.64.1.2/32 masquerade
Then it is needed to configure the network in the netns:
root-in-netns@desktop # ip link set ve_internet up
root-in-netns@desktop # ip addr add 100.64.1.2/30 dev ve_ns
root-in-netns@desktop # ip route add 203.0.113.25 via 100.64.1.1 dev ve_ns # where 203.0.113.25 is the IPv4 of the VPS
The static route is needed because we will need to SSH to the VPS. All other
traffic will go through the SSH tunnel.
At this point we should be able to ping the VPS and even SSH there.
Delegating VLAN 4000 is the same as before, but we don't configure IP addresses
– we are only interested in having the VLAN for now. So just ip link add …; ip link set … netns …
as root and ip link set … up in the namespace.
We also prepare the tun0 interface for the tunnel:
root-in-netns@desktop # ip tuntap add mode tun
root-in-netns@desktop # ip addr add 100.64.15.2 dev tun0 peer 100.64.15.1
root-in-netns@desktop # ip route add default via 100.64.15.1
Keep the interface down for now [TODO i think]
Last, we need to tell the namespace to forward packets:
root-in-netns@desktop # sysctl net.ipv4.conf.all.forwarding=1
Note how this would be required also in the host, but in my case it already is
a router so there is no need to run that command. But it would not hurt.
Preparation of VPS
Here we only need to prepare the tun interface and set up the routing table and NAT. Everything as root:
root@vps # ip tuntap add mode tun user ledoian
root@vps # ip addr add 100.64.15.1 dev tun0 peer 100.64.15.2
root@vps # ip route add 192.168.201.0/30 via 100.64.15.2
root@vps # nft add chain inet nat tmp-masq \{ type nat hook postrouting priority srcnat\; \}
root@vps # nft add rule inet nat tmp-masq iifname "tun0" oifname "eth0" ip saddr \{ 100.64.15.2, 192.168.201.0/30 \} masquerade
root@vps # sysctl net.ipv4.conf.tun0.forwarding=1
root@vps # sysctl net.ipv4.conf.eth0.forwarding=1
root@vps # ip link set tun0 up
The interfaces are set up, now we only need to bring up the PPPoE and SSH
tunnels.
SSH tunnel
We now reach the smaller of the pain points. The goal is simple: run
ssh -w 0:0 -i ~ledoian/.ssh/id_rsa.pub -l ledoian 203.0.113.25 from the
namespace and have it create a working tunnel. Note how we need to specify the
path to the key and username, because the namespace thinks we are root and
not ledoian. Also, -4 might be useful if using a hostname with AAAA DNS
record and not IPv4 address directly.
This would not be called a pain point if it would work outright… Turns out the
VPS needs to have tunnelling allowed – put this to /etc/ssh/sshd_config:
Match User ledoian
PermitTunnel point-to-point
And now it should work, but won't anyway. For some reason I don't
understand, the SSH tunnel is very wonky and will only work under certain
circumstances, including but not limited to:
- The tunnel is unused on both ends
- The tunnel is up on both ends
- The user on both ends has the right to manipulate the tunnel
- If the tunnel was running too long/too idle, it just dies even though the rest of the connection works [TODO iirc]
This means that whenever I was testing this, I needed to [TODO do what].
Naturally, the output of the ssh, maybe improved by -vv helps somewhat.
In the end, it should be possible to ping all of the public IPv4 internet from the namespace.
PPPoE, finally
This turned out to be a complete cat and mouse game (luckily, not an endless
one). I don't know many PPPoE server implementations, so I found
rp-pppoe in my repos and decided to go with that. It's idea is
simple: do the PPPoE discovery and then hand the channel over to regular
pppd to handle the connection.
First issue: It won't run as root. It prints an error message to the syslog
and then ends (I did not find a way to run it in foreground). And the network
namespace, when run as unshare -rn, has pretty much only root account,
which is not actually privileged, but geteuid(2) still says 0.
OK, I found the source code, patched the exit away and recompiled. Why do you
have to do this? I know what I am doing…
The patch:
diff --git a/src/common.c b/src/common.c
index ca4c1b2..48be974 100644
--- a/src/common.c
+++ b/src/common.c
@@ -167,11 +167,11 @@ switchToRealID (void) {
if (saved_gid == (uid_t) -2) saved_gid = getegid();
if (setegid(getgid()) < 0) {
printErr("setgid failed");
- exit(EXIT_FAILURE);
+ //exit(EXIT_FAILURE);
}
if (seteuid(getuid()) < 0) {
printErr("seteuid failed");
- exit(EXIT_FAILURE);
+ //exit(EXIT_FAILURE);
}
}
}
@@ -190,11 +190,11 @@ switchToEffectiveID (void) {
if (IsSetID) {
if (setegid(saved_gid) < 0) {
printErr("setgid failed");
- exit(EXIT_FAILURE);
+ //exit(EXIT_FAILURE);
}
if (seteuid(saved_uid) < 0) {
printErr("seteuid failed");
- exit(EXIT_FAILURE);
+ //exit(EXIT_FAILURE);
}
}
}
@@ -228,7 +228,7 @@ dropPrivs(void)
}
if (ok < 2) {
printErr("unable to drop privileges");
- exit(EXIT_FAILURE);
+ //exit(EXIT_FAILURE);
}
}
Second issue: the compilation flags work in a weird way, so I found no way
of having /etc/ppp/pppoe-server-options in another (user-writable)
location. Gah, I just edited that one as host-root.
Third issue: pppd on Omnia and the rp-pppoe-managed pppd in the
namespace could not agree on parameters. I think there wasn't anything
actionable in the logs, but wiresharking the interface and a bit of guesswork
lead me to problems with compression negotioation (desktop offered some, Omnia
rejected it).
In the end: the /etc/ppp/pppoe-server-options that worked (there are some
extra options from testing):
noauth
mru 1492
noipv6
password aaaa
show-password
user test-tr-secret
nobsdcomp
nodeflate
nopcomp
novj
novjccomp
And the invocation: /tmp/rppppoe/sbin/pppoe-server -I test-pppoe -C Zirconium
-L 192.168.201.1 -R 192.168.201.2 -N 1. On any change, killall
pppoe-server, try again, wait until Omnia's pppd times out and the restart
delay passes.
At least once this worked, it connected quickly and stayed connected, unlike
the SSH tunnel.
Working PPPoE can be verified by pinging 192.168.201.1 from the end-user
namespace or WiFi. And if the SSH tunnel works, even the public IPv4 internet
should be reachable, as we wanted in the first place.