wkz

/* You are not expected to understand this */


Marvell LinkStreet LAG Issues

While running some kselftest-like tests on a system made up of three Marvell LinkStreet chips, I uncovered a series of issues related to offloaded link aggregates (LAGs) that are spread across multiple chips.

The system has the following layout:

                    .--0-1-2-3-4--.
              .-----a     sw1     |
        .-0-1-4-.   '--5-6-7-8-9--'   sw0: 6353 (Agate)
CPU +---6  sw0  |                     sw1: 6097 (Opal+)
        '-2-3-5-'   .--0-1-2-3-4--.   sw2: 6097 (Opal+)
              '-----a     sw2     |
                    '--5-6-7-8-9--'

Software-wise, the system is running NetBox with 5.18 Linux kernel, using the mv88e6xxx driver to control the switch chips.

Inconsistent Hashing

After creating a LAG consisting of sw0p1 and sw1p1, I was running a test where multicast was to be flooded out through the LAG. During that test, I observed that some groups were correctly flooded, while others were not.

If a packet was forced out of the LAG port By injecting a FROM_CPU packet on the CPU port , it was correctly received on the other side. So I knew that the problem was in the forwarding plane of the switch.

Previously, I had run the same test, but with sw1p0 and sw2p4 as the LAG ports, without any issues. Guessing that the issue might have something to do with asymmetric hashing, I generated some test packets with nemesis:

for i in $(seq 0 255); do
	nemesis ethernet -c 1 -d eth0 -M 01:00:de:ad:00:$(printf '%2.2x' $i) \
		-T 0xbbbb -P <(echo testing $i)
done

On the receiver I observed the following results:

# Groups# CopiesInterpretation
640Both switches assume that the designated port is on the other switch
1281Switches are in agreement
642Both switches determines that the local port is the designated one

Disabling the hashing By clearing Global1:Reg7:Bit11 causes the devices to fall back to a simple XOR-based port selection. In this mode of operation, sending the same test packets, exactly one copy of each of the 256 packets is received.

Conclusion: The Agate uses a different hash function than the one on the Opal+.

Suggested Solution

Opportunistically enable hashing until a configuration is encountered which prevents it. As a first step, you could disable hashing as soon as a cross-chip LAG is detected.

To support even more cases, you could also keep hashing enabled as long as all cross-chip LAGs are setup between compatible chips. This would require some more information from Marvell though, i.e. the groups of chips using the same hash function.

UPDATE: It turns out that this issue is a side-effect of a silicon bug in the Agate, for which an easy workaround exists. An undocumented field Global1/Reg10/Bit0-1 has an incorrect value when the Agate comes out of reset.

In more recent chips of the same family, this field specifies the hash mode to use for ATU bucket selection, where 1 is “Default” and 3 is “Direct” (0 and 2 are reserved).

The Agate has a reset value of 0 whereas Peridot and Amethyst both reset to 1. Setting the the value to 1 resolves the issue.

Thus, it appears that the field in question, in addition to influencing the hash function used for ATU bucket selection, also affects the hash function used for LAG member selection.

DSA Tag Trunk Bit Override

Because of the way mv88e6xxx handles port isolation Full disclosure: This is my fault , frames assigned to VID 0 ingressing on DSA ports are trapped to the CPU using the VTU policy feature.

This avoids a whole slew of issues where intermediate switches may be confused by looking up the DA in the ATU. Unfortunately it also means that the original DSA tag, which looked something like this…

FORWARD dev:2 port:0 trunk:yes

…is rewritten to:

TO_CPU  dev:2 port:0 code:policy-trap

Since there is not trunk bit in the TO_CPU tag, the original source information is lost.

Fortunately, this is only an issue for LAGs in standalone mode - as soon as a LAG is added to a bridge, no packets will ever be assigned to VID 0. If a standalone LAG is required, you can create one using a mode that is not possible to offload E.g. balance-rr for Bond interfaces . In that case, the DSA layer in the kernel will fallback to a software LAG and everything will work as expected.

Conclusion: The VTU policy feature can’t coexist with offloaded standalone LAGs.

Suggested Solution

When a LAG is created, accept the offload in the mv88e6xxx driver if it is supported, but do not actually push the configuration down to hardware until the LAG is attached to a bridge. Before that point, the “offloading” does very little anyway, the reason you want it offloaded is so that you can hardware switch packets in and out of it - in standalone mode all packets have to pass through the CPU anyway.

ATU Trunk Bit Inheritance

When mv88e6xxx adds a static FDB entry to the ATU, it will reuse any existing ATU entry for the DA in question This is most likely because it makes the code reusable for MDB operations .

Unfortunately, the Trunk bit of the ATU entry is not cleared before entry is written back with the new port information.

Example: We start with the following bridge setup:

  br0
  / \
 / lag0  lag1
'   /\    /\
0  1  2  3  4
   |  '--'  |
   '--------'

Then we send a packet from lag1, which is physically looped back to lag0, where it is received and learned by the ATU.

Now let’s say that we swap the roles of the LAGs, i.e. we connect lag1 to the bridge and keep lag0 as a standalone interface.

At this point, the bridge will want to add a static entry for the MAC address of lag1, pointing towards the CPU port. It will then find the existing dynamic entry from the previous configuration, override the port and state and the write it back to the ATU. But since the trunk bit is not cleared, you now end up with a static entry pointing towards a non-existing LAG 0x400 (for the common case where the CPU port is 11).

Conclusion: The trunk bit must always be cleared when updating an existing ATU entry.

Suggested Solution

Do that.