After the servers were connected, we were getting complaints about intermittent connectivity. I would do a continuous ping against one of the servers, and it would randomly drop one or two packets out of every 20 or so, and then every 10 minutes it would completely lose connectivity for about 30 seconds. I logged into all of the switches at the new data center, turned on the terminal monitor, looked at the logs, and saw all of the switches reporting “MAC flapping”, and then disabling MAC learning on the affected VLANs for 3 minutes.
Because everything is redundantly connected with Spanning-Tree blocking the redundant links, the disabled MAC learning occurred on different switches at different times, which is why the pings would continue, but lose one or two while RSTP reconverged. Since the switches would block MAC learning for 3 minutes, I surmised that when the server would lose connectivity for 30 seconds, that was when the “rolling VLAN blackouts” had affected all of the switches at the same time due to the issue overlapping among all of them.
My boss and I still weren’t sure what was going on, because I configured the ports facing these systems on the new switches identically to the ports on the old switches, where we had no apparent issues. As a point of troubleshooting, we decided to cut the HSRP active gateway over to one of the new core switches, which we hadn’t done yet because our primary WAN connectivity was still at the old data center.
Two things ended up working against us here. The first was that because we were under pressure to get things working correctly ASAP (despite this happening on a Saturday night), I pasted the configuration to cut over all of the HSRP active gateways simultaneously (lesson learned: unless you’re dealing with hundreds of VLANs, don’t do this! Do it one at a time, or in small groups). The second thing was me not fully appreciating what disabled MAC learning really means with regard to HSRP interaction. If it was happening only on the leaf switches, it would have been less severe, but because it was happing across the entire STP domain at the new data center, when I cut over the HSRP gateways, one of the new cores would be active until the MAC learning was disabled for the particular VLAN.
This ended up causing an HSRP “split-brain” scenario between the core switches at the old data center and the new data center. It ended up causing enough of a problem that the Metro-E switch connecting the two data centers completely froze, which caused me to have to make the trip across town to the new data center and power-cycle the Metro-E switch (since the switch belongs to the service provider, not us, so we had no way to access it remotely). During that time, we put the HSRP active gateway back to the old core switch.
I had to wake up the iSeries admin and get him to show me his networking configuration for the servers. Finally, things started to make sense. Going into this, because the configuration of the old ports and new were identical, we could not have foreseen these particular issues, especially since things worked just fine on the 6500s. What his network config revealed, as I’m sure you’ve guessed by now, is that not only are all of the links bundled together on the server side, they’re in a static EtherChannel; LACP was not used.
This is what caused the Nexus switches to report MAC flapping and temporarily disable MAC learning. That means the issue we experienced was occurring from the very beginning. It’s just that the 6500s handled the issue by not reporting it to us. The 6500s basically “just dealt with it” as the MAC addresses would flap between the links. But the Nexus switches called us out on it, and made sure we did things right.
After I created the static EtherChannels on the Nexus switches, the problem of MAC flapping was still present, but was occurring less frequently. It was during this troubleshooting session (which took place at around 3am on Sunday morning, with me running on only a couple hours of sleep) that I discovered why you ALWAYS want to use LACP and not hardcode static EtherChannels. With LACP, all links have an ID, and if things don’t line up, the ports do not become active. Not so with a static EtherChannel. But, since I’m not the iSeries admin, I have no control over that side of things (if the devices are even capable of using LACP).
What I found out was that the network interfaces on the iSeries were labeled (and assembled into EtherChannel bundles) differently than what you would expect, so the static EtherChannels were active on the Nexus side because the links were physically connected, but they were connected in the wrong order. I spent the next half hour across many different links and trial and error moving cables around until I got the right combination of EtherChannel bundles. Finally, the MAC flapping stopped, no pings were dropped across several minutes, and I went home and slept like the dead.