November 02, 2016

The Data Center Move

Sometimes in life, the best experience comes from being in the right place at the right time. I studied enterprise networking for years while being employed in the SMB space. My short time with the Flagler County school system was my first enterprise-level job, but my role at my present company as network engineer has really been the start of my journey of “real” enterprise-level experience. In such a comparatively short period of time of being employed here, I have gained immense experience that I know will serve me well for the rest of my career.

My company recently migrated our 15-year-old in-house data center to a private cage in a colocation facility. Many people would view the thought of migration with dread. Many people that I work with did in fact dread it. However, I saw it from the very beginning as a somewhat unique opportunity to raise my awareness of various issues and learn as much as I could from the process. How often does a company move its entire data center? Indeed, many people working for a single company for many years may never experience something like this.

Technology, as always, marches on. Fifteen years ago, it made much more sense to run your own data center, if you had enough equipment to fill it. Our private in-house data center had, at one point in time, somewhere between 50 and 100 racks full of equipment, and a separate telecom room where all of the copper connections terminated. The data center was filled with many physical servers, primarily Linux machines and proprietary IBM midrange systems, as well as very large (and very heavy!) multi-rack SANs.

The telecom room also served as the demarc for various WAN technologies over the years, including T1s, DS3s, straight fiber, and at one point (before migrating to MPLS), the central hub for our former frame-relay hub-and- spoke WAN (this was well before my time here). We still have all of the original SONET muxing equipment in the wiring room, and many shelves for the T1 cards, with hundreds of pairs of wires running between racks. That in itself is something someone fairly recent to enterprise-level networking might not run into very often anymore.

Fifteen years ago, 32TB of storage would take hundreds of drives spanning multiple racks of equipment. In 2016, you can get 32TB across just four drives or less. Soon we’ll be laughing about how it used to take multiple drives at all to get 32TB of storage. Thanks to server virtualization, hundreds of former physical boxes can now be placed into a blade server chassis that takes the space of less than five of the former physical servers. Those former physical servers all required multiple individual connections back to the network, which usually required chassis-based switches containing hundreds of ports. Today, you just run a couple of 10G or higher connections into a hypervisor to be accessed by many VMs simultaneously.

All of this technology consolidation means that at a certain point, it no longer makes sense to run your own data center. It becomes a giant lot of wasted space and money. With consolidation, you might end up using less electricity, but you still have the infrastructure in place, including huge cooling systems, redundant power, backup diesel generators, etc., all of which cost money to maintain, whether or not they are actually used. It is at this point where relocating the data center to a colocation facility makes the most business sense. When you can reduce your data center footprint down to a tenth of its former size due to advances in technology, why continue paying for the infrastructure that you no longer need?

From the networking perspective, this entire move process has had many “firsts” for me with regard to getting firsthand experience. Some of what I have experienced are things that I read about years ago when studying for the CCNP. I remember at the time I was studying wondering if I would ever actually encounter some of these situations and be able to actually use the knowledge I was gaining. This move has made that knowledge come to life.

One of the nicest things about migrating the data center is we get a chance to start somewhat fresh and hopefully do things correctly (or rather, more “current” from a best practices perspective). As I’ve witnessed myself, network complexity builds up over time. Temporary “band-aid” fixes unintentionally become permanent solutions. The accumulation of all these one- off problems eventually leads to many complex interdependencies that may not reveal themselves until things start to unravel, as I experienced during the move.

The new data center network is based on a leaf/spine topology with Top of Rack switches connecting to a smaller, higher-speed core. The two core switches are connected together with a 2x40G EtherChannel, and the racks where the ToRs are present have two switches, each of which are redundantly connected back to both cores with 10G links.

The new switches are Cisco Nexus series, which I had not worked with until this point (another of the many “firsts”). After working with them for awhile, and discovering which commands had changed or are now implemented differently as compared to Catalyst, I actually enjoy NX-OS and prefer the way it handles most things in the configuration. It seems more logical to me in most cases. More firsts for me included working with 10 and 40 gigabit transceivers and optics, including 4x10G fiber optic breakout cables with MPO connectors, and fiber optic shelves.

Moving to the new data center was (and still is as I write this) a process of migration. While the bulk of the move happened on a single day, much of it has been in pieces both before and after the major portion. Planning for the entire operation has been ongoing for nearly a year. The major portion consisted of moving in the servers and SANs. However, the network had to be in place to support that connectivity within the new data center, as well as still provide connectivity to the old data center while we complete the migration.

A lot of planning went into the network portion, including having all of the switches pre-configured, the network cabling pre-labeled and, where possible, pre-wired and ready to plug into the devices for instant connectivity to the rest of the network, including the old data center. I created a spreadsheet containing all of the physical interconnections, as well as device ports, switch ports, and assigned VLANs, so that we would have a reference on move day and could trace every cable to every destination if we encountered connectivity issues. We had a support person from IBM on site with us all day during the move (for our iSeries systems), and at the end of the day he told me personally that he has been a part of many data center moves, and had never seen a move go as smoothly as this one did with all the pieces falling into place. I wanted to ask him if he would put that in writing so I could put it on my resume :-)

We decided the best way to have seamless connectivity between the old data center and the new one during the migration was with a Layer 2 Metro-E trunk from a local ISP. This let us access the new data center as if everything were still in the same room. We made the two new core switches HSRP neighbors with the old core switches, which let us shift the active gateway for the various VLANs when we were ready to do so. During different parts of the migration, we did have some traffic tromboning, but it was well within our bandwidth limits and didn’t cause any issues (including delay, since the new data center is only about 20 miles away).

However, after the majority of the systems had been moved and were in place, we did encounter a networking issue that ended up causing me to have to run across town in the middle of the night to the new data center and power-cycle a switch.

Our newest iSeries systems were placed into production during the past few months in preparation for our move. Just like how newer hypervisors replace many physical servers, the new iSeries servers do the exact same thing. A single system that used to occupy multiple racks now fits in an eighth of a single rack.

However, instead of a few 10G links, these new servers went with the old model of many 1G links (16 per server, in fact). When these servers were placed into production, they connected directly to our Catalyst 6500 core, with eight links per server going to each switch. When I configured the network to support these new servers, I was asked to make each link an 802.1q trunk, and was not asked about bonding the links together.

Unfortunately, I did not think to ask about link bonding, which ended up causing my boss and I to spend a few hours tracking down the newly-discovered issue when we migrated the servers to the new Nexus switches. This ended up being an extremely interesting problem; one that I will never forget.

After the servers were connected, we were getting complaints about intermittent connectivity. I would do a continuous ping against one of the servers, and it would randomly drop one or two packets out of every 20 or so, and then every 10 minutes it would completely lose connectivity for about 30 seconds. I logged into all of the switches at the new data center, turned on the terminal monitor, looked at the logs, and saw all of the switches reporting “MAC flapping”, and then disabling MAC learning on the affected VLANs for 3 minutes.

Because everything is redundantly connected with Spanning-Tree blocking the redundant links, the disabled MAC learning occurred on different switches at different times, which is why the pings would continue, but lose one or two while RSTP reconverged. Since the switches would block MAC learning for 3 minutes, I surmised that when the server would lose connectivity for 30 seconds, that was when the “rolling VLAN blackouts” had affected all of the switches at the same time due to the issue overlapping among all of them.

My boss and I still weren’t sure what was going on, because I configured the ports facing these systems on the new switches identically to the ports on the old switches, where we had no apparent issues. As a point of troubleshooting, we decided to cut the HSRP active gateway over to one of the new core switches, which we hadn’t done yet because our primary WAN connectivity was still at the old data center.

Two things ended up working against us here. The first was that because we were under pressure to get things working correctly ASAP (despite this happening on a Saturday night), I pasted the configuration to cut over all of the HSRP active gateways simultaneously (lesson learned: unless you’re dealing with hundreds of VLANs, don’t do this! Do it one at a time, or in small groups). The second thing was me not fully appreciating what disabled MAC learning really means with regard to HSRP interaction. If it was happening only on the leaf switches, it would have been less severe, but because it was happing across the entire STP domain at the new data center, when I cut over the HSRP gateways, one of the new cores would be active until the MAC learning was disabled for the particular VLAN.

This ended up causing an HSRP “split-brain” scenario between the core switches at the old data center and the new data center. It ended up causing enough of a problem that the Metro-E switch connecting the two data centers completely froze, which caused me to have to make the trip across town to the new data center and power-cycle the Metro-E switch (since the switch belongs to the service provider, not us, so we had no way to access it remotely). During that time, we put the HSRP active gateway back to the old core switch.

I had to wake up the iSeries admin and get him to show me his networking configuration for the servers. Finally, things started to make sense. Going into this, because the configuration of the old ports and new were identical, we could not have foreseen these particular issues, especially since things worked just fine on the 6500s. What his network config revealed, as I’m sure you’ve guessed by now, is that not only are all of the links bundled together on the server side, they’re in a static EtherChannel; LACP was not used.

This is what caused the Nexus switches to report MAC flapping and temporarily disable MAC learning. That means the issue we experienced was occurring from the very beginning. It’s just that the 6500s handled the issue by not reporting it to us. The 6500s basically “just dealt with it” as the MAC addresses would flap between the links. But the Nexus switches called us out on it, and made sure we did things right.

After I created the static EtherChannels on the Nexus switches, the problem of MAC flapping was still present, but was occurring less frequently. It was during this troubleshooting session (which took place at around 3am on Sunday morning, with me running on only a couple hours of sleep) that I discovered why you ALWAYS want to use LACP and not hardcode static EtherChannels. With LACP, all links have an ID, and if things don’t line up, the ports do not become active. Not so with a static EtherChannel. But, since I’m not the iSeries admin, I have no control over that side of things (if the devices are even capable of using LACP).

What I found out was that the network interfaces on the iSeries were labeled (and assembled into EtherChannel bundles) differently than what you would expect, so the static EtherChannels were active on the Nexus side because the links were physically connected, but they were connected in the wrong order. I spent the next half hour across many different links and trial and error moving cables around until I got the right combination of EtherChannel bundles. Finally, the MAC flapping stopped, no pings were dropped across several minutes, and I went home and slept like the dead.

Over the next couple of weeks, we continued to migrate more portions of the network and less-critical systems over to the new data center. One of the issues we experienced was temporary route instability due to accidentally advertising the same routes through multiple BGP ASNs because of redistribution.

The overall WAN and Internet design of our network is hub-and-spoke. We use an MPLS L3VPN service and peer with the PE routers with eBGP, so all of the spokes can talk directly to each other (which is useful for all of the inter-office VoIP traffic), but both the primary business data as well as Internet connectivity for the entire company all flow back to the hub data center.

Over time, for various reasons, we ended up with multiple MPLS routers at the old data center that face the rest of the WAN. All the MPLS routers use the same BGP ASN and speak iBGP with each other, and peer with our MPLS provider with eBGP. Even though all the MPLS routers have equal access to the WAN (differing bandwidths aside), different routing policies had been put into place for various reasons over the years. For instance, all of our SIP sessions went over just one of the routers. We advertised the company-wide default route from two of the routers, but not all of them. We advertised our data center private /16 network out of all the routers, but advertised more specific /24s of just a couple subnets on only a couple of the routers. Nearly all of these routing patterns were established before I got here. Some of them made absolutely no sense to me, so I had to question their history, which often pointed back to one-off fixes that were supposed to be temporary, but of course became permanent.

We’re primarily a Cisco shop as far as the routing and switching infrastructure goes, so we use EIGRP on the internal network. Both the new data center and old were connected together using the same EIGRP ASN. We perform mutual redistribution of BGP and EIGRP on all the MPLS routers, and use tagging to prevent route loops. However, at the new data center, we used a different private BGP ASN.

I knew that because of the two different BGP ASNs, I would have to be careful about outbound route advertisements appearing to the rest of the network from multiple sources. So I used BGP prepending to make different paths be more preferable than others, while still allowing for redundancy in case one of the routers went down. But, since all of the MPLS routers at both data centers were joined together by the same EIGRP ASN and they were configured to do mutual redistribution of BGP and EIGRP, it ended up causing a problem that I didn’t see at the time, but can now see very clearly in hindsight.

The routing table seemed stable, and everything was flowing properly. Then a couple of our remote branches were having WAN troubles, and their circuits went down. When they came back up, they could no longer reach any of our data center subnets. Or more correctly, they could reach us, but we couldn’t reach them. I didn’t think that a routing loop would have occurred, because at all points of redistribution, I set and matched tags to prevent that from happening.

The part that I can now see clearly with the experience of hindsight is that when those branches went down, their routes were removed from the overall routing table. However, when they came back up and the routes were re- introduced, they were learned by BGP from our old data center, redistributed into EIGRP and passed onto the new data center, and redistributed back into BGP with a different ASN, which caused a loop. But only for routes that were removed from and re-introduced into the overall routing table.

Luckily, we were able to catch this very quickly and correct it with only a couple of small branches experiencing the issue. As I write this, we are still in the middle of migrating the WAN over, and the way we chose to deal with this is to not yet re-distribute EIGRP back into BGP at the new data center, and instead just advertise the routes we desire directly through BGP. It’s true, we could have just used filtering, but we only need to do this temporarily and I thought it would be a much quicker and easier way to solve the problem.

Having this job has been very exciting so far, and I feel like I came onboard at just the right time. Since this is my first “real” enterprise-level networking job, I spent the first several months getting up to speed and seeing where all the things I read about and studied for fit into place. Now I’ve reached a point of being able to put the more advanced things I’ve studied into action, and gain real practical experience from it which will propel me forward in my career.

As I said, I am well aware of the fact that the majority of people who are early on in their network careers will not have had the opportunities that I’ve had in experiencing this data center move. That is why I made sure to make the most of it and write down some of the many things I experienced and learned. The static EtherChannel problem, in particular, was a real nightmare and I don’t think I will ever forget it. More importantly, I now know exactly what symptoms to look for and how to fix it if something like that ever happens again.

I feel like I have gained a lot of really important experience in a short amount of time. I’m very grateful for it, and I’m always continuing to study and look toward the future with what this career may bring me, and to help solve the many problems of business.