The Data Center Move, Part 4

Part 1  |  Part 2  |  Part 3  | Part 4

Over the next couple of weeks, we continued to migrate more portions of the network and less-critical systems over to the new data center. One of the issues we experienced was temporary route instability due to accidentally advertising the same routes through multiple BGP ASNs because of redistribution.

The overall WAN and Internet design of our network is hub-and-spoke. We use an MPLS L3VPN service and peer with the PE routers with eBGP, so all of the spokes can talk directly to each other (which is useful for all of the inter-office VoIP traffic), but both the primary business data as well as Internet connectivity for the entire company all flow back to the hub data center.

Over time, for various reasons, we ended up with multiple MPLS routers at the old data center that face the rest of the WAN. All the MPLS routers use the same BGP ASN and speak iBGP with each other, and peer with our MPLS provider with eBGP. Even though all the MPLS routers have equal access to the WAN (differing bandwidths aside), different routing policies had been put into place for various reasons over the years. For instance, all of our SIP sessions went over just one of the routers. We advertised the company-wide default route from two of the routers, but not all of them. We advertised our data center private /16 network out of all the routers, but  advertised more specific /24s of just a couple subnets on only a couple of the routers. Nearly all of these routing patterns were established before I got here. Some of them made absolutely no sense to me, so I had to question their history, which often pointed back to one-off fixes that were supposed to be temporary, but of course became permanent.

We’re primarily a Cisco shop as far as the routing and switching infrastructure goes, so we use EIGRP on the internal network. Both the new data center and old were connected together using the same EIGRP ASN. We perform mutual redistribution of BGP and EIGRP on all the MPLS routers, and use tagging to prevent route loops. However, at the new data center, we used a different private BGP ASN.

I knew that because of the two different BGP ASNs, I would have to be careful about outbound route advertisements appearing to the rest of the network from multiple sources. So I used BGP prepending to make different paths be more preferable than others, while still allowing for redundancy in case one of the routers went down. But, since all of the MPLS routers at both data centers were joined together by the same EIGRP ASN and they were configured to do mutual redistribution of BGP and EIGRP, it ended up causing a problem that I didn’t see at the time, but can now see very clearly in hindsight.

The routing table seemed stable, and everything was flowing properly. Then a couple of our remote branches were having WAN troubles, and their circuits went down. When they came back up, they could no longer reach any of our data center subnets. Or more correctly, they could reach us, but we couldn’t reach them. I didn’t think that a routing loop would have occurred, because at all points of redistribution, I set and matched tags to prevent that from happening.

The part that I can now see clearly with the experience of hindsight is that when those branches went down, their routes were removed from the overall routing table. However, when they came back up and the routes were re-introduced, they were learned by BGP from our old data center, redistributed into EIGRP and passed onto the new data center, and redistributed back into BGP with a different ASN, which caused a loop. But only for routes that were removed from and re-introduced into the overall routing table.

Luckily, we were able to catch this very quickly and correct it with only a couple of small branches experiencing the issue. As I write this, we are still in the middle of migrating the WAN over, and the way we chose to deal with this is to not yet re-distribute EIGRP back into BGP at the new data center, and instead just advertise the routes we desire directly through BGP. It’s true, we could have just used filtering, but we only need to do this temporarily and I thought it would be a much quicker and easier way to solve the problem.

Having this job has been very exciting so far, and I feel like I came onboard at just the right time. Since this is my first “real” enterprise-level networking job, I spent the first several months getting up to speed and seeing where all the things I read about and studied for fit into place. Now I’ve reached a point of being able to put the more advanced things I’ve studied into action, and gain real practical experience from it which will propel me forward in my career.

As I said, I am well aware of the fact that the majority of people who are early on in their network careers will not have had the opportunities that I’ve had in experiencing this data center move. That is why I made sure to make the most of it and write down some of the many things I experienced and learned. The static EtherChannel problem, in particular, was a real nightmare and I don’t think I will ever forget it. More importantly, I now know exactly what symptoms to look for and how to fix it if something like that ever happens again.

I feel like I have gained a lot of really important experience in a short amount of time. I’m very grateful for it, and I’m always continuing to study and look toward the future with what this career may bring me, and to help solve the many problems of business.

Part 1  |  Part 2  |  Part 3  | Part 4

The Data Center Move, Part 3

Part 1  |  Part 2  |  Part 3  | Part 4

After the servers were connected, we were getting complaints about intermittent connectivity. I would do a continuous ping against one of the servers, and it would randomly drop one or two packets out of every 20 or so, and then every 10 minutes it would completely lose connectivity for about 30 seconds. I logged into all of the switches at the new data center, turned on the terminal monitor, looked at the logs, and saw all of the switches reporting “MAC flapping”, and then disabling MAC learning on the affected VLANs for 3 minutes.

Because everything is redundantly connected with Spanning-Tree blocking the redundant links, the disabled MAC learning occurred on different switches at different times, which is why the pings would continue, but lose one or two while RSTP reconverged. Since the switches would block MAC learning for 3 minutes, I surmised that when the server would lose connectivity for 30 seconds, that was when the “rolling VLAN blackouts” had affected all of the switches at the same time due to the issue overlapping among all of them.

My boss and I still weren’t sure what was going on, because I configured the ports facing these systems on the new switches identically to the ports on the old switches, where we had no apparent issues. As a point of troubleshooting, we decided to cut the HSRP active gateway over to one of the new core switches, which we hadn’t done yet because our primary WAN connectivity was still at the old data center.

Two things ended up working against us here. The first was that because we were under pressure to get things working correctly ASAP (despite this happening on a Saturday night), I pasted the configuration to cut over all of the HSRP active gateways simultaneously (lesson learned: unless you’re dealing with hundreds of VLANs, don’t do this! Do it one at a time, or in small groups). The second thing was me not fully appreciating what disabled MAC learning really means with regard to HSRP interaction. If it was happening only on the leaf switches, it would have been less severe, but because it was happing across the entire STP domain at the new data center, when I cut over the HSRP gateways, one of the new cores would be active until the MAC learning was disabled for the particular VLAN.

This ended up causing an HSRP “split-brain” scenario between the core switches at the old data center and the new data center. It ended up causing enough of a problem that the Metro-E switch connecting the two data centers completely froze, which caused me to have to make the trip across town to the new data center and power-cycle the Metro-E switch (since the switch belongs to the service provider, not us, so we had no way to access it remotely). During that time, we put the HSRP active gateway back to the old core switch.

I had to wake up the iSeries admin and get him to show me his networking configuration for the servers. Finally, things started to make sense. Going into this, because the configuration of the old ports and new were identical, we could not have foreseen these particular issues, especially since things worked just fine on the 6500s. What his network config revealed, as I’m sure you’ve guessed by now, is that not only are all of the links bundled together on the server side, they’re in a static EtherChannel; LACP was not used.

This is what caused the Nexus switches to report MAC flapping and temporarily disable MAC learning. That means the issue we experienced was occurring from the very beginning. It’s just that the 6500s handled the issue by not reporting it to us. The 6500s basically “just dealt with it” as the MAC addresses would flap between the links. But the Nexus switches called us out on it, and made sure we did things right.

After I created the static EtherChannels on the Nexus switches, the problem of MAC flapping was still present, but was occurring less frequently. It was during this troubleshooting session (which took place at around 3am on Sunday morning, with me running on only a couple hours of sleep) that I discovered why you ALWAYS want to use LACP and not hardcode static EtherChannels. With LACP, all links have an ID, and if things don’t line up, the ports do not become active. Not so with a static EtherChannel. But, since I’m not the iSeries admin, I have no control over that side of things (if the devices are even capable of using LACP).

What I found out was that the network interfaces on the iSeries were labeled (and assembled into EtherChannel bundles) differently than what you would expect, so the static EtherChannels were active on the Nexus side because the links were physically connected, but they were connected in the wrong order. I spent the next half hour across many different links and trial and error moving cables around until I got the right combination of EtherChannel bundles. Finally, the MAC flapping stopped, no pings were dropped across several minutes, and I went home and slept like the dead.

Part 1  |  Part 2  |  Part 3  | Part 4

The Data Center Move, Part 2

Part 1  |  Part 2  |  Part 3  | Part 4

One of the nicest things about migrating the data center is we get a chance to start somewhat fresh and hopefully do things correctly (or rather, more “current” from a best practices perspective). As I’ve witnessed myself, network complexity builds up over time. Temporary “band-aid” fixes unintentionally become permanent solutions. The accumulation of all these one-off problems eventually leads to many complex interdependencies that may not reveal themselves until things start to unravel, as I experienced during the move.

The new data center network is based on a leaf/spine topology with Top of Rack switches connecting to a smaller, higher-speed core. The two core switches are connected together with a 2x40G EtherChannel, and the racks where the ToRs are present have two switches, each of which are redundantly connected back to both cores with 10G links.

The new switches are Cisco Nexus series, which I had not worked with until this point (another of the many “firsts”). After working with them for awhile, and discovering which commands had changed or are now implemented differently as compared to Catalyst, I actually enjoy NX-OS and prefer the way it handles most things in the configuration. It seems more logical to me in most cases. More firsts for me included working with 10 and 40 gigabit transceivers and optics, including 4x10G fiber optic breakout cables with MPO connectors, and fiber optic shelves.

Moving to the new data center was (and still is as I write this) a process of migration. While the bulk of the move happened on a single day, much of it has been in pieces both before and after the major portion. Planning for the entire operation has been ongoing for nearly a year. The major portion consisted of moving in the servers and SANs. However, the network had to be in place to support that connectivity within the new data center, as well as still provide connectivity to the old data center while we complete the migration.

A lot of planning went into the network portion, including having all of the switches pre-configured, the network cabling pre-labeled and, where possible, pre-wired and ready to plug into the devices for instant connectivity to the rest of the network, including the old data center. I created a spreadsheet containing all of the physical interconnections, as well as device ports, switch ports, and assigned VLANs, so that we would have a reference on move day and could trace every cable to every destination if we encountered connectivity issues. We had a support person from IBM on site with us all day during the move (for our iSeries systems), and at the end of the day he told me personally that he has been a part of many data center moves, and had never seen a move go as smoothly as this one did with all the pieces falling into place. I wanted to ask him if he would put that in writing so I could put it on my resume 🙂

We decided the best way to have seamless connectivity between the old data center and the new one during the migration was with a Layer 2 Metro-E trunk from a local ISP. This let us access the new data center as if everything were still in the same room. We made the two new core switches HSRP neighbors with the old core switches, which let us shift the active gateway for the various VLANs when we were ready to do so. During different parts of the migration, we did have some traffic tromboning, but it was well within our bandwidth limits and didn’t cause any issues (including delay, since the new data center is only about 20 miles away).

However, after the majority of the systems had been moved and were in place, we did encounter a networking issue that ended up causing me to have to run across town in the middle of the night to the new data center and power-cycle a switch.

Our newest iSeries systems were placed into production during the past few months in preparation for our move. Just like how newer hypervisors replace many physical servers, the new iSeries servers do the exact same thing. A single system that used to occupy multiple racks now fits in an eighth of a single rack.

However, instead of a few 10G links, these new servers went with the old model of many 1G links (16 per server, in fact). When these servers were placed into production, they connected directly to our Catalyst 6500 core, with eight links per server going to each switch. When I configured the network to support these new servers, I was asked to make each link an 802.1q trunk, and was not asked about bonding the links together.

Unfortunately, I did not think to ask about link bonding, which ended up causing my boss and I to spend a few hours tracking down the newly-discovered issue when we migrated the servers to the new Nexus switches. This ended up being an extremely interesting problem; one that I will never forget.

Part 1  |  Part 2  |  Part 3  | Part 4

The Data Center Move, Part 1

My posts are sometimes (nearly always) kind of long. I thought I’d try something different and break this one up into multiple parts.

Part 1  |  Part 2  |  Part 3  | Part 4

Sometimes in life, the best experience comes from being in the right place at the right time. I studied enterprise networking for years while being employed in the SMB space. My short time with the Flagler County school system was my first enterprise-level job, but my role at my present company as network engineer has really been the start of my journey of “real” enterprise-level experience. In such a comparatively short period of time of being employed here, I have gained immense experience that I know will serve me well for the rest of my career.

My company recently migrated our 15-year-old in-house data center to a private cage in a colocation facility. Many people would view the thought of migration with dread. Many people that I work with did in fact dread it. However, I saw it from the very beginning as a somewhat unique opportunity to raise my awareness of various issues and learn as much as I could from the process. How often does a company move its entire data center? Indeed, many people working for a single company for many years may never experience something like this.

Technology, as always, marches on. Fifteen years ago, it made much more sense to run your own data center, if you had enough equipment to fill it. Our private in-house data center had, at one point in time, somewhere between 50 and 100 racks full of equipment, and a separate telecom room where all of the copper connections terminated. The data center was filled with many physical servers, primarily Linux machines and proprietary IBM midrange systems, as well as very large (and very heavy!) multi-rack SANs.

The telecom room also served as the demarc for various WAN technologies over the years, including T1s, DS3s, straight fiber, and at one point (before migrating to MPLS), the central hub for our former frame-relay hub-and-spoke WAN (this was well before my time here). We still have all of the original SONET muxing equipment in the wiring room, and many shelves for the T1 cards, with hundreds of pairs of wires running between racks. That in itself is something someone fairly recent to enterprise-level networking might not run into very often anymore.

Fifteen years ago, 32TB of storage would take hundreds of drives spanning multiple racks of equipment. In 2016, you can get 32TB across just four drives or less. Soon we’ll be laughing about how it used to take multiple drives at all to get 32TB of storage. Thanks to server virtualization, hundreds of former physical boxes can now be placed into a blade server chassis that takes the space of less than five of the former physical servers. Those former physical servers all required multiple individual connections back to the network, which usually required chassis-based switches containing hundreds of ports. Today, you just run a couple of 10G or higher connections into a hypervisor to be accessed by many VMs simultaneously.

All of this technology consolidation means that at a certain point, it no longer makes sense to run your own data center. It becomes a giant lot of wasted space and money. With consolidation, you might end up using less electricity, but you still have the infrastructure in place, including huge cooling systems, redundant power, backup diesel generators, etc., all of which cost money to maintain, whether or not they are actually used. It is at this point where relocating the data center to a colocation facility makes the most business sense. When you can reduce your data center footprint down to a tenth of its former size due to advances in technology, why continue paying for the infrastructure that you no longer need?

From the networking perspective, this entire move process has had many “firsts” for me with regard to getting firsthand experience. Some of what I have experienced are things that I read about years ago when studying for the CCNP. I remember at the time I was studying wondering if I would ever actually encounter some of these situations and be able to actually use the knowledge I was gaining. This move has made that knowledge come to life.

Part 1  |   Part 2  |   Part 3  | Part 4

Attempting to Avoid Obsolescence

This post is pretty long and meandering. The crux of it is that I wrote and used my first Python script on a production network yesterday, and it made me pretty damn happy to see the results.

Now for the Director’s Cut:

I’m sure the title represents something I will be forced to revisit many times during my career, and we all know that everything old is new again (Rule 11, naturally). As I get deeper into my career and expand both my experience and base of knowledge, I see Rule 11 all the time. When you’re first starting out, you may be aware of the concept, but it definitely takes time and experience to truly appreciate it.

A couple months ago, I started looking more into scripting and network automation. I mentioned Rule 11 because I am well aware of the fact that, despite all the industry buzz during the past couple of years, this is not in any way new. Before Python took over as the scripting language of choice for network engineers, it was Perl.

However, scripting seems to remain a relatively important skill for network engineers dealing with systems of any sort of scale. Many of the end results of scripting can be simplified by purchasing expensive network management software. I have discovered, though, that even with the expensive NMS software in place, sometimes you would like to gather very specific information, and either the NMS doesn’t support what you’re looking for, or it is extremely unintuitive and/or cumbersome as to how to actually obtain the desired information from the software.

I enjoy learning the “classical” network engineering skills, such as the various routing and bridging protocols, and architecture and design. That is one of the motivations for me still studying for the CCIE. But, as I’ve written about in the past, it’s taking me a long time to get the CCIE because that is not my sole focus. I don’t want to be a one-sided engineer who can’t fathom thinking outside the world of Cisco, and I don’t want to be the kind of person that designs networks a certain way because “that’s how it was on the Cisco certification exam.”

On the other side, I’m not necessarily interested in being the “full stack” engineer, either. There are many sysadmin duties and responsibilities that I am glad I do not have (like worrying about people’s files and email). Yet, I know enough about Windows and Linux to do the things that I need to do. I know the basics of wireless networking and IP telephony. I have a decent level of VMware vCenter knowledge and experience (mostly through breaking and fixing things in my home lab). I also have some knowledge of SANs and storage systems, though I will also admit that I find the networking aspect of SANs very appealing and may explore that at a deeper level later on.

Still, there’s enough industry talk about scripting and network automation that I decided it was time to investigate a little on my own. I will very readily admit that the idea of working with APIs where you can send and receive actual, exact information, as opposed to what you get with screen scraping, is very appealing. It’s going to take awhile before I reach that point, especially considering the vast majority of equipment I work with currently is 10-15 year-old classic Cisco IOS-based. That means interaction via screen scraping.

I started my current job a little more than a year ago. While my last job gave me the absolute tiniest taste of enterprise-level network engineering, my current job has been full-on, giving me so much of the experience I have desired for the past several years. While I may be working with primarily older technology (there’s some brand new stuff sprinkled in here and there), it is at a scale that is large enough where certain automated tasks begin to make sense.

When I started, we had no configuration backup process in place for the routers and switches (that I was aware of). I discovered RANCID, and built my own server from scratch. I had never worked with this software prior, and while we are using SolarWinds Orion for some things, we are missing the NCM piece, which I discovered was ungodly expensive.

Figuring out how to set up and use RANCID has been one of the best things I’ve done so far for myself. I have used it so many times for various things, and it has been a real time saver on occasion. Just the other day, I had to send out a replacement router, only I didn’t find out about it until 20 minutes before UPS was scheduled to do their normal pickup. By having the configuration already backed up in RANCID, I was able to quickly grab a router, erase its config, put the backup config on it, and get it packaged and shipped out within 20 minutes before the deadline.

I experimented with sending out certain commands to all of the routers via RANCID, and that was pretty neat, but felt kind of awkward for some reason. I knew that I would be able to do more if I learned how to script with something like Python. A couple months ago I started making my way through Learn Python the Hard Way, and I’ve also made it halfway through Kirk Byers‘ excellent Python For Network Engineers course.

I found learning some of the stuff pretty difficult at first. Part of it was me doing my best to let go of my attitude regarding programming. I got my Bachelor’s degree in Information Systems, not Computer Science, because I was more interested in networking, and not at all interested in programming. But as time goes on, I feel like I made the wrong choice. Not because I want to be a programmer now, but because having that background could have been more beneficial to me for the future. However, I am sure that writing scripts is not at all like being a full-on software developer so I should probably not make that comparison.

Right now I am still at the extreme beginner stage, and I know that I will need to go over the learning material several times if I wish to reach the point of being able to “think in code”. Still, I knew that I would need to start somewhere if I wanted to make what I am learning try to stick. Sometimes, the hardest part of learning is trying to see where and how what you’re learning can be applied. Coding is all about breaking down problems into the smallest amounts possible, and then recombining them into something meaningful.

This is something that can be very difficult to see at first. “I want to learn Python, but what do I want to do with it?” Beginning with the knowledge that you can do practically anything with it, as long as you know how, does not help at all. I needed to start smaller. I needed a single simple task that I could start with, and learn how to build upon it as needed. During this past week, I had been thinking about this more.

I wanted to start with a simple script that could take a list of hostnames or IP addresses of Cisco routers, send a command to them, and dump the results to a text file. On the surface, that sounds easy enough. Yesterday, my boss wanted me to gather information about how many of our branch offices had more than six Cisco phones, but did not have Cisco switches installed. I saw this as the perfect opportunity to finish what I had started and have an actual use case for the script.

Ultimately, the script would need to log into each specified Cisco router, issue a command (“sh cdp n | inc SEP” in this case), and dump the results to a text file. I was able to cobble together a script based on what I had learned from the first several chapters of Learn Python the Hard Way, the first half of Kirk Byers’ course, and a couple of quick web searches on how to write code for a couple of specific tasks. I created a GitHub account a few months ago, and as embarrassing as my first script might be, I decided to go ahead and post it to my account anyway. If nothing else, I figured it was something else that I could say I have done.

The first issue I ran into was scope creep. Should the script do this? Should it do that? It would be nice if it did this other thing. I saw how this could very quickly get out of control. I realized that, for now, I am still learning, and I just need to accomplish this simple single task at the moment. Keep it simple.

The second issue I ran into was errors I didn’t anticipate and wasn’t sure how the script would handle them. The first thing I experienced was when a host was not reachable. I knew from the start that this would be an issue, but I wasn’t sure how to handle it. Through a web search, I found out how to ping a host, and then return a True or False. I used this as the primary error catch.

Other errors I ran into, but have not yet resolved, include bad authentication and command timeout. Sometimes, the two are related. Upon reaching either of these failures, my current script just dies and does not handle the exception. This will probably be the first thing I try to improve upon.

Going back to the idea of scope creep, as I was putting the script together, I was thinking of all kinds of different ways that the script could be added to and enhanced…if only I knew how to do it. For example, in my current use case, I only wanted to know which locations have more than six Cisco phones, but do not have a Cisco switch. If I knew Python a little better, I could write a script to scour the network and present back to me only this exact information, instead of the extraneous information provided by screen scraping. However, this is something that will develop over time, because learning this stuff is just one of many skills I wish to gain experience in, and it all takes time.

In the end, the script did what I needed it to do, and gathered the information that I required. I was able to get the information I needed by letting the script run over a lunch break, whereas my alternative (getting information from our IPAM system) would have taken much longer, due to the specific information I needed. I was very excited by this, and I can see this turning into something important over time.

There’s a lot out there to learn, and it is very difficult sometimes to remain focused on any single thing. Many times, what you learn is dictated by the immediate business problem you need to solve. Rarely is what you need to learn singularly-focused, and learning it all takes time, and practice. It can be a delicate balance to not be pulled into too many directions at once, both in real life, and in “study life”. I’ll admit, that in itself is another skill I am still trying to learn.