Over the next couple of weeks, we continued to migrate more portions of the network and less-critical systems over to the new data center. One of the issues we experienced was temporary route instability due to accidentally advertising the same routes through multiple BGP ASNs because of redistribution.
The overall WAN and Internet design of our network is hub-and-spoke. We use an MPLS L3VPN service and peer with the PE routers with eBGP, so all of the spokes can talk directly to each other (which is useful for all of the inter-office VoIP traffic), but both the primary business data as well as Internet connectivity for the entire company all flow back to the hub data center.
Over time, for various reasons, we ended up with multiple MPLS routers at the old data center that face the rest of the WAN. All the MPLS routers use the same BGP ASN and speak iBGP with each other, and peer with our MPLS provider with eBGP. Even though all the MPLS routers have equal access to the WAN (differing bandwidths aside), different routing policies had been put into place for various reasons over the years. For instance, all of our SIP sessions went over just one of the routers. We advertised the company-wide default route from two of the routers, but not all of them. We advertised our data center private /16 network out of all the routers, but advertised more specific /24s of just a couple subnets on only a couple of the routers. Nearly all of these routing patterns were established before I got here. Some of them made absolutely no sense to me, so I had to question their history, which often pointed back to one-off fixes that were supposed to be temporary, but of course became permanent.
We’re primarily a Cisco shop as far as the routing and switching infrastructure goes, so we use EIGRP on the internal network. Both the new data center and old were connected together using the same EIGRP ASN. We perform mutual redistribution of BGP and EIGRP on all the MPLS routers, and use tagging to prevent route loops. However, at the new data center, we used a different private BGP ASN.
I knew that because of the two different BGP ASNs, I would have to be careful about outbound route advertisements appearing to the rest of the network from multiple sources. So I used BGP prepending to make different paths be more preferable than others, while still allowing for redundancy in case one of the routers went down. But, since all of the MPLS routers at both data centers were joined together by the same EIGRP ASN and they were configured to do mutual redistribution of BGP and EIGRP, it ended up causing a problem that I didn’t see at the time, but can now see very clearly in hindsight.
The routing table seemed stable, and everything was flowing properly. Then a couple of our remote branches were having WAN troubles, and their circuits went down. When they came back up, they could no longer reach any of our data center subnets. Or more correctly, they could reach us, but we couldn’t reach them. I didn’t think that a routing loop would have occurred, because at all points of redistribution, I set and matched tags to prevent that from happening.
The part that I can now see clearly with the experience of hindsight is that when those branches went down, their routes were removed from the overall routing table. However, when they came back up and the routes were re-introduced, they were learned by BGP from our old data center, redistributed into EIGRP and passed onto the new data center, and redistributed back into BGP with a different ASN, which caused a loop. But only for routes that were removed from and re-introduced into the overall routing table.
Luckily, we were able to catch this very quickly and correct it with only a couple of small branches experiencing the issue. As I write this, we are still in the middle of migrating the WAN over, and the way we chose to deal with this is to not yet re-distribute EIGRP back into BGP at the new data center, and instead just advertise the routes we desire directly through BGP. It’s true, we could have just used filtering, but we only need to do this temporarily and I thought it would be a much quicker and easier way to solve the problem.
Having this job has been very exciting so far, and I feel like I came onboard at just the right time. Since this is my first “real” enterprise-level networking job, I spent the first several months getting up to speed and seeing where all the things I read about and studied for fit into place. Now I’ve reached a point of being able to put the more advanced things I’ve studied into action, and gain real practical experience from it which will propel me forward in my career.
As I said, I am well aware of the fact that the majority of people who are early on in their network careers will not have had the opportunities that I’ve had in experiencing this data center move. That is why I made sure to make the most of it and write down some of the many things I experienced and learned. The static EtherChannel problem, in particular, was a real nightmare and I don’t think I will ever forget it. More importantly, I now know exactly what symptoms to look for and how to fix it if something like that ever happens again.
I feel like I have gained a lot of really important experience in a short amount of time. I’m very grateful for it, and I’m always continuing to study and look toward the future with what this career may bring me, and to help solve the many problems of business.