Easy Disaster Recovery Plan

DR plans encompass everything from no plan whatsoever (failing to plan is planning to fail), to active/active workloads distributed among several geo-redundant datacenters. This spectrum, just like nearly everything else in business, goes from zero to enormous cost and complexity. In the interest of keeping things simple, I designed a relatively inexpensive and uncomplicated enterprise DR plan that can be adapted and scaled with organizational requirements.

The initial design starts with two datacenters (or campuses, or a couple boxes in a rented colo cage, whatever your situation may be). One is primary (Datacenter A), and the other functions as a cold standby with some secondary production traffic (Datacenter B). In this example, I am using a single router, and a single VRF-capable Layer 3 switch in Datacenter B.

With this design, Datacenter A serves Site X and Site Y for normal production traffic. Datacenter B also serves normal production traffic, but mostly for secondary systems for when individual primary systems at Datacenter A become available, which is different from a full disaster recovery scenario. This demonstrates the cost savings of starting out with partial redundancy. Services that are deemed critical to infrastructure (such as DNS) have live systems simultaneously active in both datacenters. However, it is not a full replacement for all systems should Datacenter A become completely unavailable.

The Layer 3 switch in Datacenter B is divided into two VRFs: one for secondary production traffic, and one for disaster recovery traffic which mirrors the prefixes available in the primary datacenter. The DR VRF is always active, and can be prepared and maintained using whatever resources are required before an actual DR event occurs.

During normal production, the router at Datacenter B (indicated in the diagram as BGP 64514) sees both the production 10.100.x.x/16 and DR 10.100.x.x/16 networks. When a router receives multiple routes from different protocols, it needs to decide which route to believe. This router is configured to install the production 10.100.x.x/16 routes from Datacenter A in its FIB as long as it is receiving them from BGP. If those routes disappear, the DR 10.100.x.x/16 routes are installed, instead.

This means the DR 10.100.x.x/16 routes will be inaccessible from the rest of the network during normal production. With this design using a Layer 3 switch and VRFs, the SVIs are assigned to either the production or DR VRF. You can access the DR routes during normal production by configuring a jump host that contains at least two network interfaces: one in the production VRF, and one in the DR VRF.

Finally, a disaster recovery plan is incomplete if you do not test it. On the BGP router in Datacenter B, you can set up filtering to prevent the production 10.100.x.x/16 routes from being learned by BGP. This will cause the Datacenter B router to believe the DR 10.100.x.x/16 routes are the “real” routes. Likewise, for DR testing purposes, you should ensure that the Datacenter B BGP router does not advertise the 10.100.x.x/16 routes back to the rest of the network. You would only do that during an actual DR event. With the bidirectional filtering in place, you can use one or more hosts on the secondary production 10.200.x.x/16 network to test the validity of the recovery. After the results of the test are verified, you can remove the inbound BGP filtering to restore Datacenter B to normal production.

During a DR test, hosts in other parts of the network (such as Site X and Site Y) can still access hosts in the 10.200.x.x/16 network, and they will still use the 10.100.x.x/16 routes present in Datacenter A. Hosts in Datacenter A can reach hosts on the 10.200.x.x/16 network, but the replies will be sent to the DR VRF. You can use static routes on the Datacenter B BGP router to override this behavior. For example, if during DR testing a system in the DR environment absolutely must reach another system in the production network in Datacenter A, a static /32 route can be entered into the Datacenter B router pointing toward the MPLS L3VPN.

This design starts small and simple, which is sometimes all you need to reach the particular business goals. It also represents an older way to do things by using an active/standby model, based on systems that are not prepared to have their IP addresses changed. The tradeoff is a much lower cost since additional hardware and software licensing are not required until an actual disaster (depending on what is being recovered).

A more modern (albeit more expensive) model is active/active using different subnets and load-balancers. The client-facing IP prefixes could be advertised from both datacenters, fronted by the load-balancers. The clients would reach their closest load-balancer (anycast style), and the load-balancer would redirect to the proper server to complete the connection (depending on the application architecture, of course). This is more expensive because it requires a duplication of hardware and software (including licensing) to maintain the active/active environment.

The tradeoff is a much quicker recovery time. In fact, the users might not even be aware of an outage if proper planning and testing is performed. It all depends on the business weighing the risks versus the costs. Some businesses can tolerate the active/standby model, many cannot.

My First CCIE Lab Attempt

This is the unabridged version. The abridged version is available on LinkedIn.

From the Written…

I passed the CCIE Routing & Switching v5.1 written exam in August 2017. It was a huge moment for me, and felt like a great validation of the effort I had put in to reach that point. Around the same time, my work had some Cisco Learning Credits that were about to expire, and they let me use them to purchase the full self-paced Cisco Expert Level Training (CELT, formerly 360) package, which included 25 workbook labs and 15 graded labs. I had had the INE.com AAP that I paid for myself for a couple of years that I started my CCIE training on, but I personally found the labs included with CELT to be much better for preparation. Additionally, I confirmed later that the final few workbook and graded CELT labs are indeed very similar to the real thing.

Part of what stalled me when initially studying for the lab exam was the same thing that stalled me for the written exam: not knowing the true topic scope and depth. You really can’t know for sure until you take the real exam. As painful as the current price of $450 is for the written exam, the current $1600 lab exam fee is much worse just to be able to find out. I think Cisco (and other vendors) could do a much better job of outlining exactly what topics are going to be on the exams (and to what depth) if they’re going to charge so much to take them. The blueprint can be somewhat vague and occasionally exclusive, which makes focused studying difficult at first, though in my experience the feeling lessens as you become more familiar with the major topics.

Having access to CELT made a huge difference in my preparation because it helped narrow down both the scope and depth of topics. Both are still quite vast (as you would expect for an expert-level certification), but they no longer seemed unlimited. CELT includes introductory “lesson labs” which are mini labs that are more focused on a single technology area. While studying for the written exam, I labbed things up when I needed more clarification on a particular topic, but the reality is that I did not lab very much overall to pass the written exam. When I started studying for the lab exam, I quickly discovered that I had gained a lot of “book” knowledge, but actually implementing the technologies at the command-line is quite different.

In January 2018, I wrote about some of my thoughts and experiences studying for the lab up until that point in time. Soon after that post, I decided to release the 3500-card flashcard deck I created for myself while studying for the written exam. A couple months later I released the CCIE R&S topic mind-map I created in attempt to help narrow down the topic scope a little bit (this way the scope no longer seemed to be practically unlimited).

By this time, I had completed all of the CELT topic mini-labs along with the first five workbook labs. I was still taking things comparatively slow and essentially “dabbling” into the work. I had not performed any graded labs yet. Since you can only perform the graded labs a single time, my reasoning was that I wanted to go through all of the self-paced workbook labs first so that I would hopefully be more prepared for the graded labs.

Toward the end of April 2018, I was presented with the opportunity to attempt the lab exam as part of the CCIE mobile labs for Cisco Live in Orlando. I knew at that moment I was nowhere near ready for the lab. But, living only 30 minutes away from where the mobile lab was being held, I also knew it would be much easier (and less stressful) to take the lab here just to see what it was like, rather than having to make the nine-hour trip to RTP in North Carolina.

Preparing, Part 1

It has been said before that actually scheduling the lab provides a huge boost in motivation to study. That certainly proved to be true with me. The first thing I did was generate a calendar to plan out the studying I wanted to accomplish before attempting the lab, which included finishing the workbook and taking all of the graded CELT labs. I knew this plan was going to make things tight, and I had to take a total of five days off of work to fit everything in. I performed CELT workbook labs during work days (so I could afford interruptions), and uninterrupted graded labs (with the exception of my cat howling) on weekends and days that I took off. I instantly went from relatively light and unfocused study to laser-focused for the next month and a half leading up to my scheduled lab date.

I took the GP01 graded lab, which is a four-hour introductory lab that you’re supposed to take when you first start studying under the CELT program. This lab only covers the core topics, and even though it was my first graded lab, I did very well on it, about twice as high as the average score for that lab. It also served as my introduction to the CELT grading environment and seeing how things work (such as how the grading is performed). I felt great by how well I had done, but I reminded myself not to get overconfident, and that it was meant as just an introductory lab.

The next day I took my first full 8-hour graded lab (2hrs TS, 6hrs CFG). The first thing that surprised me was that I was able to make it through the whole day without nearly as much fatigue as I was expecting. Making it through the entire lab without being mentally exhausted in the middle gave me a lot of confidence for moving forward. I once again achieved a score that was quite a bit higher than the average score for that lab. This gave me even more confidence. However, the tides were about to turn, and the graded labs quickly became much more difficult.

As I completed more graded labs, I felt myself becoming much sharper and more capable. I felt the knowledge I gained studying for the written exam start to take hold in a much more meaningful way. For example, I knew the basic differences between DMVPN Phase 1, 2 and 3, and the basics of how IPsec works, but I had difficulty memorizing the configuration. This was remedied very quickly through the forced repetition of the graded labs. Through the configuration, it became clearer what the differences are with regard to the options available for many of the covered protocols and technologies.

The other major aspect of doing so much labbing is you get to see the interaction of the various protocols and technologies. For instance, if OSPF neighbors are configured as point-to-multipoint, what impact will that have on reachability once you leave the OSPF routing domain? In what ways does changing the STP port-priority or interface cost affect where individual ports will be blocked?

Preparing, Part 2

As I performed the graded labs, I went over the final results and made notes along the way. This was an extremely powerful tool because it not only helped demonstrate the criteria by which the labs are graded, but it also served as a reference point of mistakes and misunderstandings to learn from. For instance, both in the graded and real labs, you are not necessarily graded on actual implementation (configuration), but rather on the results. This is because for many tasks, there are multiple ways to achieve the results. Your configuration may be 100% correct, but you still lose points because you didn’t verify the results. An example that Bruno van de Werve (the current program manager for the CCIE R&S certification) gave is adding an MD5 password to an established BGP neighbor. Sure, your configuration might be correct, but if you don’t bounce the neighbor, the neighbor will not actually be authenticated, and you’ll lose those points.

I had reached a pretty high level of confidence on the labs I had done to this point. That was about to change as I performed the final few labs. With CELT, the vast majority of the labs are what I call “CCIE v4 style”. That is, v5 topics are covered, but using a v4-style topology (generally 6 routers and 4 switches) and task arrangement. The final workbook and graded labs are a decent approximation of the real lab, which features a much larger topology, and the task arrangement was also in line with how the real lab is.

I did not perform nearly as well on these labs. I believe a large part of this was due to the change in strategy required. I mentioned the task arrangement because the v5 style labs seemed to include many more tasks for fewer points to gain. In other words, there is more to do, and more opportunities to get things wrong. If you have a four-point task with fifteen sub-tasks and you mess up on any of them, you lose all four points.

I did my very best to adjust my strategy for these tasks, which included trying to pay much more attention to the little details. The v4 style labs were generally more difficult from a technical point of view (many more esoteric topics tested and tighter interaction between protocols), but the v5 style labs were much more difficult from a practical standpoint. I frequently finished the v4 style labs an hour or more early, but that wasn’t the case for the final graded labs. My confidence had taken quite a nosedive.

By the end of the weekend prior to taking the real lab, I had completed all of the graded labs and nearly all of the workbook labs. I was at the point of being able to complete the vast majority of the labs within Notepad without even touching the CLI. I performed better than the average score on every single graded lab, but “average” is not “passing”. I felt like all of the core topics were very much drilled into me, and most of the miscellaneous topics also made more sense. For example, I am much more comfortable now with multicast than I was before I started the intensive labbing.

I was originally going into the lab with the attitude of it being just an “investigative attempt”, meaning I knew I was not ready when I scheduled the attempt. At the midpoint between scheduling the lab and the actual lab date, I was feeling pretty confident that I might have a real chance of passing. That confidence started to go downhill as I completed the final graded labs, but I still remained hopeful. Maybe the real lab wouldn’t be as difficult as the graded labs?

Lab Day Part 1: TS

During the couple of days prior to the lab, I monitored Google Maps to see what the required travel time would be from my house to the hotel where the lab was being held. Even though I know the area very well, I made sure to add some extra time to my route to account for extra traffic due to Cisco Live taking place. Traffic in the morning ended up not being an issue, and I arrived over an hour early. But, better early than late!

As the lab start time approached, everyone gathered together outside the doors of the room where the lab was being held. The proctor introduced himself and accounted for everyone. He told us he is the proctor for the CCIE labs in Tokyo. He is a very nice and patient person, as I’ll detail shortly. One person was running late, so he provided a couple of extra minutes. When that person did not show up, we went inside the room, the doors were shut, and the proctor provided some more introductory information about policies and expectations.

The mobile lab consisted of around 14 tables arranged in a large square, each containing a small workstation sitting behind two 24″ monitors, a keyboard and a mouse. The workstations run a minimal and locked-down install of Windows, with access to Notepad, Calculator, and the locked-down web browser which the lab is performed through, including access to the official Cisco documentation. The lab interface itself has been demonstrated publicly, such as this presentation from Bruno van de Werve. Clicking on the routers and switches in the diagram inside the web browser opens up a new PuTTY session to the particular device.

The lab experience was similar to what I had encountered previously in the graded CELT labs, but not quite exact, so it took a little bit of time for me to adjust to the real lab quirks after spending so much time with the practice labs. Maybe it was because I was so used to it, but I preferred the CELT interface to that of the real lab, even though they are very similar.

I proceeded through the troubleshooting section. The first thing I did was open Notepad and made a section for each of the TS tickets, including the point value and what I quickly perceived to be the issue or general topic realm based on reading the description. While I was very good at ascertaining this information during my graded labs, I discovered I was incorrect on my initial assessment for about half of the tickets once I started to work on them. I worked on the topics I felt I knew the best first, and then cycled through the remaining tickets. It was my perception that the TS section on the lab was actually quite a bit harder than what I experienced on the graded labs.

Though I did not set any specific mental hard limits on time for each ticket, I did remain mindful of the clock, made notes and moved on when I felt like I was taking too long. After I had attempted all of the tickets, I circled back around to work on them individually some more. By the time I finished my final practice labs, I was in decent shape, usually finishing 30 minutes early or more. That didn’t happen for the real lab. As I was approaching the two-hour mark, I knew that I was right on the border of the minimum cut-score, if I had even crossed that line. You know the one or two technologies you feel you’re a little weak on? Those are the topics that are going to be all over your delivery of the lab exam!

I was starting to feel a little hopeless, but because I believed I was at least close to the minimum cut-score and still somehow might have a chance, that kept me going. I should say, if I knew for certain that I had failed TS and therefore the entire lab, I would have kept going just to get the experience, but I wouldn’t have taken it as seriously from that point onward. I spent a few more minutes trying to see if I could somehow solve at least one more ticket, thinking maybe there was just something small that I hadn’t seen and would have an a-ha! moment. But of course, those don’t usually come until the drive home after the lab is over. I was at the two hour and 10 minute mark, and I knew I was not going to gain anything from the remaining 20 minutes, so I progressed into the diagnostics section.

Lab Day Part 2: DIAG

Just like with TS, I felt like DIAG was a lot harder in the real lab than in the practice labs. For DIAG, I believe this feeling was due to the limited experience with the section. CELT only provides three of them in their graded labs, and none in the workbook. Of the scenarios I received in the lab, I was confident on my answers in some, and had to make educated guesses on the others as I was running out of time. Part of the difficulty of DIAG is the inclusion of unnecessary information along with the critical information. I feel like they did a good job on my delivery of the exam in making it difficult to distinguish between the two. As I was about to leave DIAG, I felt the same as I did about TS, knowing that I must be right on the line of the minimum cut-score. Unlike with TS, I did not click to go to the next section, I let my timer run out.

The configuration section did not load properly for me. I waited about 30 seconds, and I had a blank white screen with a couple of buttons to click, one that said “End Session”. You know where this is headed. For some reason, I thought it said “End Section”, thinking I was still in DIAG and had to click that button to proceed to CFG. I clicked it, and I was immediately logged out. I went to log back in, and my login did not work. Uh-oh!

I felt the wave of adrenaline go down my back, similar to when you lock yourself out of a remote device. I approached the proctor and explained what happened. He said “You clicked End Session?”. I nodded. He gritted his teeth and slowly shook his head and said “I’m sorry…that’s it.”. I performed some mental gymnastics trying to stay calm, and a few seconds later he told me to go back to my station and wait, and he’d contact someone to see if anything could be done. He then stood up next to me and said, “Everyone, make sure you do NOT click the ‘End Session’ button until you are ready to exit the lab completely. Once you click that button, that is the end.” How embarrassing! Now everyone in the room knew exactly what I did. It felt like a case of instant karma for my carelessness.

I returned to my station and imagined everyone looking at me in astonishment, though I knew in reality everyone was too busy concentrating and simply now knew for sure not to click that button. While I was waiting, I discovered that they made the Mouse control panel item available. I wish I had known that at the start of the lab, because the default mouse settings were much more sensitive than I was used to, and it kept throwing me off. I adjusted it more to my liking. A few minutes later, the proctor approached me and asked me to log in again. It didn’t work.

I waited a few seconds and tried again. This time it worked, and I was dropped into the configuration section. I now saw what I was expecting on my screen (a full diagram, etc.), which I did not see before this happened, so I still don’t know if there really was something wrong with my initial lab delivery. I didn’t see where my lab was being hosted before (I assumed San Jose, CA), but I saw now that my lab was being hosted in Toronto, Canada. I thanked the proctor most graciously and told him how much I appreciated his remedy. I will never know if they recovered my previous attempt and simply resumed it, or if this was a new second lab and they stitched the two of them together at the end. Regardless, I was now in the CFG section. The proctor also kindly granted me an extra 10 minutes to complete the section.

Lab Day Part 3: CFG

I started CFG as I had learned to do in the practice labs: I wrote down the major task topics for each section along with the point value for that subsection. I didn’t get too detailed or granular, I just wrote down things like “O HQ”, “E Main”, referring to OSPF in the HQ section and EIGRP in the Main section of the topology, respectively. For standalone items I just put the general topic such as “NTP” or “FHRP”. Like with the practice labs, this was just a way for me to check off items that I had attempted, finished, and verified. The topology is very large, and the tasks are not necessarily presented in the order they should be completed in, which makes having the checklist crucial to see what you’ve done and what you have left to do.

I then took about 10 minutes to read over and quickly analyze the entire lab. I was looking specifically for tasks that could be completed independently, tasks that needed to be done in a certain order, tasks that are standalone and do not depend on or affect any other tasks (stub tasks), along with a relative level of difficulty. I then started working on generating the configurations. As I had done in the practice labs, I performed as much configuration as possible in Notepad first, pasting the configs into the devices at the end of each sub-section and doing simple verifications as I went.

After an hour or so, we paused for lunch. I’ve heard many different stories about the CCIE lunch, so I was pleasantly surprised to find a catered lunch waiting for us in the next room. There was something available for just about every person’s taste, including vegetarian options. I am no stranger to eating more than I should, but I was really surprised at how many people were really loading up, especially on carb-heavy foods like cake and pie. I stuck primarily to the chicken and didn’t have any dessert, because I did not want to experience a 2pm carb crash.

We split into two different tables. We all spoke of how we felt we were doing so far, and what we were using for training to get where we are now. There were a couple of multi-CCIEs in the group that were attempting the lab for other tracks. There was also discussion of previous lab attempts at the regular lab locations. I tried to make myself look a little less foolish by explaining what had happened with me prematurely ending my lab. The proctor gave us a couple of minutes notice, and then we went back into the room to finish our labs.

I spent the next hour working as far as I could on the core tasks. I eventually reached a point where I knew I was going to have to put in significantly more effort to figure out how to configure the remaining core tasks correctly. There were also a couple of tasks remaining that were going to require a significant amount of configuration for very little payoff in points, but they would be required in order to complete other tasks. I took a mental break by moving onto some of the stub topics.

After doing what I could to grab some easy points, I returned to trying to finish the core tasks. I kept working as long as I could, but I eventually reached a point where I knew I was not going to be able to complete some future sections because I couldn’t figure out how to configure the prerequisite sections properly in the way they wanted. The prerequisite tasks were such that I didn’t even have any illegal options for a workaround. I had enough time, but I simply had not practiced enough with the particular technologies in order to configure things the way they needed to be, despite all of my effort over the previous month and a half. With about 20 minutes remaining (plus the extra 10 minutes I was granted), I knew that there were no additional points for me to gain in this lab, and that I had done all that I could.

I ended my session (on purpose this time), and quietly thanked the proctor before I left the lab. At this point in time, I knew I had failed the lab due to not making the overall cut-score, but I felt close to the minimum cut-score on all three sections, so I thought there was a chance I would get the dreaded PASS+PASS+PASS=FAIL score (which would still have been somewhat of a consolation prize). I walked to my car and started to head home. By this time Cisco Live was starting to ramp up, and the traffic reflected it. Even though I knew I failed, I felt like a huge weight had been lifted off of me.


The next morning, I received the official notice that I had failed the attempt. I did not meet the minimum cut-score on either TS or CFG, but I did pass DIAG, which surprised me since I had to guess at the end. I performed slightly worse in TS than I thought, and about what I expected in CFG. One thing about my CFG result that surprised me was that I didn’t do as well as I thought on the Layer 2 section. Layer 2 was the only section that I completely finished, and I thought I verified it too. All I can guess is there must have been something real small that I somehow missed, which unfortunately was a semi-frequent occurrence during my practice labs.

One of the things I found to be interesting was that overall, I was not as nervous during the lab as I originally thought I would be. I know this is very much due to taking all of the graded CELT labs just before the real lab. There are other factors involved as well, such as the fortune of being only 30 minutes away from the lab (a luxury I will not have next time), and not having my job depend on it (as was unfortunately the case for some of my friends). This has been entirely a personal goal, and it is difficult for me to imagine the stress of the exam being a professional requirement.

The day after was essentially the first day I had had off since I scheduled the lab toward the end of April (about 45 days between scheduling and attempt). My weekends and days I took off were the most intense, since I performed the full 8-hour graded labs on those days. The day after the lab, I relaxed and tried to start organizing my thoughts about the future. Something I did not anticipate was realizing just how worn out I was from the intense focus I had on studying, because I found it very difficult for a few days to think about anything!

I knew I just needed to give myself a little break, but a few days later I tried to get back into a routine of studying, and it just wasn’t happening. I wasn’t taking in the things I was trying to learn, and I felt like I couldn’t concentrate on anything at all. I was starting to wonder if somehow I broke my mind with this! I began to realize that when I was doing the graded labs, I had a defined path established, and it was easy to stick with that path, even though it was a lot of hard work and dedication. I still knew exactly where I was going by following it.

In studying for the CCIE, I put off a lot of other things that I want to learn at a deeper level. I was suddenly thinking about all of those things again, and I was trying to go in every direction simultaneously, which of course leads to nowhere. I decided I needed to make a new mind-map file to keep track of the things I want to learn. This was a good start to getting me back on track again mentally. Any time a new thought pops into my head regarding something to even consider learning, I can add it to this file in a hierarchical manner. This also lets me keep track of and prioritize what I feel is more important at the time.


In the months between passing the written exam and attempting the lab, I have gone through just about every thought imaginable regarding the CCIE exam. Passing the CCIE lab exam legitimately is no joke and it requires serious dedication. I have analyzed multiple times all of the pros and cons of trying to obtain the certification from just about every perspective. Through it all, I never outright decided it was no longer worth it and that I will no longer work to obtain it, though I’ve come pretty close a few times in sheer frustration.

As it stands now, even though I did not pass this first attempt, the work I put in leading up to the attempt is something that absolutely changed my life and made me a better network engineer. There are design and implementation implications that I can now see more quickly just by scanning a topology that I was unable to do before. I more fully understand and appreciate the various options (nerd/wanker knobs) and have a much better idea of when you should and should not use them. After working through all of the labs, I have a much greater appreciation for simplicity and repeatability in network designs.

These are all things I knew about before, but the experience of actually labbing the complexity has made me extremely aware of using the simplest solution to solve the problem at hand. It is for these reasons that I am not at all regretful of the amount of effort I put into attempting the lab, even though I failed. I did feel great relief in knowing that I can now get back to my regular life for a period of time. I feel comfortable in the knowledge and experience I have gained because of this process, even if I have not yet obtained the trophy. I am grateful about how much I have learned about myself and what I am capable of, and that I discovered new tools and methods to learn things at a deep level along the way. These are things that will last me for the rest of my life, regardless of what the future holds.

For a few days after the lab, I was pondering whether what I have done so far is good enough. After all, there are many great network engineers who have no certifications whatsoever, let alone the CCIE. I’ve gained excellent experience that will carry me forward in my career. Yet I suspect there are a far greater number of senior-level network engineers who have passed at least one expert-level exam legitimately in their careers, whether or not they still maintain that certification. After weighing the decision for a few days, I purchased a voucher for my next attempt, which will occur sometime in the next 12 months.

I’m also going to move ahead with the service provider topics I want to study. In the days leading up to the lab attempt, I noticed something. Before the graded labs, I was pretty good at configuring the Layer 2 topics, and shortly into studying it became second nature. Within a couple of weeks, the same thing happened with the major IGPs. In the real lab, I had no problem rolling out configurations for Layer 2 and the IGPs. I still had to put a lot of time and consideration into configuring BGP and MPLS L3VPN. Since the service provider topics build on top of this, it occurred to me that if I study in that direction, BGP and MPLS L3VPN should eventually become second nature as well (especially BGP since it is the glue that holds everything together).

In the end, it wasn’t time or even necessarily strategy that caused me to fail the CCIE lab attempt. Having been through the entire CELT program, I also cannot say I received any tasks on the real lab for which I had no previous exposure (and there certainly wasn’t anything present that was not on the blueprint). It was my perception that the lab I received was slightly more difficult than any of the v5 style CELT labs, but not tremendously so. I need to become even better at the core topics, be better at more of the major miscellaneous topics, and I need to become even more cognizant of all the little details in the tasks. I am also very lucky to have a spouse who completely supports me in this endeavor, which really does make all the difference in the world, and I will always be grateful to her for it.

Onward to the next attempt!

Using Bridge Domain Interfaces on Cisco ASR-1K Routers

I am replacing an old Cisco 3945 router with a new ASR-1001X. The 3945, which has three gigabit Ethernet interfaces, has one connection to two service providers, and a single tagged link back to the network core carrying the traffic of a few different IP subnets. The ASR-1001X has six gigabit Ethernet interfaces, so when replacing the 3945 I wanted to introduce some redundancy into the network by utilizing two physical links back to the core, with each link going to a separate physical switch. This is a great use case for some kind of MLAG technology, but what if the upstream switches don’t support MLAG?

Bridge domain interfaces in IOS-XE can resolve this situation. BDIs are somewhat of a replacement for the old BVIs in classic IOS. However, BDIs are much more feature-rich and capable than BVIs, and have all kinds of extended use cases. Bridge domains are a part of the Ethernet Virtual Circuit functionality in IOS-XE, more fully described here.

For my current needs, I am going to be replacing BVI functionality with BDI. This allows for an IP address to be terminated on the router, while having both links available for failover in case one link goes down. Only one link at a time is usable due to spanning-tree, but a single link can fail with a minimum amount of downtime (on the order of a few seconds when using RSTP).

Enable STP processing and loopguard protection on the router with the following commands:

spanning-tree mode rapid-pvst
spanning-tree loopguard default

Loopguard isn’t strictly necessary, but can offer an additional layer of protection for your network.

[UPDATE: When I first wrote this post, I labbed it up in VIRL. When I went to deploy it on an actual ASR-1001X, the spanning-tree commands did not work. As I found out, this is because this functionality is not included in the ipbase license. You need advipservices or higher to follow these steps because you will need spanning-tree support to make this work. Without spanning-tree, the same MAC address is presented to both uplinks, and your upstream switch will experience MAC flapping because it sees the same MAC address on multiple ports simultaneously.]

The ports on the upstream switches are configured as standard 802.1Q tagged ports. The router is configured with service instances and bridge domains to classify and handle the incoming traffic. Here is an example configuration under a physical interface on the ASR-1001X:

interface g1
no ip address
service instance 100 ethernet
encapsulation dot1q 100
rewrite ingress tag pop 1 symmetric
bridge-domain 100
service instance 200 ethernet
encapsulation dot1q 200
rewrite ingress tag pop 1 symmetric
bridge-domain 200

The service instance command associates the physical interface with an Ethernet Virtual Circuit (EVC) construct in IOS-XE. The encapsulation dot1q command says that any frames received on the physical interface that carry that particular tag will be processed according to this service instance.

The rewrite ingress tag command (as configured in this example) will remove the VLAN tag before processing the frame further, since it is not necessary for this particular application of BDI. The ‘pop 1 symmetric’ portion of the command causes the router to remove the outer VLAN tag before it sends the frame to the BDI, and to re-introduce the VLAN tag as the frame moves from the BDI back to the physical interface. If you were performing QinQ, you could set the value to 2, for example.

Finally, the bridge-domain configuration specifies the BDI to use. In my example, I matched all of the numbers in each configuration stanza as a best practice for good configuration readability, but this is not a requirement. Each of the three values (service instance, dot1q tag, bridge-domain) are completely independent. This is to allow for more interesting bridging options within the realm of Ethernet Virtual Circuits.

You can use the exact same configuration on multiple interfaces, or you can specify that certain VLANs will only be processed on certain links. For example, you could configure a service instance for VLAN 300, and place it only on interface g2, and not on g1. You can additionally use per-VLAN spanning-tree values as a form of traffic engineering. For instance, you could either modify the per-VLAN spanning-tree cost on the router, or the port-priority on the upstream switch, to specify that under normal conditions, some VLANs use one link, and other VLANs use another link. Just be careful to not oversubscribe your links so that if there is a failure, all traffic can still be carried across the surviving link(s).

Finally, configure the BDIs:

interface BDI100
ip address
no shutdown

interface BDI200
ip address
no shutdown

You can use the command show spanning-tree vlan X to verify the redundant links from a STP point of view. Trying pinging a few addresses in the same subnets. You can troubleshoot connectivity with show bridge-domain X and show arp. The first command will reveal if the destination MAC was learned on a particular interface (similar to show mac-address table on a switch), and show arp will reveal if the ARP process was successful for a particular IP address. I had some interesting issues during configuration on virtual equipment for a lab proof-of-concept, and these commands helped isolate where the issue was. In the virtual case, simply rebooting the virtual router solved the issue.

Someone reading this might be critical of relying on STP for redundancy instead of using a modern solution like MLAG. This particular solution offers a level of redundancy that does not require MLAG. The tradeoff is a few seconds of dropped traffic if STP has to reconverge around a failed link. As with all things, the tradeoff primarily involves money, and using the resources you have available to solve business needs as best as you can. This solution still beats having a single physical link with no redundancy. Previously, if the single link failed, it would mean an immediate trip to the datacenter. With the new redundancy, a failed link still probably means a trip to the datacenter, but maybe not in the middle of the night.  😛

Automating Labs…Now With YAML and Multi-Threading!

The automation described in my last post had a couple of glaring flaws. I quickly discovered the inflexibility of using a CSV file for the data source as I started to add more variables to each device. The second flaw was that for approximately 30 devices, it took about 20 minutes to generate and push the device configurations, because each device was processed serially.

I solved the first issue by using a YAML file for the data source. I initially went with a CSV file because I had not yet developed an IP addressing scheme, and I found it easier to do that in a row-and-column format. However, as I was developing the Jinja2 template, it became apparent that the CSV file wasn’t going to cut it since each device has (or will have) customizations that won’t apply to all (or even a good portion) of the devices.

For example, I am configuring basic IS-IS routing for the service provider IGP, but the CE devices will not be running that protocol. The CE devices represent nearly half of my lab, so having IS-IS options within the CSV file seemed like a waste. This led me to think a little deeper about the information I wanted to represent for each device, and YAML’s immense flexibility seemed like the perfect fit. I would also consider using a SQLite database if I were dealing with hundreds or more devices.

The most time-consuming part of learning to work with YAML files in Python is discovering how to access your data. It’s very easy to write a YAML file, but it takes some thought and testing to get the data back out (which, like most things in life, I’m sure gets easier with more experience and exposure).

Here is an example device from my YAML file:

  port: 32795
    # Interface, IP, Mask, MPLS Enabled?
    - ['lo1', '', '', True]
    - ['g0/0', '', '', True]
    - ['g0/1', '', '', True]
    - ['g0/2', '', '', True]
    - ['g0/3', '', '', True]
    - ['g0/4', '', '', True]
    - ['g0/5', '', '']
    net: 49.0001.0000.0000.0010.00
    interfaces: ['lo1', 'g0/0', 'g0/4', 'g0/5']
  bgpasn: 65000
    # Peer IP, Peer ASN, Update Source, Next-Hop-Self
    - ['', '65000', 'lo1', True]
    - ['', '65000', 'lo1', True]
    - ['', '65000', 'lo1', True]
    - ['', '65000', 'lo1', True]
    - ['', '65001']

This device is described by its management IP and port, IP interfaces, IS-IS and BGP options. If I were to configure another device and it was not going to run IS-IS, I would merely leave the isis: section out.

After the YAML file is imported, it is processed by my Jinja2 template. Here is an example:

hostname {{ host }}
no ip domain-lookup

{%- if isis %}
router isis
 net {{ isis['net'] }}
{%- endif %}

{%- for iface in interfaces %}
interface {{ iface[0] }}
 ip address {{ iface[1] }} {{ iface[2] }}
{%- if isis %}
{%- if iface[0] in isis['interfaces'] %}
 ip router isis
{%- endif %}
{%- endif %}
{%- if iface[3] %}
 mpls ip
{%- endif %}
 no shutdown
{%- endfor %}

{%- if bgpasn %}
router bgp {{ bgpasn }}
 {%- for peer in bgp_peers %}
 neighbor {{ peer[0] }} remote-as {{ peer[1] }}
 {%- if peer[2] %}
 neighbor {{ peer[0] }} update-source {{ peer[2] }}
 {%- endif %}
 {%- if peer[3] %}
 neighbor {{ peer[0] }} next-hop-self
 {%- endif %}
 {%- endfor %}
{%- endif %}


All devices will be configured with a hostname and the no ip domain-lookup option. If the device is going to run IS-IS, that is configured, and if not, that section is skipped. Each specified interface is then configured with its IP address and mask. If the interface will participate in IS-IS or MPLS, that is configured. If the router will participate in BGP, that is configured as well. This Jinja2 template shows a generic device, but as I displayed in my last post, this can easily be modified for individual devices as well (if device == ‘Whatever’). This template also demonstrates examples of nested looping, which takes a little bit of time to test and work out the logic. Once it clicks, though, it is a thing of beauty!

I solved the timing issue with the discovery of the multi-threading library for Python. In my lab configuration script, the YAML file is read into a Python dictionary. Then, for each device represented in the YAML file, I pass its variables into the multithreading function, which then calls my function to generate and push the configuration. Each device is effectively processed simultaneously, which cut the lab configuration generation and deployment from 20 minutes to less than one.

Here is my Python script to glue the YAML and Jinja2 files together:

#!/usr/bin/env python3
import yaml
import jinja2
import time
from netmiko import Netmiko
import threading

yaml_file = 'hosts.yml'
jinja_template = 'jtemp.j2'

# Generate the configurations and send it to the devices
def confgen(vars):
    # Generate configuration lines with Jinja2
    with open(jinja_template) as f:
        tfile = f.read()
    template = jinja2.Template(tfile)
    cfg_list = template.render(vars)

    # Connect directly to host via telnet on the specified port
    conn = Netmiko(host=vars['hostip'], device_type='cisco_ios_telnet', port=vars['port'])

    # Check if host is in initial config state
    output = conn.read_channel()
    if 'initial configuration dialog' in output:

    # Send generated commands to host
    output = conn.enable()
    output = conn.send_config_set(cfg_list)

    # Display results
    print('-' * 80)
    print('\nConfiguration applied on ' + vars['host'] + ': \n\n' + output)
    print('-' * 80)

    # Probably a good idea

# Parse the YAML file
with open(yaml_file) as f:
    read_yaml = yaml.load(f)  # Converts YAML file to dictionary

# Take imported YAML dictionary and start multi-threaded configuration generation
for hosts, vars in read_yaml.items():
    # Add host to vars dictionary
    host = {'host': hosts}

    # Send vars dictionary to confgen function using multi-threading, one thread per-host
    threads = threading.Thread(target=confgen, args=(vars,))

Threads = threading.Thread. I love it!

Automating Labs with Python, Jinja2, and Netmiko

Following up on my last post, I have set out to start automating certain aspects of my labs. I spent a few days going over the material from Kirk Byers‘ highly-recommend Python for Network Engineers course. I studied on the previous version of his course a couple of years ago (covering Python2), but this new version, which covers Python3, is even better.

I came up with a generic topology that was purposely overengineered so that I can enable and disable links on-demand to create different logical topologies without having to interact with the physical lab topology. The lab represents a single service provider core network, multiple customer sites, and two SP-attached Internet connections. Most links will remain disabled for most lab scenarios, but are there for various cross-site, DIA and backdoor options available with this design.

To automate the baseline configuration, I created a Python script that imports the inventory from a CSV file, uses a Jinja2 template to generate the configuration for each device, and Netmiko to push the configuration to the devices. It’s kind of funny to succinctly place into a blog post something that took many hours to test and troubleshoot before coming up with the final version. The best part of gaining this kind of experience is that I can use what I have already done as a template moving forward, whether for the lab or for actual production.

The CSV file is straight-forward. The header row contains the variables for each device, such as the name, management IP, port, and interface IP addresses. Each subsequent row defines individual devices:

The Jinja2 template defines configurations for all devices, which gets populated with the individual variables, and covers device-specific configurations:


hostname {{ device }}

interface lo1
 ip address {{ lo1ip }}

{%- if ifg00ip %}
interface g0/0
 ip address {{ ifg00ip }} {{ ifg00mask }}
 no shutdown
{%- endif %}

{%- if device == 'P1' %}
int lo2
 ip address
{%- endif %}

With this example, every device is configured with the device-specific hostname. Every device is configured with a lo1 loopback address. If the device has an IP address configured for interface g0/0, the IP and mask are configured, along with making sure the interface is not shutdown. If the g0/0 IP address is not specified in the CSV file for this particular device, that configuration section is skipped. Likewise, the final section of the template will only be used if the device is ‘P1’. All other devices will skip this particular configuration section.

The Python script is the glue between the CSV file, configuration generation, and actual configuration deployment. The script imports the csv, jinja2, time and netmiko libraries. The script then defines variables for the CSV and Jinja2 files. Next, the CSV file is imported. The details of individual devices are placed into a dictionary, and each dictionary is placed into a list representing all devices. The script then generates the configuration for each device by feeding the details into the Jinja2 template. Netmiko is then used to send the output of the Jinja2 processing to the actual devices.

This kind of automation is perfect for the lab, because the CSV file represents certain baseline aspects that are not going to change, such as the IP addressing of the links between all of the service provider ‘P’ routers. The Jinja2 template can then be modified for different lab scenarios, depending on how much configuration you want to build into the baseline, per-scenario. The script could even be expanded so that it selects a different Jinja2 template based on a menu of possible scenarios. This same type of scripting setup could be used on a production network to set up new sites or push certain standardized configurations (such as enabling NetFlow on all devices). There are all kinds of possibilities.

Continue reading “Automating Labs with Python, Jinja2, and Netmiko”

Why Network Automation?

I have been wanting to get a little deeper into some various technologies surrounding MPLS and BGP-based VPNs (beyond basic L3VPN, such L2VPN, QoS, multicast, EVPN, etc.), so I assembled a virtual lab with approximately 30 routers which represent a service provider core and several “customer” sites, along with two sources of fake Internet connectivity (or more accurately, a simulated Default-Free Zone (DFZ)). After I earn a deeper understanding of topics within a single service provider core, I will expand this to inter-provider topics. Yes, I meant “earn”, since more work will be involved beyond just reading.

I was getting ready to develop an IP addressing scheme for the core network, and I realized I have a good opportunity here to get deeper into network automation. While studying for the CCIE R&S lab, I spend quite a lot of time in a text editor building configurations to review before pasting them into the device consoles. For tasks that involve repetitive configurations, copy-and-paste is my friend. You don’t (yet) have access to anything like Python or Ansible in the CCIE R&S lab to try to automate things (though I suppose you could use TCL if you really wanted to).

A good portion of setting up a large lab environment of any kind is developing and applying the baseline configuration. I’ve done this countless times over the years, and I was getting ready to do it yet again when it occurred to me that if I invest some time now to develop and use some network automation processes, the buildup and teardown of future labs will be so much quicker. I’ve dabbled with this in the past; I learned the basics of Python and have developed a few network-oriented scripts. I found I enjoy working through the logic and seeing the working results. I also developed and deployed a simple Ansible script to push out some configurations on my current production network.

I read the mostly-complete “rough cuts” version of Network Programmability and Automation by Jason Edelman, Matt Oswalt, and Scott Lowe. This is a really fantastic book, and along with Russ White and Ethan Banksnew book, I consider it an absolute must-read for anyone wishing to progress their career in computer networking and establish a very strong set of foundational knowledge (I swear I’m not trying to name-drop, but I’ve read a LOT of networking books, and these really are toward the top of the list). When I read Network Programmability and Automation the first time, I used the knowledge as an overview of some of the things that are possible within the realm of network automation. Now I’m going through it again to develop the skills necessary to automate the deployment of my lab configurations.

One thing I believe hinders many people wanting to dig deeper into automation (myself included), is having a use case. It’s easy enough to say that if you have to do a single task more than once, you should automate it. Automate all the things, right? There are two issues I see here: the underlying knowledge of what it is you’re trying to automate, and the ability to evaluate the “bigger picture”.

For example, within the context of networking, you could learn how to automate the deployment of complex configurations for a particular routing protocol, but what good is that going to do if you don’t fully understand what those configurations mean? Automation presents you with an opportunity to make mistakes much faster, and on a much more grand scale. If you automate configurations among several devices and things end up not working as anticipated, can you troubleshoot what went wrong?

Likewise, evaluating the bigger picture helps to understand where particular forms of automation are helpful, and where you will run into diminishing returns. For example, you could automate just about every process involved in network configuration, but nearly every business is going to have exceptions that need to be handled somehow, and automation may not be the answer in those instances.

Tying both of these concepts together, I realized the opportunity to automate the things I know extremely well due to my previous knowledge and experience, such as baseline configurations involving hostnames, IP addresses, basic routing protocol configuration, etc. Because I know how all of these things work very well, I can easily automate this as well as troubleshoot potential issues if things don’t go as expected. In the bigger picture aspect, the purpose of the lab is for me to understand other topics that use the baseline configuration as a prerequisite, and therefore I am not yet ready to automate those technologies because I do not yet have a full understanding of their nuances.

In other words, the more you learn, the more you can automate. You need to develop skills on how to automate things, but if you automate things you do not understand, you are setting yourself up for future frustration. Don’t let this discourage you from learning and getting deeper into both automation and classical network engineering skills. Increasingly, the two go hand-in-hand, but you can certainly end up in a chicken-or-the-egg scenario. My advice is to “earn” the networking skills first, and automate them second.

Mind Map for CCIE & CCNP Routing & Switching

I created a mindmap of topics that are covered on the current Cisco CCIE RSv5 lab exam to help myself study, and I thought my work might be useful to the general network community as well. I included CCNP R&S in the title, because there’s a lot of overlapping information that I think most people pursuing the CCNP might find useful as well. I have covered a lot of topics by way of configuration examples that I remember struggling with a little bit when I was studying for the CCNP (wow, it’s been five years already!).

This document contains a hierarchy of topics with their associated Cisco IOS configuration syntax. These commands should work on most versions of classic IOS and IOS-XE, versions 15 and later. I tried to be as comprehensive as possible with regard to the covered topics referenced against the current CCIE R&S blueprint, however it is next to impossible to truly cover every configuration aspect within a single document, mostly because any given topic set (and more) may be covered in any specific delivery of the lab exam. In other words, you won’t really know until you get there.

This document provides the configuration syntax for nearly all topics covered. In many cases, examples and explanations are also provided, but not for every single topic (you still need to do your homework first!). Likewise, verification commands are generally not included here, because that would have easily doubled or tripled the size. For any topic you wish to know more about, I highly recommend looking at the official Cisco documentation. Most topics have both explanations as well as configuration examples there. Much of what is contained in this document was sourced from the official Cisco documentation.

This document is NOT:

  • A hand-holding guide through all CCIE topics
  • Any sort of answer key for any sort of specific lab scenarios
  • A comprehensive guide to every possible topic you might encounter on the lab exam
  • A replacement for pretty much any other form of studying

That being said, this document can serve as a good supplemental quick-reference to the vast majority of topics on both the CCNP and CCIE Routing & Switching exams. CCIE topics are limited to the lab blueprint only (I didn’t cover IS-IS, for example).

The original version of this document will always be available here. It was created with the excellent and highly-recommend MindNode software for the Mac. I have included the original MindNode file here, and several other formats as well. The original MindNode file lets you expand and collapse branches of the tree as desired.

Some browser plugins may have trouble viewing some of these files. If a file does not display properly in your browser, try downloading the file and opening it with a different application.

Don’t forget about my 3500 CCIE flashcard deck, and my blueprint documentation reference guide. If you found this to be useful, please let me know on Twitter or LinkedIn. Thanks!

CCIE FlashCard Deck for R&S v5.1

Today I have decided to finally release the flashcard deck that I created for myself in order to pass the written portion of the CCIE Routing & Switching v5.1 exam. This deck represents many months and hundreds (if not thousands) of hours of study effort.

After passing the exam, I had considered putting together some kind of study package, but I recently started thinking about how version six of the CCIE R&S is going to be here before too long, and it serves no purpose for me to sit on this accumulated knowledge and not share it.

That being said, I still believe you will receive a greater benefit by creating flashcards yourself, as I’ve written about previously. The process of comprehending what you are studying and converting that knowledge into a useful format for yourself is very empowering. Sometimes, though, it helps to get an outside viewpoint, which is what this deck of nearly 3,500 flash cards represents.

I passed the written exam using these cards I created. However, there are most certainly different errors and mistakes present in the cards. This could be due both to my sources having errors, as well as from me misinterpreting something. Or it could be as simple as a type-o.

Nearly all cards are tagged with at least one topic. My flash card deck is available in the following:

I hope this proves to be useful to you! Thank you.

Don’t forget about my extensive CCIE topics mind map, and my blueprint documentation reference guide. If you found this to be useful, please let me know on Twitter or LinkedIn. Thanks!

Retrospection and the Future: Still Studying for the CCIE Lab

Last October marked the five-year anniversary of my blog. I had started an introspective post looking back over the past five years, but the truth is, I feel like I spend more time looking forward than backward at this point.

That’s not to say I haven’t come quite a long way in the past five years. I sometimes need reminding of that whenever I feel like I’m progressing too slowly. Occasionally, I get down on myself for being in my late thirties and still toward the beginning of my career in networking, when I see many shining examples of people much younger than me who are much further along in their careers. However, I cannot change the past and my various circumstances, and I cannot go back and make different decisions at different points in my life.

I can recognize, though, that just five years ago, I was considered a high-end technician or perhaps low-end systems administrator (even though I had some light networking experience sprinkled in there as well). Since then, I progressed into enterprise and networking support where I received a senior-level promotion in less than a year, then to a full-on network engineer, and now to senior network engineer with another recent promotion. I occasionally need to step back and realize that, regardless of age, there are people who never make it as far as I have, let alone in just five years. My wonderful wife is always great at reminding me of these things whenever I need a pickup.

So, to looking forward: my last post detailed passing the CCIE R&S written exam, which was a very big milestone for me since I’m doing this out of self-interest. However, the lab exam is still a decent amount of time and work away from where I am now. I still spend a large amount of time each week studying for it. Having access to the Cisco Expert Level Training has been huge for me. It’s no magic bullet, as achieving the CCIE legitimately requires a LOT of personal work and dedication, but I feel like it helped greatly to organize and narrow down the topic scope into a more manageable form of study.

I stated before that there is a huge difference between the written exam and the lab exam, and that it probably is in fact a good idea to study for the lab first, then pass the written exam shortly before you plan to attempt the lab. I also stated, though, that for me, passing the written so early was an important method of self-validation. That being said, I found out very quickly that there is a huge difference between just knowing the technologies, and being able to configure them.

Going through the first few CELT workbook labs, there were several instances where I knew exactly what they were talking about, but I either could not remember how to configure them, or I couldn’t solve the task in the exact way for which they were looking. I also quickly learned that I needed to improve my reading comprehension as well as attention to detail.

On top of that, there are things that simply require outright memorization for the sake of speed during the lab. To that end, I’ve been digesting the material a little differently than I did for the written exam. With the written exam, I made and studied detailed flash cards, eventually creating a deck of nearly 3,500 cards before I took and passed the exam on the first attempt. With the lab exam, I’ve been going over various materials yet again, going even deeper where necessary, and creating new study materials for myself.

The first thing I did was create a mind-map in several phases. In the first phase, I created high-level topic domains following the hierarchy of the CELT program (Layer 2, IGPs, BGP, MPLS, etc.). For the second phase, I went through every item on the official lab blueprint and molded the topics into the appropriate places within the appropriate hierarchies in the mind map. I then used INE’s expanded blueprint to fill in the remaining topics (pruning the list where necessary – not everything on their expanded blueprint is actually on the lab exam). Finally, I covered the specific configuration of every topic in the hierarchy by going through the configuration guides and command references, and cross-checking it with the version of IOS used on the lab.

I then worked on expanding my Python skills a little bit by creating a set of scripts that take each of the topic items and randomizes certain elements, such as names, interfaces, ASNs, etc. The idea was to create something similar to flash cards, but to make it less repetitive since the elements are randomized wherever possible. As I write this, the script is still very simple in that it presents you with an isolated task, you enter the answer, and then it shows you the correct answer. I have not yet progressed into combining topics together for questions.

You might wonder what value this has for me personally since I am the one who created all of the questions. The CCIE is absolutely massive, and it covers a lot of technologies that I may never actually work with in production, and therefore it is difficult to memorize everything. There are many topics where I know how they work and what they are supposed to do (and what problem they are supposed to solve), but remembering the exact syntax offhand can be difficult.

One of the features I intend to code into the script is the ability to judge how difficult the question was (similar to Anki), so that in future sessions, I can simply skip over the easy ones and drill in the harder questions. I may also work the script into a GUI or web-based version in order to further expand my Python skills, but at the same time I’m doing my best to not take my eye off the prize, so to speak, and make sure that I’m not being distracted too much from actually studying for the CCIE.

Distractions can be very difficult. Learning scripting and automation is important for network engineers. However, my current work environment does not require it, so I’ve only dabbled in it here and there just to make sure I understand the concepts. I imagine I will immerse myself deeper into it (and other topics as well) after completing the lab exam.

I have not had my sole focus on studying for the lab, which has been making it take longer to achieve. I do not intend to stay within enterprise networking forever, and I have been introducing myself to various service provider topics. After passing the CCIE, I intend to try to find work within the realm of service provider networking.

However, I keep reminding myself that passing the lab is a personal goal, not a professional requirement. This can be both an advantage and a disadvantage, depending on how I am feeling at any particular moment. The advantage being that I need to remind myself occasionally not to put too much pressure on myself. The disadvantage being that there are so many things I want to explore, but the CCIE requires dedication of some form.

When the studying gets tough, I question the value of the CCIE, as I’m sure everyone who has ever studied for it also has. This of course has been covered in every imaginable outlet with every imaginable argument both for and against since the inception of the CCIE in 1993. I can see both sides of most arguments. I am still taking the approach that simply studying for it, whether or not I ultimately obtain the trophy, is making me a better network engineer. I can much more comfortably solve problems now using different methods that I would not have been able to just a couple of years ago. And I can do it in a vendor-agnostic way when necessary. It’s certainly true that you don’t need any certifications to be able to do that. But once again, the certification has provided a meaningful learning path for me to follow.

It’s difficult when you’re studying topics that you know you’ll never actually work with in real life, or when working with normal technologies and having to push them into types of configurations that would probably never occur on a production network. But I keep telling myself that working with these various technologies in these various manners will help me to recognize patterns in the future, and ultimately make me a more seasoned engineer.

I still have at least a year worth of work before I plan on taking the lab exam. I am doing my best right now to drill in the basics so that if I need to refer to the documentation, it will only be for those rare miscellaneous topics, if at all. I strongly believe this initial effort, while slow, will pay off in the end.

Today I Passed the CCIE R&S v5.1 Written Exam

This is a big step for me, and has been a long time coming. I know I haven’t “won” anything yet (I’m not going to be one of those people who put “CCIE written” on my resumé), but at the same time, passing this exam is a major milestone for me. The topic scope for the CCIE written exam is quite vast. There are 100 questions on the exam, which means the single exam cannot cover all of the topics. Until you take the exam, you have no real idea of how deep the topic coverage is, which very much clouded my initial study preparations as I tried to shoot for the moon. I learned over time that that was the wrong approach.

The CCIE R&S written exam has been criticized as being merely a trivia exam. Additionally, the 5.0 version was highly criticized for various reasons, including spelling and grammar/clarity issues. Tom Hollingsworth later followed up and mentioned opportunities for improvement. Those articles were written a little over a year ago, and I am happy to report that I feel like things have definitely changed.

This is the first CCIE-level exam I’ve ever attempted, so I can’t directly say how this version compares to previous versions, but last year I did take the SWITCH exam to renew my CCNP, and I felt a lot of the same frustrations expressed about previous versions of the CCIE R&S exam. I was happy after reading Tom’s article to know that Cisco really does look at the comments left on exam questions. I left lots of comments when I took the SWITCH exam about many of the issues, such as poorly-worded questions and answers.

By comparison, of the 100 questions in the current CCIE RS v5.1 exam, I did not feel the need to leave a comment on a single one (and I was ready!). I honestly felt that all of the questions were fair, decently-worded, and I saw no spelling or grammatical errors. Also of the 100 questions, I encountered perhaps five where the wording was kind of tricky and I would have missed the question had I not read it more carefully. That is much better than I was expecting. There were perhaps two questions covering things I had never even heard of, and I wasn’t sure what category they would fit into.

I also felt like the topic distribution did not align with what is stated in the blueprint, however that could simply have just been with my specific delivery of the exam, and not with the overall question pool. There’s no way for me to know. It could also very well be that those questions simply stuck out to me more than the others in hindsight. The most generic non-breaking-of-the-NDA way I can put it is that I felt like they tested some topics at a deeper level than I expected, and other topics were tested more shallow than I thought they would be. I’m sure everyone who has ever attempted any version of the exam has felt the same way, though.

In one sense, I have been studying for the CCIE for over four years, since passing the CCNP in 2013. On the other hand, it has only been this year that I finally began to take the goal of passing the exam much more seriously and assign some sort of timeline to it. I’ve written about 500 times now how creating flash cards has been one of the keys to reaching this success. I’ve read many of the classic books you’re supposed to read when studying for the CCIE over the past several years. That’s all good background knowledge to have. This year, my primary sources of learning were the official cert guides (this year was my second pass of them, I read them once before when they were first published), Narbik’s new v5.1 book, the official Cisco documentation, and the Evolving Technologies guide and selected topics from the CCIE Service Provider Version 4 Written and Lab Comprehensive Guide, both by Nicholas Russo. I bought Nick’s book because I am also very interested in service provider topics, but it is incredibly valuable for the R&S certification, too.

I can also say without a doubt that another key element to me passing today was purchasing the Boson CCIE practice exams. I’ve used Boson’s exams in the past and felt they were decent, but not especially great. That was not the case here. I can honestly say there were a few questions on the real exam that I was able to answer because the Boson practice questions introduced me to the specific topic. That is, despite all of my other sources of studying, there were some Boson questions that either covered something I didn’t think would actually be on the real exam, or they made me think differently about a topic I had already studied, which led to a deeper understanding overall.

However, the version of the product as I write this is not without flaws. There are a few questions that cover topics that have been removed from the current version of the CCIE, and there were several questions that referenced older Cisco documentation that is either no longer relevant, or to where things are configured slightly different in the IOS 15.x code as opposed to 12.x. Overall though, I really believe had I not purchased the Boson practice exam, I would not have passed the real exam today.

The CCIE R&S written exam really is very far above and beyond the CCNP level. This is what made studying for it difficult at first (the “shoot for the moon” aspect I felt at first). The CCNP breaks the topics up into three exams (it used to be four), whereas the CCIE not only covers all of the CCNP topics in a single exam, it covers a very wide range of things not even mentioned at the current CCNP level, such as QoS and multicast. Additionally, topics introduced at the CCNP level are covered at a much greater depth at the CCIE level (like BGP, for example).

I think one of the early mistakes I made when I first started studying right after my CCNP four years ago was to think that now I was basically starting all over again. That was definitely the incorrect approach, as real knowledge builds upon itself, and that’s how the Cisco certification program is designed as well. The CCIE is deeper than the CCNP which is deeper than the CCNA. But there are topics that are universal between all of them. You are asked subnetting questions at the CCNA level, but you still need to know how to do it at the CCIE level. The difference is that hopefully at the CCIE level, you can just do it quickly in your head.

The other mistake I made early on was straying too far from the blueprint. I was under the assumption that the expert level must mean that a person knows absolutely everything inside and out about every facet of every protocol mentioned on the blueprint, so I started out by covering some of the topics at an insane depth. This is most certainly the wrong approach. Nobody knows everything, and you do not become an expert simply by passing an exam with that word in its title.

As it stands now, I believe the CCIE R&S written exam is about half good practical knowledge, and half pure trivia. I enjoy trivia questions, and I feel good when I get them right, but a lot of that knowledge is not useful in the day-to-day design and operation of a real-world network. Additionally, it takes a process such as constant review of flash cards to maintain that trivia knowledge.

Some people recommend to not take the written exam until you are close to being ready to take the lab exam. I can understand and appreciate that approach. However, taking and passing the written exam is also an important “feel-good” milestone and provides some self-validation about my current level of networking knowledge. But, that is the difference between the written exam and the lab exam. The written exam represents the knowledge itself. The lab exam represents the knowledge plus experience.

As they say, now the real work begins.

Cisco CCNP SWITCH Topics

Last year, I was not yet ready to take the CCIE R&S written exam, but my CCNP was about to expire. I decided to renew by taking the SWITCH exam, which had been revised since my original CCNP certification. I continued to study for the CCIE, but I focused my attention on topics that were on the SWITCH exam. As I was doing this, I was still in the middle of transitioning my learning process to using Anki, and therefore I was still taking traditional notes.

I had made this information available previously in individual Google Docs files, but have now consolidated it into this single post. If you’re looking at this information for CCNP preparation, some of it goes above and beyond what you need to learn for the exam. However, after reading the textbooks, you may receive an alternate take on some of the technologies covered here.

In this post:

Continue reading “Cisco CCNP SWITCH Topics”

On Learning: Knowledge Management

For most knowledge-intensive fields of work, including network engineering, you must learn how to manage vast amounts of information if you wish to progress into more advanced levels. The first two articles in this series discuss creating and reviewing flash cards, which through spaced repetitions lead to dramatically increased knowledge retention. But what about static knowledge at-rest?

Knowledge Management:

Just twenty years ago, resting knowledge would typically consist of multiple shelves of books, and several binders full of notes. Many people still operate this way, and there’s nothing wrong with that if it works, since the goal is to know (retain) and understand (apply) the knowledge, regardless of the methods used. For many people, the sense of touch is conducive to learning, hence the physical books and the “muscle memory” of writing out notes by hand.

As for myself, I have always had a preference for knowledge in digital form. A full bookshelf is great for visitors to say “Wow, you’ve read a lot of books!” But beyond mere trophies, I personally don’t experience the appeal of physical books. Digital books don’t take up space in your residence, you don’t have the hassle of moving them around, and of course the largest gain is searchability.

One of my greatest “career blessings” has been having a subscription to Safari Books Online. I say without hesitation that anyone serious about their career in information technology, whether it be software development, infrastructure, design, or business analytics, should consider getting a subscription. I place this service as the single most important money spent on my personal career development each year. New books are added several times each week, and you are frequently offered early access to books that are still being written. Additionally included are many thousands of books that have been published during the past several years, including the majority of the Cisco Press library. One of the best features is the ability to search across all books in the entire library.

For managing this stockpile of knowledge, Safari’s built-in queuing system let’s you collect books you intend to read, or wish to access frequently. I quickly found this to be unwieldy, with the amount of books that I both wish to read, and reference. I like hierarchies, and the current Safari queue is a flat list. What I do is create a hierarchy of browser bookmarks for all the books I’ve come across that I have read, wish to read, or otherwise wish to be able to quickly reference from. I have about 15 folders containing approximately 250 links to books. The folders I’ve created represent general topics, and the links themselves begin with the year of publication. When I wish to reference a particular book, it is much quicker for me to click on the saved link in my browser.

A second major source of knowledge for me is through other people’s online articles and blog entries. I use Feedly for my aggregator, and I have enough feed subscriptions to where I see approximately 300-400 new titles each day, of which I will actually read an average of ten. The raw number may sound like a lot, but it doesn’t take long to actually go through them. I select articles to read first by title, and then by source. This is why it is important to have a good title when you post something (a skill I am still developing). My logic is that if a title is very interesting, I’ll open it regardless of the source, and if the title is vaguely interesting, I’ll open it if it’s from a source whom I know to produce great content. The rest get filtered through very quickly. If I ever encounter a blog entry that I’ve really enjoyed from someone (usually referenced from Twitter or from someone else’s blog), I add them to my feed.

I then clip the good blog entries into OneNote for potential future reference. I have about 5,000 articles and blog entries saved in my collection. A large portion of them were collected not because I read them, but because I assumed I would reference them someday. For example, a few years ago, if someone posted a great blog entry, I would take the time to clip every blog entry from their site. I eventually realized this is not helpful, and only leads to “digital hoarding”. I think it was due to a Fear Of Missing Out, of which I am still learning to move beyond. Access to Safari helped me out dramatically in that regard, but the realization and correction of those tendencies is what led to the creation of this third part of my series on learning.

Although having an unread article for reference can potentially be valuable in the future, it’s more difficult to reference something if you’ve never read it. That is why I eventually learned to clip only those articles that I’ve actually read and wish to save, or those that I’ve at least skimmed over and may reference again (instead of just blindly clipping everything). When I need to search for something to reference, I found I always start at Google anyway, and if I don’t find what I’m looking for on the first page (and rarely the second), then I’ll do a search across my OneNote library. Based on this realization, I’ve thought about purging all of my saved articles and starting over so that I’ll have a more “curated” collection. On the other hand, I figure I’ve already done the work of collecting the articles, and their contents are only a search away.

Also within my OneNote library are all of the digital notes I’ve taken over the years before my transition to direct notes in Anki. In a way, these notes are sort of like the many unread clipped articles in that there’s a decent chance I may never actually reference them, but I’ve already put the work in to capture them, so they stick around. Additionally, while Anki is great for most learning, sometimes traditional notes are more appropriate. On the rare occasion that I decide to learn via video, I use OneNote to capture both notes and relevant screenshots.

Another tool that has helped me dramatically with knowledge management is something I encountered more recently: Xmind. At its core, Xmind is mind-mapping software, and while it is certainly useful for that in the traditional sense of the term, instead of using it purely as an exploratory mental exercise, I use it mostly to hierarchically organize existing sets of information. This has been extremely powerful with regard to certification studying.

Shown in the above image, I took the CCIE R&S v5.1 blueprint and organized it hierarchically. This allows me to expand or collapse branches of the overall blueprint as desired. Additionally, I colored green those topics which appear on the written exam only, and not on the lab exam. Not shown, I also highlighted sections that I felt were not covered in great depth (or at all) in the OCG books, so that I can quickly see which topics I need to find alternative sources for. Using the same blueprint hierarchy, on another page I broke down each subtopic and wrote a single one-line explanation of the particular technology. For example, “Bidirectional PIM uses only shared trees and is useful when many receivers are also senders.”

Finally, I have been using Xmind as a tool for project management. The paid version of Xmind has a more complete set of traditional project management functions (such as actual time scheduling and tracking), but I use the free version to list tasks I wish to accomplish, and their associated sub-tasks. I then drag and drop the order of the tasks as things change.

I’ve written before how I feel like studying for the CCIE is sort of like a research project. Some people approach it linearly, based on the blueprint. This is sometimes associated with keeping a “tracker”, which is a spreadsheet of topics to study, along with a perceived level of understanding. I’ve tried that approach a couple of times, but it never felt very useful to me personally. Using Xmind, I’ve created a branching tree structure containing things I would like (or need) to do, re-order them as necessary, and cross them off (via strikethrough text) when I have accomplished them. This lets me quickly see where I’ve been and where I intend to go.

Knowledge management is a very subjective topic. To me, learning represents an evolving continuum of progress, and I have tried several different methods of acquiring, maintaining, and managing a base of knowledge over the years. This series represents a set of tools I wish I had known many years ago. Learning is personal, and it takes time to develop the processes that work best for you. As technology continues to evolve, so do the tools and methods of learning, knowledge retention and knowledge management. It can be more difficult to progress if you’re unaware of the available tools and methods. I hope that by sharing my experiences, I may have helped you in some way. Thank you for reading.

Part One: Creating Meaningful Flash Cards
Part Two: Flash Card Review
Part Three: Knowledge Management

On Learning: Flash Card Review

This part covers what I have learned about reviewing the flash cards.

Flash Card Review:

The premise of Anki (and related types of software) is spaced repetitions. To get the most out of the software, you need to make yourself get in the habit of reviewing your cards every single day. When you stop reviewing, you very quickly start to lose retention. However, by maintaining the habit of reviewing cards every day, you will spend less time each day reviewing because the retained cards will be spaced further apart.

When you’re very deep into studying a particular topic, you may create a decent amount of cards in a short period of time, especially if the topic is new to you. This will temporarily increase your time spent each day reviewing cards. The default Anki setting is to introduce 20 new cards for review each day, in addition to the cards that were already scheduled to be reviewed on that day. There is a tradeoff to be realized here. If you have a lot of new cards, your daily review times will be increased. If you introduce fewer new cards each day, it will take longer to learn the material. I have tried it both ways, and found the default setting to be just about right. If after your daily review session you feel like you could take in some more cards, you can easily add new cards to the day’s review session. I recommend doing this 10 cards at a time. Just remember that in doing so, you will have more cards to review over the next few days.

When you are creating lots of cards in a short period of time, I highly recommend setting the review order to chronological, instead of random. By changing this setting, new cards are introduced in the order of creation. When you create multiple new cards, chances are very good that the cards will be related to each other in some way. During the initial review of these cards, you are more likely to have better retention by seeing related cards close together. As time passes and the reviews become more randomized, the initial bond of related cards will have a greater impact on retention. After you have added (and reviewed) the bulk of your cards, it may be beneficial to return the setting to random as you add fewer miscellaneous cards over time.

When reviewing your cards, resist the temptation to click the “easy” button, except for perhaps the first viewing of the card, which defaults to showing the card again in four days. I discovered that for a large number of cards, if I had clicked the “easy” button three or four times, by the time card was scheduled to be shown again, the spaced repetition was too great, and I had trouble remembering it. When I stopped doing this, and reverted back to being honest with myself (and clicking “again” when I really couldn’t fully remember the answer to a card), I found my retention to be greater. The default spacing algorithm works very well.

Speaking of honesty, the greatest retention really does come from judging yourself as honestly as possible when reviewing a card. Even if it’s something that you know you used to know inside and out, if you can’t fully (and relatively quickly) answer a card, you should click “again” so that the card is reset and you view it more frequently.

However, if you click “again” on a card eight times, it will be considered a “leech” and it becomes suspended from the deck. This action is the software trying to protect you from wasting your time learning that particular poor card. If any of your cards reach the leech status, you need to re-evaluate it. You most likely need to figure out a way to break it up into smaller pieces of information, or somehow increase your understanding of the topic. The solution may be to create some kind of visualization to represent the topic, as I demonstrated in the previous post.

Somewhat related is the ability to “bury” a card during review. As you create cards, they will naturally have different levels of difficulty. Burying a card during a review session allows you to move on to the next card in the queue. This can be useful if you’d like to get through the perceived easy cards first, and then do the more difficult cards later by unburying them. If you don’t unbury a card, it will be automatically scheduled for review the next day.

This is a snapshot of my review patterns for the past 30 days:

As I write this, I have about 2500 cards in my CCIE R&S deck, and I have reviewed 3624 cards during the past 30 days with an average of 121 reviews per day (which, not shown, takes about 30-45 minutes). The blue indicates brand new cards learned during a particular day. “Young” cards are those which are scheduled to be reviewed within the next 21 days, and “mature” cards are those to be reviewed more than 21 days from now. The color red indicates cards that I had forgotten and clicked the “again” button. I joke that the red indicates my level of self-honesty.

The color yellow in the image deserves special mention. If you’re creating lots of cards in a short period of time, your next-day reviews can be substantial at first. Sometimes it helps to cut down the next-day reviews a bit by studying ahead the previous day. For example, I prefer to study as many cards as possible first thing in the morning, and I usually finish all the reviews in a single session. But if I know I’m going to have a larger than normal amount of cards to review the next day, sometimes I’ll study ahead later in the day.

To do this, click the deck title > custom study > review ahead > 1 days. This creates a custom deck containing the cards that you would normally have reviewed the next day. I like to keep the reviews at about 75 or less per day, but sometimes it gets up to over 100. By doing this, all of tomorrow’s reviews are placed into a temporary custom deck, and I can review however many cards I wish, then delete the custom deck. The reviewed cards are treated as if it was tomorrow, and the remaining cards that were deleted from the custom deck will be scheduled as normal. For example, if you have 100 cards due for review tomorrow, but create the custom deck and review 20 cards, you will only have 80 card due tomorrow.

By reviewing the cards scheduled each day and building up the retention of knowledge, I’ve found that when I move on to another learning source for the same topic, it is easier to pick out the details that I don’t already know and make new cards out of that to add to the deck. This demonstrates knowledge building upon itself. For example, as I mentioned, my CCIE R&S deck contains about 2500 cards as I write this. I created the bulk of these cards by going through the OCG books with a fine-toothed comb. Now that I have moved on to other sources of learning, I am usually aware of when I have already created a card for a particular topic by virtue of having seen it several times already. If there is something that I’ve already created a card for, but don’t remember, chances are that the new card will be somehow phrased differently anyway, and having both cards in the deck will also lead to greater understanding and retention.

This end result of moving between different sources of information and realizing the details of what I have already learned is what has made using Anki so revolutionary for me, and why I felt it was important to devote yet another post to the wonders of spaced repetitions. I remember going through books in the past, and then moving onto other books within the same topic scope and feeling like I was essentially starting all over again. Back then, knowing how much information I needed to retain, it gave me a slight feeling of hopelessness.

Now, being able to demonstrate to myself the progress I am achieving is extremely empowering. I’ve encountered several things in my day job which, previously, I would probably have had a vague understanding of, but now I can immediately recall the acquired knowledge and place it into action. That is a very powerful feeling.

Part One: Creating Meaningful Flash Cards
Part Two: Flash Card Review
Part Three: Knowledge Management

On Learning: Creating Meaningful Flash Cards

This is part one of a three-part series.

I still study for the CCIE R&S. I study for it in some form (and often multiple forms) every single day. My attitude, thought processes, and learning process has changed quite significantly in the last year and a half. My experience is growing, and timelines are starting to become more concrete. I’ve written about this before, but this past year really has been life-changing with regard to my study habits, “learning how to learn”, and discovering what works best for me to take in, manage, and retain information.

Just over a year ago, I wrote about using Anki. This software has been the catalyst for my learning transformation. I wrote a blog entry after having used Anki for only a short period of time. With this three-part series, I am now expanding on my experiences of using Anki since then, as well as knowledge management across different platforms.

Flash Card Creation:

I have learned a lot about the process of creating good, quality cards, as well as consuming them. For so many years, I took direct notes while reading books or watching videos. Despite taking the notes, they made little impact on my studies due to lack of quality in both the notes and the review process. Using Anki has helped me refine both activities, and I now formulate my notes directly as flash cards. I also realized that I am able to absorb information much better in the written form, as opposed to video. Following are some tips that help me to create more effective cards.

When creating flash cards that go beyond simple raw facts, take the time to succinctly explain the concepts and meanings or reasons behind the topic. It has been said that you don’t really understand something complicated until you can explain it in simplified language. If it’s not making sense to you while you’re trying to make the card (such as if you’re tired of studying for the moment), don’t try to create the card until you can fully grasp the concepts, otherwise you’ll just end up creating poor-quality cards. You can’t force meaningful learning, and you must break down what you don’t fully understand.

When creating cards, understand the tradeoffs of shorter versus longer. Shorter cards are easier to remember, but sometimes a card must be longer due to the amount of information to convey. Always try to limit cards to a single fact or idea. If a card must contain multiple pieces of information (for example, if everything goes together and there’s no real logical way of breaking it up) and you’re having trouble remembering everything after several reviews, try to figure out a way to create a visualization.

I kept forgetting the five requirements of using the EIGRP Add-Path feature with DMVPN until I created this visualization:

This image contains the exact same information that was present in my original text-only card, but by adding colors, shapes, and positions, I was able to have a mental visualization of the information to recall.

If you’re taking in something of unknown importance, create the card anyway and then suspend it from review. For example, in studying for a certification, you may or may not need to know all of the fields within a particular protocol header. Something like this can be relatively complicated as well as intensive to try and memorize, and it might be unnecessary. By having the suspended card in your collection, you can easily revisit the information or reinstate the card if you discover it really is necessary.

Mnemonics are also extremely useful. What are the thirteen fields in the IPv4 header? I can tell you that “Very Heavy Dudes Prefer Fat Fried Food To Pretty Healthy Salad Dish Options” is easier to remember than: Version, Header Length, DS Field, Packet Length, Fragment ID, Fragment Flag, Fragment Offset, TTL, Protocol, Header Checksum, Source IP Address, Destination IP Address, and Options. When I created this mnemonic, I purposely tried to come up with something silly so that it would be easier to remember.

After creating several cards (such as for a section or a chapter), go back and break up cards that convey multiple ideas (that can be easily broken up). For example, a card that asks “How and why…” can probably broken up into separate “how” and “why” cards, which will increase your retention. A card-creation habit I had to break myself out of was phrasing a card as “What is X and how do you configure it?”. I discovered better retention by creating separate cards for “What is X?” and “How do you configure X?”.

Likewise, create cards that allow you to learn the same thing in reverse when possible, known as two-way learning. For example, one card could ask “What is the IP Protocol number used by L2TPv3?”, and a second card could ask “What technology uses IP Protocol 115?”

Don’t worry about this too much at first when you’re creating the cards, as it may inhibit the initial creation. However, do take the time to go back and break them up. What I discovered for myself was that when cards contained multiple pieces of information, I would remember one part, but not the other, so the card became less valuable in helping me to retain the information. Two-way learning cards might be a little more difficult to realize, especially when examining many cards at once, and you may bump into diminishing returns if you attempt to create two-way cards for everything. Raw facts make the easiest two-way cards. However, for cards that explain a particular concept, creating a two-way card can demonstrate that you really understand the topic.

Assign tags to create meaningful groups to use later for custom review decks. Try to use somewhat broad categories, and avoid creating lots of super-detailed tags unless they are combined with tags that are more general. For example, use STP as a broad category, with RSTP,  MST, and STP Enhancements as subcategories. Don’t create tags that will represent only a few cards. Use tags that will create meaningful groups to study from, otherwise just use the search function in the browse window.  On a related note, in the card browser search box, you can exclude a term by prefixing it with a dash. For example -EIGRP or -tag:IS-IS excludes cards containing the word “EIGRP” or cards tagged with “IS-IS”.

When creating cards, try to keep in mind that you may be reviewing them again months or even years later. This incentivizes you to create good, quality cards. As mentioned, don’t create cards if you’re not understanding the material at the time. Be honest with yourself, and If you’re just not getting it, take a break and come back to it. For example, I’ve noticed if I’m feeling tired, I have trouble creating good meaningful cards, but if I come back to the same material the next day, it all begins to make sense and I can create better cards as a result.

Remember, you can’t force learning. It has to mean something to you. When attempting to create quality cards, try to be as accurate as possible with your understanding of a topic (aside from raw facts which are either correct or not), but remember you can always edit a card later if your understanding of the particular topic improves. I have found myself more than once reviewing a card created months ago, with an improved understanding that allows me to re-phrase the card to have more meaning or clarity.

Make your cards as useful as possible within the realm of diminishing returns. Don’t worry about making the cards and tags perfect. The goal is to retain the information from the individual cards, not to have them perfectly phrased or organized. If a topic seems too complex to explain simply, look for ways to break it up into smaller pieces. A concept may seem singular at first, but if you can break it down into even smaller components, you will have both better retention and a better understanding of the topic. Reviewing the individual components will make the assembled whole make more sense in your mind.

Part One: Creating Meaningful Flash Cards
Part Two: Flash Card Review
Part Three: Knowledge Management

How to Install Oxidized for Network Configuration Backup

Oxidized is an open-source project started by Saku Ytti and Samer Abdel-Hafez as an alternative to the very popular RANCID software. A little over a year ago, I created a RANCID server to backup the configuration of my network devices. It has been a good, stable piece of software that has been doing the job very well across hundreds of devices.

When I set up the RANCID server, I had heard of Oxidized, but the project wasn’t yet as far along as it is now. A few days ago, I decided to take another look at it. One of the things that made Oxidized more appealing to me right away is its companion web interface. While RANCID can be “web-enabled” with the viewvc interface, it is pretty limited in functionality. I found the Oxidized web interface to be exactly what I was looking for. It also supports a very wide range of network devices and network operating systems.

Compared to using viewvc with RANCID, Oxidized also lets you view current configurations and diffs between versions. However, Oxidized lets you search for terms across all the configurations. If only some of your devices have a very specific configuration or inventory item, you can search and only the devices matching will be displayed. For example, in my environment, I can search for “PVDM” and quickly see which of my Cisco routers contain DSPs.

The web interface is also very fast! I have approximately 500 devices being backed up, and the web interface is always extremely responsive. Another feature of the web interface is the status of the last device configuration poll. You can see how long it takes on average to pull a configuration from the device, the number of times the configuration backup failed, the failure rate, and the time of the last failure. This helped me to identify a broadband link that was consistently slow, because the average run time was much more than the other devices.

As wonderful as Oxidized is, one of its current drawbacks is a lack of good, complete documentation. When I set up an Oxidized server for my environment, I documented all of the steps I took, including caveats I encountered, to have a successful install. The following is a guide to setting up an Oxidized server on CentOS 7 with basic web authentication. Like many Linux-related installation instructions, there are multiple ways to reach the ultimate goal, and what I have done may not be the best, most secure, or optimized way, but I reached the end goal of a working installation.

Continue reading “How to Install Oxidized for Network Configuration Backup”

Today I Passed the CWNA Exam

I have been involved with both wired and wireless networking for many years. My original wireless setups were from the early 2000s, shortly after 802.11b became popular. I remember at one point I had a PCMCIA card with a pigtail and external antenna attached to it.

As my career started taking a focus more toward networking, I became intimately familiar with just about every aspect of wired networking. Having worked with wireless for so long, I knew a decent amount about how the technology works, but not nearly to the level of familiarity I have with Ethernet.

Occasionally, I look at various job listings just to see what employers generally expect within different levels of networking careers. I kept seeing wireless networking as a general skill, and in many listings, I saw the CWNA as either a requirement, or a “nice to have”. I decided it was time to finally bridge the divide in my networking knowledge and learn some wireless topics at a deeper level.

I feel like the CWNA exam is absolutely perfect for this. This exam is not so introductory as to have no value whatsoever, but it is not so deep that you have to devote a significant amount of time toward it to pass. I am not yet looking to devote myself to wireless networking, but the CWNP program does offer more advanced certifications for those that are. If I ever decided to pursue an even deeper level of wireless networking knowledge, I would definitely come back to the CWNP program and work on those additional certifications.

I started studying for this certification, and took and passed the exam on the first attempt, within the course of just a little over a month. I will admit, with me already having CCNP-level knowledge, there were a lot of topics on the CWNA that I was already familiar with (and even a few topics that I disagreed with!). This made studying for the exam go by a little faster.

My process was to first read the Official CWNA Study Guide all the way through. This took a couple of weeks, reading one or two chapters each day. In the past, when studying for a certification, I would have taken tons of notes, which end up being somewhat useless to me. It took me a long time to break this habit. This may work well for some people, but I found out through time that this process doesn’t work for me. I still have all of the notes I’ve ever taken for all of the certifications I’ve studied for, but simply reading my notes doesn’t really do much for me. This time, I took no notes while reading.

For this certification, after reading the entire certification guide, I took all of the chapter questions from the book, and all of the entries in the glossary, and made flash cards out of them in Anki. Using Anki, I was able to very quickly separate what I already knew from what I still needed to retain. After two weeks of spending an hour or so each day reviewing flash cards, I took the first of three online practice tests. I made new flash cards out of the questions that I missed, and continued to study. A week later, I took the second of three practice tests and did much better. Once again, I made cards out of the questions I missed.

Since I did so well on the second practice exam, I decided to schedule the real exam for the following week. I continued to review cards, and a few days before taking the test, I took the third of three practice exams and did very well. I didn’t do as well on the third practice exam as the second, which shook my confidence a little bit, but it was still a passing score, so I proceeded to review the cards and keep the exam as scheduled. In the end, my flash card deck contained about 1100 cards.

The online practice exams are included as part of having access to the textbook. I have a subscription to Safari Books Online (best money I’ve ever spent in my life!), and I was able to register for access to the practice exams on the Sybex website. These official practice tests, along with using Anki, absolutely transformed my method of studying and more importantly, information retention. I actually found the practice tests to be a little more difficult to pass than the actual exam, which was a nice bonus.

There are a lot of little details that you need to memorize to pass the CWNA exam. These are details that will definitely be forgotten after the test is over, unless you keep reviewing the material. But, the CWNA also teaches many different concepts and methodologies that revolve around the world of wireless networking, and this is the most important information that I believe will stick with you if you study for and pass the exam.

For example, if you are setting up a brand new 802.11ac wireless network, when previously there was no wireless network (a Greenfield installation), you might not need to remember what the Modulation and Coding Schemes are that 802.11ac uses, but knowing essentials such as the fact that 802.11ac operates only in the 5 GHz bands, and how the 5 GHz frequency bands operate a little differently than the 2.4 GHz bands, will be excellent knowledge to have when you need to troubleshoot the wireless network post-installation.

The pricing of the CWNA-106 exam isn’t too bad ($175 as I write this), at least not compared to Cisco’s recent price hikes, and the process of studying for and gaining the credentials has been well worth it to me. I will now absolutely be able to more intelligently discuss wireless networking, troubleshoot, and plan and make appropriate proposals when needed.

General Network Challenges, and IP/TCP/UDP Operations

Having fundamental knowledge of what affects TCP, UDP, and IP itself helps you to better troubleshoot the network when things go wrong. I feel like most of the lower-level network-oriented certifications barely touch on these topics, if at all. However, the current Cisco CCNP and CCIE Routing & Switching exams do expect you to know this. This post is geared toward Cisco’s implementation and defaults regarding the various topics. However, whether you are studying for a certification or not, this is all good information to have.

This mega-post covers the following topics:

Continue reading “General Network Challenges, and IP/TCP/UDP Operations”

QoS in Action

Quality of Service is an added-value network infrastructure service that is still very important within the scope of private networks. Some might argue that QoS is not as important as it once was as we start to see more SD-WAN deployments that utilize the general Internet for transport, because the Internet has no inherent QoS. Additionally, many private networks do not utilize QoS whatsoever, and their operators essentially just “hope for the best” as all the different types of traffic traverse the various links. This may be due to lack of awareness or training on the part of the operators, or it may simply be that the business has not placed enough value in its importance.

One of the ideas behind an SD-WAN deployment is that since the Internet does not offer QoS, you can attempt to circumvent this when using the Internet for transport by having multiple connections, ideally from different service providers, and monitor the end-to-end quality of the links through metrics such as bandwidth utilization, delay and jitter. A good SD-WAN solution will monitor the links, and could be configured perhaps to send voice and other delay-sensitive traffic over the link that is the least congested and/or has the lowest delay and jitter, while sending bulk data over a different link.

Even if you are using the general Internet for your transport, QoS may still be important if you consistently use all or the majority of your available bandwidth. You can’t control how your data will flow across the Internet after it leaves your private network, but you can control all aspects of your data until it reaches your private edge. One of the major benefits of using QoS is queuing/scheduling your traffic through classification and marking.

At a high level, you implement QoS by first classifying your traffic. This can be as simple as two classes, such as delay-sensitive traffic, and everything else. The most common model is four-class, and there is also a standardized eight-class model. Most networking equipment that supports QoS allows you to get even more granular, if you wish. You determine classes based on different characteristics such as the type of treatment or relative importance of the traffic. You can also simply classify the traffic based on the source or the destination (such as all traffic to or from a particular server).

After classifying traffic, actions can be taken on the different traffic classes, such as marking or specialized treatment. Classified traffic is often marked using CoS for Layer 2 (such as Ethernet), and DSCP for Layer 3 (IP). Layer 2 is considered a local marking, whereas DSCP can be carried across the entire IP network. For example, traffic coming from an IP phone may be marked as CoS 5 by the switch the phone is connected to. Then when the traffic crosses the first-hop router (which could very well be the same switch), the Layer 2 CoS marking may be mapped to DSCP “EF” at Layer 3. The DSCP marking may be ignored at various points in the network, but it will remain inside the packet header unless a network device purposely changes it. With QoS marked in the IP header, any devices along the path that processes IP packets can examine the header and possibly take action, such offering that particular packet different treatment.

The ultimate purpose of classifying and marking traffic is for queuing/scheduling, which is the process of determining which traffic is sent first. Network interfaces will normally use FIFO (first-in, first-out) scheduling when the link is not congested. However, when the link is congested, traffic that has been classified and marked as more important can be scheduled to be sent first.

When using the Internet for transport, you can’t control the treatment of your most important data once it leaves your network, but you can make sure at your Internet edge that the most important traffic gets sent out before any other traffic does. This is one of the main reasons why QoS is as important as it ever was, even with SD-WAN solutions that use the Internet for transport.

QoS scheduling is also important when the data is transmitting from a higher-speed link to a lower-speed link. For example, a company’s data center will almost always have much higher WAN-facing bandwidth than a branch-office WAN link. QoS scheduling once again ensures that higher-priority traffic makes it to the branch WAN link first. For example, in MPLS L3VPN environments, the service provider can offer as a service (and usually for an extra fee) QoS capabilities. If your data center has a 1 Gbps pipe toward your MPLS WAN, but your branch office is on a 1.5 Mbps T1, subscribing to the service provider’s QoS service can ensure that when a large file is blasted out to the branch office, the VoIP traffic will still receive preferential treatment because it will be scheduled first as it leaves the service provider’s router on the other end of the T1.

Another aspect of QoS is policing and shaping. A service provider will often use policing to create “sub-rate” links. For example, the SP may provide for you a physical gigabit Ethernet link, but you may be only paying for 200 Mbps of service. The SP uses policing to turn the gigabit link into an effective 200 Mbps link by dropping any traffic that goes over the 200 Mbps mark. Policing is typically used on the ingress to a network. Conversely, shaping is typically used on the egress of a network. Shaping works by temporarily buffering excess traffic, and then transmitting it when possible, which helps to avoid dropping the traffic.

Policing can also be very useful within your private network to prevent a source of traffic from overwhelming a particular destination. For example, if you have a server in your data center that provides some kind of updates to the computers in your network (such as a WSUS server), you could use granular policing to prevent it from overwhelming just the slower branch office links during regular business hours, but still offer the full available capacity after hours. 

As important as QoS is, I find it pretty amazing that it is not covered at all in the current Cisco CCNP R&S curriculum. It’s covered earlier under Cisco’s Collaboration, Wireless, and Service Provider tracks, but the general R&S track does not mention QoS at all until the CCIE level (as of this writing).

Getting into QoS can seem very daunting at first. Like most technologies (or sub-technologies), there’s a new lexicon to learn, and not everything may seem obvious at first. When I first started exploring QoS as part of reading the CCIE OCG a couple years ago, it did seem a bit overwhelming, and I felt that even though I could follow along and understand what I was reading while I read it, when I was done I wasn’t really able to retain what I had just read because at that point in time, I’d never experienced it for myself. Working at my current job has changed that, fortunately.

Like so many things, witnessing it in action (especially in production) and repeated exposure to books and documentation has helped to solidify the major concepts of QoS for me. Experience is great, and it really solidifies the things you learn when you study. But I am still a firm believer that you need to obtain the knowledge first (at least in the general sense), and then build the experience afterward. If I had not taken it upon myself to move past the CCNP R&S curriculum and explore the content within the scope of the CCIE, there are several things I would not even know about. Things like QoS, the service provider side of MPLS, and working with VRFs. These things represent tools in a toolbox, and knowing what tools you have to work with is the key to solving business problems and making you a success.

Update: I found out that QoS is indeed introduced now on the current CCNA exam. This is excellent news, and I would expect it to be covered somewhere on the next revision of the CCNP R&S.

The Data Center Move, Part 4

Part 1  |  Part 2  |  Part 3  | Part 4

Over the next couple of weeks, we continued to migrate more portions of the network and less-critical systems over to the new data center. One of the issues we experienced was temporary route instability due to accidentally advertising the same routes through multiple BGP ASNs because of redistribution.

The overall WAN and Internet design of our network is hub-and-spoke. We use an MPLS L3VPN service and peer with the PE routers with eBGP, so all of the spokes can talk directly to each other (which is useful for all of the inter-office VoIP traffic), but both the primary business data as well as Internet connectivity for the entire company all flow back to the hub data center.

Over time, for various reasons, we ended up with multiple MPLS routers at the old data center that face the rest of the WAN. All the MPLS routers use the same BGP ASN and speak iBGP with each other, and peer with our MPLS provider with eBGP. Even though all the MPLS routers have equal access to the WAN (differing bandwidths aside), different routing policies had been put into place for various reasons over the years. For instance, all of our SIP sessions went over just one of the routers. We advertised the company-wide default route from two of the routers, but not all of them. We advertised our data center private /16 network out of all the routers, but  advertised more specific /24s of just a couple subnets on only a couple of the routers. Nearly all of these routing patterns were established before I got here. Some of them made absolutely no sense to me, so I had to question their history, which often pointed back to one-off fixes that were supposed to be temporary, but of course became permanent.

We’re primarily a Cisco shop as far as the routing and switching infrastructure goes, so we use EIGRP on the internal network. Both the new data center and old were connected together using the same EIGRP ASN. We perform mutual redistribution of BGP and EIGRP on all the MPLS routers, and use tagging to prevent route loops. However, at the new data center, we used a different private BGP ASN.

I knew that because of the two different BGP ASNs, I would have to be careful about outbound route advertisements appearing to the rest of the network from multiple sources. So I used BGP prepending to make different paths be more preferable than others, while still allowing for redundancy in case one of the routers went down. But, since all of the MPLS routers at both data centers were joined together by the same EIGRP ASN and they were configured to do mutual redistribution of BGP and EIGRP, it ended up causing a problem that I didn’t see at the time, but can now see very clearly in hindsight.

The routing table seemed stable, and everything was flowing properly. Then a couple of our remote branches were having WAN troubles, and their circuits went down. When they came back up, they could no longer reach any of our data center subnets. Or more correctly, they could reach us, but we couldn’t reach them. I didn’t think that a routing loop would have occurred, because at all points of redistribution, I set and matched tags to prevent that from happening.

The part that I can now see clearly with the experience of hindsight is that when those branches went down, their routes were removed from the overall routing table. However, when they came back up and the routes were re-introduced, they were learned by BGP from our old data center, redistributed into EIGRP and passed onto the new data center, and redistributed back into BGP with a different ASN, which caused a loop. But only for routes that were removed from and re-introduced into the overall routing table.

Luckily, we were able to catch this very quickly and correct it with only a couple of small branches experiencing the issue. As I write this, we are still in the middle of migrating the WAN over, and the way we chose to deal with this is to not yet re-distribute EIGRP back into BGP at the new data center, and instead just advertise the routes we desire directly through BGP. It’s true, we could have just used filtering, but we only need to do this temporarily and I thought it would be a much quicker and easier way to solve the problem.

Having this job has been very exciting so far, and I feel like I came onboard at just the right time. Since this is my first “real” enterprise-level networking job, I spent the first several months getting up to speed and seeing where all the things I read about and studied for fit into place. Now I’ve reached a point of being able to put the more advanced things I’ve studied into action, and gain real practical experience from it which will propel me forward in my career.

As I said, I am well aware of the fact that the majority of people who are early on in their network careers will not have had the opportunities that I’ve had in experiencing this data center move. That is why I made sure to make the most of it and write down some of the many things I experienced and learned. The static EtherChannel problem, in particular, was a real nightmare and I don’t think I will ever forget it. More importantly, I now know exactly what symptoms to look for and how to fix it if something like that ever happens again.

I feel like I have gained a lot of really important experience in a short amount of time. I’m very grateful for it, and I’m always continuing to study and look toward the future with what this career may bring me, and to help solve the many problems of business.

Part 1  |  Part 2  |  Part 3  | Part 4

The Data Center Move, Part 3

Part 1  |  Part 2  |  Part 3  | Part 4

After the servers were connected, we were getting complaints about intermittent connectivity. I would do a continuous ping against one of the servers, and it would randomly drop one or two packets out of every 20 or so, and then every 10 minutes it would completely lose connectivity for about 30 seconds. I logged into all of the switches at the new data center, turned on the terminal monitor, looked at the logs, and saw all of the switches reporting “MAC flapping”, and then disabling MAC learning on the affected VLANs for 3 minutes.

Because everything is redundantly connected with Spanning-Tree blocking the redundant links, the disabled MAC learning occurred on different switches at different times, which is why the pings would continue, but lose one or two while RSTP reconverged. Since the switches would block MAC learning for 3 minutes, I surmised that when the server would lose connectivity for 30 seconds, that was when the “rolling VLAN blackouts” had affected all of the switches at the same time due to the issue overlapping among all of them.

My boss and I still weren’t sure what was going on, because I configured the ports facing these systems on the new switches identically to the ports on the old switches, where we had no apparent issues. As a point of troubleshooting, we decided to cut the HSRP active gateway over to one of the new core switches, which we hadn’t done yet because our primary WAN connectivity was still at the old data center.

Two things ended up working against us here. The first was that because we were under pressure to get things working correctly ASAP (despite this happening on a Saturday night), I pasted the configuration to cut over all of the HSRP active gateways simultaneously (lesson learned: unless you’re dealing with hundreds of VLANs, don’t do this! Do it one at a time, or in small groups). The second thing was me not fully appreciating what disabled MAC learning really means with regard to HSRP interaction. If it was happening only on the leaf switches, it would have been less severe, but because it was happing across the entire STP domain at the new data center, when I cut over the HSRP gateways, one of the new cores would be active until the MAC learning was disabled for the particular VLAN.

This ended up causing an HSRP “split-brain” scenario between the core switches at the old data center and the new data center. It ended up causing enough of a problem that the Metro-E switch connecting the two data centers completely froze, which caused me to have to make the trip across town to the new data center and power-cycle the Metro-E switch (since the switch belongs to the service provider, not us, so we had no way to access it remotely). During that time, we put the HSRP active gateway back to the old core switch.

I had to wake up the iSeries admin and get him to show me his networking configuration for the servers. Finally, things started to make sense. Going into this, because the configuration of the old ports and new were identical, we could not have foreseen these particular issues, especially since things worked just fine on the 6500s. What his network config revealed, as I’m sure you’ve guessed by now, is that not only are all of the links bundled together on the server side, they’re in a static EtherChannel; LACP was not used.

This is what caused the Nexus switches to report MAC flapping and temporarily disable MAC learning. That means the issue we experienced was occurring from the very beginning. It’s just that the 6500s handled the issue by not reporting it to us. The 6500s basically “just dealt with it” as the MAC addresses would flap between the links. But the Nexus switches called us out on it, and made sure we did things right.

After I created the static EtherChannels on the Nexus switches, the problem of MAC flapping was still present, but was occurring less frequently. It was during this troubleshooting session (which took place at around 3am on Sunday morning, with me running on only a couple hours of sleep) that I discovered why you ALWAYS want to use LACP and not hardcode static EtherChannels. With LACP, all links have an ID, and if things don’t line up, the ports do not become active. Not so with a static EtherChannel. But, since I’m not the iSeries admin, I have no control over that side of things (if the devices are even capable of using LACP).

What I found out was that the network interfaces on the iSeries were labeled (and assembled into EtherChannel bundles) differently than what you would expect, so the static EtherChannels were active on the Nexus side because the links were physically connected, but they were connected in the wrong order. I spent the next half hour across many different links and trial and error moving cables around until I got the right combination of EtherChannel bundles. Finally, the MAC flapping stopped, no pings were dropped across several minutes, and I went home and slept like the dead.

Part 1  |  Part 2  |  Part 3  | Part 4

The Data Center Move, Part 2

Part 1  |  Part 2  |  Part 3  | Part 4

One of the nicest things about migrating the data center is we get a chance to start somewhat fresh and hopefully do things correctly (or rather, more “current” from a best practices perspective). As I’ve witnessed myself, network complexity builds up over time. Temporary “band-aid” fixes unintentionally become permanent solutions. The accumulation of all these one-off problems eventually leads to many complex interdependencies that may not reveal themselves until things start to unravel, as I experienced during the move.

The new data center network is based on a leaf/spine topology with Top of Rack switches connecting to a smaller, higher-speed core. The two core switches are connected together with a 2x40G EtherChannel, and the racks where the ToRs are present have two switches, each of which are redundantly connected back to both cores with 10G links.

The new switches are Cisco Nexus series, which I had not worked with until this point (another of the many “firsts”). After working with them for awhile, and discovering which commands had changed or are now implemented differently as compared to Catalyst, I actually enjoy NX-OS and prefer the way it handles most things in the configuration. It seems more logical to me in most cases. More firsts for me included working with 10 and 40 gigabit transceivers and optics, including 4x10G fiber optic breakout cables with MPO connectors, and fiber optic shelves.

Moving to the new data center was (and still is as I write this) a process of migration. While the bulk of the move happened on a single day, much of it has been in pieces both before and after the major portion. Planning for the entire operation has been ongoing for nearly a year. The major portion consisted of moving in the servers and SANs. However, the network had to be in place to support that connectivity within the new data center, as well as still provide connectivity to the old data center while we complete the migration.

A lot of planning went into the network portion, including having all of the switches pre-configured, the network cabling pre-labeled and, where possible, pre-wired and ready to plug into the devices for instant connectivity to the rest of the network, including the old data center. I created a spreadsheet containing all of the physical interconnections, as well as device ports, switch ports, and assigned VLANs, so that we would have a reference on move day and could trace every cable to every destination if we encountered connectivity issues. We had a support person from IBM on site with us all day during the move (for our iSeries systems), and at the end of the day he told me personally that he has been a part of many data center moves, and had never seen a move go as smoothly as this one did with all the pieces falling into place. I wanted to ask him if he would put that in writing so I could put it on my resume 🙂

We decided the best way to have seamless connectivity between the old data center and the new one during the migration was with a Layer 2 Metro-E trunk from a local ISP. This let us access the new data center as if everything were still in the same room. We made the two new core switches HSRP neighbors with the old core switches, which let us shift the active gateway for the various VLANs when we were ready to do so. During different parts of the migration, we did have some traffic tromboning, but it was well within our bandwidth limits and didn’t cause any issues (including delay, since the new data center is only about 20 miles away).

However, after the majority of the systems had been moved and were in place, we did encounter a networking issue that ended up causing me to have to run across town in the middle of the night to the new data center and power-cycle a switch.

Our newest iSeries systems were placed into production during the past few months in preparation for our move. Just like how newer hypervisors replace many physical servers, the new iSeries servers do the exact same thing. A single system that used to occupy multiple racks now fits in an eighth of a single rack.

However, instead of a few 10G links, these new servers went with the old model of many 1G links (16 per server, in fact). When these servers were placed into production, they connected directly to our Catalyst 6500 core, with eight links per server going to each switch. When I configured the network to support these new servers, I was asked to make each link an 802.1q trunk, and was not asked about bonding the links together.

Unfortunately, I did not think to ask about link bonding, which ended up causing my boss and I to spend a few hours tracking down the newly-discovered issue when we migrated the servers to the new Nexus switches. This ended up being an extremely interesting problem; one that I will never forget.

Part 1  |  Part 2  |  Part 3  | Part 4

The Data Center Move, Part 1

My posts are sometimes (nearly always) kind of long. I thought I’d try something different and break this one up into multiple parts.

Part 1  |  Part 2  |  Part 3  | Part 4

Sometimes in life, the best experience comes from being in the right place at the right time. I studied enterprise networking for years while being employed in the SMB space. My short time with the Flagler County school system was my first enterprise-level job, but my role at my present company as network engineer has really been the start of my journey of “real” enterprise-level experience. In such a comparatively short period of time of being employed here, I have gained immense experience that I know will serve me well for the rest of my career.

My company recently migrated our 15-year-old in-house data center to a private cage in a colocation facility. Many people would view the thought of migration with dread. Many people that I work with did in fact dread it. However, I saw it from the very beginning as a somewhat unique opportunity to raise my awareness of various issues and learn as much as I could from the process. How often does a company move its entire data center? Indeed, many people working for a single company for many years may never experience something like this.

Technology, as always, marches on. Fifteen years ago, it made much more sense to run your own data center, if you had enough equipment to fill it. Our private in-house data center had, at one point in time, somewhere between 50 and 100 racks full of equipment, and a separate telecom room where all of the copper connections terminated. The data center was filled with many physical servers, primarily Linux machines and proprietary IBM midrange systems, as well as very large (and very heavy!) multi-rack SANs.

The telecom room also served as the demarc for various WAN technologies over the years, including T1s, DS3s, straight fiber, and at one point (before migrating to MPLS), the central hub for our former frame-relay hub-and-spoke WAN (this was well before my time here). We still have all of the original SONET muxing equipment in the wiring room, and many shelves for the T1 cards, with hundreds of pairs of wires running between racks. That in itself is something someone fairly recent to enterprise-level networking might not run into very often anymore.

Fifteen years ago, 32TB of storage would take hundreds of drives spanning multiple racks of equipment. In 2016, you can get 32TB across just four drives or less. Soon we’ll be laughing about how it used to take multiple drives at all to get 32TB of storage. Thanks to server virtualization, hundreds of former physical boxes can now be placed into a blade server chassis that takes the space of less than five of the former physical servers. Those former physical servers all required multiple individual connections back to the network, which usually required chassis-based switches containing hundreds of ports. Today, you just run a couple of 10G or higher connections into a hypervisor to be accessed by many VMs simultaneously.

All of this technology consolidation means that at a certain point, it no longer makes sense to run your own data center. It becomes a giant lot of wasted space and money. With consolidation, you might end up using less electricity, but you still have the infrastructure in place, including huge cooling systems, redundant power, backup diesel generators, etc., all of which cost money to maintain, whether or not they are actually used. It is at this point where relocating the data center to a colocation facility makes the most business sense. When you can reduce your data center footprint down to a tenth of its former size due to advances in technology, why continue paying for the infrastructure that you no longer need?

From the networking perspective, this entire move process has had many “firsts” for me with regard to getting firsthand experience. Some of what I have experienced are things that I read about years ago when studying for the CCNP. I remember at the time I was studying wondering if I would ever actually encounter some of these situations and be able to actually use the knowledge I was gaining. This move has made that knowledge come to life.

Part 1  |   Part 2  |   Part 3  | Part 4

Anki, My New Love

This post was also featured on PacketPushers.net

ankiUntil now, I was never one to use flashcards. I could not see their value, and I was too lazy to actually write things down on a paper flashcard (and my handwriting is horrible).

I recently discovered a program called Anki. On the surface, it is just a flash card program, but underneath, it can be as simple or as complex as you desire. The first couple of days that I used Anki, I was still in this mindset that flashcards are not for me, and they hold no value with how I am used to learning.


What makes Anki so great (in addition to being free for every platform except Apple iOS), is the way it works. Active recall and spaced repetitions are what make it such a powerful program. As mentioned in the link, active recall is the process of answering an asked question, as opposed to just passively studying (such as reading or watching training videos). Spaced repetition is the action of spreading out the review of material in gradually longer increments, with the idea being that you’ll remember things for a longer period of time by doing this.

Anki is based an another program called SuperMemo. I had first heard of SuperMemo a few years ago after reading this blog post by Petr Lapukhov (who is one of many people that I consider a Rock Star in the world of computer networking). A lot of research went into the development of SuperMemo (and consequently, Anki), and Anki attempts to solve some of the perceived shortcomings of SuperMemo.

After using it for nearly two weeks, I am already experiencing the benefit of learning using this method. I am retaining details from the flashcards I created that I know I would have forgotten (because I’ve learned and forgotten these things in the past, probably more than once!).

The flashcards are arranged into decks, and decks can contain other decks. The cards themselves can contain pretty much any content you can think of, including audio and video. Cards can also contain tags, which I’ve found to be extremely useful.

For example, even though I am studying the overall topic of “CCIE Routing & Switching”, I have multiple sub-decks, with each deck representing a source of information (such as a particular book, or a particular video series). Yet I can relate the different decks together with the use of tags. For example, I could study on the EIGRP tag across all the sub-decks.

One of the most useful things I have learned about creating flashcards is to not put too much on a single card. I found it better to break things up as much as possible. This helps with faster recall, and since you’re not actually using paper, it doesn’t matter how many cards you create.

For the first couple of days, I had a few cards that contained too much information, and I kept getting the answers wrong. After I broke each complicated card into multiple simpler cards, I was able to retain the information better with each successive pass.

What lead me to create more complicated cards at first was knowing that, for example, if I’m studying for the CCIE, it’s an advanced test with expert-level questions. I thought I would be doing a disservice to myself by making the flashcards too easy. Luckily, I quickly realized that this is the wrong approach. The reason for using the flashcards is to retain little pieces of information, whose aggregate can then be applied to something more complex.

When making easier cards, I try to contain only a single piece of information in the answer whenever possible. When it’s not possible, I try to formulate the question so that it indicates the number of components in the answer. I also modified the default flashcard format to display the associated tags I have given the flashcard, which can act as a hint if the question seems too ambiguous.

The style of flashcard will depend on what you’re trying to learn. For example, if you’re learning a foreign language, you may place the foreign word on the front, and the native word on the back (or vice versa). For me, I found taking simple facts and re-phrasing them as simple questions to be the most effective. I find the question “What IP protocol does EIGRP use?” more engaging than simply “EIGRP IP Protocol” or something similar. IP Protocol 88 is the answer, by the way.

At first, I was worried about the questions being too easy. This is a simple question, and duh, the answer is obvious! But, the answer is always obvious as you are writing the question. A few days or a week later, the answer may not be so obvious. This was what I discovered after using the program for about two weeks. I remember writing the question, and I remember the answer being something very easy…but I couldn’t remember what the answer was.

Enter spaced repetitions.

After you have created the flashcards, reviewing is just like a real flashcard; you look at the front, and recall what is on the back. What makes Anki work so well is that upon revealing the back of the card, you have to decide how difficult or easy recalling the answer was. This is where you need to be truly honest with yourself to get the most out of the software.

Depending on what you click (Again, Hard, Good, Easy), the card will be shown to you again at the appropriate time in the future. For example, if the answer came to you instantly, you would click Easy. If the answer comes to you instantly again when you see the card the next time, clicking Easy again will increase the time Anki waits before showing you the card again. The Again, Hard, Good, Easy values are not static, and depend on multiple factors that change with each repetition of the flashcard.

Getting into the routine of reviewing the flashcards once every day is important to retaining the knowledge. By default, Anki will introduce 20 new flashcards to you every day per deck. This value (like just about everything else in Anki) can be adjusted. The cards can be sequential (default) or randomized (which is what I set it to). If you make your flashcards simple enough, 20 may be a very good value for you. If you have a deck of 200 cards, it will take 10 days for all of the cards to be revealed to you.

However, in addition to the 20 new cards, each day will contain previous cards depending on how you rated them. If you rated a card as “Hard” yesterday, you’ll probably see it repeated today. This is what I have found to be so useful over the past two weeks.

I may have marked several cards as “Again”, which will show you the card again during the day’s study session. After a couple days of marking a card as “Again”, I might have marked the same card as “Hard”, and after a few more repetitions, the card becomes “Good”, and hopefully eventually “Easy”. I haven’t been using the program long enough yet for some of my cards to make that complete progression, but I can see it getting there, which is exciting! Yet another great thing about Anki is that it keeps statistics with regards to your learning, and you can view your progress on nice pretty graphs.

Because cards that are marked “Easy” get displayed less, you waste less time studying those cards because you’ve retained that information, so you can study other cards that are more important. Before using Anki, that was a very bad habit I found myself falling into frequently: studying things I already knew, because it’s easier.

Anki supports sharing the flashcard decks you create. This may be useful if you want to import somebody else’s work, but personally, I found much more value in creating my own flashcards with my own questions and answers because it forces me to examine the individual piece of information and then figure out how to formulate it into an answerable question (which is not always as easy as you might think it is).

When you’re studying for something complicated, such as a certification, it may contain many details that are important to know, but difficult to retain because you don’t frequently need that information. Going back to the EIGRP example, you need to know what the default K-values are for some Cisco certification exams, but in a production network, it is rare to actually need to know that exact detail, and it is even more rare for those values get changed. However, through the power of spaced repetitions, it is a piece of information that you can hold on to.

And who knows? Outside of a certification exam, maybe one day you’ll run into a situation where that particular bit of information really is helpful, and that is when knowledge and experience will combine to give you the solution you need.

On a personal note, it may sound silly considering I am 36 years old as I write this, but during this past year I really feel like I am finally learning how to learn. I feel like I am discovering things that I should have been taught in high school or college. I would certainly have had an easier time with the more difficult subjects if I knew then what I know now.

Bringing an Old Mac Pro Back to Life with ESXi 6.0

downloadIt’s been quite a while since I’ve done a purely technical post.

The original Mac Pro is a 64-bit workstation-class computer that was designed with the unfortunate limitation of a 32-bit EFI. The two models this post discusses are the original 2006 Mac Pro 1,1 and the 2007 Mac Pro 2,1 revision. Both systems are architecturally similar, but the 2006 model features two dual-core CPUs, while the 2007 model has two quad-core CPUs, both based on the server versions of Intel Core 2 chips. I have the 2007 version, which has two Intel Xeon X5365 CPUs for a total of eight cores.

Apple stopped releasing OS X updates for this computer in 2011, with 10.7 Lion being the final supported version. There are workarounds to get newer versions of OS X to run, and a similar concept is being used to make newer versions of ESXi to run. On a side note, getting the newer versions of OS X to run on these old Mac Pros works pretty well, as long as you have the necessary hardware upgrades, which includes a newer video card and potentially newer wi-fi/bluetooth cards.

Like older versions of OS X, older versions of ESXi booted and installed without issue on the old Mac Pros. But at some point, ESXi stopped being supported on these model Macs, due to newer systems using EFI64, and older systems being stuck at EFI32. However, even though it is nearly 10 years old, the 2007 Mac Pro did have eight Xeon CPU cores (and the two quad-core CPUs combined have the same computational power as a single Sandy Bridge-era Core i7 CPU), and is capable of housing 32 GB of RAM, plus four hard drives (six if you don’t care about the drives being seated properly), and has four full-length PCI-e slots and two built-in Gigabit Ethernet ports.

This computer is more than worthy for lab use, and could definitely serve other functions (such as a home media server or NAS). Additionally, when running ESXi, you do not need to have a video card installed, which frees up an extra PCI-e slot.

To get ESXi 6.0 (I used Update 2) to run on the old Mac Pro, you need the 32-bit booter files from an older version of ESXi. The process involves creating an installation of ESXi 6.0 and then replacing the files files included in the link on the new installation.

To do this, I installed ESXi 6.0 Update 2 into a VM on a newer Mac running VMware Fusion, and using the physical disk from the Mac Pro. The physical disk may be attached to the newer Mac using any attachment method (USB, etc). I have a Thunderbolt SATA dock that I used. VMware Fusion does not let you attach a physical disk to a VM from within the GUI, but it can be done.

After creating the VM, attaching the physical disk, and booting from the ESXi ISO image, I installed ESXi, choosing to completely erase and use the entire physical disk. After installation, you may wish to do like I did and boot up the VM before you replace the EFI files. The reason is so that you can set up the management network. By setting this up in advance, you can run your Mac Pro headless, and just manage it from the network.

After you have installed ESXi in the VM onto the physical disk (and optionally set up the management network options), shut down the VM, but leave the physical disk attached. Go to the Terminal, type “diskutil list” without quotes, and look for the partition that says “EFI ESXi”. Make a note of the identifier (it was disk4s1 in my case). Enter “diskutil mount /dev/disk4s1” or whatever yours may be.

Use the files included in the ZIP to replace:


Then unmount the physical disk with “diskutil unmountdisk /dev/disk4” (changing 4 to your actual disk; don’t specify the individual partition). Then connect the disk to your Mac Pro, power it on, and have fun.

By having ESXi installed on a Mac Pro, you are able to install OS X virtual machines without requiring the VMware Unlocker workaround. Additionally, with four PCI-e slots, you could add things like Fibre Channel HBAs, multi-port NICs, USB 3.0 cards, etc.

The downside to using a Mac Pro 1,1 or 2,1 today, though, is its power usage and heat output. This is due to two primary factors: the CPUs and the RAM. Both are considered horribly inefficient and power hungry by today’s standards (but what do you expect with 10-year old technology?). The two CPUs each have a TDP of 150W. Nearly all of the Intel Xeon CPUs produced today (even the most expensive ones) run much cooler than this. The other culprit is the DDR2 FB-DIMM RAM.

To provide some perspective, I plugged in my handy Kill-a-Watt to see what kind of power was being used. I thought the bulky X1900 XT video card that came with the system would be a large part of the equation, but that turned out to not be true. With the video card, 32 GB of RAM (8x4GB), and a single SSD, the system consumes about 270W idle! Take out the video card, and it idles at 250W. Take out 24 GB of memory (leaving two 4GB sticks installed), the power drops to 170W. So that means the FB-DIMMs alone consume about 100W altogether. I calculated that where I live, it would cost about $1 a day in electricity to keep it running 24/7.

For perspective, my main server, which houses two quad-core Nehalem Xeons (which are about 7 years old as I write this), 48 GB of RAM (6x8GB DDR3 DIMMs), and 12 hard drives, uses a total idle power of 250W. A typical modern desktop PC probably uses less than 100W.

Another potential disadvantage is the Mac Pro 1,1 and 2,1 has PCI-e version 1.1 slots, which are limited in bandwidth to 2.5 GB/s per lane. This may or may not be an issue, depending on the application, but don’t expect to be running any new 32GB FC cards with it.

Possibly the most serious disadvantage, especially with regards to lab usage, is that the CPUs in these model Macs, while they do support Intel VT-x, they do not support EPT, which was introduced in Intel’s next microarchitecture, Nehalem. EPT, Extended Page Tables, otherwise known as SLAT, Second Level Address Translation, is what allows for nested hypervisors. This means you can’t run Cisco VIRL on these model Mac Pros.

So for me, reviving the old Mac Pro is good for lab purposes, and I turn it off when I’m not using it to save electricity. It seems more fitting to me to use the technology in this way, rather than for it to simply become a boat anchor, though it would certainly work well in that application, as the steel case is quite heavy!

Experiences with Cisco VIRL Part 2: INE’s CCIE RSv5 Topology on VIRL

This blog entry was also featured on PacketPushers.net


VIRL topology + INE RSv5 ATC configs

After getting VIRL set up and tweaked to my particular environment, my next step is to set up INE’s CCIE RSv5 topology, as this is what I will be using VIRL for the most, initially.

I was satisfied with using IOL, but I decided to give VIRL a try because it not only has the latest versions of IOS included, it has many other features that IOL in itself isn’t going to give you. For example, VIRL includes visualization and automatic configuration options, as well as other features like NX-OSv. I was particularly interested in NX-OSv since I have also been branching out into datacenter technologies lately, and my company will be migrating a portion of our network to the Nexus platform next year. At this point in time, NX-OSv is still quite limited, and doesn’t include many of the fancier features of the Nexus platform such as vPC, but it is still a good starting point to familiarize yourself with the NX-OS environment and how its basic operation compares to traditional Cisco IOS. Likewise, I intend to study service provider technologies, and it is nice to have XRv.

I configured the INE ATC topology of 10 IOSv routers connected to a single unmanaged switch node. I then added four IOSv-L2 nodes, with SW1 connecting to the unmanaged switch node, and then the remaining three L2 nodes interconnected to each other according to the INE diagram. The interface numbering scheme had to change, though. F0/23 – 24 became g1/0 – 1, f0/19 – 20 became g2/0 – 1, and f0/21 – 22 became g3/0 – 1.

I built this topology and used it as the baseline as I was testing and tweaking the VIRL VM, as described in Part 1. I was familiar with how the topology behaved in IOL, as well as with using CSR1000Vs and actual Catalyst 3560s, and that was my initial comparison. After getting things to an acceptable performance level (e.g. ready to be used for studying), I realized I needed a way to get the INE initial configurations into the routers, and I would prefer to not have to copy and paste the configs for each device for each lab every time I wanted to reload or change labs.

One of the issues I experienced with VIRL is that nothing is saved when nodes are powered off. If you stop the simulation completely, the next time you start it, everything is rebuilt from scratch. If you stop the node itself, and then restart it, all configurations and files are lost. There is a snapshot system built in to the web interface, but it is not very intuitive at this point in time. Likewise, you have the option of extracting the current running configurations when the nodes are stopped, but this does not include anything saved on the virtual flash disks. Some people prefer having a separate VIRL topology file for each separate configuration, but I find it to be more practical (and faster) to use the configure replace option within the existing topology to load the configurations.

Luckily, the filesystem on the VIRL host VM is not going to change between simulations, and all of the nodes have a built-in method of communicating with the host. This makes it an ideal place to store the configuration files. I went through and modified the initial configurations to match the connections in my VIRL topology. You can download the VIRL topology and matching INE configurations I assembled here. For the routers, this included replacing every instance of GigabitEthernet1 with GigabitEthernet0/1. The switch configs were a little more involved and required manual editing, but there’s not nearly as many switch configurations as there are router configurations. After getting the configuration files in order, I used SCP to copy the tar files to the VIRL VM using its external-facing (LAN) IP address. I placed the files into /home/virl/.

Originally, I added an L2-External-Flat node to match every router and switch in the topology so that each node could communicate with the VIRL host VM. However, someone pointed out to me that there was a much easier way to do this: click the background of the topology (in design mode), select the “Properties” pane, then change the “Management Network” setting to “Shared flat network” under the “Topology” leaf. This will set the GigabitEthernet0/0 interfaces to receive an IP address via DHCP in the range, by default. This setting only applied to the router nodes when I tried it, so I still had to manually edit the configurations of the IOSv-L2 switch nodes.

For the four IOSv-L2 switch node instances, I used this configuration:

interface GigabitEthernet0/0
 no switchport
 ip address dhcp
 negotiation auto

It is very important to note that when you convert the port to a routed port on IOSv-L2, you need to remove the media-type rj45 command. This does not need to be done on the IOSv routers, though. The IOSv nodes were configured as:

interface GigabitEthernet0/0
 ip address dhcp
 duplex auto
 speed auto
 media-type rj45

My original intention was to modify the initial startup configurations of the nodes to automatically copy the files to their virtual flash drives upon each boot via TFTP, but the issue I ran into was that the interfaces would remain shutdown until the configuration was completed. So even though I was able to put the necessary commands into the config (prefixed with “do”), the commands wouldn’t work from that point because the interfaces were automatically shutdown. However, placing other EXEC-mode commands at the end of the configurations (before the end line), such as do term len 0, may save you some extra steps when you start labbing.

Originally, I was planning to just use SCP to copy the files from the VIRL VM host to the nodes, but there is no way to specify the password within the command – the password prompt is always separate, so it is unusable as part of the configuration.

This led me to configure the VIRL VM as a TFTP server to access the configuration files. I modified some of the information as detailed on this site and performed these steps on the VIRL VM:

sudo su
apt-get update && apt-get install -y tftpd-hpa
vi /etc/default/tftpd-hpa

Modify the file as follows:

TFTP_OPTIONS="--secure --create"

And finally, restart the TFTP service to make the changes take effect:

service tftpd-hpa restart

However, I set up the TFTP server before discovering that placing the copy commands in the startup config was useless. So, setting up the VIRL VM as a TFTP server is an optional step; I just decided to stick with it because I’m used to using TFTP.

At this point, the configuration files are in placed on the VIRL host VM, and after booting the nodes in the topology, there is connectivity between the nodes and VIRL to copy the files.

If you perform a dir flash0: from any of the nodes, you will see that there is quite a bit of free space on that filesystem. However, I found out in attempting to copy files to it that it does not actually let you use that space. It would not let me copy all of the configuration files to it. Thankfully, the flash3: filesystem does.

Assuming you placed the tar files directly in /home/virl on the VM, the following commands will copy the configuration files to your node:

Using SCP:

archive tar /xtract scp://virl@ flash3:

The default password is all uppercase, VIRL

Using TFTP:

archive tar /xtract tftp:// flash3:

Replace “R1” with the actual device you’re copying the files to. With the default settings, the VIRL host VM will always be

After all the files are copied and extracted to your devices, you can use the Terminal Multiplexer inside the VM Maestro interface to issue a command such as this to all of the devices simultaneously:

configure replace flash3:basic.eigrp.routing.cfg force

So far, I have not yet had great luck with the IOSv-L2 instances. They were released as part of VIRL not too long ago, and have been making improvements with time. However, for studying for the CCIE R&S at this point in time, I will probably stick with the four 3560s in my home lab and bridge them to the IOSv router nodes.

VIRL is a pretty complex software package with lots of individual components. I’ve only had the software for a few days, so I haven’t had time yet to do a really deep dive, and there are probably even better ways to do some of the things I’ve described here. I wish it could be as fast as IOL, but by comparison, that really is the only major disadvantage of using VIRL instead of IOL. There are so many other features that do make the software worth it, though, in my opinion.

In years past, CCIE R&S candidates were known to spend thousands of dollars on equipment for a home lab. We are lucky enough today that computing power has caught up to the point where that is no longer the case. If you’re in the same boat that I am currently in, where your employer is not paying for training, then VIRL is a pretty good investment in your career. But of course, like anything else, you’ll only get out of it what you put into it. It’s not some kind of magic pill that will instantly make you a Cisco god, but it definitely has awesome potential for the realm of studying and quick proof-of-concept testing.

Experiences with Cisco VIRL Part 1: Comparing and Tweaking VIRL

This blog entry was also featured on PacketPushers.net


Since it has been out for more than a year, and has been developed and improved tremendously during that time, I decided to finally take the plunge and buy a year’s subscription to the Cisco VIRL software. Until now, I have been using any combination of real hardware, CSR1000Vs, and IOL instances for studying and proof of concept testing.

My first impression of VIRL is that it is a BEAST of a VM with regards to CPU and RAM consumption. I installed it on my 16GB MacBook Pro first, and allocated 8GB to it. However, its use was very limited as I was unable to load more than a few nodes. I then moved it to my ESXi server, which is definitely more appropriate for this software in its current state.

I knew that the CSR1000Vs were fairly RAM hungry, but at the same time they are meant to be production routers, so that’s definitely a fair tradeoff for good performance. The IOSv nodes, while they do take up substantially less RAM, are still surprisingly resource intensive, especially with regards to CPU usage. I thought the IOSv nodes were going to be very similar to IOL nodes with regards to resource usage, but unfortunately, that is not yet the case.

I can run several tens of instances of IOL nodes on my MacBook Pro, and have all of them up and running in less than a minute, all in a VM with only 4GB of RAM. That is certainly not the case with IOSv. Even after getting the VIRL VM on ESXi tweaked, it still takes about two minutes for the IOSv instances to come up. Reloading (or doing a configure replace) on IOL takes seconds, whereas IOSv still takes about a minute or more. I know that in the grand scheme of things, a couple of minutes isn’t a big deal, especially if you compare it to reloading an actual physical router or switch, but it was still very surprising to me to see just how much of a performance and resource usage gap there is between IOL and IOSv.

Using all default settings, my experience of running VIRL on ESXi (after going through the lengthy install process) was better than on the MBP, but still not as good as I thought it should have been. The ESXi server I installed VIRL on has two Xeon E5520 CPUs, which are Nehalem chips that are each quad core with eight threads. The system also has 48GB of RAM. I have a few other VMs running that collectively use very little CPU during normal usage, and about 24 GB of RAM, leaving 24 GB for VIRL. I allocated 20GB to VIRL, and placed the VM on an SSD.

The largest share of CPU usage comes from booting the IOSv instances (and maybe the other node types as well). The issue is that upon every boot, a crypto process is run and the IOS image is verified. This pegs the CPU at 100% until the process completes. This is what contributes the most to the amount of time the IOSv node takes to finish booting, I believe. This may be improved quite a bit in newer generation CPUs.

When I first started, I assigned four cores to the VIRL VM. The IOSv instances would take 5-10 minutes to boot. Performing a configure replace took a minimum of five minutes. That was definitely unacceptable, especially when compared to the mere seconds of time it takes for IOL to do the same thing. I performed a few web searches and found some different things to try.

The first thing I did was increase the core count to eight. Since my server only has eight actual cores, I was a little hesitant to do this because of the other VMs I am running, but here is a case where I think HyperThreading may make a difference, since ESXi sees 16 logical cores. After setting the VM to eight cores, I noticed quite a big difference, and my other VMs did not appear to suffer from it. I then read another tweak about assigning proper affinity to the VM. Originally, the VM was presented with eight single-core CPUs. I then tried allocating it as a single eight-core CPU. The performance increased a little bit. I then allocated it properly as two quad-core CPUs (matching reality), and this was where I saw the biggest performance increase with regards to both boot time and overall responsiveness.

My ESXi server has eight cores running at 2.27 GHz each, and VMware sees an aggregate of 18.13 GHz. So, another tweak I performed was to set the VM CPU limit to 16 GHz, so that it could no longer take over the entire server. I also configured the memory so that it could not overcommit. It will not use more than the 20GB I have allocated to it. In the near future, I intend to upgrade my server from 48GB to 96GB, so that I can allocate 64GB to VIRL (it is going to be necessary when I start studying service provider topologies using XRv).

I should clarify and say that it still doesn’t run as well as I think it should, but it is definitely better after tweaking these settings. The Intel Xeon E5520 CPUs that are running in my server were released in the first quarter of 2009. That seven years ago, as of this writing. A LOT of improvements have been baked into Xeon CPUs since that time, so I have no doubt that much of the slowness I experienced would be alleviated with newer-generation CPUs.

I read a comment that said passing the CCIE lab was easier than getting VIRL set up on ESXi. I assure you, that is not the case. The VIRL team has great documentation on the initial ESXi setup, and with regards to that, it worked as it should have without anything extra from their instructions. However, as this post demonstrates, extra tweaks are needed to tune VIRL to your system. It is not a point-and-click install, but you don’t need to study for hundreds of hours to pass the installation, either.

VIRL is quite complex and has a lot of different components. It is expected that complex software needs to be tuned to your environment, as there is no way for them to plan in advance a turnkey solution for all environments. Reading over past comments from others, VIRL has improved quite dramatically in the past year, and I expect it will continue to do so, which will most likely include both increased performance and ease of deployment.

Part 2 covers setting up the INE CCIE RSv5 topology.

Hey, Wait…I Thought You Started Blogging in 2012?

It’s true, I did start this blog in October 2012. In June 2018, I made the decision to prune all of my entries before December 2015. I spent a couple of hours reading over the majority of these entries and realized they are no longer relevant to my life and current career trajectory.

When I started this blog, I was just entering into the vast world of network engineering. I was not yet working in an environment that could take advantage of the new skills I was developing. That wouldn’t come until about three years later. Originally, my intention was to have a personal record of my career development and progression. My blog still serves this purpose for me in some ways.

A minor intention in blogging was to attempt to give myself a much-needed boost in self-esteem, especially since I had no peers to communicate these things with for so many years. I felt that by writing content for a community that I was just getting to know, it would serve as a form of self-validation, since I was unable to obtain it from my work at the time.

This is the part that has dramatically changed, especially since I passed the CCIE R&S written exam and made an attempt at the lab exam. Whereas before I would have tried to project personal confidence, but not necessarily feel it within myself, I now feel true confidence in the things I say and do as a professional based both on the skills I’ve developed as well as the professional experience I have established.

I’ve come to this realization within the past month or so, and it occurred to me that this is the point I had hoped to reach someday when I started blogging nearly six years ago. My oldest posts are filled with artificial confidence. My recent posts have a much different feel, and this is what I wish to project publicly moving forward. Out with the old, in with the new, such is the world of technology. 🙂