Managed SD-WAN: An Enterprise Perspective

In this post I detail my experiences as a network engineer deploying SD-WAN from the point of view of an enterprise customer of a Managed Services Provider (MSP). My company is dispersed across approximately 400 sites with multiple datacenters. I am fortunate enough to be in a position where my technical expertise and operational experience play a large role in helping shape the direction of network connectivity across the entire company. I also consider myself extremely lucky to have a great relationship with my direct management which extends all the way up to the CIO.

This post was written to support a recent episode of The Network Collective podcast I participated in along with Jason Gintert.

I am a big fan of SD-WAN. The concepts and building blocks of SD-WAN have been around for more than a decade, but it has only been within the past few years that SD-WAN as both a concept and a platform have reached a stage of maturity suitable for the mass market. Some vendors’ platforms are further along in development and maturity than others, but the general consensus is that SD-WAN is not just a fad: it is here to stay. SD-WAN takes a lot of traditional routing and WAN concepts and makes them easier to consume, which ultimately improves ongoing operations and can lead to business cost reductions.

Will SD-WAN meet your needs?

When evaluating if SD-WAN is right for your organization, there are several different factors to consider. Some of the questions you need to evaluate are:

  • Will I need to evaluate new kinds of connectivity?
  • Will my existing equipment support an upgrade to SD-WAN?
  • What kind of cost model do I need to look at?
  • What kinds of operational considerations are there?
  • Should I do everything myself or have a third party do everything? Is there a middle ground?

Originally, I would have pondered a question such as “Will my current and future applications support SD-WAN?” but ultimately, the answer to this question comes down to application requirements and underlying connectivity. SD-WAN isn’t necessarily a magic bullet in itself if you have very strict application performance requirements. When most people think of SD-WAN, they think of a traditional setup with one or two broadband connections, and possibly 4G/LTE for a backup link. However, that doesn’t mean you can’t pair your SD-WAN deployment with high-speed low-latency links when it is necessary. This is what makes SD-WAN marvelous: it is transport-agnostic, and automatically chooses the current best-performing link based on the defined application performance policy.

On the other side of high application performance requirements is datacenter interconnect (DCI). It is often the case that inter-datacenter traffic requires both low-latency and high-throughput. SD-WAN platforms can be used for DCI, but it requires much more expensive and powerful hardware. As of this writing, it is still more cost-effective to use traditional DCI solutions, though SD-WAN could be used for an emergency backup connection between datacenters. We kept our private DCI solutions in place when we migrated to SD-WAN, though it becomes an acceptable backup path if all other connectivity is lost.

For most businesses, VoIP is the application with the strictest application performance requirements. VoIP isn’t necessarily very bandwidth-intensive but is very sensitive to latency and jitter. SD-WAN aims to alleviate the symptoms of both by automatically choosing the best path per-packet when multiple paths are available, and depending on the platform, de-jitter buffers may also be present to smooth things out. WebEx and other types of video conferencing add to the performance requirements by necessitating more bandwidth as compared to just VoIP.

Are SLAs and QoS more important than bandwidth?

My company has had SLA-backed QoS-enabled private MPLS L3VPN connectivity as our WAN for many years. Management was initially very hesitant to move from guaranteed private connections (nearly universally 1.5 Mbps T1s) to public broadband. The concerns were not so much from a security perspective (IPsec solves that issue), but more from a performance standpoint due to a lack of QoS. For several sites, we did a trial of switching them from a T1 to a single broadband connection using a firewall for site-to-site VPN access into our corporate network. This was before SD-WAN was on the roadmap and was more for evaluating the feasibility of using faster, but cheaper broadband connections even though they lack QoS.

Over time, the broadband circuits proved themselves to be faster than T1s (of course), but more important they proved to be no less reliable or less performant with regard to VoIP (our biggest concern, due to lack of QoS). In fact, as time went on, whenever we had an issue with any broadband circuits, we were frequently seeing them rectified much quicker than our SLA-backed T1s. We are always left with the impression that the restoration of downed T1s will happen whenever the LEC happens to get to it, with no urgency whatsoever (escalations are meaningless). It is also a running joke with us that LECs do not seem work on Fridays, at least from a customer perspective. Maybe one day I will be on the other side of the demarc and see what is really going on. The increase in general performance and relative reliability of the broadband circuits compared to a T1 is what ultimately set the stage for us to begin considering SD-WAN.

Upgrading to SD-WAN

Your existing equipment may support a software upgrade to enable SD-WAN capabilities. This could be your path into SD-WAN while potentially saving some money in the process. All of our equipment was far beyond end-of-life, which gave us the opportunity to evaluate multiple platforms to decide what was best for us according to our current and projected needs. Some SD-WAN solutions are delivered purely in software and can run on a local hypervisor. Most are hardware-based (though in reality they are in fact just pre-packaged software running on a pre-packaged x86 hypervisor).

Most companies do not like the thought of having to buy things up front, especially technology. The prospect of having to buy new hardware simultaneously to support 400 locations is not enticing (hence the reason all of our existing equipment was so old). This is part of what steered us in the direction of using an MSP for our SD-WAN, since the hardware is provided as part of the service. While there are vendors and partners for which you can lease equipment or otherwise arrange for an OPEX model of spending so you can do everything yourself, we went with an MSP for other reasons as well.

Choosing an MSP

When we put out our Request for Proposal (RFP), we already had in mind what SD-WAN platform we were interested in and what features we were looking for based on an initial multi-month Proof of Concept deployment with 15 of our sites. We also knew what we were looking for operationally, based on negative experiences with our previous service provider. After the RFP deadline, we narrowed our decision to two different providers, both of whom we had a relationship established with already.

We had a great working relationship on the networking side with one of the MSPs. They have really great engineers working for them and are always very responsive to our issues which includes both response time, and quickly reaching someone with the appropriate skill level to handle the particular issue (some of which were very complex MPLS routing issues we required). After dealing with our previous WAN provider (where it’s next to impossible to get someone qualified to work the issue, let alone the severe lack of response time), working with them was a breath of fresh air and they would definitely have been my first choice from a technical perspective.

However, we ended up choosing the other MSP for whom we already had an established relationship for a few different reasons. This MSP already handled our company-wide 4G/LTE connectivity needs and was able to provide LTE backup for nearly all of our 400 locations as part of the monthly fee for the SD-WAN service, including a pool of wireless data to be shared across all sites (definitely a huge positive!). However, the operational aspects were what ultimately made us choose this MSP.

Consider more than just the bottom line

With more experience you begin to consider the business itself with every aspect of your thinking. Being able to reach a knowledgeable support engineer in a timely manner is very important to me as a network engineer who supports the company’s WAN. However, deployment, ongoing operations, and ultimately money are the things that must take precedence. Money can always be measured in more ways than simply the raw dollar amount for a product or service, particularly when you take operations (and the cost of potential downtime) into consideration.

One of the biggest decisions that swayed us toward the MSP we chose was that they would go to each location to install the SD-WAN equipment and connect it to our existing network. If you have only a few locations, this is not a big deal. But when you have 400 sites and your company’s entire technical support exists only at the HQ office, this is absolutely paramount. A handful of technical people are not going to have a good time trying to walk non-technical people into replacing old routers with new ones over the phone and making sure everything is connected properly, especially when site connectivity is not 100% uniform. Like most enterprise networks, we have exceptions here and there from a standard template. Having someone technical go and install the equipment and then just call us when there are exceptions they can’t handle (why did that particular site have a different VLAN numbering scheme and how did we miss that?!) saved us from a huge potential burden.

Site connectivity

Managing connectivity for 400 locations takes a significant investment in time and human resources. My company has a very small team to manage all aspects of connectivity (both data and voice) across the entire enterprise. Like many companies, we are positioned with the requirement to do more with less. Managing all of the circuits (including LTE) for this many locations requires a dedicated full-time position, which we do not have. For us, it is more cost-effective to have an MSP handle all of those details. They acquire and provision the circuits for each of our sites, provide the SD-WAN hardware, manage the SD-WAN service, and install it for us. If we have a problem with any of the circuits, we put in a ticket through their portal, and then they deal with the underlying carrier directly.

By phrasing it like that, I make it sound like we just push a button and then we get instant SD-WAN at our locations. As you would expect, it’s not quite that simple. We still have to evaluate each location individually for connectivity. You can get just about any kind of connectivity you want – if you’re willing to pay for it. My company has many rural locations where it is not cost-effective to get the kind of connectivity we would like to have. We also have to take the number of users at each location (as well as projected growth) into consideration when choosing services. This is a benefit of using broadband and Ethernet DIA circuits in that it is generally easier to upgrade the bandwidth if your location’s requirements change. With a T1, you’re out of luck unless you bond multiple links together, which can get expensive very quickly. One of the advantages of SD-WAN is that you can usually even get better reliability out of just a single link due to the previously-mentioned de-jitter buffers as well as Forward Error Correction (FEC).

We’ve also been surprised by the lack of connectivity options for some of our sites located in more populous areas. In one city, we have a large site and a smaller site only half mile away from each other on the same main (busy) road. We wanted to get cable broadband services. It was no problem for the large site, but the cable company wanted us to pay over $20,000 in construction costs to bring service to the nearby smaller site. We couldn’t justify that cost for the small site, so we sought alternative connectivity.

Likewise, we’ve been surprised by the extreme variations in 4G/LTE connectivity. We have sites in the middle of nowhere that receive great connectivity, and sites where you would expect good signal, but we have to install special external antennas for even hope of a connection. In those situations, the LTE is considered a “better than nothing” connection if the main circuit goes down. Typically, VoIP will not work when the LTE is that bad due to extreme latency. Hopefully 5G will alleviate some of these situations in a few years with its more complex signal coding schemes and ability to use antenna arrays.

Design and deployment

We purchased a completely managed service, and in theory the MSP is supposed to handle all operational aspects. In reality, you still need someone on the customer side who is knowledgeable in all aspects of the service, including operations and design. We have some very specific network design requirements, and even though doing everything ourselves would give us the most control, we coordinated with the MSP for every aspect of the overall WAN design (including resisting various “upsells” that we felt were unnecessary), and we retained the ability to make configuration changes as we see fit.

A third party may not understand why your network is designed the way it is and may not appreciate some of the decisions made in the past that affect your operations today. Having this level of control was important to us for these reasons. Not all MSPs will offer you this level of control over the managed service – you need to be certain you will have this access (if needed) before you sign the final contract. This level of control represents the middle ground between doing it all yourself versus the MSP doing everything.

Likewise, the employees of an MSP are human and naturally make mistakes. While the MSP seeded the original configurations based on our designs and information that we provided, we found little mistakes that had to be corrected during deployment. On their side, we frequently found incorrect IP addresses assigned to the sites, and certain configuration options we had asked for missing (which I was able to correct myself). On our side, we’ve run into occasional sites that had different equipment than what we thought, or VLANs configured differently than all our other sites for some reason (luckily this was extremely rare).

Another operational aspect that steered us toward this particular MSP is their ticketing and service monitoring platform. When a location’s primary circuit is down for more than five minutes, a ticket is placed automatically, and the MSP investigates and deals with the underlying carrier if it is necessary. But even this won’t work 100% of the time due to more complicated underlying issues. This is another reason why it is still important for your organization to have knowledgeable staff even as your platforms become more automated and easier to use.

Ongoing operations

Automation and ease of use are great because it makes the technology more accessible to a larger number of people. For many companies, SD-WAN may even be their first foray into automation and orchestration since it is generally built into the SD-WAN platform itself. Making profile-based changes that affect multiple sites simultaneously definitely beats the old model of individual device-by-device configuration. Sure, you can write your own scripts to handle certain things but when those operations are built right into the platform, it makes it more accessible.

Similarly, SD-WAN makes network monitoring more accessible. For example, in the past, my team would have been exclusively responsible for diagnosing WAN issues. If a T1 was being flaky, it would be up to me to log into the router and check for something like path code violations. With SD-WAN, now our help desk can log into the platform and easily see if a site’s link is not performing how it should. By doing this, they can quickly determine if a site is having an obvious WAN connectivity issue and let the location know.

But what happens when there’s a more complicated networking issue?

The above was an issue that I tracked down myself. A link was performing very poorly, and I put in a ticket with our MSP. This was for an Ethernet DIA circuit. They reached out to the carrier, and as is typical, the carrier simply said the issue was not on their side without passing along any information on how they reached that conclusion (I wish I could get away with that!). I did some further investigation myself and discovered that our SD-WAN appliance was set for the standard auto-negotiation, but the carrier had hardcoded their side for 100/full. This is something that should have been taken care of by the MSP during the install but was missed somehow. If I had not caught this, the resolution time would have taken a lot longer, even though the MSP is technically supposed to handle everything, because the MSP’s solution was to send a technician out the next day.

Other considerations

Like just about any set of technologies, SD-WAN can be as simple or as complex as you need it to be. There are new security implications to consider, especially when evaluating options to improve performance for Internet-based Software as a Service (Saas) applications by using local Internet breakout instead of backhauling all of your traffic to a central datacenter. You also need to change your mindset when it comes to policy application and traffic control. As Jason Gintert mentioned in The Network Collective podcast, you have to realize that with SD-WAN, your traffic policies now live in the overlay, not the underlay. Your SD-WAN platform may be able to take advantage of features in the underlay such as QoS marking, but ultimately those policies are defined in your overlay configurations and it is up to the SD-WAN platform to carry out those policies using whatever methods are available.

SD-WAN is going to benefit the vast majority of organizations that currently have a traditional WAN. Before diving in, make sure all of your requirements are fully defined and prioritized since different platforms support different features (and at different price points). Likewise, determine if a DIY approach, fully-managed, or some middle ground is appropriate for you. There are other considerations beyond just cost, such as the feasibility of ongoing operations (including circuit management). Can you even get the kind of connectivity you desire at your locations? Will your staff have a difficult time adjusting to new concepts and configuration paradigms? By understanding your current environment, the issues you currently face, and having an outlook into future needs, you will be in a better position to evaluate how SD-WAN can positively impact your organization. For my company, the improvement in overall WAN reliability was our primary driver along with updating old equipment. Improved bandwidth and built-in orchestration capabilities were included as a bonus.