The Self-Healing Powers of ML and AI


The idea of a self-healing network in a distributed enterprise – where problems are resolved without the need for human intervention – has been around for a long time. It typically involves a self-healing network core, cloud or datacenter that identifies choke points automatically and routes traffic around them proactively. While such efforts improve core resiliency, they do little to benefit the branch, store, home or mobile location where end users access services.

For a variety of reasons, including commercial constraints, branch and similar locations have single point-of-failure edge devices that are impacted differently and independent of whether the network core is resilient. There have been attempts to address this constraint at the branch, from efforts to reset circuits automatically, or prompt a failover switch to backup circuits, or apply active-active design to SD-WAN so it uses multiple circuits simultaneously.

Hughes-Digital-Signage-SolutionsWANsform Your Enterprise
Before you can transform customer experience, you must WANsform your enterprise. Learn how Hughes Managed SD-WAN is enabling digital transformation and achieving real results.

Yet always there remained a single point of failure risk at the branch: the WAN Edge CPE (the system’s routers, SD-WAN devices and firewalls). Failure at the WAN Edge brings all branch services down and impacts revenue-generating activity. In addition, time to restore or replace equipment can range from 30 minutes to 2 hours if it can be corrected remotely, and 4 to 24 hours if it requires on-site service.

When Engineers Ask “What if?”

Everyone at Hughes loves a challenge. After all, we are largely an enterprise of engineers. (Even people in our marketing department have engineering degrees!) So, we have been puzzling this problem for a while and exploring how we can apply Machine Learning (ML) technology and Artificial Intelligence (AI) to identify when the WAN Edge was exhibiting symptoms correlated to a high likelihood of service failure.

If risks could be identified through deep learning models and AI could take action to pre-emptively mitigate potential failures, then we would arrive at a truly powerful self-healing solution – one that could not only benefit our customers’ enterprise networks but also their operations. We also saw the need to track metrics for the offending WAN Edge to avoid potential relapse. Averting branch-level ‘disaster’ scenarios and experiencing a high degree of success through self-healing actions would be an industry first!

Over the past 7 months, we have been pioneering this self-healing WAN Edge service capability, part of our efforts more broadly into AI Operations, or AIOps – for our customers across 32,000 managed sites in North America.

As Dan Rasmussen, senior vice president for the Enterprise Division, said, “We estimate we’ve seen a 70% success rate for autonomous correction across the sites under our management. Those successes have saved approximately 1,750 hours of network downtime in just these first 7 months. In the other 30% of cases, the system provided early diagnoses of potential hardware failure or chronic site issues so they could be addressed by our team members.”

Hughes applied these self-healing efforts first at WAN Edge systems because a failure in those systems can be catastrophic for a site and cost hours of operational downtime. Its “immune system” capabilities are driven by our ML models that continuously absorb and contextualize proprietary network data (from nearly 250,000 sites under Hughes management). This type of unsupervised learning builds a baseline for network performance, both for each individual network and for peer networks with similar attributes. Deviations against the baseline can then be detected, with the risk-reward of potential corrective actions assessed. Once appropriate measures are taken, performance is tracked to ensure a return to steady-state. Our big data domain expertise enables us to conduct extensive analysis and to experiment to ensure that risk-reward equations maximize benefits as well as manage business constraints and trade-offs.

Consistent with our vendor-agnostic approach, we’ve engineered this self-healing capability to be extensible and API-driven to allow new WAN Edge platforms to be added rapidly to the service – in many cases, in as little as one week. With its extensible design, Hughes plans to address Local Area Network (LAN) services next, with the autonomous remediation of such devices as switches, wireless controllers, access points, and even digital signage screens.

Hughes is excited to be the first managed services provider to usher in this new era in self-healing networks made possible by our autonomous AIOps capabilities and to offer customers a true path to network resilience.


About the Authors


Seejo Sebastine leads the technology strategy and execution team that drives Hughes’ leadership position as a Global Managed Service Provider. He was instrumental in the growth of the SD-WAN portfolio at Hughes.