Building Resilient, Highly Available Disaggregated Routing Networks

The evolution of telecom and ISP infrastructure is being driven by disaggregation—separating software from hardware to unlock flexibility, scalability, and cost efficiency. But with this shift comes a critical question: how do operators ensure carrier-grade resiliency, high availability, and redundancy in a distributed, software-defined environment?

Modern disaggregated routing platforms, such as those developed by RtBrick, demonstrate that these attributes are not only achievable—but can actually surpass the resilience of traditional monolithic systems.

Smaller ‘Blast Radius’ by Scaling Down as well as Up Disaggregated switches can scale down more cost-effectively than traditional chassis-based systems. This allows for a more distributed architecture, spreading out the numbers of customer connections across a great number of lower-cost open switches. The switches can even be temperature hardened to allow deployments outside of traditional central offices, if required.

Robust Hardened Routing Protocols Not everything about disaggregation is new. Disaggregated networks can still use time-hardened routing protocols to deliver resiliency and recovery, such as BGP, MPLS and IS-IS. These routing protocols have evolved over decades, and their behaviour is well understood and well-proven.

Broadband Subscriber Stateful High-Availability Broadband services are usually delivered using either PPPoE or IPoE protocols.

PPPoE is inherently resilient to outages. It creates a logical session between the CPE and the BNG. If a session drops, this is detected using PPP keepalives and subscribers are automatically reconnected via any available path, potentially to a different BNG.

DHCP/IPoE has no inherent session/keepalive mechanism at Layer 3 and the client may not realise connectivity has been lost. Failure detection depends on ARP timeouts or DHCP lease expiry. This leads to slower inherent recovery times.

RtBrick has solved this problem by developing Stateful Redundancy for IPoE subscribers. A ‘primary’ BNG prepopulates the forwarding state of a redundant BNG. The redundant BNG can be located anywhere, as long as it has MPLS connectivity to the primary BNG. The redundant BNG can also support active subscribers, as long as it has sufficient spare capacity to handle both the active and the failover subscribers.

The two BNGs are connected using LAG (Link Aggregation Group). If the LAG fails, the primary BNG becomes inactive for the subscriber group, the standby BNG detects the failure and starts performing subscriber services, providing a fast recovery from the disruption. This is typically fast enough (2-3 seconds) that the subscriber doesn’t notice a break in service. This Stateful Redundancy for IPoE protects against failure of the BNG, the access network and the DSLAM/OLT connecting the subscribers.

The Stateful Redundancy feature can be seen in action in this technical demonstration video.

Routing FIB Recovery Times

In the event of a complete router failure, a router may need to re-learn and repopulate its FIB (Forwarding Information Base) on the silicon inside the switch. This could be as large as the entire Internet routing table, of around one million routes. Using open switches, RtBrick can re-learn and repopulate the FIBs faster than any other router available. Proprietary routing systems can usually achieve this in around one minute. RtBrick’s software can learn the Internet routing table and program it into off-the-shelf open hardware in 46 seconds, with further performance increases expected over time.

Granular Monitoring and Telemetry for Alarms In addition to robust recovery mechanisms it is, of course, preferable to detect network issues before they become critical. This requires the operator to have visibility and insight into as much real-time network data as possible. Unlike traditional routing operating systems, RtBrick’s routing software exposes every metric that can be measured by the NOS (Network Operating System), through a single API and from a single database.

It uses an in-built Prometheus time-series data base that can be exposed through a Grafana interface to flag any alarms and threshold to the operator that they choose.

This can include any parameter, from the number packets sent across a specific port to the speed of the fans running in the switch, or the temperature of the CPU.

Internet Peering Security ISPs face sophisticated attacks on their infrastructure, including at Internet peering points, where they need robust security tools. RtBrick offers several key technologies to protect Internet peering routers including: BGP RPKI, TCP-AO for BGP and LDP, BGP Flowspec, SFLow and GTSM.

BGP Flowspec: Protects networks from DDoS (Distributed Denial of Service) attacks.
Resource Public Key Infrastructure (RPKI): Allows network owners to validate and secure the critical route updates, or Border Gateway Protocol (BGP) announcements, and prevent route hijacking or misconfiguration.
TCP Authentication Option (TCP-AO): Enhances the security and authenticity of TCP segments exchanged during BGP and LDP sessions. It adds support for the latest security mechanisms and is stronger than legacy mechanisms such as TCP MD5.
sFlow, or "sampled flow": Samples packets from routers and sends them to a central collector for analysis, to identify abnormal traffic patterns and potential attacks.
Generalized TTL Security Mechanism (GTSM): Prevents a remote intruder from hijacking a route using a mechanism that also protects it from CPU-utilization based attacks.

Microservices-Based Fault Isolation Unlike traditional routers, RtBrick’s software uses a microservices architecture, where:

Each protocol or service runs independently
Failures are isolated to specific components
Individual services can restart without impacting the entire system

This makes it far simpler, and faster, for RtBrick to test, isolate and fix any bugs in the routing software, as the microservices can be treated independently of one another, rather than the whole routing stack having to be treated as a single ‘black box’, with millions of lines of interdependent code to investigate and test.

Summary

Resiliency, high availability, and redundancy are no longer dependent on proprietary hardware systems. Through disaggregation and cloud-native design, modern routing platforms redefine what’s possible for network reliability. By combining…

More distributed architectures
Stateful redundancy
Faster recovery times
Microservices-based fault isolation
Granular network monitoring

…solutions like those from RtBrick enable operators to build self-healing, highly available networks that are better designed to meet the demands of today’s always-on digital world.