Date created: 06/26/19 19:17:44. Last modified: 07/01/19 10:57:23

Few Larger Routers vs. Many Smaller Routers

Over the years I have been bitten multiple times by having fewer big routers with either far too many services/customers connected to them or too much traffic going through them.

One experience I have made is that when there is an outage on a large PE, even when it still has spare capacity, is that the business impact can be too much to handle (the support desk is overwhelmed, customers become irate if you can't quickly tell them what all the impacted services are, when service will be restored, the NMS has so many alarms it’s not clear what the problem is or where it's coming from etc.).

I’ve seen networks place change freeze on devices, with the exception of changes that migrate customers or services off of the PE, because any outage would create too great an impact to the business, or risk the customers terminating their contract. I’ve also seen changes freeze be placed upon large PEs because the complexity was too great, trying to work out the impact of a change on one of the original PEs from when the network was first built, which is somehow linked to virtually every service on the network in some obscure and unforeseeable way.

This doesn’t mean there isn’t a place for large routers. For example, in a typical network, by the time we get to the P nodes layer in the core we tend to have high levels of redundancy, i.e. any PE is dual-homed to two or more P nodes and will have 100% redundant capacity. Down at the access layer customers may be connected to a single access layer device or the access layer device might have a single backhaul link. So technically we have lots of customers, services and traffic passing through larger P node devices, but these devices have a low rate of changes / low touch, perform a low number of functions, they are operationally simple, and are highly redundant. Adversely at the service edge, more smaller devices with single service dedicated devices has proven a simpler to manager, simpler to scale and simpler to decommission model time and time again.

Some reasons for and against fewer larger routers or more smaller routers are listed in the table below (based on personal experience only). The tl;dr version though is that there’s rarely a technical restriction to having fewer large routers and it’s an operational/business impact problem;

Fewer Large Routers Many Smaller Routers
Less devices to manage. More device to manage.

Reduce NMS requirements, polling scale and device/interface licenses.

Increased NMS requirements.

Fewer devices to touch (simpler changes).

More devices to touch.

Bigger service impact during outages (potentially multiple services or a large number of customers of a single service affected by a single large device outage).

Reduced service impact during outages (potentially just one service is affected by a single device outage or a smaller number of customers).

Reduced overall service availability. With fewer devices multiple services are impacted by a single device failure (e.g. Internet and VoIP at the same time for the same customer).

Increased overall service availability. With more devices only one service is impacted by a single device failure (e.g. Internet is down but VoIP still works).

More difficult to predict outage impact on devices with many services.

Simpler to predict outage impact on devices which are dedicated to a single service.

Complex capacity planning when multiple service types with different bandwidth profiles traverse the same device.

Simpler capacity planning when all services on a device have the same traffic profile.

More difficult to plan and attain approval for maintenance events.

Less difficult to plan and attain approval for maintenance events.

Increased risk of inter-feature interference (e.g. order-of-operations bugs).

 

Reduced likelihood of feature inter-op issues.

More complex software/hardware/regression/soak testing.

 

Simpler software/hardware/regression/soak testing.

Potentially cheaper to deploy if the same model devices are used in both scenarios, this scenario requires fewer. If different devices are used, this could be more expensive, chassis are typically cheap, but line cards are usually expensive, and this model requires potentially high density or higher speed line cards.

Potentially more expensive to deploy if the same model devices are used in both scenarios, this scenario requires more of the same device. If different devices are used, this could be less expensive, chassis are typically cheap, but line cards are usually expensive, and this model requires potentially lower density or lower speed line cards.

Centralisation of services can reduce inter-PoP traffic requirements and improve service performance, e.g. few BNGs existing and in each BNG PoP are CDN nodes and P&T nodes.

Decentralisation of services can increase inter-PoP traffic and decrease performance, e.g. BNGs are widely distributed and not in the same PoPs as CDN nodes or P&T nodes.