L3 fabric DC -The underlay Network -Part1
In the previous posts we have discussed the classic DC designs and the M-LAG solution.
In this post we will cover the basic L3 fabric DC, you might never heard of it or you think that’s a solution for massive scale DC, yes the massive DC in the world would be running L3 fabric but nowadays more and more customers are moving to it.
What about the Vendor specific solutions Cisco fabric-path, Cisco FEX, Juniper Q-fabric, Juniper VCF. All are decent solution but the problem with all of is they are tagged as “closed Ethernet fabric” or “turn-key Ethernet fabric”. That means they are closed so once you get one of these solutions you stuck with this vendor (Sometimes you stuck even with a specific platform from that vendor) because the vendor doesn't interoperate with each other.
Also all of these solutions are limited in scaling, yes they can scale much better than the classic design and all of them are easy to manage and better performance but all have certain limit they can't scale behind (some go from 28,64,128 TOR some would count that per port but at the end of the day there is a fixed limited number) and if you exceed the limit of that solution would have to buy multiple entities of that solution ( as the vendor want to sell more) and then you would have to manage control & data plan interaction between these entities (EVPN might play a role in that later on we would cover that).
All these vendor fabric are STP-free either building some proprietor protocol or implemented open standard one but they are all self-contains inside the solution so you don’t have to configure them and you have less control over them as we said it is turn-key fabric, these protocols actually have these roles.
1- Encapsulate the Ethernet frames.
2- Discover the physical connectivity and build loop-free path.
3- One exchange reachability information as well control and management.
Personally, I won't go for any of these solutions given all of their competitive features for me being locked with one vendor is what prevents me from implementing any of these solutions, also thinking of the merchant silicon switches and white box concept and how cheap and competitive they become for the large scale DC, also if consider your ability to build your own TOR OS that supports only the feature that you need not a bigger size of OS that has many features that gives you nothing you need but bugs.
Do you think massive DCs would be relying upon one vendor solution? Of course not, may be from the hardware perspective but not never technology
Massive DCs use CLOS/spine and leaf IP fabric as well it is the trend for the medium scale and some even small one are thinking of it.
Let's explain few more concepts: spine and leaf, overlay and underlay networks.
Currently, spine and leaf physical connectivity has become the recommended way to build your data center where the leaf would be your TOR (Top Of the Rack where you connect your servers) and the spine would be the aggregation for these spines with few conditions.
- No connectivity between spines.
- No connectivity between leafs.
- Each leaf is connected to all spines.
- The is an equal speed interface from the spines to the leafs (highly recommended).
Traffic is always going leaf-spine-leaf, this what we call “3 stages spine and leaf) which would guarantee equal delay ,latency and number of hops from any server to any server in your data center, as well give you the ability to scale and build your DC the size you want.
Since each leaf has to be connected to all spines so if you have 48 TOR then you would need at least 48 port on each spine and so on, this is not that much of limitation when it comes to scaling for example juniper announced QFX10016 which give you up to 2,304 10GbE, 576 40GbE or 480 100GbE all in one box so perfect fit to be spine for massive scale DC that has hundreds of TORs.
The other factor that effects you scaling is the speed of this spine to leaf connections as it is based on the subscription ratio you would look for in the TOR, there are other factors. Cost of the optics if you really want to scale especially with the distance between the racks (single mode fiber , 40G optics ,etc..) but this is more a financial issue, not a technical issue that much.
If you want to grow more or segment/zone your DC then you would move to 5 stage fabric which is another story, So let's keep it for another discussion
Also, many DC connect the server to an access layer switching and from this access layer they connect to the leave so we kind access-leaf-spine topology.
Others would build VSpine which is based on the same concept, and in this post, we are laying the foundation so let's discuss only the 3 stages without access layer.
This physical connectivity spine and leaf L3 fabric has become the de facto type of connectivity in the massive DC.
Many networkers get confused when you talk about CLOS and think of it as just L3 fabric but in fact CLOS is not an acronym, it is named after “Charles Clos” a researcher at Bell Laboratories, he lay the foundation non-blocking connectivity in his paper “A Study of Non-blocking Switching Networks" 1953. And it was mainly for switching telephone calls.
So CLOS is somehow the spine and leaf physical connectivity with ECMP traffic forwarding, so it can be spine and leaf Ethernet fabric or spine and leaf IP fabric.
Many of the closed Ethernet fabric that we discussed above are physically connected spine and leaf and use ECMP to forward traffic but every vendor building it based on his extension of some standard (ISIS, TRILL)
Now let's see what makes L3 fabric named as L3 although it’s connected physically the same way like the vendor Ethernet fabric.
Normally the Ethernet fabric provide you with access or trunk port that you would connect your server to it, and the traffic would be forward based on l2 information VLAN and MAC address (even if the vendor’s solution encapsulated the fame in different header but still the lookup and the encapsulation is based on the L2), and when it come to ECMP each vendor implement his own flavor to achieve it so we would have no blocking paths without the need to run STP.
For the L3 fabric is different story, what about changing the TOR switch main job from being a switch to act as a router, what about changing the TOR main ideas of forwarding based mac-address lookup to forwarding based on IP address? What about changing the TOR server port face from being l2 interface access or trunk ports, to become a layer l3 interface.
Where the TOR act like a router that forward based IP utilizing all the uplinks toward the spines using (ECMP) as the spine-leaf connection are all l3 interface
Building such a network is building what we call the underlying network, which provides IP connectivity in that case between all the server, this underlying network doesn’t maintain any servers forwarding information (VLAN, MAC, etc..) where your underlay network is not aware which VM talking to which VM in which VLAN.
The term underlay network is not new, some might consider the internet as just big underlay network, (the internet as routers and its infrastructure), where the infrastructure of the internet provides reachability between IPs and exchanging prefixes with BGP but any given router in the internet is not aware which packet belong to which user or which application, for this router it is packet need to be delivered from point A to point B , that’s why the internet can be considered as underlay networks. So in the IP CLOS the underlying network provides IP reachability and utilizing all available links ECMP.
But the server as well the VM will be communicating with another server or VM
as a quick answer for the data plan the L2 frames would be encapsulated and carried of IP and for the control plan we need to find a way that makes this server and VM exchange control plan information and that’s what you are going to call the overlay network. How this would work? How can we make the server or VM communicated while they are not aware of the underlay network?
We will come back to the overlay network in the upcoming post.
However what are the benefits of l3 CLOS ?
Here are some decent benefits.
1- We have gained all the benefits from the CLOS physical connectivity (scaling, ECMP that provide equally delay and number of hops between any hopes, etc..).
2- We are not bonded with any vendor because your TOR is running basic IP forwarding, ECMP and routing protocol to exchange IP this kind of feature every single switching company does it, not only that but you can build your own merchant's silicon switch and white boxes.
3- The underlay network is not maintaining any overlay states (mac-address, VLAN).
4- The effect of failure is contained only to part that failure.
5- You can scale to unbelievable numbers compare to any other vendor solution.
6- By building such a network you would easily open the door for the SDN world.