Etherchannel is the technology that binds multiple physical links in to the single logical link in the switch which fools the spanning tree to be visible as a single port id instead of multiple physical ports id's this causes all the ports to be in the forward state to pass the traffic without creating the loops in the network. We can configure the etherchannel either through static or dynamic. Static ether channel works by manually binding the physical ports into one logical port, this is not recommended because it is not aware the state of the other end physical ports whereas LACP (Link aggregation control protocol) and PAGP (Port aggregation group protocol) are the two dynamic protocols. LACP is the IEEE standard and the PAGP is the Cisco proprietary protocol, LACP is the most commonly used protocol in the networks it works by negotiating with the other end of the ports and would form the portchannel once the set of parameters match on both ends. We can bind max 16 ports to a single logical link but only 8 would be in active forwarding state and the remaining ports would be in the hotstandby mode. In Case if any active ports are down then the standby ports will join into the active state. This article is mainly focused on explaining how the VPC works. I will cover more about the LACP in the next upcoming article.
The main drawback of the Etherchannel, whether it is static or dynamic, is that it is limited to only between two devices. We can form the etherchannel from the single switch to the connected server or from one switch to the other switch. Imagine if any chance the switch is down then there would be no use even if we bind 8 links in the portchannel because all these links are connected to the same switch and it leads to the downtime. To mitigate this issue VPC will come into play, simply to say it is the etherchannel between the two switches.
VPC is the Cisco proprietary technology runs in the Cisco Nexus switches, other vendors use the same the technique with different name for ex Arista switches call it as a MLAG (Multichassis link aggregation group), before introduction to Nexus, Cisco uses the stack, VSS features in the catalyst switches which looks similar to VPC but not, the main difference between VPC and the VSS is, VPC had two control and two management plane whereas VSS had one control and separate management plane.
As shown in the below figure, we can run VPC with only two switches even VSS also form only between two switches but in the stack we can bind max 9 switches. Given parameters have to be matched for successful VPC connection.
· Both Switches should be in the same model.
· It’s recommended to configure the keepalive link as an L3 to give the IP directly on the interface or else we can use an SVI but that should be not allowed in the peerlink. It’s not mandatory to use the dedicated port, we can use the management port if it is the fixed switch, for chassis switch it is recommended to use dedicated ports because the management port is in the sup-card, failover of sup card may land it to be other sup card which cause the management IP unreachable. Not required to use high speed interface links for the keepalive.
· It’s recommended to use high bandwidth links and aggregation of two or more ports to mitigate single point of failure for the Peerlink connectivity because it is the core part in configuring the VPC. In the chassis model it is recommended to use the ports from different line cards and aggregate to mitigate from the line card failure.
Peerlink and Keep alive link: These two links plays an important role in successful vpc connection, Peerlink should be the L2 trunk link and CFSOE protocol (Cisco fabric services over ethernet) runs on the peer link which helps to synchronize the control plane information between the two switches ex. STP packets, mac addresses, ARP learning , HSRP hello packets, multicast traffic etc. The configured SVI’s which are allowed in the member links should be the same in both the switches and to be allowed in the peerlink. If by any chance you have created the vlan in one switch and allowed in the member port but forgot to allow in the peerlink the respective vlan would go down in both the switches this would cause the type 2 consistency issues. We will discuss these types in the advanced vpc article. Keepalive link is used to exchange the UDP packets for heartbeat information between the two switches, this helps both the switches know the status of each other. One switch should be primary and the other switch in secondary in the VPC, this means don’t think the secondary switch is in the standby mode and not forward the traffic, both the switches would forward the dataplane traffic but the primary switch would respond to the control plane packets.It is recommended to use minimum two 10 Gig links in bundled for peerlink and the management port as a keep alive link in different vrf, not mandatory to use high bandwidth interface for keepalive as it only exchanges the hearbeat information.
Failover scenario in VPC: As discussed above both the switches are in primary and secondary in vpc, it acts like a single logical switch and connected to the downstream switch or server.
Primary switch is down: if the primary switch is down because of power issue then it’s not reachable to the secondary switch through keepalive and peerlink as well so the secondary takes the role of primary and starts forwarding traffic.
Secondary switch is down: If the secondary switch is down because of power issues all the member ports are removed from the portchannel and the primary switch will remain in the primary role and forward the traffic, this causes the issues to the orphan ports which are connected to the secondary switch.
Peerlinks are down: If the connected peerlinks are down, there would be no role has been changed in between the switches because the keepalive link still live, the connected member ports would be suspended and moved to the error disabled state in the secondary switch. There would be no issue in this case because the primary is still actively forwarding the traffic.
Keep alive link is down: There would be no role has been changed , if the keepalive link is down and the peerlinks are up and the primary still be reachable through the peerlink so no issue to the production but its recommended to make the keepalive link functional as soon as possible to avoid dual failure scenario which discussed below.
First the keep alive is down immediately peer link is down: Imagine the situation, initially the keep alive link is down after a few minutes the peer link is down. Once the keep alive is down then there would be no issue to the production because both the peers are reachable through the peer link the primary still maintain its role and forward the traffic, now the peer link is down then both the switches are not reachable through each other and the split brain scenario exist which means both the switches will become primary and forward traffic but no synchronization of control plane data which definetly causes the issues to the production environment.
First the peerlink is down followed by primary switch is down: Now think the peer link is down at first then the secondary will suspend all the connected member ports and the primary will forward the traffic, immediately the primary switch goes down then the primary switch is not reachable to the secondary through the keep alive link and it will wait for some time to make it as a primary role which leads production downtime for few minutes, to avoid this, Cisco introduced some advanced features like auto recovery.
In case if both the switches are down due to power failure but only the secondary switch comes up then again it is production downtime for a few minutes to go through the vpc election, to get rid of all the discussed issues I will cover one more article for the best practices to follow in the VPC design.
Ganapareddy Sudhakar
July 18th 2022