现象描述
某网络迁移改造,将原网中的核心层设备部署为接入层设备,即从三层变为二层。如图所示,网络迁移完成后,从核心层设备向接入层设备的管理IP地址发起Ping测试,发现时通时不通,并且核心层设备输出VRRP主备状态频繁切换的告警。
故障组网图
在Switch_1上出现如下告警信息:
Sep 17 2015 21:46:11+08:00 DS_02 VRRP/3/VRRPMASTERDOWN:OID 1.3.6.1.4.1.2011.5.25.127.2.30.1 The state of VRRP changed from master to other state.(VrrpIfIndex=143, VrId=48, IfIndex=143, IPAddress=11.91.127.239, NodeName=DS_02, IfName=Vlanif948, CurrentState=2, ChangeReasnotallow=priority calculation)
Sep 17 2015 21:46:11+08:00 DS_02 %%01VRRP/4/STATEWARNINGMEV1R3(l):Virtual Router state BACKUP changed to MASTER, because of protocol timer expired. (Interface=Vlanif948, VrId=48).
Sep 17 2015 21:46:11+08:00 DS_02 %%01VRRP/4/STATEWARNINGMEV1R3(l):Virtual Router state MASTER changed to BACKUP, because of priority calculation. (Interface=Vlanif948, VrId=48).
原因分析
网络中存在环路。
问题判断
1. 查看VRRP备份组的状态信息。
<Switch_1> display vrrp brief
VRID State Interface Type Virtual IP
--------------------------------------------------------
3 Backup Vlanif903 Normal 10.93.4.30
5 Backup Vlanif599 Normal 11.91.127.94
14 Backup Vlanif914 Normal 10.93.41.126
24 Backup Vlanif924 Normal 10.93.32.126
25 Backup Vlanif925 Normal 10.93.32.254
…………
发现Switch_1在该备份组中作为Backup,且状态都为正常。
- 查看上送CPU的VRRP协议报文统计信息。
<Switch_1> display cpu-defend vrrp statistics all
Statistics on mainboard:
-------------------------------------------------------------------------------
Packet Type Pass(Bytes) Drop(Bytes) Pass(Packets) Drop(Packets)
-------------------------------------------------------------------------------
vrrp 0 0 0 0
-------------------------------------------------------------------------------
Statistics on slot 1:
-------------------------------------------------------------------------------
Packet Type Pass(Bytes) Drop(Bytes) Pass(Packets) Drop(Packets)
-------------------------------------------------------------------------------
vrrp 0 0 0 0
-------------------------------------------------------------------------------
Statistics on slot 4:
-------------------------------------------------------------------------------
Packet Type Pass(Bytes) Drop(Bytes) Pass(Packets) Drop(Packets)
-------------------------------------------------------------------------------
vrrp 79880066214 2581617736 1174644777 37950869
-------------------------------------------------------------------------------
发现Switch_1的4号接口板上有大量丢包。
2. 查看端口状态统计信息。
<Switch_1> display interface brief
…………
Interface PHY Protocol InUti OutUti inErrors outErrors
Eth-Trunk1 up up 31% 31% 0 0
GigabitEthernet4/0/22 up up 0.72% 81% 0 0
GigabitEthernet4/0/23 up up 81% 0.73% 2 0
Ethernet0/0/0 down down 0% 0% 0 0
…………
GigabitEthernet4/0/0 up up 0% 81% 0 0
GigabitEthernet4/0/1 up up 0% 81% 0 0
GigabitEthernet4/0/2 up up 0% 81% 2 0
GigabitEthernet4/0/3 up up 0% 81% 0 0
GigabitEthernet4/0/4 up up 0% 81% 0 0
GigabitEthernet4/0/5 up up 0% 81% 0 0
GigabitEthernet4/0/6 up up 0% 81% 0 0
GigabitEthernet4/0/7 up up 0% 81% 0 0
GigabitEthernet4/0/8 up up 0% 82% 0 0
GigabitEthernet4/0/9 up up 0% 82% 0 0
GigabitEthernet4/0/10 up up 0% 82% 0 0
GigabitEthernet4/0/11 down down 0% 0% 0 0
GigabitEthernet4/0/12 up up 0% 82% 0 0
GigabitEthernet4/0/13 up up 0% 82% 0 0
GigabitEthernet4/0/14 up up 0% 82% 0 0
GigabitEthernet4/0/15 up up 0% 82% 0 0
GigabitEthernet4/0/16 up up 0% 82% 0 0
GigabitEthernet4/0/17 up up 0.01% 82% 0 0
GigabitEthernet4/0/18 up up 82% 0% 0 0
GigabitEthernet4/0/19 up up 87% 82% 0 0
GigabitEthernet4/0/20 down down 0% 0% 0 0
GigabitEthernet4/0/21 up up 0.01% 0.01% 0 0
LoopBack500 up up(s) 0% 0% 0 0
NULL0 up up(s) 0% 0% 0 0
Vlanif599 up up -- -- 0 0
…………
发现GigabitEthernet4/0/19端口出方向的平均带宽利用率达到80%以上,应该存在环路,并且GigabitEthernet4/0/18和GigabitEthernet4/0/19端口的入方向流量统计也到达80%以上,初步判断是这两个端口下挂的设备引起环路导致。
3. 手工shutdown端口利用率超限的端口GigabitEthernet4/0/18和GigabitEthernet4/0/19,再查看上送CPU的VRRP协议报文统计信息,发现VRRP协议报文Drop统计数不再增加。同时Ping其他接入层设备的管理地址发现可以Ping通。
4. 检查Switch_1的GigabitEthernet4/0/18和GigabitEthernet4/0/19两个端口下接的接入层设备,这两台交换机都是其他厂商设备,排查后发现,由于这两台设备都是三层设备,没有开启STP协议,部署成二层设备时,未添加开启STP的命令,导致环路。
解决方案
- 在Switch_1的GigabitEthernet4/0/18和GigabitEthernet4/0/19两个端口下接的接入层设备上开启STP。
- 从Switch_1上undo shutdown端口GigabitEthernet4/0/18和GigabitEthernet4/0/19,查看STP状态和接口流量,业务恢复正常。
建议与总结
在网络流量不稳定时,可以通过接口流量状态查看是否有环路,并根据收发状态猜测环路的源头,尽快关闭端口来临时解决问题,分析出根因后,再实施解决方案。