Global consequence of the specific bug in a Quagga routing engine
Two weeks ago Qrator Radar team encountered an intricate network incident. It’s clarification resulted in an internal investigation/research, victims and perpetrators search and attempts to remedy the situation. On September 30, 2017, our team drew attention to an unusually large number of flashing BGP sessions.
A quick analysis showed that all problematic sessions had some similar symptoms: they failed due to some corrupted BGP announce, and it also gave the impression that the sessions were not ripping apart only from us — it resulted in the significant rebuilding of routing around the world.
After investing some time in figuring out the reasons for this problem, we found that the broken announce came for the prefix 220.127.116.11/23 from AS262197 from Costa Rica, and from other BGP sessions, this advertisement came without any errors, but with a prepend value… 563. On the one hand, this policy is useless, since the effect of the prepend policy already vanishing starting from values around 5, on the other hand, such advertisement remains legitimate.
We passed questions to users to find out which routing software is used on their side, and in most cases, the BGP session on the user’s side was set up using Quagga, VyOS or Brocade (which are also based on Quagga). We modeled scenario with big prepend in our lab and after we were convinced that the issue is indeed localized in the Quagga - after processing a route with significant prepend value, it creates an announce with several anomalies inside: incorrect lengths of the AS_PATH attribute, plus the value of the AS_PATH attribute was malformed.
As a result, following RFC4271, this receiver of that advertisement was sending a NOTIFICATION message, which leads to BGP sessions flaps on a global scale. For operators whose upstream utilize Quagga, this anomaly resulted in persistent breaks in the BGP session and was able to lead to partial, or even complete, network inaccessibility. We detected several hundreds networks that lost global visibility due to this incident.
This anomaly repeated during last weekend, but for a different prefix - 18.104.22.168/23. It was active until October 16, 2017, lasting for several days already.
Last week Quagga v1.2.2 was released, where bug with the AS_PATH attribute length was fixed. Our team also sent an additional patch, that fixes the second problem with the content of AS_PATH attribute itself, and at Sunday this patch was accepted, but it may take time before next release.
We strongly recommend all Quagga users to update their software up to 1.2.2 version and watch for subsequent updates. Also, in the appendix to this article, we attach a patch that fixes both revealed bugs in processing announces.