National Internet Segments' Reliability
Qrator Labs is excited to present the 2018 National Internet Reliability Survey. In this report we study how the outage of a single AS may affect the global connectivity of the region.
Internet connectivity at the interdomain level is based on connectivity between autonomous systems (AS’s). As the number of alternate routes between AS’s increases, so goes the fault-resistance and stability of the internet in a given country. However, some paths prove to be more important than others.
The global connectivity of any AS, regardless of whether it is a minor provider or an international giant, depends on the quantity of its paths to Tier-1 ISP’s. Usually, Tier-1 implies an international company offering IP transit at global scale and its network should be interconnected with other Tier-1 providers. But there is no obligation to support these interconnections! Only the market can motivate them to peer with each other and maintain global Internet connectivity. Is that motivation enough? We explore this question in the IPv6 section below. Anyway, if an AS loses its connection to at least one of Tier-1, it would likely become unreachable in some parts of the world.
Measuring Internet Reliability
Imagine that an AS is experiencing significant network degradation. We want to answer the following question: “What percentage of AS in this region would lose their global availability due to single failure?”
Why model such a situation? Strictly speaking, when the BGP and the world of interdomain routing were in the design stage, the creators assumed that every non-transit AS would have at least two upstream providers to guarantee fault tolerance in case one goes down. However, the reality is different: more than 45% of ISP’s have only one connection to an upstream provider. And some weird topologies between transit ISPs further reduces reliability. So, did transit ISPs ever fail? The answer is yes, and this happens rather often. The proper question is – when would a particular ISP experience service degradation? If such problems seem remote, it may be worth considering Murphy’s Law: “Anything that can go wrong, does.”
To model such a scenario, we have applied the same model for the third year in a row, but we did not merely repeat previous calculations; this year we’ve expanded the research significantly. The followed steps were taken to calculate AS reliability:
- For every AS in the world, we retrieved all alternate paths to Tier-1 operators with the help of an AS relation model, core of the Qrator.Radar;
- Using the IPIP geodatabase, we matched countries to the originated address space of every AS;
- For every AS we counted the share of its address space that corresponds to the selected region. It helped to filter out situations where an ISP may be present at an internet exchange point in the given country but not have any significant presence in the region. A good example is Hong Kong, where hundreds of members of the biggest Asian Internet Exchange HKIX exchange traffic but have zero presence in the Hong Kong internet segment itself;
- After we evaluated the effect of a possible failure of a given AS on the other AS’s and, thus on the specific countries;
- In the end, for each country, we found the specific AS that affects the largest portion of other AS’s in the given region.
Here you can find top-20 countries in terms of reliability in the event of a single failure. In practice, this means, that these countries have most diverse IP-transit market. The percentage shows the portion of AS’s that would lose global connectivity in case of specific AS failure.
- Romania and Luxembourg fell out of the top-20 from 11th and 20th place, respectively, in 2017;
- Singapore jumped 18 places to 5th;
- Hong Kong fell 13 places to 15th;
- The Netherlands entered the top-20 for the first time in 17th place.
- 18 of 20 countries remained in top-20 compared to last year.
While individual countries may have moved up or down the list, overall reliability rate did not change significantly from 2017. Last year the average outage from a single failure was 41% and in 2018 it decreased by 3 percent to approximately 38%. The number of countries with outage rate less than 10% (indicating fault resistance) increased by one to reach 30.
The primary trend for the year was found to be the significant reliability improvements in small countries of South Asia and Africa. These regions are still developing, but strong improvements in IP-transit market diversity are a sign of accelerating progress.
Game of IPv6
It is commonly believed that if a technology works well in IPv4, it could be easily ported to IPv6. This mistaken assumption may be the most significant structural problem of the whole IPv6 development process. And measuring internet reliability for IPv6 hasn’t become an exception.
To maintain global connectivity in IPv4, it's enough to maintain single customer-to-provider path to one of the Tier-1 providers. But in IPv6 this may not be enough. Due to ongoing peering wars between several Tier-1 providers in IPv6, they are not all connected to each other. At least two pairs of providers decided to "de-peer” in IPv6: Cogent (AS174) and Hurricane Electric (AS6939); Deutsche Telekom (AS3320) and Verizon US (AS701). These telecoms may have different reasons for their conflicts, but if a network is connected only to one party in the conflict, it would not have full IPv6 connectivity. It also affects the reliability of ISP’s with multiple upstream providers - the outage of one may already lead to connectivity problems.
To address these issues, we adjusted our measurement process to check whether full IPv6 connectivity is maintained during an outage. In other words, paths for all Tier-1 providers must be present to maintain full connectivity. We also calculated the percentage of AS’s in the country that have only partial connectivity due to these peering wars. Here are the results:
The overall comparison of IPv4 and IPv6 in case of a single failure shows that for 86% of countries, IPv4 connectivity is more reliable. An important discovery in the world of IPv6 is that many ISP’s do not have proper connectivity under normal operating conditions without any outages. For example, in the US this applies to approximately 10% of all AS’s that have IPv6 support, and in China the situation is even worse with China Telecom (AS4134) getting global IPv6 connectivity from only one provider - Hurricane Electric.
As stated above, nobody can force Tier-1 providers to peer with each other except their customers. Data clearly shows that user demand is not incentive enough for them to connect to each other and achieve 100% network visibility. Explicit market calls for proper IPv6 service seem to be the only way to improve the situation. The Qrator.Radar team is considering different options to make this information transparent for every ISP in the world, thus improving community awareness of the problem.
Broadband Internet and PTR records
We believe that studying the diversity and reliability of IP-transit market in different countries can be quite useful in certain business scenarios. But there are different viewing angles - there are mobile users, broadband connections and we can’t assume that they have uniform distribution among all ISPs in the region.
There are different ways to gather information related to internet market shares. We tried to keep our study technical and so tried to find an easy-to-check metrics that would correspond to some features of the internet markets. We tried to use different metrics: size of IP address space, DNS records, PTR records, ping-able IPs and all its combinations. We found data gathered by Rapid7 Open Data quite useful at this stage of the research. After many experiments, we determined that in the majority of regions there is a strong correspondence between the number of PTR records and broadband market share.
So, using distribution of PTR records we decided to recalculate the reliability rating and compare results with the original one:
The results proved to be significantly worse compared to the reliability of IPv4 transit marker: the average outage increased from 38% to 63%. In some regions, the possible outage percent increased more than 10 times. For instance, the service degradation at Deutsche Telecom will affect 42% of broadband connections in Germany compared to 2.2% affected AS. There is no surprise that in some regions the ISP, which failure leads to the biggest outage, has also changed: for an example, in France, Spain and Great Britain, Cogent was replaced by Orange (AS5511), British Telecom (AS5400) and Telefonica (AS12956) accordingly.
This study highlights that even in countries with excellent IP-transit market diversity, other internet segments may be dominated by several or even single player. This creates additional operational risks that should be considered by any kind of service that is looking for constant availability in the selected region.
During this survey, we expanded our reliability research beyond IPv4 transit market. We found a way to calculate the outage in case of single failure for IPv6 market which proved to be still under development and can't be called reliable in the majority of regions. The main reason is ongoing peering wars, and we hope these issues finally fixed with the customer demand.
We also tried to study other internet markets, starting with broadband internet segment. We found the correlation between PTR records and broadband internet market share in many regions, but there were also exceptions. We'll keep studying this curious relation to get even more robust metric. But its usage has already highlighted certain disproportion of the broadband internet market in selected regions.