How Meta detects and mitigates ‘silent errors’

We’re excited to bring Transform 2022 back in person on July 19 and virtually July 20 – August 3. Join AI and data leaders for insightful conversations and exciting networking opportunities. Learn more

Silent errors, as they are called, are hardware errors that leave no trace in system logs. The occurrence of these problems can be further exacerbated by factors such as temperature and age. It is an industry-wide problem that poses a major challenge to data center infrastructure, as they can cause damage to applications for extended periods of time, while going undetected.

In a newly published article, Meta detailed how it detects and mitigates these flaws in its infrastructure. Meta uses a combined approach by both testing while machines are offline for maintenance and performing smaller tests during production. Meta has found that while the previous method achieves greater overall coverage, in-production testing can achieve robust coverage in a much shorter time frame.

silent mistakes

Silent errors, also known as silent data corruption (SDC), are the result of an internal hardware failure. To be more specific, these errors occur in places where there is no control logic, so the defect goes undetected. They can be further influenced by factors such as temperature variance, data path variations, and age.

The defect causes incorrect operation of the circuit. This can then manifest itself at the application level as a flipped bit in a data value, or it can even cause the hardware to execute the wrong instructions. Their effects can even spread to other services and systems.

For example, in one case study, a simple calculation in a database yielded the wrong answer 0, resulting in missing rows and then data loss. On the Meta scale, the company reports to have observed hundreds of such SDCs. Meta has found an SDC occurrence of one in a thousand silicon devices, which it claims reflects fundamental silicon challenges rather than particle effects or cosmic rays.

Meta has been running detection and testing frameworks since 2019. These strategies can be divided into two categories: fleet scanner for out-of-production testing and ripple for in-production testing.

Silicon Test Funnel

Before entering the Meta fleet, a silicon device passes through a silicon test funnel. Even before launch during development, a silicon chip goes through verification (simulation and emulation) and then places the silicon validation on actual samples. Both tests can take several months. During manufacture, the device undergoes further (automated) testing at the device and system level. Silicon suppliers often use this level of testing for binning as there will be variations in performance. Non-functional chips result in lower production efficiency.

Finally, when the device arrives at Meta, it undergoes infrastructure intake (burn-in) testing on many rack-level software configurations. Traditionally, this would have completed the test and the device was expected to have operated for the rest of its lifecycle, relying on built-in reliability-availability-serviceability (RAS) features to monitor system health.

However, SDCs cannot be detected using these methods. This therefore requires special test patterns that are run periodically during production, which requires orchestration and planning. In the most extreme case, these tests are done during:

It is noteworthy that the closer the device is to running production workloads, the shorter the duration of the tests, but also the lower the ability to cause silicon defects (diagnosis). In addition, the cost and complexity of testing, as well as the potential impact of a defect, are increasing. For example, at the system level, multiple types of devices must work together in tandem, while at the infrastructure level, complex applications and operating systems must be added.

Test observations for the entire fleet

Silent errors are tricky because they can produce erroneous results that go unnoticed, and also affect many applications. These errors will continue to propagate until they produce noticeable differences at the application level.

In addition, there are multiple factors that influence its occurrence. Meta has found that these errors fall into four main categories:

Data randomization. Corruptions usually depend on input data, for example because of certain bit patterns. This creates a large state space for testing. For example, maybe 3 times 5 correctly evaluates to 15, while 3 times 4 evaluates to 10. Electrical variations. Changes in voltage, frequency and current can lead to more data corruption. Under one set of these parameters, the result may be accurate, while for another set it may not. This further complicates the test state space. Environmental Variations. Other variations such as temperature and humidity can also affect silent errors as they can directly affect the physics of the device. Even in a controlled environment like a data center, there can still be hotspots. In particular, this can lead to variations in results between data centers. Variations in the life cycle. As with regular device failures, the occurrence of SDCs can also vary over the life of the silicon.

Test Infrastructure

Meta has implemented two categories of fleet-wide testing on millions of machines. These are out-of-production and in-production testing.

Workflow chart for in-production testing.

In out-of-production testing, machines are taken offline and subjected to known input patterns. The output is then compared with references. These tests take into account all the variables discussed above and test them against the use of state search policies.

Typically, machines are not taken offline specifically to test for silent errors, but are tested opportunistically while the machine is offline for various other reasons, such as firmware and kernel upgrades, provisioning, or traditional server repair.

During such server maintenance, Meta performs silent error detection with a testing tool called fleetscanner. This way of working minimizes overhead and therefore costs. When a silent data corruption is detected, the machine is quarantined and subjected to further testing.

Workflow chart for testing outside of production.

Because out-of-production is slow, as it has a long response time to newly identified signatures, Meta also performs in-production testing with a tool called ripple. It sits next to the workload and executes test statements at millisecond intervals. Meta reported that it has been able to perform shadow testing by doing A/B testing across various variables, as well as having the tool always on. In particular, Meta has identified ripple testing as an important evolution for silent data corruption tools.

Findings and considerations

Based on three years of observations, fleetscanner achieved 93% coverage for a particular defect family and 23% unique coverage that was not reachable through Ripple. However, the cost is, of course, a non-trivial amount of time (and therefore cost) spent on testing. Ripple, on the other hand, offered 7% unique coverage. Meta states that this coverage would be impossible to achieve with fleetscanner because of the frequent transition of workloads with ripple.

When comparing the time to achieve equivalent SDC coverage of 70%, fleetscanner would take 6 months compared to just 15 days for ripple.

If left undetected, applications can be exposed to silent data corruption for months. This, in turn, can lead to significant consequences, such as data loss, which can take months to debug. This poses a critical problem for the data center infrastructure.

Meta has implemented a comprehensive testing methodology consisting of a discontinued fleet scanner that runs for other purposes during maintenance, and faster (millisecond level) in-production ripple testing.

VentureBeat’s mission is to be a digital city square for tech decision makers to learn about transformative business technology and transactions. Learn more

This post How Meta detects and mitigates ‘silent errors’

was original published at “https://venturebeat.com/2022/03/29/how-meta-detects-and-mitigates-silent-errors/”