Description:
A slipstream processor runs two copies of a program, one slightly ahead of the other, to achieve both higher single-program performance and transient fault tolerance. The leading copy of the program, or the Advanced Stream (A-stream), is accelerated by executing only a key subset of all instructions. The partial A-stream is speculative. Therefore, a second, complete copy of the program, called the Redundant Stream (R-stream), receives and checks all A-stream outcomes. The R-stream is also accelerated in this process. Together, the A-stream and R-stream finish faster than a single program copy would.
The partial redundancy between the A-stream and R-stream enables detection and recovery from transient faults. A transient fault that affects a redundantly executed instruction is easily detected, because its two instances will differ. However, a transient fault that affects a singly executed instruction (instruction removed from A-stream) is difficult to detect directly, because there is no redundant counterpart for comparison.
Actually, a fault in a singly executed instruction is indirectly detectable via a redundantly executed consumer. However, such a fault is unrecoverable since the fault is attributed to the consumer. Recovery is initiated too late, from the consumer instead of the faulty producer.
We propose a mechanism that conservatively attributes a detected fault, not to the redundantly executed instruction that detected it, but to its singly executed producer. Accordingly, recovery is initiated safely from the singly executed producer. Our approach works by forming a forward slice for each singly executed instruction, terminating in its direct/indirect redundantly executed consumers. Now, a consumer can mark its singly executed producer as faulty when its comparison mismatches.
A singly executed branch does not have a forward slice and thus is not checkable by consumers. However, the branch was removed from the A-stream precisely because its branch prediction is highly confident, hence, very likely correct. This likely correct branch prediction is treated as a second execution for the corresponding singly executed branch, different from true execution but nearly as effective for detecting faults.
In fact, the observation about confident branches extends to all redundantly executed instructions since the A-stream is predictive as a whole. All A-stream instructions are speculative, yet most likely correct in the fault-free case. This reveals an intriguing predictive checking paradigm.
Experiments using the SPEC95 and SPEC2K benchmarks show that coverage improves from 81% for baseline slipstream to 99% with only a small decrease in speedup. To obtain the same performance as baseline slipstream, we propose a relaxed checking model, which still achieves a much higher coverage of 95%.