Automatic Derivation of Instruction-Local Fault-Propagation Behavior

This is an explorative topic and probably a quite challenging thesis.

Testing fault tolerance mechanisms is commonly done by performing extensive fault-injection experiments on a system that try to mimic physical causes of radiation effects like soft errors/bit flips and then observing the system's behaviour. There are different strategies to perform such injections: In the most basic variant, one would inject every bit in every cycle, which spans a so-called fault space. However, this strategy is not an viable option for longer running programs that operate on large data sets.

Therefore, fault-space pruning tecniques where developed that subsume multiple faults of that basic injection model into an equivalence set, where an injection into any member of that set will yield the exact same result. Therefore, we only have to inject one fault of each set, the fault-injection pilot, and backproject the result onto all members of the set. One basic method to form such equivalence sets is the so-called Def/Use Pruning method, which subsumes possible faults of the same memory location between an read/write events. In a nutshell, as long as a faulty bit is stored passively in memory, it cannot have any influence on the system.

In previous work, we developed an extension to the classic def/use-pruning technique: data-flow sensitive fault pruning. This technique does not only use the read/write semantic of executed assembler instructions but also their semantic. For example, if you flip a bit in one argument of an XOR-instruction this is exactly equivalent to flip the same bit position in the output of the instruction. Other instruction-specific fault propagations that are easy to understand include MOV, AND, OR instructions.

Besides the general pruning algorithm, this technique requires instruction-local fault-space propagation rules, which describe how a one-bit fault at on of its inputs propagates to the output bits.These rules are specific for each architecture, which calls for an automatic derivation! With this thesis, it is your task to develop a method for deriving such propagation rules based on a formal description of the instruction-set architecture.

Tasks

  • Review literature on fault-equivalence collapsing in logical circuits
  • Extract model (e.g., SAT or SMT) of individual instructions from the formal ISA model
  • Adapt and apply fault-collapsing technique from the hardware domain

Further Reading

  • LCTES Conference A
    Data-Flow–Sensitive Fault-Space Pruning for the Injection of Transient Hardware Faults
    Oskar Pusz, Christian Dietrich, Daniel LohmannProceedings of the 2021 ACM SIGPLAN/SIGBED Conference on Languages, Compilers and Tools for Embedded Systems (LCTES '21)ACM Press2021.
    PDF Slides 10.1145/3461648.3463851 [BibTex]