Declarative Resilience

Hamza Omar; Qingchuan Shi; Masab Ahmad; Halit Dogan; Omer Khan

doi:10.1145/3210559

What is it about?

It has been shown that application/algorithms can be segregated among two different regions, and both of these regions are observe to have different consequences when subjected to soft-errors (transient faults due to radiations). One of the two regions is named as crucial because it defines the correct functionality and execution of the application. If this region is subjected to soft-error, there is high probability that the application would crash/malfunction. However, the second region of the application defines how much accurate the output of the application would be. Soft-error's impact on such a region would only effect the output accuracy, thus we name it non-crucial. This work proposes to selectively apply resilience protection for the crucial part of the application, and switch to none-resilience mode for the non-crucial regions. Within none-resilience mode, certain accuracy bounding mechanisms are employed to ensure acceptable/strong guarantees of the output accuracy. Since this selective application of resilience is declared at the start of the execution, we name this novel method "declarative resilience". Employing declarative resilience allows one to achieve improved performance as compared to the state-of-the-art resilience schemes, alongside acceptable output accuracy guarantees.

Why is it important?

For safety critical systems, both resiliency and performance vary based on the conditions and constraints surrounding the system. Therefore, providing strong resilience guarantees all the time may cause the system to miss various timing/performance deadlines. On the contrary, focusing on performance alone may lead to an unsafe execution environment for the safety-critical system since the probability of a soft-error strike happening is "once per day" according to various studies. Hence, there is a need for such a tradeoff to ensure safe and resilient execution, yet provide efficient execution to the real-time safety-critical systems.

Perspectives

I hope this article makes what people might think is a boring, slightly abstract area, kind of interesting and maybe even exciting. Writing this article was a great pleasure as it has co-authors with whom I have had long standing collaborations. This article also lead to various interesting ideas and greater involvement in the domain of resiliency, approximate computing, and security as well.
Hamza Omar
University of Connecticut

This page is a summary of: Declarative Resilience, ACM Transactions on Embedded Computing Systems, August 2018, ACM (Association for Computing Machinery),
DOI: 10.1145/3210559.
You can read the full text:

Read

Resources

Image
Declarative Resilience TECS'18 -- Figures
The figures added to the paper are available.

Contributors

The following have contributed to this page

Hamza Omar
University of Connecticut

A multicore resilient architecture that introduces a performance-resilience tradeoff

What is it about?

Why is it important?

Perspectives

Resources