Adaptive Fault-Tolerant distributed Systems for Real-Time Critical Workloads

Authors

  • Bipinkumar Reddy Algubelli  Independent Researcher, USA
  • Sai Kiran Reddy Malikireddy   Independent Researcher, USA

DOI:

https://doi.org/10.32628/IJSRST2295224

Keywords:

Fault-Tolerant Computing, Distributed Systems, Real-Time Applications, Critical Systems, Redundancy and Replication, Consensus Algorithms, Error Detection and Recovery, Real-Time Scheduling, Scalability, System Reliability

Abstract

Functionality in fault-tolerant systems, particularly in maintaining dependability and availability of the actual time applications for various sectors, including but not limited to healthcare, aerospace, transportation, and industrial control systems, is indispensable. The systems should run continuously; there are breakup equipment and network and software glitches. This paper discusses the major concepts and the ways and issues associated with fault-tolerant distributed computing for real-time applications in safety-critical systems. The course notes emphasize that such measures as redundancy, replication, consensus algorithms, error detection, and recovery strategies ensure that system integrity is maintained even during failure modes and that real-time constraints are met. We consider using case analysis to exploit these approaches to apply such fault-tolerant infrastructures in various sectors as critical environments with an acute need for existing fault-tolerance mechanisms. Present-day problems such as scalability, performance in case of failures, and the effectiveness/cost ratio are also presented in the paper. Finally, future work in self-organizing and self-healing frameworks, which use machine learning, quantum computing, and other related technologies to minimize the effects of faults occurring in real-time distributed systems, is considered. This work highlights the role of building and designing infallible, high-availability system redundancy models to assure such systems' safety, speed, and uninterruptible functionality.

References

  1. Baldoni, R., Marchetti, C., & Virgillito, A. (2001, May). Design of an interoperable FT-CORBA compliant infrastructure. Proceedings of the 4th European Research Seminar on Advances in Distributed Systems (ERSADS'01).
  2. Budhiraja, N., Marzullo, K., Schneider, F., & Toueg, S. (1993). The primary-backup approach. Frontier Series.
  3. Zheng, Q., & Shin, K. G. (1998). Fault-tolerant real-time communication in distributed computing systems. IEEE Transactions on Parallel and Distributed Systems, 9(5), 470–480.
  4. Cristian, F. (1991). Understanding fault-tolerant distributed systems. Communications of the ACM, 34(2), 56–78.
  5. Cukier, M., Ren, J., Sabnis, C., Sanders, W. H., Bakken, D. E., Berman, M. E., et al. (1998, October). AQuA: An adaptive architecture that provides dependable distributed objects. Proceedings of the IEEE Symposium on Reliable and Distributed Systems (SRDS), 245–253.
  6. Marin, O., Bertier, M., & Sens, P. (2003, November). Darx - A framework for the fault-tolerant support of agent software. Proceedings of the 14th IEEE International Symposium on Software Reliability Engineering (ISSRE 2003), 406–417.
  7. Reiser, H. P., Kapitza, R., Domaschka, J., & Hauck, F. J. (2006, June). Fault-tolerant replication based on fragmented objects. Proceedings of the 6th IFIP WG 6.1 International Conference on Distributed Applications and Interoperable Systems - DAIS 2006, 256–271.
  8. Felber, P. (2001, September). Lightweight fault tolerance in CORBA. Proceedings of the International Conference on Distributed Objects and Applications, 239–250.
  9. Narasimhan, P. (1999, December). Transparent fault tolerance for CORBA.
  10. Narasimhan, P., Dumitras, T., Paulos, A., Pertet, S., Reverte, C., Slember, J., et al. (2005). MEAD: Support for real-time fault-tolerant CORBA. Concurrency and Computation: Practice and Experience.
  11. Narasimhan, P., Moser, L. E., & Melliar-Smith, P. M. (2000, April). Gateways for accessing fault tolerance domains. Middleware 2000, 88–103.
  12. Vaysburd, A., & Yajnik, S. (1999, October). Exactly-once end-to-end semantics in CORBA invocations across heterogeneous fault-tolerant ORBs. IEEE Symposium on Reliable Distributed Systems, 296–297.
  13. Krishna, C. M. (2014). Fault-tolerant scheduling in homogeneous real-time systems. ACM Computing Surveys (CSUR), 46(4), 1–34.
  14. Pathan, R. M. (2014). Fault-tolerant and real-time scheduling for mixed-criticality systems. Real-Time Systems, 50, 509–547.
  15. Thekkilakattil, A., Dobrin, R., & Punnekkat, S. (2014, July). Mixed criticality scheduling in fault-tolerant distributed real-time systems. In 2014 International Conference on Embedded Systems (ICES) (pp. 92–97). IEEE.
  16. Rubel, P., Loyall, J., Schantz, R., & Gillen, M. (2006). Fault tolerance in a multi-layered DRE system: A case study. Journal of Computers (JCP), 6, 43–52.
  17. Balasubramanian, J., Gokhale, A., Schmidt, D. C., & Wang, N. (2008). Towards middleware for fault-tolerance in distributed real-time and embedded systems. In Distributed Applications and Interoperable Systems: 8th IFIP WG 6.1 International Conference, DAIS 2008, Oslo, Norway, June 4–6, 2008. Proceedings 8 (pp. 72–85). Springer Berlin Heidelberg.
  18. Schneider, F. B. (1990). Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Computing Surveys, 22(4), 299–319.
  19. Gill, C. D., Levine, D. L., & Schmidt, D. C. (2000, September). Towards real-time adaptive QoS management in middleware for embedded computing systems. Proceedings of the 4th Annual Workshop on High Performance Embedded Computing.
  20. Y. Amir and J. Stanton, The Spread Wide Area Group Communication System. Technical Report CNDS 98–4 Center for Networking and Distributed Systems, 1998.
  21. Ramezani, R., Sedaghat, Y.: An overview of fault tolerance techniques for real-time operating systems. In: ICCKE 2013, Mashhad, pp. 1–6 (2013).
  22. Persya, C., Nair, G.: Fault tolerant real-time systems. In: International Conference on Managing Next Generation Software Application, MNGSA 2008, Coimbatore (2008).
  23. Imai, S., Blasch, E., Galli, A., Zhu, W., Lee, F., Varela, C.A.: Airplane flight safety using error-tolerant data stream processing. IEEE Aerosp. Electron. Syst. Mag. 32(4), 4–17 (2017).
  24. Al-Omari, R.M.S: Controlling schedulability-reliability trade-offs in real-time systems (2001).
  25. Sahingoz, O.K., Sonmez, A.C.: Agent-based fault-tolerant distributed event system. Comput. Inf. 26, 489–506 (2007).
  26. Sahingoz, O.K., Sonmez, A.C.: Fault tolerance mechanism of agent-based distributed event system. In: 6th International Conference Computational Science, ICCS 2006, Reading, UK, 28–31 May, pp. 192–199 (2006).
  27. Salehi, M., Tavana, M.K., Rehman, S., Shafique, M., Ejlali, A., Henkel, J.: Two-state checkpointing for energy-efficient fault tolerance in hard real-time systems. IEEE Trans. Very Large Scale Integr. Syst. 24(7), 2426–2437 (2016).
  28. Tranninger, M., Haid, T., Stettinger, G., Benedikt, M., Horn, M.: Fault-tolerant coupling of real-time systems: a case study. In: 3rd Conference on Control and Fault-Tolerant Systems (SysTol), Barcelona, pp. 756–762 (2016).
  29. Mohammed, B., Kiran, M., Awan, I.U., Maiyama, K.M.: Optimising fault tolerance in real-time cloud computing IaaS environment. In: IEEE 4th International Conference on Future Internet of Things and Cloud (FiCloud), Vienna, pp. 363–370 (2016).
  30. Abdi, F., Mancuso, R., Tabish, R., Caccamo, M.: Restart-Based Fault-Tolerance: System Design and Schedulability Analysis. CoRR (2017).
  31. Driscoll, K., Hall, B., Sivencrona, H., Zumsteg, P.: Byzantine fault tolerance, from theory to reality. In: LNCS, pp. 235–248 (2003).
  32. Murthy, C.: Resource Management in Real-Time Systems and Networks. The MIT Press, Cambridge (2016).

Downloads

Published

2018-07-30

Issue

Section

Research Articles

How to Cite

[1]
Bipinkumar Reddy Algubelli, Sai Kiran Reddy Malikireddy "Adaptive Fault-Tolerant distributed Systems for Real-Time Critical Workloads" International Journal of Scientific Research in Science and Technology(IJSRST), Online ISSN : 2395-602X, Print ISSN : 2395-6011,Volume 4, Issue 9, pp.482-506, July-August-2018. Available at doi : https://doi.org/10.32628/IJSRST2295224