Journal of Systems Engineering and Electronics ›› 2011, Vol. 22 ›› Issue (2): 238-246.doi: 10.3969/j.issn.1004-4132.2011.02.009

• SYSTEMS ENGINEERING • Previous Articles     Next Articles

Bayesian serial revision method for RLLC cluster
systems failure prediction

Qiang Liu1,2, Guang Jin1,*, Jinglun Zhou1, Quan Sun1,3, and Min Xi1,4   

  1. 1. College of Information System and Management, National University of Defense Technology, Changsha 410073, P. R. China;
    2. School of Computer Science, McGill University, Montreal H3A2A7, Canada;
    3. School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta 303320205, USA;
    4. Department of Computer Science, Xi’an Jiaotong University, Xi’an 710049, P. R. China
  • Online:2011-04-19 Published:2010-01-03

Abstract:

Failure prediction plays an important role for many tasks
such as optimal resource management in large-scale system.
However, accurately failure number prediction of repairable largescale
long-running computing (RLLC) is a challenge because of
the reparability and large-scale. To address the challenge, a general
Bayesian serial revision prediction method based on Bootstrap
approach and moving average approach is put forward, which can
make an accurately prediction for the failure number. To demonstrate
the performance gains of our method, extensive experiments
on the data of Los Alamos National Laboratory (LANL) cluster is
implemented, which is a typical RLLC system. And experimental
results show that the prediction accuracy of our method is 80.2 %,
and it is a greatly improvement with 4 % compared with some
typical methods. Finally, the managerial implications of the models
are discussed.