AI Training Gets Smarter: Researchers Cut Fault Tolerance Overhead to 1%

Download PDF Copy

Reviewed by Joel ScanlonFeb 10 2025

Researchers have unveiled a breakthrough fault tolerance mechanism that harnesses idle system resources to optimize model training efficiency, reducing computational losses and cutting checkpointing overhead to just 1%.

Research: BAFT: bubble-aware fault-tolerant framework for distributed DNN training with hybrid parallelism. Image Credit: lilik ferri yanto / Shutterstock

In the context of deep learning model training, checkpoint-based error recovery techniques are a simple and effective form of fault tolerance. By regularly saving the state of the model during training, we can resume training from the most recent checkpoint in case of errors, thereby reducing computational time loss. However, frequent checkpointing introduces some overhead, impacting the training efficiency of the model. On the other hand, infrequent checkpointing may result in more extended training time losses in the event of an error.

To solve the dilemma, a research team led by Minyi GUO published their new research on 15 January 2025 in Frontiers of Computer Science, a journal co-published by Higher Education Press and Springer Nature.

This team has proposed a fault tolerance solution capable of perceiving idle system resources during training. By summarizing the distribution of idle system resources when training with different parallel modes, they have designed a scheduling algorithm that effectively coordinates the existing computing tasks with additional fault tolerance functions. Building upon this foundation, they have re-engineered the task scheduler in distributed training to efficiently manage both training tasks and fault tolerance tasks, effectively alleviating the conflict between checkpointing frequency and training efficiency in distributed training scenarios.

This research analyzed the distribution of idle time (bubble time) on computational devices during distributed training and the resource utilization of checkpoint recording at various steps. Subsequently, they proposed a fault tolerance mechanism that can perceive and utilize bubble time for model checkpoint recording. Finally, they introduced the integration of this checkpoint recording mechanism with elastic training, achieving automated fault tolerance capabilities in distributed training.

To validate the performance of this fault tolerance framework in real training scenarios, they conducted training tasks with different models and configurations on a training cluster. Results from the experiments indicate that the fault tolerance framework developed in this study effectively applies to distributed training scenarios. Furthermore, the overhead introduced due to checkpointing is only 1% compared to scenarios without fault tolerance mechanisms, surpassing other similar fault tolerance frameworks in terms of efficiency.

Source:

Higher Education Press

Journal reference:

Chen, R., Lu, G., Wang, Y. et al. BAFT: bubble-aware fault-tolerant framework for distributed DNN training with hybrid parallelism. Front. Comput. Sci. 19, 191102 (2025). DOI: 10.1007/s11704-023-3401-5, https://link.springer.com/article/10.1007/s11704-023-3401-5

Posted in: AI Research News