Hello! I am a 4th-year Ph.D. student at Rochester Institute of Technology (RIT) and a member of the High Performance Distributed Systems Lab. Advised by M. Mustafa Rafique and Bogdan Nicolae, my research aims to develop system software that optimize the scheduling and I/O profile of deep learning applications. I enjoy working on simple and complex software projects that span across diverse scientific and engineering fields including computational molecular dynamics, cosmology and transportation. I advocate for reproducible and open-source software. I have research experience in the following:
Numerical and performance reproducibility of high-performance computing applications
Performance modeling for large scale distributed systems
Research paper titled Optimizing the Training of Co-Located Deep Learning Models Using Cache-Aware Staggering is accepted at IEEE HiPC conference in GOA, India, and nominated as Best Paper Finalist.
Despite significant advances, training deep learning models remains a time-consuming and resource-intensive task. One of the key challenges in this context is the ingestion of the training data, which involves non-trivial overheads: read the training data from a remote repository, apply augmentations and transformations, shuffle the training samples, and assemble them into mini-batches. Despite the introduction of abstractions such as data pipelines that aim to hide such overheads asynchronously, it is often the case that the data ingestion is slower than the training, causing a delay at each training iteration. This problem is further augmented when training multiple deep learning models simultaneously on powerful compute nodes that feature multiple GPUs. In this case, the training data is often reused across different training instances (e.g., in the case of multi-model or ensemble training) or even within the same training instance (e.g., data-parallel training). However, transparent caching solutions (e.g., OS-level POSIX caching) are not suitable to directly mitigate the competition between training instances that reuse the same training data. In this paper, we study the problem of how to minimize the makespan of running two training instances that reuse the same training data. The makespan is subject to a trade-off: if the training instances start at the same time, competition for I/O bandwidth slows down the data pipelines and increases the makespan. If one training instance is staggered, competition is reduced but the makespan increases. We aim to optimize this trade-off by proposing a performance model capable of predicting the makespan based on the staggering between the training instances, which can be used to find the optimal staggering that triggers just enough competition to make optimal use of transparent caching in order to minimize the makespan. Experiments with different combinations of learning models using the same training data demonstrate that (1) staggering is important to minimize the makespan; (2) our performance model is accurate and can predict the optimal staggering in advance based on calibration overhead.
@inproceedings{assogba:hal-04343672,title={{Optimizing the Training of Co-Located Deep Learning Models Using Cache-Aware Staggering}},author={Assogba, Kevin and Rafique, M. Mustafa and Nicolae, Bogdan},booktitle={{HIPC'23: 30th IEEE International Conference on High Performance Computing, Data, and Analytics}},pages={246-255},address={Goa, India},publisher={{IEEE}},year={2023},month=dec,keywords={Deep Learning ; Caching and Reuse of Training Data ; Co-Located Training ; Performance Modeling},pdf={https://ieeexplore.ieee.org/abstract/document/10487070},doi={10.1109/HiPC58850.2023.00042},dimensions={true},}
Cluster ’23
PredictDDL: Reusable Workload Performance Prediction for Distributed Deep Learning
Accurately predicting the training time of deep learning (DL) workloads is critical for optimizing the utilization of data centers and allocating the required cluster resources for completing critical model training tasks before a deadline. The state-of-the-art prediction models, e.g., Ernest and Cherrypick, treat DL workloads as black boxes, and require running the given DL job on a fraction of the dataset. Moreover, they require retraining their prediction models every time a change occurs in the given DL workload. This significantly limits the reusability of prediction models across DL workloads with different deep neural network (DNN) architectures. In this paper, we address this challenge and propose a novel approach where the prediction model is trained only once for a particular dataset type, e.g., ImageNet, thus completely avoiding tedious and costly retraining tasks for predicting the training time of new DL workloads. Our proposed approach, called PredictDDL, provides an end-to-end system for predicting the training time of DL models in distributed settings. PredictDDL leverages Graph HyperNetworks, a class of neural networks that takes computational graphs as input and produces vector representations of their DNNs. PredictDDL is the first prediction system that eliminates the need of retraining a performance prediction model for each new DL workload and maximizes the reuse of the prediction model by requiring running a DL workload only once for training the prediction model. Our extensive evaluation using representative workloads shows that PredictDDL achieves up to 9.8× lower average prediction error and 10.3× lower inference time compared to the state-of-the-art system, i.e., Ernest, on multiple DNN architectures.
@inproceedings{assogba2023predictddl,author={Assogba*, Kevin and Lima*, Eduardo and Rafique, M. Mustafa and Kwon, Minseok},booktitle={2023 IEEE International Conference on Cluster Computing (CLUSTER)},title={PredictDDL: Reusable Workload Performance Prediction for Distributed Deep Learning},year={2023},volume={},number={},pages={13-24},keywords={},doi={10.1109/CLUSTER52292.2023.00009},issn={2168-9253},month=oct,pdf={https://ieeexplore.ieee.org/abstract/document/10319964},dimensions={true},}
SC-W ’23
Asynchronous Multi-Level Checkpointing: An Enabler of Reproducibility using Checkpoint History Analytics
Kevin Assogba, Bogdan Nicolae, Hubertus Van Dam, and 1 more author
In Proceedings of the SC’23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis, Oct 2023
High-performance computing applications are increasingly inte- grating checkpointing libraries for reproducibility analytics. How- ever, capturing an entire checkpoint history for reproducibility study faces the challenges of high-frequency checkpointing across thousands of processes. As a result, the runtime overhead affects application performance and intermediate results when interleav- ing is introduced during floating-point calculations. In this paper, we extend asynchronous multi-level checkpoint/restart to study the intermediate results generated from scientific workflows. We present an initial prototype of a framework that captures, caches, and compares checkpoint histories from different runs of a scientific application executed using identical input files. We also study the impact of our proposed approach by evaluating the reproducibility of classical molecular dynamics simulations executed using the NWChem software. Experiment results show that our proposed solution improves the checkpoint write bandwidth when capturing checkpoints for reproducibility analysis by a minimum of 30x and up to 211x compared to the default checkpointing approach in NWChem.
@inproceedings{assogba2023asynchronous,title={Asynchronous Multi-Level Checkpointing: An Enabler of Reproducibility using Checkpoint History Analytics},author={Assogba, Kevin and Nicolae, Bogdan and Van Dam, Hubertus and Rafique, M. Mustafa},booktitle={Proceedings of the SC'23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis},pages={1748--1756},year={2023},doi={10.1145/3624062.3624256},pdf={https://hal.science/hal-04343694/document},dimensions={true},}
e-Science ’23
Building the I (Interoperability) of FAIR for performance reproducibility of large-scale composable workflows in RECUP
Bogdan Nicolae, Tanzima Z Islam, Robert Ross, and 8 more authors
In 2023 IEEE 19th International Conference on e-Science (e-Science), Oct 2023
Scientific computing communities increasingly run their experiments using complex data- and compute-intensive workflows that utilize distributed and heterogeneous architectures targeting numerical simulations and machine learning, often executed on the Department of Energy Leadership Computing Facilities (LCFs). We argue that a principled, systematic approach to implementing FAIR principles at scale, including fine-grained metadata extraction and organization, can help with the numerous challenges to performance reproducibility posed by such workflows. We extract workflow patterns, propose a set of tools to manage the entire life cycle of performance metadata, and aggregate them in an HPC-ready framework for reproducibility (RECUP). We describe the challenges in making these tools interoperable, preliminary work, and lessons learned from this experiment.
@inproceedings{nicolae2023building,title={Building the I (Interoperability) of FAIR for performance reproducibility of large-scale composable workflows in RECUP},author={Nicolae, Bogdan and Islam, Tanzima Z and Ross, Robert and Van Dam, Huub and Assogba, Kevin and Shpilker, Polina and Titov, Mikhail and Turilli, Matteo and Wang, Tianle and Kilic, Ozgur O and others},booktitle={2023 IEEE 19th International Conference on e-Science (e-Science)},pages={1--7},year={2023},organization={IEEE},doi={10.1109/e-Science58273.2023.10254808},pdf={https://ieeexplore.ieee.org/abstract/document/10254808},dimensions={true},}