Publications
You can find below a list of workshop, conference and journal papers published in the past 5 years. A complete list of my list of my publications is available on Google Scholar.
* denotes equal contribution
2024
- Middleware ’24Towards Affordable Reproducibility Using Scalable Capture and Comparison of Intermediate Multi-Run ResultsNigel Tan, Kevin Assogba, Jay Asworth, and 5 more authorsIn 25th ACM/IFIP International Middleware Conference, 2024
@inproceedings{tan2024statediff, title = {Towards Affordable Reproducibility Using Scalable Capture and Comparison of Intermediate Multi-Run Results}, author = {Tan, Nigel and Assogba, Kevin and Asworth, Jay and Bogale, Befikir and Rafique, M. Mustafa and Cappello, Franck and Taufer, Michela and Nicolae, Bogdan}, booktitle = {25th ACM/IFIP International Middleware Conference}, year = {2024}, organization = {ACM/IFIP}, pdf = {}, dimensions = {true} }
2023
- HiPC ’23Optimizing the Training of Co-Located Deep Learning Models Using Cache-Aware StaggeringKevin Assogba, M. Mustafa Rafique, and Bogdan NicolaeIn HIPC’23: 30th IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2023
Despite significant advances, training deep learning models remains a time-consuming and resource-intensive task. One of the key challenges in this context is the ingestion of the training data, which involves non-trivial overheads: read the training data from a remote repository, apply augmentations and transformations, shuffle the training samples, and assemble them into mini-batches. Despite the introduction of abstractions such as data pipelines that aim to hide such overheads asynchronously, it is often the case that the data ingestion is slower than the training, causing a delay at each training iteration. This problem is further augmented when training multiple deep learning models simultaneously on powerful compute nodes that feature multiple GPUs. In this case, the training data is often reused across different training instances (e.g., in the case of multi-model or ensemble training) or even within the same training instance (e.g., data-parallel training). However, transparent caching solutions (e.g., OS-level POSIX caching) are not suitable to directly mitigate the competition between training instances that reuse the same training data. In this paper, we study the problem of how to minimize the makespan of running two training instances that reuse the same training data. The makespan is subject to a trade-off: if the training instances start at the same time, competition for I/O bandwidth slows down the data pipelines and increases the makespan. If one training instance is staggered, competition is reduced but the makespan increases. We aim to optimize this trade-off by proposing a performance model capable of predicting the makespan based on the staggering between the training instances, which can be used to find the optimal staggering that triggers just enough competition to make optimal use of transparent caching in order to minimize the makespan. Experiments with different combinations of learning models using the same training data demonstrate that (1) staggering is important to minimize the makespan; (2) our performance model is accurate and can predict the optimal staggering in advance based on calibration overhead.
@inproceedings{assogba:hal-04343672, title = {{Optimizing the Training of Co-Located Deep Learning Models Using Cache-Aware Staggering}}, author = {Assogba, Kevin and Rafique, M. Mustafa and Nicolae, Bogdan}, booktitle = {{HIPC'23: 30th IEEE International Conference on High Performance Computing, Data, and Analytics}}, pages = {246-255}, address = {Goa, India}, publisher = {{IEEE}}, year = {2023}, month = dec, keywords = {Deep Learning ; Caching and Reuse of Training Data ; Co-Located Training ; Performance Modeling}, pdf = {https://ieeexplore.ieee.org/abstract/document/10487070}, doi = {10.1109/HiPC58850.2023.00042}, dimensions = {true}, }
- Cluster ’23PredictDDL: Reusable Workload Performance Prediction for Distributed Deep LearningIn 2023 IEEE International Conference on Cluster Computing (CLUSTER), Oct 2023
Accurately predicting the training time of deep learning (DL) workloads is critical for optimizing the utilization of data centers and allocating the required cluster resources for completing critical model training tasks before a deadline. The state-of-the-art prediction models, e.g., Ernest and Cherrypick, treat DL workloads as black boxes, and require running the given DL job on a fraction of the dataset. Moreover, they require retraining their prediction models every time a change occurs in the given DL workload. This significantly limits the reusability of prediction models across DL workloads with different deep neural network (DNN) architectures. In this paper, we address this challenge and propose a novel approach where the prediction model is trained only once for a particular dataset type, e.g., ImageNet, thus completely avoiding tedious and costly retraining tasks for predicting the training time of new DL workloads. Our proposed approach, called PredictDDL, provides an end-to-end system for predicting the training time of DL models in distributed settings. PredictDDL leverages Graph HyperNetworks, a class of neural networks that takes computational graphs as input and produces vector representations of their DNNs. PredictDDL is the first prediction system that eliminates the need of retraining a performance prediction model for each new DL workload and maximizes the reuse of the prediction model by requiring running a DL workload only once for training the prediction model. Our extensive evaluation using representative workloads shows that PredictDDL achieves up to 9.8× lower average prediction error and 10.3× lower inference time compared to the state-of-the-art system, i.e., Ernest, on multiple DNN architectures.
@inproceedings{assogba2023predictddl, author = {Assogba*, Kevin and Lima*, Eduardo and Rafique, M. Mustafa and Kwon, Minseok}, booktitle = {2023 IEEE International Conference on Cluster Computing (CLUSTER)}, title = {PredictDDL: Reusable Workload Performance Prediction for Distributed Deep Learning}, year = {2023}, volume = {}, number = {}, pages = {13-24}, keywords = {}, doi = {10.1109/CLUSTER52292.2023.00009}, issn = {2168-9253}, month = oct, pdf = {https://ieeexplore.ieee.org/abstract/document/10319964}, dimensions = {true}, }
- SC-W ’23Asynchronous Multi-Level Checkpointing: An Enabler of Reproducibility using Checkpoint History AnalyticsKevin Assogba, Bogdan Nicolae, Hubertus Van Dam, and 1 more authorIn Proceedings of the SC’23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis, Oct 2023
High-performance computing applications are increasingly inte- grating checkpointing libraries for reproducibility analytics. How- ever, capturing an entire checkpoint history for reproducibility study faces the challenges of high-frequency checkpointing across thousands of processes. As a result, the runtime overhead affects application performance and intermediate results when interleav- ing is introduced during floating-point calculations. In this paper, we extend asynchronous multi-level checkpoint/restart to study the intermediate results generated from scientific workflows. We present an initial prototype of a framework that captures, caches, and compares checkpoint histories from different runs of a scientific application executed using identical input files. We also study the impact of our proposed approach by evaluating the reproducibility of classical molecular dynamics simulations executed using the NWChem software. Experiment results show that our proposed solution improves the checkpoint write bandwidth when capturing checkpoints for reproducibility analysis by a minimum of 30x and up to 211x compared to the default checkpointing approach in NWChem.
@inproceedings{assogba2023asynchronous, title = {Asynchronous Multi-Level Checkpointing: An Enabler of Reproducibility using Checkpoint History Analytics}, author = {Assogba, Kevin and Nicolae, Bogdan and Van Dam, Hubertus and Rafique, M. Mustafa}, booktitle = {Proceedings of the SC'23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis}, pages = {1748--1756}, year = {2023}, doi = {10.1145/3624062.3624256}, pdf = {https://hal.science/hal-04343694/document}, dimensions = {true}, }
- e-Science ’23Building the I (Interoperability) of FAIR for performance reproducibility of large-scale composable workflows in RECUPBogdan Nicolae, Tanzima Z Islam, Robert Ross, and 8 more authorsIn 2023 IEEE 19th International Conference on e-Science (e-Science), Oct 2023
Scientific computing communities increasingly run their experiments using complex data- and compute-intensive workflows that utilize distributed and heterogeneous architectures targeting numerical simulations and machine learning, often executed on the Department of Energy Leadership Computing Facilities (LCFs). We argue that a principled, systematic approach to implementing FAIR principles at scale, including fine-grained metadata extraction and organization, can help with the numerous challenges to performance reproducibility posed by such workflows. We extract workflow patterns, propose a set of tools to manage the entire life cycle of performance metadata, and aggregate them in an HPC-ready framework for reproducibility (RECUP). We describe the challenges in making these tools interoperable, preliminary work, and lessons learned from this experiment.
@inproceedings{nicolae2023building, title = {Building the I (Interoperability) of FAIR for performance reproducibility of large-scale composable workflows in RECUP}, author = {Nicolae, Bogdan and Islam, Tanzima Z and Ross, Robert and Van Dam, Huub and Assogba, Kevin and Shpilker, Polina and Titov, Mikhail and Turilli, Matteo and Wang, Tianle and Kilic, Ozgur O and others}, booktitle = {2023 IEEE 19th International Conference on e-Science (e-Science)}, pages = {1--7}, year = {2023}, organization = {IEEE}, doi = {10.1109/e-Science58273.2023.10254808}, pdf = {https://ieeexplore.ieee.org/abstract/document/10254808}, dimensions = {true}, }
2022
- SC ’22Canary: Fault-Tolerant FaaS for Stateful Time-Sensitive ApplicationsMoiz Arif, Kevin Assogba, and M. Mustafa RafiqueIn SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, Nov 2022
Function-as-a-Service (FaaS) platforms have recently gained rapid popularity. Many stateful applications have been migrated to FaaS platforms due to their ease of deployment, scalability, and minimal management overhead. However, failures in FaaS have not been thoroughly investigated, thus making these desirable platforms unreliable for guaranteeing function execution and ensuring performance requirements. In this paper, we propose Canary, a highly resilient and fault-tolerant framework for FaaS that mitigates the impact of failures and reduces the overhead of function restart. Canary utilizes replicated container runtimes and application-level checkpoints to reduce application recovery time over FaaS platforms. Our evaluations using representative stateful FaaS applications show that Canary reduces the application recovery time and dollar cost by up to 83% and 12%, respectively over the default retry-based strategy. Moreover, it improves application availability with an additional average execution time and cost overhead of 14% and 8%, respectively, as compared to the ideal failure-free execution.
@inproceedings{arif2022canary, author = {Arif, Moiz and Assogba, Kevin and Rafique, M. Mustafa}, booktitle = {SC22: International Conference for High Performance Computing, Networking, Storage and Analysis}, title = {Canary: Fault-Tolerant FaaS for Stateful Time-Sensitive Applications}, year = {2022}, volume = {}, number = {}, pages = {1-16}, keywords = {}, doi = {10.1109/SC41404.2022.00046}, issn = {2167-4337}, month = nov, pdf = {https://ieeexplore.ieee.org/abstract/document/10046074}, dimensions = {true} }
- ICPP ’22Exploiting CXL-Based Memory for Distributed Deep LearningMoiz Arif, Kevin Assogba, M. Mustafa Rafique, and 1 more authorIn Proceedings of the 51st International Conference on Parallel Processing, Nov 2022
Deep learning (DL) is being widely used to solve complex problems in scientific applications from diverse domains, such as weather forecasting, medical diagnostics, and fluid dynamics simulation. DL applications consume a large amount of data using large-scale high-performance computing (HPC) systems to train a given model. These workloads have large memory and storage requirements that typically go beyond the limited amount of main memory available on an HPC server. This significantly increases the overall training time as the input training data and model parameters are frequently swapped to slower storage tiers during the training process. In this paper, we use the latest advancements in the memory subsystem, specifically Compute Express Link (CXL), to provide additional memory and fast scratch space for DL workloads to reduce the overall training time while enabling DL jobs to efficiently train models using data that is much larger than the installed system memory. We propose a framework, called DeepMemoryDL, that manages the allocation of additional CXL-based memory, introduces a fast intermediate storage tier, and provides intelligent prefetching and caching mechanisms for DL workloads. We implement and integrate DeepMemoryDL with a popular DL platform, TensorFlow, to show that our approach reduces read and write latencies, improves the overall I/O throughput, and reduces the training time. Our evaluation shows a performance improvement of up to 34% and 27% compared to the default TensorFlow platform and CXL-based memory expansion approaches, respectively.
@inproceedings{arif2022cxl, author = {Arif, Moiz and Assogba, Kevin and Rafique, M. Mustafa and Vazhkudai, Sudharshan}, title = {Exploiting CXL-Based Memory for Distributed Deep Learning}, year = {2022}, isbn = {9781450397339}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, doi = {10.1145/3545008.3545054}, booktitle = {Proceedings of the 51st International Conference on Parallel Processing}, articleno = {19}, numpages = {11}, keywords = {Caching, Deep Learning, TensorFlow, Data Pipeline, Prefetching}, location = {Bordeaux, France}, series = {ICPP '22}, pdf = {https://doi.org/10.1145/3545008.3545054}, dimensions = {true} }
- CCGrid ’22On Realizing Efficient Deep Learning Using Serverless ComputingKevin Assogba, Moiz Arif, M. Mustafa Rafique, and 1 more authorIn 2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid), May 2022
Serverless computing is gaining rapid popularity as it enables quick application deployment and seamless application scaling without managing complex computing resources. Re-cently, it has been explored for running data-intensive, e.g., deep learning (DL), workloads for improving application performance and reducing execution cost. However, serverless computing imposes resource-level constraints, specifically fixed memory allocation and short task timeouts, that lead to job failures. In this paper, we address these constraints and develop an effective runtime framework, DiSDeL, that improves the performance of DL jobs by leveraging data splitting techniques, and ensuring that an appropriate amount of memory is allocated to containers for storing application data and a suitable timeout is selected for each job based on its complexity in serverless deployments. We implement our approach using Apache OpenWhisk and TensorFlow platforms and evaluate it using representative DL workloads to show that it eliminates DL job failures and reduces action memory consumption and total training time by up to 44% and 46%, respectively as compared to a default serverless computing framework. Our evaluation also shows that DiSDeL achieves a performance improvement of up to 29% as compared to bare-metal TensorFlow environment in a multi-tenant setting.
@inproceedings{assogba2022disdel, author = {Assogba, Kevin and Arif, Moiz and Rafique, M. Mustafa and Nikolopoulos, Dimitrios S.}, booktitle = {2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid)}, title = {On Realizing Efficient Deep Learning Using Serverless Computing}, year = {2022}, volume = {}, number = {}, pages = {220-229}, keywords = {}, doi = {10.1109/CCGrid54584.2022.00031}, issn = {}, month = may, pdf = {https://ieeexplore.ieee.org/abstract/document/9826021}, dimensions = {true} }
2019
- JCLEPROMulti-depot green vehicle routing problem with shared transportation resource: Integration of time-dependent speed and piecewise penalty costYong Wang, Kevin Assogba, Jianxin Fan, and 3 more authorsJournal of Cleaner Production, May 2019
@article{wang2019multi, title = {Multi-depot green vehicle routing problem with shared transportation resource: Integration of time-dependent speed and piecewise penalty cost}, author = {Wang, Yong and Assogba, Kevin and Fan, Jianxin and Xu, Maozeng and Liu, Yong and Wang, Haizhong}, journal = {Journal of Cleaner Production}, volume = {232}, pages = {12--29}, year = {2019}, publisher = {Elsevier}, doi = {10.1016/j.jclepro.2019.05.344 }, pdf = {https://www.sciencedirect.com/science/article/pii/S0959652619318864}, dimensions = {true} }