Meeting Notes 09-27-2021

Meeting Notes from 09-27-2021

Minutes of SBI-FAIR September 27 2021 Meeting

Present: Kamil Iskra, Vikram Jadhao, Geoffrey Fox, Deborah Penchoff, Xiaodong Yu, Jack Dongarra, Shantenu Jha, Piotr Luszczek, Pete Beckman, Baixi Sun, Gregor von Laszewski

Updates

Indiana/Virginia

Vikram has a new surrogate and is finalizing a paper on it. He will talk to Shantenu soon.

Rutgers

Shantenu affected by hurricane

  1. Developing 3 layer simulations with surrogate at each level
  2. ML driven HPC motifs/patterns identified in research to be reported at November meeting
    1. DeepDriveMD ensemble is one example
    2. climate science simulations gives surrogates that select best simulation
    3. Link with observation link seen in climate, materials and biomolecular science

University of Tennessee

  1. Workshop in April 4-7 2022 at UTK
  2. Performance surrogate paper to IPDPS; excellent speedup but not 2 billion
  3. FAIR ontologies will resume after this paper

Argonne

  1. Yu introduced their GPU scheduling work and investigation of the surrogate model training change scalability
  2. Baixi Sun gave a detailed presentation on Distributed Training On PtychoNN
    1. Utilized the Horovod framework on ptychoNN model.
    2. Tested the Horovod performance for different number of GPUs on single node and multiple nodes using Ring All-Reduce
    3. Tried Mirrored Strategy framework on ptychoNN model.
    4. Tested the performance for different number of GPUs on single node.
    5. Debugging of the Mirrored Strategy framework for distributed training.
    6. Presented performance numbers with MNIST and ptychoNN
    7. Updated our versions of code on our gitlab repository and wiki documentation.
  3. Links for more details are: 8. This is the official documentation for Horovod: Horovod with Keras — Horovod documentation . 9. And this is the thetaGPU Horovod tutorial: Distributed training on ThetaGPU using data parallelism | Argonne Leadership Computing Facility . 10. This is the official documentation for Mirrored Strategy: Multi-GPU and distributed training (Section “Single-host, multi-device synchronous training”). 11. To be specific, the code I ran on thetaGPU is currently in our private Gitlab repository: https://gitlab.com/SBI-HPC/benchmark_suite/-/tree/main/ptychography . (Please note that for Mirrored Strategy I am currently debugging on it so the latest stable version of code has not committed yet, will come soon!). 12. The guidance of using those code on thetaGPU is written in the Gitlab wiki: https://gitlab.com/SBI-HPC/benchmark_suite/-/wikis/PtychoNN-Distributed-Training-on-ThetaGPU.
Last modified January 26, 2024: add notes (fa4a2ea)