Meeting Notes 04-25-2022
Meeting Notes from 04-25-2022
Minutes of SBI-FAIR April 25, 2022, Meeting
Present: Kamil Iskra, Deborah Penchoff, Vikram Jadhao, Shantenu Jha, Geoffrey Fox, Xiaodong Yu, Piotr Luszczek, Cade Brown, Baixi Sun, Jack Dongarra
Updates
Virginia
- Discussed continued work on diffusion surrogate with Glazier and Javier Toledo (Edmonton)
- Discussed Fusion surrogate benchmark from Lawrence Livermore
Tennessee
- Cade Brown presented an update
- Discussed Sentinel 3 benchmark based on UK Cloudmask from MLCommons
- Then discussed FAIR Benchmark platform SLIP which is has been extended to become SABATH
- Described report structure
- Model format - how universal is this
- Has done UK cloudmask and looked at TEvol (2 MLCommons benchmarks)
- Deal with Jupyter notebooks with nbconvert
- Add callbacks to model.fit
- How to do FAIR
- Use Json
- Relation to SciML-Bench GitHub - stfc-sciml/sciml-bench: SciML Benchmarking Suite for AI for Science and MLCube from MLCommons
Rutgers
- The IPDPS paper was accepted. This isn’t the final version, but the only publicly available version currently is [2104.04797] Coupling streaming AI and HPC ensembles to achieve 100-1000x faster biomolecular simulations
- Discussed Adversarial autoencoders and use of Alphafold which is expected to do better
- Summit difficult due to IBM containers
- Noted continued study of “2 billion” paper (renamed “Building high accuracy emulators for scientific simulations with deep neural architecture search” https://arxiv.org/pdf/2001.08055.pdf)
- Survey paper
- Noted Proxima by Ian Foster Proxima | Proceedings of the ACM International Conference on Supercomputing
**Indiana **
- Working on scaling recurrent neural net surrogate https://doi.org/10.1088/2632-2153/ac5f60 to more particles
- Ph.D. student JCS Jcs Kadupitiya will defend thesis.
Argonne
- Baixi presentation
- Described distributed training shuffling problem as a graph
- Cost of training has large data loading time
- Studied increasing standard deviation/mean by redistribution over nodes
- Address Imbalance data loading by moving computetasks to other nodes
- Note large compute variance over GPUs even if batch size fixed, which seems surprising – why are some GPUs slow?
Last modified January 26, 2024: add notes (fa4a2ea)