Meeting Notes 02-14-2022
Minutes of SBI-FAIR February 14 2022 Meeting
- Present: Kamil Iskra, Vikram Jadhao, Geoffrey Fox, Deborah Penchoff, Xiaodong Yu, Piotr Luszczek, Cade Brown, Baixi Sun, Gregor von Laszewski
Updates
Tennessee
A new team member Cade Brown gave a fascinating talk CadeBrown-notes-SBI_Schema. Cade Brown is a new ICL student tasked with designing a schema and tooling for installing, running, and benchmarking ML models. He showed examples using MLCommons Science benchmarks CloudMask and STEMDL. There will be a public website from which you can search models, datasets, and results and run examples. He discussed use of JSON rather than XML and the use of Google’s Firebase JSON database tool. There was a discussion of the sustainability of Firebase (as you need to pay) and the use of containers.
Geoffrey noted synergy with MLCommons Science Data working group Science Working Group | MLCommons, the Research Data Alliance and Christine KIrkpatrick
Argonne National Laboratory
Argonne described the continued work on understanding the performance of distributed training already discussed in the last four meetings. Today’s discussion focussed on I/O and included a talk by Baixi which as always was very informative. I/O is a major bottleneck alleviated by caching in either SSD and/or CPU memory. There is a plan for a Parallel I/O and hdf5 paper at SC22. The Hoefler paper at SC21 Clairvoyant prefetching for distributed machine learning I/O | Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis has a simulator that ANL used in this analysis. Shuffling is major difficulty as requires access to all the data. There is a fast local version but it is not as good an algorithm as the usual global shuffle. Currently, dataset is 22 GB but it can increase. \
Indiana
Vikram reported that his surrogate was ready to deploy and that he has received a Summit allocation to support its training. He had met with Shantenu. He sent Cade Brown a couple of links to a repository that hosts their ML surrogate model and the simulation code used to generate datasets to train and test this model. Hopefully, this surrogate can serve as a test model for the system he is building.
https://github.com/softmaterialslab/nanoconfinement-md/tree/master/python
https://github.com/softmaterialslab/nanoconfinement-md/
You can see the surrogate in action, by launching the tool:
https://nanohub.org/tools/nanoconfinement/
Virginia
Progress continues with surrogate for discussion solver. We are writing a second paper on this. Gregor discussed progress with compression.