Meeting Notes 05-23-2022

Meeting Notes from 05-23-2022

Minutes of SBI-FAIR May 23, 2022, Meeting

Present: Kamil Iskra, Deborah Penchoff, Vikram Jadhao, Shantenu Jha, Geoffrey Fox, Xiaodong Yu, Piotr Luszczek, Baixi Sun, Gregor von Laszewski

Updates

Virginia

Geoffrey described substantial progress with Science working group of MLCommons which should have reached first base on June 1 at an ISC BOF
The diffusion equation surrogate work with Javier Toledo and James Glazier is being written up.
He also commented on Argonne shuffling performance and use of Big Data collective shuffle primitives that work on disk and memory.

Tennessee

Cade Brown is on internship with NVIDIA
Piotr gave the presentation describes the nice progress with SABATH system introduced by Cade last month.
SABATH is now available with two applications
- Keras MNIST
- Cloudmask-0 extended from work of UK group of Tony Hey
SABATH would cache data locally
Tensorboard visualization support was described
Add PyTorchsupport to current Tensorflow plus new applications. \

Rutgers

**Indiana **

Working with Rutgers to agree with last bullet!
Devising strategy to minimize needed training size
JCS Kadupitiya in Vikram’s group got his Ph.D. and the Luiddy outstanding research award. He is off to work for Microsoft.

Argonne

Baixi gave the Argonne presentation after introduction by Xiaodong
They are debating between HDF5 or Binary storage
Changing the I/O middleware to be based on parallel HDF5
Test done on 16 GPUs corresponding to 2 nodes
Execution time doesn’t depend much on Batch size. Geoffrey suggested that indicates GPUs not fully utilized so smaller computation does not exploit all internal GPU parallelism
Baixi reviewed the problems with shuffle being needed every epoch and the challenge when data size large and will not fit in memory and needs disk (small datasets fit into memory)
The Lustre file system used is bad for small randomly accessed files; typically each image is one file
The load is manly read with some writes
The shufflings are all precalculated and the redistribution needed (MPI AllScatter/gather) can be represented as a graph which is imbalanced
Computation and Data movement are traded off with heuristic solution near to the true minimum
Parallel HDF5 (using MPI-IO) supports multiple MPI processes

Last modified January 26, 2024: add notes (fa4a2ea)