Meeting Notes 05-23-2022

Meeting Notes from 05-23-2022

Minutes of SBI-FAIR May 23, 2022, Meeting

Present: Kamil Iskra, Deborah Penchoff, Vikram Jadhao, Shantenu Jha, Geoffrey Fox, Xiaodong Yu, Piotr Luszczek, Baixi Sun, Gregor von Laszewski

Updates

Virginia

  • Geoffrey described substantial progress with Science working group of MLCommons which should have reached first base on June 1 at an ISC BOF
  • The diffusion equation surrogate work with Javier Toledo and James Glazier is being written up.
  • He also commented on Argonne shuffling performance and use of Big Data collective shuffle primitives that work on disk and memory.

Tennessee

  • Cade Brown is on internship with NVIDIA
  • Piotr gave the presentation describes the nice progress with SABATH system introduced by Cade last month.
  • SABATH is now available with two applications
    • Keras MNIST
    • Cloudmask-0 extended from work of UK group of Tony Hey
  • SABATH would cache data locally
  • Tensorboard visualization support was described
  • Add PyTorchsupport to current Tensorflow plus new applications. \

Rutgers

  • Meeting with the Indiana group (Vikram) on adaptive training

**Indiana **

  • Working with Rutgers to agree with last bullet!
  • Devising strategy to minimize needed training size
  • JCS Kadupitiya in Vikram’s group got his Ph.D. and the Luiddy outstanding research award. He is off to work for Microsoft.

Argonne

  • Baixi gave the Argonne presentation after introduction by Xiaodong
  • They are debating between HDF5 or Binary storage
  • Changing the I/O middleware to be based on parallel HDF5
  • Test done on 16 GPUs corresponding to 2 nodes
  • Execution time doesn’t depend much on Batch size. Geoffrey suggested that indicates GPUs not fully utilized so smaller computation does not exploit all internal GPU parallelism
  • Baixi reviewed the problems with shuffle being needed every epoch and the challenge when data size large and will not fit in memory and needs disk (small datasets fit into memory)
  • The Lustre file system used is bad for small randomly accessed files; typically each image is one file
  • The load is manly read with some writes
  • The shufflings are all precalculated and the redistribution needed (MPI AllScatter/gather) can be represented as a graph which is imbalanced
  • Computation and Data movement are traded off with heuristic solution near to the true minimum
  • Parallel HDF5 (using MPI-IO) supports multiple MPI processes
Last modified January 26, 2024: add notes (fa4a2ea)