Meeting Notes 02-05-2024

Meeting Notes from 02-05-2024

Notes

Virginia

Rutgers

ASCR-PI-Meeting-Feb-2024-Rutgers

Indiana

  • Indiana has 2 surrogates.
  • Ions in nano confinement.This code allows users to simulate ions confined between material surfaces that are nanometers apart, and extract the associated ionic structure.

time evolution: GitHub: Code for our paper “Simulating Molecular Dynamics with Large Timesteps using Recurrent Neural Networks”

See powerpoint sbi_Jadhao_2024.pptx

ANL

UTK

SABATH Harness

Other

Last Joint Presentation SBI DOE Presentation November 28 2022.pptx

The poster is FoxG_FAIR Surrogate Benchmarks .pptx or Abstract 250 words

Replacing traditional HPC computations with deep learning surrogates can dramatically improve the performance of simulations. We need to build repositories for AI models, datasets, and results that are easily used with FAIR metadata. These must cover a broad spectrum of use cases and system issues. The need for heterogeneous architectures means new software and performance issues. Further surrogate performance models are needed. The SBI (Surrogate Benchmark Initiative) collaboration between Argonne National Lab, Indiana University, Rutgers, University of Tennessee, and Virginia (lead) with MLCommons addresses these issues. The collaboration accumulates existing and generates new surrogates and hosts them (a total of around 20) in repositories. Selected surrogates become MLCommons benchmarks. The surrogates are managed by a FAIR metadata system, SABATH, developed by Tennessee and implemented for our repositories by Virginia. The surrogate domains are Bragg coherent diffraction imaging, ptychographic imaging, Fully ionized plasma fluid model closures, molecular dynamics(2),
turbulence in computational fluid dynamics, cosmology, Kaggle calorimeter challenge(4), virtual tissue simulations(2), and performance tuning. Rutgers built a taxonomy using previous work and protein-ligand docking, which will be quantified using six mini-apps representing the system structure for different surrogate uses. Argonne has studied the data-loading and I/O structure for deep learning using inter-epoch and intra-batch reordering to improve data reuse. Their system addresses communication with the aggregation of small messages. They also study second-order optimizers using compression balancing accuracy and compression level. Virginia has used I/O parallelization to further improve performance. Indiana looked at ways of reducing the needed training set size for a given surrogate accuracy.

[1] Web Page for Surrogate Benchmark Initiative SBI: FAIR Surrogate Benchmarks Supporting AI and Simulation Research. Web Page, January 2024. URL: https://sbi-fair.github.io/. [2] E. A. Huerta, Ben Blaiszik, L. Catherine Brinson, Kristofer E. Bouchard, Daniel Diaz, Cate- rina Doglioni, Javier M. Duarte, Murali Emani, Ian Foster, Geoffrey Fox, Philip Harris, Lukas Heinrich, Shantenu Jha, Daniel S. Katz, Volodymyr Kindratenko, Christine R. Kirk- patrick, Kati Lassila-Perini, Ravi K. Madduri, Mark S. Neubauer, Fotis E. Psomopoulos, Avik Roy, Oliver R ̈ubel, Zhizhen Zhao, and Ruike Zhu. Fair for ai: An interdisciplinary and international community building perspective. Scientific Data, 10(1):487, 2023. URL: https://doi.org/10.1038/s41597-023-02298-6. Note: More references can be found on the Web site

Latex version https://www.overleaf.com/project/65b7e7262188975739dae845 with PDF FoxG_FAIR Surrogate Benchmarks _abstract.pdf https://drive.google.com/file/d/1ytrrii09tKKS2AAVuUTKGw8tmM2Xf8-N/view?usp=drive_link

Topics

Fitting of hardware and software to surrogates Uncertainty Quantification of the surrogate estimates Minimize Training Data Size needed to get reliable surrogates for a given accuracy choice. Develop and test surrogate Performance Models Findable, Accessible, Interoperable, and Reusable FAIR data ecosystem for HPC surrogates SBI collaborates with Industry and a leading machine learning benchmarking activity – MLPerf/MLCommons

Rutgers 2 slides Detailed example: AI-accelerated Protein-Ligand Docking Taxonomy and 6 mini-apps

Tennessee 6 slides SABATH structure and UTK use Cosmoflow in detail

Argonne 7 slides 5 slides High-Performance Data Loading Framework for Distributed DNN Training with Maximize data reuse: Inter-Epoch Reordering (InterER) has minimal impact on the accuracy. Intra-Batch Reordering (IntraBR) that has no impact on the accuracy. I/O balancing A strategy that aggregates small reads into a chunk read.

2 slides Scalable Communication Framework for Second-Order Optimizers using compression balancing accuracy and compression amount

Indiana Goal 1: Develop surrogates for nanoscale molecular dynamics (MD) simulations Surrogate for MD simulations of confined electrolyte ions Surrogate for time evolution operators in MD simulations

Goal 2: Investigate surrogate accuracy dependence on training dataset size

Virginia Work on I/O and Communicaion optimization Done Two Argonne one IU and one MLCommons

To do Onr argonne Fully ionized plasma fluid model closures Calorimeter Challenge: 3 (NF:CaloFlow, Diffusion:CaloDiffusion, CaloScore v2, VAEQVAE Last IU UTK Cosmoflow Performance Virtual Tissue (2) 6 Rutgers

Last modified February 2, 2024: ppt (1fc608c)