1 - Links

Links

Overall Project Links

2 - Meeting Notes 02-05-2024

Meeting Notes from 02-05-2024

Notes

Virginia

Rutgers

ASCR-PI-Meeting-Feb-2024-Rutgers

Indiana

  • Indiana has 2 surrogates.
  • Ions in nano confinement.This code allows users to simulate ions confined between material surfaces that are nanometers apart, and extract the associated ionic structure.

time evolution: GitHub: Code for our paper “Simulating Molecular Dynamics with Large Timesteps using Recurrent Neural Networks”

See powerpoint sbi_Jadhao_2024.pptx

ANL

UTK

SABATH Harness

Other

Last Joint Presentation SBI DOE Presentation November 28 2022.pptx

The poster is FoxG_FAIR Surrogate Benchmarks .pptx or Abstract 250 words

Replacing traditional HPC computations with deep learning surrogates can dramatically improve the performance of simulations. We need to build repositories for AI models, datasets, and results that are easily used with FAIR metadata. These must cover a broad spectrum of use cases and system issues. The need for heterogeneous architectures means new software and performance issues. Further surrogate performance models are needed. The SBI (Surrogate Benchmark Initiative) collaboration between Argonne National Lab, Indiana University, Rutgers, University of Tennessee, and Virginia (lead) with MLCommons addresses these issues. The collaboration accumulates existing and generates new surrogates and hosts them (a total of around 20) in repositories. Selected surrogates become MLCommons benchmarks. The surrogates are managed by a FAIR metadata system, SABATH, developed by Tennessee and implemented for our repositories by Virginia. The surrogate domains are Bragg coherent diffraction imaging, ptychographic imaging, Fully ionized plasma fluid model closures, molecular dynamics(2),
turbulence in computational fluid dynamics, cosmology, Kaggle calorimeter challenge(4), virtual tissue simulations(2), and performance tuning. Rutgers built a taxonomy using previous work and protein-ligand docking, which will be quantified using six mini-apps representing the system structure for different surrogate uses. Argonne has studied the data-loading and I/O structure for deep learning using inter-epoch and intra-batch reordering to improve data reuse. Their system addresses communication with the aggregation of small messages. They also study second-order optimizers using compression balancing accuracy and compression level. Virginia has used I/O parallelization to further improve performance. Indiana looked at ways of reducing the needed training set size for a given surrogate accuracy.

[1] Web Page for Surrogate Benchmark Initiative SBI: FAIR Surrogate Benchmarks Supporting AI and Simulation Research. Web Page, January 2024. URL: https://sbi-fair.github.io/. [2] E. A. Huerta, Ben Blaiszik, L. Catherine Brinson, Kristofer E. Bouchard, Daniel Diaz, Cate- rina Doglioni, Javier M. Duarte, Murali Emani, Ian Foster, Geoffrey Fox, Philip Harris, Lukas Heinrich, Shantenu Jha, Daniel S. Katz, Volodymyr Kindratenko, Christine R. Kirk- patrick, Kati Lassila-Perini, Ravi K. Madduri, Mark S. Neubauer, Fotis E. Psomopoulos, Avik Roy, Oliver R ̈ubel, Zhizhen Zhao, and Ruike Zhu. Fair for ai: An interdisciplinary and international community building perspective. Scientific Data, 10(1):487, 2023. URL: https://doi.org/10.1038/s41597-023-02298-6. Note: More references can be found on the Web site

Latex version https://www.overleaf.com/project/65b7e7262188975739dae845 with PDF FoxG_FAIR Surrogate Benchmarks _abstract.pdf https://drive.google.com/file/d/1ytrrii09tKKS2AAVuUTKGw8tmM2Xf8-N/view?usp=drive_link

Topics

Fitting of hardware and software to surrogates Uncertainty Quantification of the surrogate estimates Minimize Training Data Size needed to get reliable surrogates for a given accuracy choice. Develop and test surrogate Performance Models Findable, Accessible, Interoperable, and Reusable FAIR data ecosystem for HPC surrogates SBI collaborates with Industry and a leading machine learning benchmarking activity – MLPerf/MLCommons

Rutgers 2 slides Detailed example: AI-accelerated Protein-Ligand Docking Taxonomy and 6 mini-apps

Tennessee 6 slides SABATH structure and UTK use Cosmoflow in detail

Argonne 7 slides 5 slides High-Performance Data Loading Framework for Distributed DNN Training with Maximize data reuse: Inter-Epoch Reordering (InterER) has minimal impact on the accuracy. Intra-Batch Reordering (IntraBR) that has no impact on the accuracy. I/O balancing A strategy that aggregates small reads into a chunk read.

2 slides Scalable Communication Framework for Second-Order Optimizers using compression balancing accuracy and compression amount

Indiana Goal 1: Develop surrogates for nanoscale molecular dynamics (MD) simulations Surrogate for MD simulations of confined electrolyte ions Surrogate for time evolution operators in MD simulations

Goal 2: Investigate surrogate accuracy dependence on training dataset size

Virginia Work on I/O and Communicaion optimization Done Two Argonne one IU and one MLCommons

To do Onr argonne Fully ionized plasma fluid model closures Calorimeter Challenge: 3 (NF:CaloFlow, Diffusion:CaloDiffusion, CaloScore v2, VAEQVAE Last IU UTK Cosmoflow Performance Virtual Tissue (2) 6 Rutgers

3 - Meeting Notes 01-08-2024

Meeting Notes from 01-08-2024

Present: Geoffrey Fox, Gregor von Laszewski, Przemek Porebski, Piotr Luszczek, Shantenu Jha

Apologies Vikram Jadhao

  • Shantenu described the background to the PI meeting for ASCR in February that was modeled on successful SCIDAC-wide meetings. It is not clear if sessions will be plenary or organized around Program manager portfolios.
  • Virginia started a list of surrogates that would help prepare any poster necessary
  • https://docs.google.com/presentation/d/1LonfbydMlQyLBv5vh8tjATv9BxdN7GmjuU8RFyuK5aw/edit#slide=id.g2acfd0f37ff_1_151
  • Argonne would add work on I/O, compression, and second-order methods.
  • Rutgers has surrogates to list, plus work on effective performance and their taxonomy of surrogate types.
  • Indiana was not available due to travel, but has work on data dependence and surrogates for sustainability (a new paper).
  • Tennessee has two surrogates, MiniWeatherML and Performance. Also has SABATH
  • We did not set a next meeting until the PI meeting was clearer.
  • Later email from DOE set the poster deadline as January 29.

4 - Meeting Notes 10-30-2023

Meeting Notes from 10-30-2023

Minutes of SBI-FAIR October 30 2023, Meeting

Present: Geoffrey Fox, Gregor von Laszewski, Przemek Porebski, Piotr Luszczek, Vikram Jadhao,** **Shantenu Jha, Margaret Lentz

  • AI for Science report AI for Science, Energy, and Security Report | Argonne National Laboratory
  • ASCAC Advanced Scientific Comput… | U.S. DOE Office of Science(SC)
  • Hal Finkel’s (Director Research ASCR Advanced Scientific Computing talk ASCR Research Priorities is important
  • Anticipated Solicitations in FY 2024
    • Compared to FY 2023, expect a smaller number of larger, more-broadly-scoped solicitations driving innovation across ASCR’s research community.
    • In appropriate areas, ASCR will expand its strategy of solicitating longer-term projects and, in most areas, encouraging partnerships between DOE National Laboratories, academic institutions, and industry.
    • ASCR will continue to seek opportunities to expand the set of institutions represented in our portfolio and encourages our entire community to assist in this process by actively exploring potential collaborations with a diverse set of potential partners.
  • Areas of interest include, but are not limited to:
    • Applied mathematics and computer science targeting quantum computing across the full software stack.
    • Applied mathematics and computer science focused on key topics in AI for Science, including scientific foundation models, decision support for complex systems, privacy-preserving federated AI systems, AI for digital twins, and AI for scientific programming.
    • Microelectronics co-design combining innovation in materials, devices, systems, architectures, algorithms, and software (including through Microelectronics Research Centers).
    • Correctness for scientific computing, data reduction, new visualization and collaboration paradigms, parallel discrete-event simulation, neuromorphic computing, and advanced wireless for science.
    • Continued evolution of the scientific software ecosystem enabling community participation in exascale innovation, adoption of AI techniques, and accelerated research productivity.
  • She noted the Executive order today, Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence | The White House, and this message (trustworthiness) will be reflected in DOE programs.
  • Microelectronics will be a thrust
  • NAIRR $140M is important

Rutgers

Shantenu Jha gave a thorough presentation  


There were four items below with status given in **bold**
  1. Develop and Characterize Surrogates in the Context of NVBL Pipeline
    1. Published in Scientific Reports: Performance of Surrogate models without loss of accuracy (Stage 1 of NVBL Drug discovery pipeline) (Done)
  2. Performance & taxonomy of surrogates coupled to HPC (paper in a month) 2. Survey surrogates coupled to HPC simulations (Almost complete 2023-Q3) 3. Generalized framework of surrogate performance (Ongoing 2023-Q4) 1. Optimal Decision making in the DD pipeline (published)
  3. Tools (Ongoing) 4. Preliminary work on mini-apps under review; extend to FAIR mini-apps for surrogates taxonomy 5. Deployed on DOE leadership class machines
  4. Interact with MLCommons** (Anticipate start in 2023/Q4)**
    6. Benchmarks for surrogate coupled to HPC workflows

Indiana

  • Vikram Jadhao presented
  • Accuracy speed up tradeoff for molecular dynamics surrogates
  • Looking for datasets with errors
  • Followed up with later discussions with Rutgers so can feed into software

Tennessee

  • Piotr Luszczek gave presentation
  • He reported on progress with SAbath and MiniWeatherML
  • He is giving several presentations

Virginia

  • **Presentation **
  • We discussed progress with surrogates and enhancements to Sabath
  • We discussed repository and noted that different models need different specific environments
    • Requirements.txt will specify this
    • Different target hardware needs to be supported
  • OSMIBench will be released before end of year
  • Support separate repositories in the future
  • We discussed papers and, in particular, a poster at the Oak Ridge OLCF users meeting.

Argonne

  • Finished the contract but will, of course, complete their papers.

5 - Meeting Notes 09-25-2023

Meeting Notes from 09-25-2023

Minutes of SBI-FAIR September 25 2023, Meeting

Present: Geoffrey Fox, Gregor von Laszewski, Przemek Porebski, Piotr Luszczek, Vikram Jadhao

Apologies: Shantenu Jha, Kamil Iskra, Margaret Lentz

Virginia

  • **Presentation **
  • Repository
  • Specific environments are needed for different models
  • Requirements.txt
  • Different hardware support
  • Copy MLCommons approach
  • MLCube as a target
  • Tools to generate targets
  • Release before supercomputing
  • Add MLCommons benchmarks
  • Separate repositories in version 2

Argonne

  • Finished the contract but will, of course, complete their papers.

Tennessee

  • Piotr presented
  • SABATH updates
  • IBM-NASA Foundation model has multi-part datasets
  • Cloudmesh uses SABATH
  • Smokey Mountain presentation tomorrow

Rutgers

  • See end of
  • The first mini-app is ready

Indiana

  • Will update the nanoconfinement app and Nanohub version still used
  • Second surrogate being worked on
  • Soft label work continuing
  • Interested in AI for Instruments
  • Surrogates help Sustainability as save energy

6 - Meeting Notes 08-25-2023

Meeting Notes from 08-25-2023

Minutes of SBI-FAIR August 28 2023, Meeting

Present: Geoffrey Fox, Gregor von Laszewski, Przemek Porebski, Kamil Iskra,, Baixi Sun. Piotr Luszczek,

Apologies: Shantenu Jha, Vikram Jadhao, Margaret Lentz (Rutgers and Indiana not presented)

Virginia

  • SABATH extensions
  • OSMIBench improved
  • Experiment Executor added in Cloudmesh
  • Argonne surrogates supported

Argonne

  • Baixi presented their new work
  • SOLAR paper with artifacts submitted
  • The communication bottleneck in the second order method K-FAC addressed with compression and sparsification methods with SSO Framework

Tennessee

  • Piotr described Virginia’s enhancements
  • IBM-NASA multi-part datasets in Foundation model
  • Smokey Mountain Conference
  • Integration with MLCommons Croissant using Schema.org

7 - Meeting Notes 07-31-2023

Meeting Notes from 07-31-2023

Minutes of SBI-FAIR July 31 2023, Meeting

Present: Geoffrey Fox, Gregor von Laszewski, Przemek Porebski, Kamil Iskra, Xiaodong Yu, Baixi Sun. Piotr Luszczek, Shantenu Jha

Apologies: Vikram Jadhao,

Virginia

  • Geoffrey presented the Virginia Update https://docs.google.com/presentation/d/132erkV49Lgd0ZFx-AtNWJPRwTrxc480m-rU6jmvMmYA/edit?usp=sharing, which also included Indiana (see below)
  • Good progress with Argonne Surrogates
    • We have added PtychoNN to SABATH, and we have run AutoPhaseNN on Rivanna
  • We reviewed other surrogates from Virginia including OSMIBench and a new Calorimeter simulation
  • We are working well with Tennessee on SABATH
  • Gregor finished with a little study on use of Rivanna – the Virginia Supercomputer

Indiana

Argonne

  • Argonne’s funds have essentially finished
  • Xiaodong Yu is moving to Stevens
  • New compression study comparing methods that are error bounded or nott – their performance differs by a factor of 4-6
  • Baixi gave an update presentation SSO: A Highly Scalable Second-order Optimization Framework ffor Deep Neural Networks via Communication Reduction
  • Quantized Stochastic Gradient Descent QSGD Non error bounded
  • Model accuracy versus compression tradeoff
  • Unable to utilize error-feedback due to GPU memory being filled by large models and large batch size.
  • Looked at different rounding methods
    • Stochastic rounding preserves direction better as not so many zeros
  • Revised our I/O paper i.e., SOLAR based on the reviews, submitting to ppopp’24 with new experiments and better writeup

Rutgers

  • The surrogate survey paper is making good progress with DeepDriveMD other motifs.
  • Andre Merzy is working on associated Miniapps (surrogates)
  • Will work with MLCommons in October

Tennessee

  • Piotr presented his groups work https://drive.google.com/file/d/1ep9zxdv25I3MJmPt5YcJi32SHu5BAF4J/view?usp=sharing
  • MiniWeatherML running with MPI and with or without CUDA.
    • No external dataset is required
  • SABATH making good progress in collaboration with Virginia
  • They are working on Cosmoflow
  • Piotr noted that those sites that are continuing with the project will need to submit a project report very soon. Geoffrey shared his project report to allow a common story

8 - Meeting Notes 06-26-2023

Meeting Notes from 06-26-2023

Minutes of SBI-FAIR June 26, 2023, Meeting

**Present: **Geoffrey Fox, Gregor von Laszewski, Przemek Porebski, Kamil Iskra, Xiaodong Yu, Baixi Sun. Vikram Jadhao, Piotr Luszczek, Shantenu Jha, Margaret Lentz

Virginia

  • This was presented by Geoffrey
  • He described work on new surrogates, including LHC Calorimeter, Epidemiology, Extended virtual tissue, and Earthquake
  • He described work on the repository and SABATH
  • This involved two existing AI models CloudMask and OSMIBench
  • Shantenu Jha asked about the number of inferences per second.
    • From MLCommons Science Working minutes, we find for OSMIBench
    • On Summit, with 6 GPUs per node, one uses 6 instances of TensorFlow server per node. One uses batch sizes like 250K with a goal of a billion inferences per second

Argonne

  • Continue to work on Second-order Optimization Framework for Deep Neural Networks with Communication Reduction
  • Baixi Sun presented the details
  • He introduced quantization to lower precision QSGD which gives encouraging results, although In one case quantization method failed in the eigenvalue stage
  • We removed Rick Stevens from the Google Group
  • Geoffrey mentioned his ongoing work on improving shuffling using Arrow vector format; he will share the paper when ready

Indiana

Rutgers

  • Shantenu presented
  • Nice paper on surrogate classes with Wes Brewer, who works with Geoffrey on OSMIBench
  • Mini-apps for each of the 6 motifs that need FAIR metadata
    • 5 motifs use surrogates; one generates them
  • He described an interesting workshop on molecular simulations
  • He noted that Aurora training trillion parameter foundation model for science
  • LLMs need 10 power 8 exaflops: Need to optimize!
  • Vikram noted SIMULATION INTELLIGENCE: TOWARDS A NEW GENERATION OF SCIENTIFIC METHODS

Tennessee

  • Piotr presented slides
  • CosmoFlow on 8 GPUs is running well
  • He introduced the MiniWeatherML mini-app
    • CUDA-aware pointers must be explicitly specified in the FAIR schema
    • Test in PETSc leaves threaded MPI in an invalid state
    • Alternative MPIX query interface varies between MPI implementations
    • GPU Direct copy support is optional
  • SABATH system is moving ahead with a focus on adding MPI support
  • Piotr is now the PI of this project at UTK. We removed Cade Brown, Jack Dongarra, and Deborah Penchof from the Google Group

9 - Meeting Notes 05-29-2023

Meeting Notes from 05-29-2023

Minutes of SBI-FAIR May 29, 2023, Meeting

Present: Geoffrey Fox, Xiaodong Yu, Baixi Sun. Piotr Luszczek,

Virginia

  • Comment on Surrogates produced by generative methods versus those that map particular inputs to particular outputs. In examples like experimental physics apparatus simulations, you only need output and not input. Methods need to sample output data space correctly.
  • Geoffrey also described earlier experiences using second-order methods and least squares/maximum likelihood optimizations for physics data analysis. One can use eigenvalue/vector decomposition or the Levenberg-Marquardt method.

Tennessee

Argonne

  • Xiaodong summarized the situation, and Baixi gave a detailed presentation
  • Working on reducing data size, but compression technology seems difficult
  • The error-bounded approach doesn’t seem to work very well, and so Argonne are investigating other methods. There is currently no method that preserves good accuracy and gives significant reduction.
  • Looking at the performance of first and second-order gradients
  • What can you drop in second order method – lots of data are irrelevant but not what current lossy compression seems to be doing
  • Model parallelism for calculating eigensystems and then Data parallelism

10 - Meeting Notes 04-03-2023

Meeting Notes from 04-03-2023

Minutes of SBI-FAIR April 3 2023, Meeting

Present: Geoffrey Fox, Gregor von Laszewski, Przemek Porebski, Piotr Luszczek, Kamil Iskra, Xiaodong Yu, Baixi Sun. Vikram Jadhao, Margaret Lentz (DOE),

Regrets: Shantenu Jha

**DOE **had no major announcements but reminded us of links

Virginia Geoffrey summarized activities (Slides 1-5) with a new Virtual Tissue surrogate using UNet and periodic boundary conditions. We are investigating new ideas that can describe functions with a wide dynamic range. Virginia is responsible for final deployed surrogates and building a team with Undergraduates, Researchers, and Ph.D. students. Students find experience educational, as we discovered in a collaboration with New York University. Przemek Porebski is joining the Virginia team with experience in computational epidemiology, and software engineering. Przemek introduced himself. Virginia also covered the status of MLCommons benchmarks, including new ones OSMIBench and FastML.

Rutgers Shantenu was unable to attend but prepared slides and briefed them to Geoffrey, who presented them to him (Slides 6-10). These summarize the current status with a list of the six classes of surrogate problems identified as important. Shantenu compared the training samples for surrogates with that found for LLM’s. He proposes to develop mini-apps (benchmarks) covering the range of key features exhibited by surrogates.

Vikram gave Indiana University’s Presentation with a careful analysis of accuracy as a function of

  • Dataset size showing error plateaus at acceptable values at a sample size of around 2000.
  • The boundary versus internal points
  • Sensitivity to removing selected features and how many removed points were needed for acceptable answers. Here result depended on the particular feature and measured generalizability of the network.
  • There is a publication under review.

Argonne’s new results were described by Baixi where the team was busy preparing a paper for SC23.

  • They continued the study of second-order methods showing a broadcast was time-consuming, taking 48% of the time on 64 GPUs.
  • The message sizes were not large and in a region where latency was important.
  • They used lossy compression and studied the outliers in this.
  • Note the last meeting’s presentation introducing the K-FAC method.

Piotr described Tennessee’s work with

  • Focus on SABATH tested on three applications. It is nearly ready to be used by Virginia
  • They have identified a new graduate student and need to modify the contract where Margaret gave key advice.

11 - Meeting Notes 02-27-2023

Meeting Notes from 02-27-2023

Minutes of SBI-FAIR February 27 2023, Meeting

**Present: **Geoffrey Fox, Piotr Luszczek, Gregor von Laszewski, Kamil Iskra, Xiaodong Yu, Baixi Sun. Vikram Jadhao,

We discussed modifying our simple summary describing the status and plans for the project to add a discussion of the timeline. Virginia did theirs as an example on slide 2.

Indiana

Vikram discussed recent activity, responding to referee comments on their recent paper.

Virginia

Geoffrey noted two new surrogates: A diffusion surrogate https://arxiv.org/abs/2302.03786 with James Glazier and J. Quetzalcoatl Toledo-Marin; a computational fluid dynamics surrogate https://code.ornl.gov/whb/osmi-bench from Oak Ridge

Geoffrey described issues arising from the diffusion surrogate above. We are trying to understand how deep learning can work for problems with a large range of input or output values. Examples could be covid, flu counts, images with a wide range of illumination, finding surrogate solutions where function values often range over several orders of magnitude, and one is interested in both large and small values. This range of values is seen over spatial values (images) or time values (time series)

However, this doesn’t seem to work properly in deep learning, where the activation value is 1. The weights cannot adjust to different sizes of input values, so one cannot see the nonlinearity of activation in values over the full range. Naively the DL will choose weights, so activation nonlinearity only really impacts a portion of the value range. One can think of many approaches

a) replace value by value**n for n < 1 including log value

b) scale activation value by an average value (found from a coarser scale if labeled by space as in an image)

c) Mixture of experts with different values of activation for each expert such as 0.001 0.01 0.1 1

Tennessee

Piotr reported that the SABATH project had a new student and was ramping up.

Argonne

Baixi discussed second-order optimization using Kronecker-factored Approximate Curvature K-FAC, which significantly outperforms standard Stochastic Gradient Descent. This is coupled with compression to reduce communication costs.

12 - Meeting Notes 01-30-2023

Meeting Notes from 01-30-2023

Minutes of SBI-FAIR January 2, 9, and 30 2023, Meetings

January 2 2023:

**Present: **Deborah Penchoff, Shantenu Jha, Geoffrey Fox, Piotr Luszczek, Gregor von Laszewski

We discussed producing a simple summary (roughly one slide per institution) describing the status and plans for the project. Virginia, UTK, and Rutgers made a draft which will be expanded before our January 30 meeting with Margaret. These should mention inter-institution collaborations. We continued on January 9

January 9 2023:

**Present: **Geoffrey Fox, Kamil Iskra, Xiaodong Yu, Baixi Sun. Vikram Jadhao, Gregor von Laszewski

Based on the earlier meeting, Argonne and Indiana produced summary pages which we iterated to include collaborations to deposit surrogates in the repository.

January 30, 2023:

Present: not recorded, but all institutions represented

We gave our presentation and followed with a discussion with Margaret. She noted recent DOE calls with useful links

https://public.govdelivery.com/accounts/USDOEOS/subscriber/new

https://science.osti.gov/ascr/Funding-Opportunities

She stressed the importance of establishing a timeline. We should discuss at the next meeting.

We didn’t decide on a cadence for her presence at our meetings.

13 - Meeting Notes 01-05-2023

Meeting Notes from 01-05-2023

Minutes of SBI-FAIR May 1, 2023, Meeting

Present: Geoffrey Fox, Gregor von Laszewski, Przemek Porebski, Kamil Iskra, Xiaodong Yu, Baixi Sun. Vikram Jadhao, Piotr Luszczek,

Regrets: Shantenu Jha

Virginia Geoffrey noted continued progress with the new Virtual Tissue surrogate using UNet and periodic boundary conditions. Interesting that UNet mimics multigrid PDE methods. Przemyslaw still disentangling from other work but will start very soon. Several (50 in 2 weeks) undergraduate and incoming graduate student research requests. Surrogate OSMIBench progress and will integrate with SABATH. Geoffrey asked what surrogates are available to work on now.

Rutgers

Not presented

Indiana University

Vikram discussed progress. Ions in confinement code will be sent to UVA. Discussed sensitivity to training data showing the need for some but not all samples in a region.

https://pubs.acs.org/doi/10.1021/acs.jctc.2c01282 and PDF is

Studied interpolation; extend to extrapolation

Speedup study – the factor of 2 if one drops every other point and replace them by a small fraction of these interpolations

Argonne

The SOLAR paper was rejected.

Baixi presented their new results with a focus on data compression (for second-order optimization)

Aggregate Broadcast as previously latency dominated

Float32 versus Float64 inversion error (eigensolution versus inversion)

Some tasks are sensitive to precision.

Submitted to SC23; will share with people

Communicated Light Source Surrogates PtychoNN and AutoPhaseNN to the FAIR main repository. Baixi asked Dr. Cherukara (from ANL) and got permission about which can be available to the public.

Specifically, they implemented PytchoNN using PyTorch Distributed Data-Parallel (DDP)

See Onedrive FAIR Or please use this google drive link:

https://drive.google.com/drive/folders/1c2HGFBiymJUu9yaUTW5K-dIOoemxOfjN?usp=sharing These have the same readme and Python files

Tennessee

Piotr presented CUDA 10 versus CUDA 11

SABATH Cosmoflow small dataset working. Move to

  • Earthquake
  • OSMIBench

Gregor described progress with Friday May 14 1 pm meeting with Wes Brewer

Gregor recommends exchanging Docker or Singularity definition files

SABATH could create the container image

14 - Meeting Notes 11-28-2022

Meeting Notes from 11-28-2022

Minutes of SBI-FAIR November 28, 2022, Meeting

Present: Kamil Iskra, Xiaodong Yu, Deborah Penchoff, Shantenu Jha, Geoffrey Fox, Piotr Luszczek, Baixi Sun. Vikram Jadhao, Gregor von Laszewski and Margaret Lentz from DOE

Preparations/drafts: Nov 28 2022 DOE Project Review Preparations

Actually delivered presentations are has on the first slide links to individual presentations in the order

  • Virginia
  • Tennessee
  • Argonne
  • Rutgers
  • Indiana

Margaret emphasized the need for continued interaction and we scheduled the next meeting with Margaret on January 30, 2023.

15 - Meeting Notes 10-31-2022

Meeting Notes from 10-31-2022

Minutes of SBI-FAIR October 31, 2022, Meeting

Present: Kamil Iskra, Xiaodong Yu, Peter Beckman, Deborah Penchoff, Shantenu Jha, Geoffrey Fox, Piotr Luszczek, Baixi Sun. Vikram Jadhao, Gregor von Laszewski

Updates

Virginia

Geoffrey discussed

  • The transfer of the DOE grant is completed
  • The Tsunami surrogate (see last meeting) is finished while the diffusion-based surrogate is still being finalized
    • Rough draft of the diffusion model for cell simulations GENERALIZATION AND TRANSFER LEARNING IN A DEEP DIFFUSION SURROGATE FOR MECHANISTIC REAL-WORLD SIMULATIONS. Interesting is the study of dataset sizes 5000-400,000 and the importance of dealing with the large numeric range in computed values
  • We discussed Margaret Lentz’s request for a project presentation
    • Draft after SC22 with final presentation November 28 1-2 pm finalized with Margaret
    • Some integrating slides and then 4-6 from each team covering past work; remaining work in the grant; what to do after the grant
    • Pete reminded us not to forget FAIR!
    • Geoffrey will make a plan

Argonne

  • Their VLDB2023 paper: “SOLAR: A Highly Optimized Data Loading Framework Training CNN-based Scientific Surrogates,” was discussed
  • This paper looks at the training of 3 surrogates and addresses the overhead of the I/O disk access that dominates the performance
  • They compare with PyTorch Data Loader and the NoPFS paper [2101.08734] Clairvoyant Prefetching for Distributed Machine Learning I/O from Torsten Hoefler at the last SC meeting. This does optimized prefetching
  • The shuffle is optimized to minimize redistribution and this leads to an improvement factor of 3.5 over NoPFS and 24 over default PyTorch \

Tennessee

Piotr reported that Cade Brown has left and they are hiring a replacement.

Rutgers

Shantenu reported

  • That their team had identified 6 categories with AI enhancing HPC and they were studying performance
  • He returned to topic of Large Language models LLM that can be effective in chemistry,

Indiana University

Vikram reported that

  • They were continuing study of accuracy and robustness as last time as well as
  • Dataset size
  • Ensemble issues
  • Definition of speedup

16 - Meeting Notes 09-26-2022

Meeting Notes from 09-26-2022

Minutes of SBI-FAIR September 26, 2022, Meeting

Present: Kamil Iskra, Xiaodong Yu, Deborah Penchoff, Shantenu Jha, Geoffrey Fox, Piotr Luszczek, Baixi Sun. Vikram Jadhao, Gregor von Laszewski

Updates

Virginia

Geoffrey discussed

  • The transfer of the DOE grant is still making progress
  • He noted two nearly completed new surrogates
    • paper on Tsunami simulation surrogates entitled “Forecasting tsunami inundation with convolutional neural networks for a potential Cascadia Subduction Zone rupture”
    • Rough draft of the diffusion model for cell simulations GENERALIZATION AND TRANSFER LEARNING IN A DEEP DIFFUSION SURROGATE FOR MECHANISTIC REAL-WORLD SIMULATIONS. Interesting is the study of dataset sizes 5000-400,000 and the importance of dealing with the large numeric range in computed values
  • He summarized the MLCommons status with the move to continuous (rolling) submissions rather than fixed date submissions

Indiana University

  • Vikram presented some of his recent work
  • He studied sensitivity to input training set showing some dramatic effects from seemingly small changes – removing one value of electrolyte concentration c

Tennessee

Piotr reported

  • There was a Data Challenge at Smoky Mountain meeting with a smaller version of the Cloudmask dataset from MLCommons 2022 Challenge 6: SMCEFR: Sentinel-3 Satellite Dataset « SMC Data Challange 2021
  • Two Submitted papers: one on Performance Surrogate and the other a SABATH paper at HPEC Conference IEEE HPEC 26th Annual 2022 IEEE High Performance Extreme Computing Virtual Conference 19 - 23 September 2022
    • paper and presentation Deep Gaussian process with multitask and transfer learning for performance optimization
  • Questions included reproducibility and overheads from using FAIR metadata
  • It was asked if SABATH recorded training time; it does record loss versus epoch number.
  • Tennessee will give a detailed presentation on SABATH next time.

Rutgers

Shantenu reported

  • Drug and Quantum surrogates
  • He noted a new DOE $25M award for climate surrogates revisiting the startling Oxford paper https://iopscience.iop.org/article/10.1088/2632-2153/ac3ffa/meta and https://arxiv.org/pdf/2001.08055v1
  • Work with Indiana University was continuing with efforts to get system running on Summit
  • There was a discussion of Large Language models LLM and DOE interest in using them on scientific literature. There is a challenge with the current $10-100 million computing training cost possibly reaching a billion dollars.

Argonne

  • Xiaodong Yu discussed the ASPLOS paper which was unfortunately rejected
  • Baixi presented their results commenting on referee remarks
  • One question prompted observation that surrogate MODEL sizes are comparatively small
  • Another question was answered by noting that scheduling was a one-time cost
  • In some cases their custom training order outperformed the baseline training

17 - Meeting Notes 08-15-2022

Meeting Notes from 08-15-2022

Minutes of SBI-FAIR August 15, 2022, Meeting

Present: Kamil Iskra, Xiaodong Yu, Deborah Penchoff, Shantenu Jha, Geoffrey Fox, Piotr Luszczek, Baixi Sun.

Apologies Vikram Jadhao,

Updates

Virginia

Geoffrey discussed

  • The transfer of the DOE grant is making progress
  • He is continuing his study of Foundation models by collecting common applications using similar deep learning systems
  • He summarized the MLCommons status answering some questions noting that MLCommons collects surrogates and non-surrogate benchmarks
    • Geoffrey will send Shantenu notice about MLCommons meetings

Gregor

  • contacted Rutgers for help, but due to staff changes that effort was shifted to Summit support team. Activity in progress.

Rutgers

Shantenu reported

  • Work with Indiana University was delayed as JCS Kadupitiya has graduated from IU and was hired by Microsoft
  • Improving AI for Science Chapter with AI-linked workflows for a new publication with performance

Argonne

  • Xiaodong Yu discussed the ASPLOS paper and will send an improved version in 2 weeks
  • There are performance issues addressed with microbenchmarks
  • Baixi presented their results optimized over epoch and batch
  • This does not change results much even though the update order is different
  • Schedule by access performance or load balance
  • 4.2 to 5.8 speedup up to 64 processes
  • Looking at scalability
  • Other surrogates are AutophaseNN and BraggNN

Indiana University

Reported by email

  • Starting Fall 2022, a new PhD student Fanbo Sun and a new postdoc Wenhui Li will work 50% on this project. Postdoc starts Sep 1.
  • Soft labels: Continuing to explore the soft labels idea and how it reduces training set sizes. Planning a submission sometime this year. One paper submitted last year on this topic is still under review.
  • Time series surrogate: With the postdoc, we will be working to extend the RNN operator to tackle NVT ensemble and larger number of particles.

Tennessee

Piotr reported

  • Cade will come back plus a new Ph.D. student
  • Two Submitted papers: one on Performance Surrogate and the other a SABATH paper
  • Third paper to Data Challenge

18 - Meeting Notes 06-27-2022

Meeting Notes from 06-27-2022

Minutes of SBI-FAIR June 27, 2022, Meeting

Present: Kamil Iskra, Deborah Penchoff, Vikram Jadhao, Shantenu Jha, Geoffrey Fox, Piotr Luszczek, Baixi Sun, Gregor von Laszewski

Updates

Virginia

Tennessee

  • SABATH software
  • MLCommons Paper ISC Piotr Luszczek went and did not get Covid. BOF presentation from Piotr and H3 conference H3 workshop report from Jeyan Thiyagalingam.

Rutgers

  • Vincent Pascuzzi has a prototype software system running with JCS Kadupitiya
  • Davis DOE AI meeting is July 26-28
  • Train Foundation models
  • Performance of workflow
  • Omniverse

**Indiana **

  • Hire postdoc now that JCS Kadupitiya has graduated and hired by Microsoft
  • Soft label paper progressing
  • Using Tensorflow for simulation

Argonne

  • Kamil Iskra described publication plan of a paper to ASPLOS and poster to SC
  • Baixi noted June 30 abstract deadline and gave the presentation
  • 1.3 TB dataset
  • I/O takes ~81% when run on 8 nodes and 64 GPUs on ThetaGPU
  • Clump data and load balance to decrease load time gives a factor of 2.16 speedup
  • Use Memory not SSD for storage
  • Gregor suggested compressing data in shared memory
  • Global arrays and RDMA

19 - Meeting Notes 05-23-2022

Meeting Notes from 05-23-2022

Minutes of SBI-FAIR May 23, 2022, Meeting

Present: Kamil Iskra, Deborah Penchoff, Vikram Jadhao, Shantenu Jha, Geoffrey Fox, Xiaodong Yu, Piotr Luszczek, Baixi Sun, Gregor von Laszewski

Updates

Virginia

  • Geoffrey described substantial progress with Science working group of MLCommons which should have reached first base on June 1 at an ISC BOF
  • The diffusion equation surrogate work with Javier Toledo and James Glazier is being written up.
  • He also commented on Argonne shuffling performance and use of Big Data collective shuffle primitives that work on disk and memory.

Tennessee

  • Cade Brown is on internship with NVIDIA
  • Piotr gave the presentation describes the nice progress with SABATH system introduced by Cade last month.
  • SABATH is now available with two applications
    • Keras MNIST
    • Cloudmask-0 extended from work of UK group of Tony Hey
  • SABATH would cache data locally
  • Tensorboard visualization support was described
  • Add PyTorchsupport to current Tensorflow plus new applications. \

Rutgers

  • Meeting with the Indiana group (Vikram) on adaptive training

**Indiana **

  • Working with Rutgers to agree with last bullet!
  • Devising strategy to minimize needed training size
  • JCS Kadupitiya in Vikram’s group got his Ph.D. and the Luiddy outstanding research award. He is off to work for Microsoft.

Argonne

  • Baixi gave the Argonne presentation after introduction by Xiaodong
  • They are debating between HDF5 or Binary storage
  • Changing the I/O middleware to be based on parallel HDF5
  • Test done on 16 GPUs corresponding to 2 nodes
  • Execution time doesn’t depend much on Batch size. Geoffrey suggested that indicates GPUs not fully utilized so smaller computation does not exploit all internal GPU parallelism
  • Baixi reviewed the problems with shuffle being needed every epoch and the challenge when data size large and will not fit in memory and needs disk (small datasets fit into memory)
  • The Lustre file system used is bad for small randomly accessed files; typically each image is one file
  • The load is manly read with some writes
  • The shufflings are all precalculated and the redistribution needed (MPI AllScatter/gather) can be represented as a graph which is imbalanced
  • Computation and Data movement are traded off with heuristic solution near to the true minimum
  • Parallel HDF5 (using MPI-IO) supports multiple MPI processes

20 - Meeting Notes 04-25-2022

Meeting Notes from 04-25-2022

Minutes of SBI-FAIR April 25, 2022, Meeting

Present: Kamil Iskra, Deborah Penchoff, Vikram Jadhao, Shantenu Jha, Geoffrey Fox, Xiaodong Yu, Piotr Luszczek, Cade Brown, Baixi Sun, Jack Dongarra

Updates

Virginia

  • Discussed continued work on diffusion surrogate with Glazier and Javier Toledo (Edmonton)
  • Discussed Fusion surrogate benchmark from Lawrence Livermore

Tennessee

  • Cade Brown presented an update
  • Discussed Sentinel 3 benchmark based on UK Cloudmask from MLCommons
  • Then discussed FAIR Benchmark platform SLIP which is has been extended to become SABATH
  • Described report structure
    • Model format - how universal is this
  • Has done UK cloudmask and looked at TEvol (2 MLCommons benchmarks)
  • Deal with Jupyter notebooks with nbconvert
  • Add callbacks to model.fit
  • How to do FAIR
  • Use Json
  • Relation to SciML-Bench GitHub - stfc-sciml/sciml-bench: SciML Benchmarking Suite for AI for Science and MLCube from MLCommons

Rutgers

**Indiana **

Argonne

  • Baixi presentation
  • Described distributed training shuffling problem as a graph
  • Cost of training has large data loading time
  • Studied increasing standard deviation/mean by redistribution over nodes
  • Address Imbalance data loading by moving computetasks to other nodes
  • Note large compute variance over GPUs even if batch size fixed, which seems surprising – why are some GPUs slow?

21 - Meeting Notes 03-19-2022

Meeting Notes from 03-19-2022

Minutes of SBI-FAIR March 19, 2022, Meeting

  • Present: Kamil Iskra, Vikram Jadhao, Shantenu Jha, Geoffrey Fox, Xiaodong Yu, Piotr Luszczek, Cade Brown, Baixi Sun, Gregor von Laszewski

Updates

Rutgers

A postdoc left unexpectedly and so the surrogate classification work was delayed. The integration of Rutgers software into Vikram’s work is proceeding and will be tested with a Summit allocation.

Indiana

Vikram discussed a surrogate paper accepted by Machine Learning: Science and Technology journal https://doi.org/10.1088/2632-2153/ac5f60. This evolves a modest collection of particles in for example the Lennard-Jones potential obtaining good results with time steps 4000 times that of classic solvers. He also presented at multiple APS sessions. He noted other work using Tensorflow to perform simulations – a collaboration with another Indiana Engineering faculty.

Virginia

Gregor presented on the status of the MLCommons benchmark stressing the difficulties in reconciling GitHub and Jupyter notebooks. Geoffrey noted that these were not quite what you wanted as a scientific electronic notebook as they didn’t support sharing of modified versions and the management of multiple Jupyter notebooks. For example, this project produced 450 notebooks and it is not even easy to search as traditional Google search fails on notebooks.

Gregor also discussed timing tools

Tennessee

Piotr described progress in integrating MLCommons ontologies into the FAIR metadata system. He also noted problems in defining how to run SciML benchmarks with Horovod. Tennessee also submitted a challenge to the Smoky Mountain conference based on Satellite images generalizing the SciML CloudMask benchmark

Argonne National Laboratory

Xiaodong introduced the Argonne study of shared I/O. The need for global shuffling at each epoch is potentially an I/O problem but their approach gave almost a factor of 10 improvement (11.4 seconds reduced to 1. seconds).

Baixi gave a detailed discussion with his usual excellent presentation.

Geoffrey and Gregor noted the practical challenge of I/O in University shared file systems with both the Earthquake code and an examination of a regular MLPerf benchmark where cloud I/O was much faster than the academic shared file system. The latter problem can be addressed by copying to local disks. Execution from those is a little faster than the cloud numbers.

22 - Meeting Notes 02-14-2022

Meeting Notes from 02-14-2022

Minutes of SBI-FAIR February 14 2022 Meeting

  • Present: Kamil Iskra, Vikram Jadhao, Geoffrey Fox, Deborah Penchoff, Xiaodong Yu, Piotr Luszczek, Cade Brown, Baixi Sun, Gregor von Laszewski

Updates

Tennessee

A new team member Cade Brown gave a fascinating talk CadeBrown-notes-SBI_Schema. Cade Brown is a new ICL student tasked with designing a schema and tooling for installing, running, and benchmarking ML models. He showed examples using MLCommons Science benchmarks CloudMask and STEMDL. There will be a public website from which you can search models, datasets, and results and run examples. He discussed use of JSON rather than XML and the use of Google’s Firebase JSON database tool. There was a discussion of the sustainability of Firebase (as you need to pay) and the use of containers.

Geoffrey noted synergy with MLCommons Science Data working group Science Working Group | MLCommons, the Research Data Alliance and Christine KIrkpatrick

Argonne National Laboratory

Argonne described the continued work on understanding the performance of distributed training already discussed in the last four meetings. Today’s discussion focussed on I/O and included a talk by Baixi which as always was very informative. I/O is a major bottleneck alleviated by caching in either SSD and/or CPU memory. There is a plan for a Parallel I/O and hdf5 paper at SC22. The Hoefler paper at SC21 Clairvoyant prefetching for distributed machine learning I/O | Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis has a simulator that ANL used in this analysis. Shuffling is major difficulty as requires access to all the data. There is a fast local version but it is not as good an algorithm as the usual global shuffle. Currently, dataset is 22 GB but it can increase. \

Indiana

Vikram reported that his surrogate was ready to deploy and that he has received a Summit allocation to support its training. He had met with Shantenu. He sent Cade Brown a couple of links to a repository that hosts their ML surrogate model and the simulation code used to generate datasets to train and test this model. Hopefully, this surrogate can serve as a test model for the system he is building.

https://github.com/softmaterialslab/nanoconfinement-md/tree/master/python

https://github.com/softmaterialslab/nanoconfinement-md/

You can see the surrogate in action, by launching the tool:

https://nanohub.org/tools/nanoconfinement/

Virginia

Progress continues with surrogate for discussion solver. We are writing a second paper on this. Gregor discussed progress with compression.

23 - Meeting Notes 01-10-2022

Meeting Notes from 01-10-2022

Minutes of SBI-FAIR January 10 2022 Meeting

Present: Kamil Iskra, Vikram Jadhao, Geoffrey Fox, Deborah Penchoff, Xiaodong Yu, Jack Dongarra, Shantenu Jha, Piotr Luszczek, Baixi Sun, Gregor von Laszewski

Updates

Tennessee

Piotr reported UTK’s continued progress with the FAIR technology in his presentation with a discussion of the ontology needed for SciML and extensions to MLCommons. The choice of YAML versus XML and TOML was discussed. There was a discussion between Piotr and Gregor about that indicated that the YAML format is not sufficient to encode the surrogate and the hardware used for it. An alternative was discussed where one encodes endpoints in the YAML and these endpoints have the detailed metadata/Schema. This is natural in examples that use PyTorch or Tensorflow which could have customized sub-ontologies. Gregor suggested circulating an example to identify if YAML would be nevertheless good enough. The performance surrogate is running on Summit.

Argonne

Argonne described the continued work on understanding the performance of distributed training already discussed in the last three meetings with the 2 models, Horovod and the Mirrored Strategy, for ptychoNN surrogate. Baixi new slides are at They are using the latest model from PtychoNN team and testing on the large diffraction and real space data on the 2 distributed training models. Horovod did better on 4, 8 GPU’s; Mirrored on 1,2 GPU’s. They implemented Pytorch DDP to profile and analysis the performance.

Rutgers

  • This continued discussion from last time on work with Vikram on software
  • .Progress on Quantum computing surrogate with Ian Foster
  • Shantenu also updated work on categorizing surrogates. \

Indiana

Vikram reported an update on the time series molecular dynamics surrogate although not using the soft (adding in simulation errors) optimization.

Virginia

Geoffrey was distracted by the poor performance of his home internet (now corrected) and did not report solid progress on his diffusion equation solver

24 - Meeting Notes 10-21-2021

Meeting Notes from 10-21-2021

Minutes of SBI-FAIR October 25 2021 Meeting

Present: Kamil Iskra, Vikram Jadhao, Geoffrey Fox, Deborah Penchoff, Xiaodong Yu, Jack Dongarra, Shantenu Jha, Piotr Luszczek, Baixi Sun, Gregor von Laszewski

Updates

Tennessee

Piotr reported that paper submitted to IPDPS; and metadata (FAIR) work is continuing

Virginia

Geoffrey has summarized 4 possible MLCommons Science Datasets that could be useful for FAIR studies. See recent Argonne preprint

Indiana

Vikram Jadhao described his new surrogate paper [2110.14714] Designing Machine Learning Surrogates using Outputs of Molecular Dynamics Simulations as Soft Labels and quoting from abstract “Here, we show that statistical uncertainties associated with the outputs of molecular dynamics simulations can be utilized to train artificial neural networks and design machine learning surrogates with higher accuracy and generalizability. We design soft labels for the simulation outputs by incorporating the uncertainties in the estimated average output quantities and introduce a modified loss function that leverages these soft labels during training to significantly reduce the surrogate prediction error for input systems in the unseen test data. The approach is illustrated with the design of a surrogate for molecular dynamics simulations of confined electrolytes to predict the complex relationship between the input electrolyte attributes and the output ionic structure. The surrogate predictions for the ionic density profiles show excellent agreement with the ground truth results produced using molecular dynamics simulations.”

Rutgers

  • Collaboration with Vikram has started
  • Classification of surrogates introduced 6 classes and analyzed many new papers
  • Gordon Bell submission involved Caltech + DOE Labs + San Diego and used surrogates at multiple levels – it studied how to balance effort between them. The application concerned Delta Covid.

Argonne

Kamil and Xiaodong described the continued work on understanding the performance of distributed training already introduced last month. Baixi gave the presentation . Next month will see a new dataset and new results.

Hyperparameters were tuned for ptychoNN surrogate on Horovod and the Mirrored Strategy.

The current approach is synchronous but will look at asynchronous methods.

We agreed on the next meeting date November 29.

25 - Meeting Notes 09-27-2021

Meeting Notes from 09-27-2021

Minutes of SBI-FAIR September 27 2021 Meeting

Present: Kamil Iskra, Vikram Jadhao, Geoffrey Fox, Deborah Penchoff, Xiaodong Yu, Jack Dongarra, Shantenu Jha, Piotr Luszczek, Pete Beckman, Baixi Sun, Gregor von Laszewski

Updates

Indiana/Virginia

Vikram has a new surrogate and is finalizing a paper on it. He will talk to Shantenu soon.

Rutgers

Shantenu affected by hurricane

  1. Developing 3 layer simulations with surrogate at each level
  2. ML driven HPC motifs/patterns identified in research to be reported at November meeting
    1. DeepDriveMD ensemble is one example
    2. climate science simulations gives surrogates that select best simulation
    3. Link with observation link seen in climate, materials and biomolecular science

University of Tennessee

  1. Workshop in April 4-7 2022 at UTK
  2. Performance surrogate paper to IPDPS; excellent speedup but not 2 billion
  3. FAIR ontologies will resume after this paper

Argonne

  1. Yu introduced their GPU scheduling work and investigation of the surrogate model training change scalability
  2. Baixi Sun gave a detailed presentation on Distributed Training On PtychoNN
    1. Utilized the Horovod framework on ptychoNN model.
    2. Tested the Horovod performance for different number of GPUs on single node and multiple nodes using Ring All-Reduce
    3. Tried Mirrored Strategy framework on ptychoNN model.
    4. Tested the performance for different number of GPUs on single node.
    5. Debugging of the Mirrored Strategy framework for distributed training.
    6. Presented performance numbers with MNIST and ptychoNN
    7. Updated our versions of code on our gitlab repository and wiki documentation.
  3. Links for more details are: 8. This is the official documentation for Horovod: Horovod with Keras — Horovod documentation . 9. And this is the thetaGPU Horovod tutorial: Distributed training on ThetaGPU using data parallelism | Argonne Leadership Computing Facility . 10. This is the official documentation for Mirrored Strategy: Multi-GPU and distributed training (Section “Single-host, multi-device synchronous training”). 11. To be specific, the code I ran on thetaGPU is currently in our private Gitlab repository: https://gitlab.com/SBI-HPC/benchmark_suite/-/tree/main/ptychography . (Please note that for Mirrored Strategy I am currently debugging on it so the latest stable version of code has not committed yet, will come soon!). 12. The guidance of using those code on thetaGPU is written in the Gitlab wiki: https://gitlab.com/SBI-HPC/benchmark_suite/-/wikis/PtychoNN-Distributed-Training-on-ThetaGPU.

26 - Meeting Notes 08-30-2021

Meeting Notes from 08-30-2021

Minutes of Meeting August 30, 2021

Present: Kamil Iskra, Vikram Jadhao, Geoffrey Fox, Deborah Penchoff, Xiaodong Yu, Jack Dongarra, Shantenu Jha, Piotr Luszczek, Pete Beckman, Baixi Sun

Updates

  • Rutgers: Progress with recruiting problems. Highlighted a new paper https://doi.org/10.1021/acs.jcim.8b00839 on molecular benchmarks from Benevolent AI GuacaMol: Benchmarking Models for De Novo Molecular Design. Peter Coveney Company in London
  • Tennessee continues work on the performance surrogate model. Tune hyperparameters. Build from small runs. Report in October. Works on simulations or data analytics. Unlike ATLAS aimed at problems with runs that take a large time
  • **Argonne. **Pete noted by email a new paper Why AI is Harder Than We Think with a cautionary tale.
    • Baixi Sun from Washington State University was introduced as a new student on project
    • Xiaodong discussed their 3 use cases. Convert notebooks to python scripts and run in multinode fashion
    • Using ALCF the first usage mode is based on Jupyter notebooks and second usage mode is batch
    • ALCF likes Jupyter notebooks. Also note Jupyter notebooks at ORNL
  • Indiana/Virginia. Vikram Jadhao presented on surrogates for soft materials
    • This reviewed highlights from the field and then focussed on his work
    • Word surrogate not often used in field
    • The review covered SorbNet from MInnesota, ab initio simulation from Toronto and pair correlation function of liquids from UIUC group Aluru
    • Vikram’s application was confined electrolytes where surrogate relates structure to attributes
    • Good use in education using nanoHUB deployment
    • Nice performance slide
    • EXTENDED Predictions were not as good as original ones
    • Need to quantify and improve accuracy – how? Average over all quantities but worse near the wall. COULD weight those points more in loss
      • Common in surrogates, that error is dominated by “special” regions – boundaries, singularities etc. as work of Geoffrey with James Glazier on diffusion equation for cell modelling.
    • Look at reducing needed training size
    • Will evaluate using Rutgers software infrastructure

27 - Meeting Notes 07-26-2021

Meeting Notes from 07-26-2021

Minutes of Meeting July 26, 2021

Shantenu led a discussion of surrogates noting his work was delayed by a loss of a postdoc. Shantenu divided Surrogates into 3 areas

Shantenu presented PY2 and PY3 plans

In PY2 primary goals are:

  • (mini-)Review of surrogates in HPC – Volunteers? See later
  • Formalizing Performance measures (MLinHPC)
    • Three scenarios discussed above: Climate, Docking, Potentials
  • Experimenting with Performance (MLoutHPC)
    • Use DeepDriveMD to support different surrogates (Table 1) for common physical model (system)

In PY3

  • tackle (more) complex problem of MLoutHPC

AlphaFold2 (Google) and RoseTTaFold (Baker at Washington) DeepMind’s AI for protein structure is coming to the masses news BOTH released

CASP said protein folding solved from AlphaFold2 but RosettaFold is cheaper and as good as AlphaFold2. This could be an opportunity

Beckman noted we see a science transformation using FAIR Methodology.

Rick Stevens has challenged “How much did Go AI cost”

Dataset size is a serious issue.

  • deepmind/alphafold: Open source code for AlphaFold. notes The total download size for the full databases is around 415 GB and the total size when unzipped is 2.2 TB. Please make sure you have a large enough hard drive space, bandwidth and time to download. We recommend using an SSD for better genetic search performance.
  • Hurricane simulation will become inference
  • Doe strategy train leave data where it is similar to medical federated learning
  • Vikram noted that material science led to smaller datasets as just output final results and not the full trajectory

We discussed having a session at The Argonne Training Program on Extreme-Scale Computing (ATPESC) in 2022

Next month we will consider Implications for the project. Vikram and Shantenu volunteered

28 - Meeting Notes 06-29-2021

Meeting Notes from 06-29-2021

Minutes of Meeting June 29, 2021

Annual Report

This meeting focussed on getting the final version of the DOE annual report which was submitted the following day by each institution.

Next Meeting

Our meetings are 1 pm Eastern on the 4th Monday of each month

This implies Monday, July 25, 1 pm at zoom https://iu.zoom.us/j/2301429329

In the July meeting, Shantenu Jha will lead a discussion of surrogates, postponed from June

29 - Meeting Notes 05-24-2021

Meeting Notes from 05-24-2021

Minutes of Meeting May 24, 2021

Links for Today’s Meeting

Powerpoint of Argonne Talk 2021-05-SBI-ANL.pptx

PDF of Argonne Talk 2021-05-SBI-ANL.pdf

Present

Argonne: Min Si, Xiaodong Yu

**Indiana: **Geoffrey Fox, Vikram Jadhao, Gregor von Laszewski

Rutgers: Shantenu Jha

UTK: Jack Dongarra, Piotr Luszczek

Argonne Presentation

Xiaodong Yu’s described 3 surrogates being developed at Argonne

Application 1 **PtychoNN: Ptychographic Imaging Reconstruction phase reconstruction **

Here the challenge is to determine phases from Xray scattering data with paper. The surrogate is being extended to run using Horovod on the multi-GPU ThetaGPU system.

Application 2: Geophysical Forecasting

This involves LSTM forecast models combined with a neural architecture search NAS using deephyper in original paper which ran on Theta without GPUs.

Application 3: Molecular dynamics (MD) simulation

This is multiscale modeling of SARS-CoV-2 in the CANDLE project which received the 2020 ACM Gordon Bell Special Prize for High Performance Computing-Based COVID-19 Research.

Shantenu Jha was a co-author on their paper “AI-Driven Multiscale Simulations Illuminate Mechanisms of SARS-CoV-2 Spike Dynamics”.

Other Business We discussed adding material to the website.

Annual Report

We just received the request from DOE for an annual report abstracted below, We could discuss (unfortunately it is due before our next meeting) a common text that we could use as part of each report.

The Office of Advanced Scientific Computing Research (ASCR) within the Department of Energy Office of Science requests that you submit a Progress Report for the award listed below. To create and submit the Progress Report, please use the DOE Office of Science Portfolio Analysis and Management System (PAMS).

Task: Submit Progress Report (Link)

Due Date: 06/24/2021 5:00 PM ET

Reporting Period: 09/23/2020 - 09/22/2021

Next Meeting

Our meetings are 1 pm Eastern on the 4th Monday of each month

This implies Monday, June 28, 1 pm at zoom https://iu.zoom.us/j/2301429329

In the June meeting, Shantenu Jha will lead a discussion of surrogates.

30 - Meeting Notes 04-19-2021

Meeting Notes from 04-19-2021

Minutes of Meeting April 19, 2021

Links for Today’s Meeting

Updates

  • Argonne postponed their update to the next meeting and the other 3 sites gave updates.
  • Indiana discussed SciMLBench from the UK with its first release and the related MLCommons Science benchmarking. With surrogates, Jadhao will work on the nanoengineering one in the Fall and Fox completed an initial study of a virtual tissue surrogate [2102.05527] Deep learning approaches to surrogates for solving the diffusion equation for mechanistic real-world simulations.
  • Tennessee gave a comprehensive report covering their Surrogate Performance Model for Autotuning; their FK6D / ASGarD · GitLab project aimed at a later release of SCiMLBench and an insightful analysis of issues and needed ontologies for a FAIR approach to benchmark data. The discussion pointed out that FAIR does not address areas like validation, verification, and reproducibility. Piotr introduced broad categories: Hardware, firmware, dataset, software, measurements. We know from MLPerf that I/O specification and measurement are nontrivial. The mode of execution: capability or capacity(high-throughput) needs to be specified. Gregor noted complications from the use of containers that can hide software versioning. Christine Kirkpatrick’s Advancing AI through MLCommons to MLCommons Benchmark-Infra WG April 6 highlighted tension between the flexibility of free text and FAIR machine readability
  • **Rutgers **Shantenu Jha discussed recent work by his group on computational performance. He pointed out a recent paper by Alexandru Iosup on GradeML: Towards Holistic Performance Analysis for Machine Learning Workflows

Discussion and Action Items

  • We agreed to start two working groups on FAIR (coordinated by Piotr) and Surrogates (coordinated by Shantenu). The scope of both groups was unclear as yet and should be discussed in meetings
  • There was a discussion of access to computers across the collaboration
  • We discussed Surrogate Software and Benchmark software with work of Deep500 (Torsten Hoefler of ETH Zurich), GradeML, MLCube, SciMLBench mentioned. We need to relate it to FAIR
  • We still need to implement SBI repository
  • We agreed in the March meeting to enhance the website with updated (after proposal) information. Please send your GitHub ID’s to Gregor laszewski@gmail.com so he can enable to directly edit web site
  • Deborah Penchoff of UTK identified a template for DOE annual report. We should accumulate the needed contributions
  • We agreed to set the next meeting for 1-2 pm Eastern May 24 2021 at the usual zoom https://iu.zoom.us/j/2301429329

31 - Meeting Notes 03-23-2021

Meeting Notes from 03-23-2021

Minutes of Meeting March 23 2021

Links for Today’s Meeting

The 4 sites all gave updates with presentations listed above.

Indiana largely discussed work with MLCommons Science research working group

  • Benchmark collection which will eventually include surrogates
  • Benchmark Technology and FAIR metadata

Argonne presented substantial progress with

  • The hiring of a new postdoc Xiaodong Yu with substantial experience
  • Identification of several surrogates including those that don’t work e.g. give insufficient accuracy
  • Use of ThetaGPU

**Tennessee **reported substantial progress with

  • Examination of MLFlow and its metadata which support many storage formats but are missing FAIR features
  • ONNX Open Neural Network Exchange which currently has no science or surrogate examples
  • The N to N issues of matching many inputs to many outputs’
  • Performance surrogate model for Autotuning work in progress

Rutgers (no presentation) discussed two activities

  • Effective performance where a new student will join.
  • Surrogates corresponding to two Gordon Bell prize winners at SC20 extending from Rutgers work with Argonne (autoencoders for collective coordinates to move through phase space quickly) to the other winner from Princeton where AI learned the complex potential.

Action Items

  • We agreed to set the next meeting for 1-2 pm Eastern April 19 2021 at the usual zoom https://iu.zoom.us/j/2301429329
  • We agreed to enhance the web site with updated (after proposal) information. Please send your GitHub ID’s to Gregor laszewski@gmail.com so he can enable to directly edit web site
  • Shantenu agreed to coordinate a surrogate working group after 4 weeks
  • Piotr agreed to coordinate cross-institution FAIR activities including issues of MLCommons metadata and Christine Kirkpatrick’s work
  • Argonne will investigate Yu giving a short presentation

32 - Meeting Notes 02-20-2021

Meeting Notes from 02-20-2021

University of Tennessee Knoxville

  • Deborah Penchoff joining the team
  • UTK Schema
  • MLFlow – reproducibility
  • Is training repeatable
  • Need to have a group on this
  • UTK have their own surrogates science and performance
  • Storage
  • Uq
  • Hardware

Rutgers University

  • Performance of surrogates
  • What does it mean
  • Gordon bell prizes
  • Deepdrivemd greatly advanced
  • Working with Princeton Gordon Bell
  • 2 billion paper

Argonne National Laboratory

  • Clear plans

  • Candle

  • Paper creates a surrogate howto – GCF forgets this

  • DOE_FAIR2020-Surrogates

Github site infrastructure

  • Web site built on Github - Possible Hugo web site

  • Form Google group

  • Form working groups

  • Infrastructure & Benchmarking Tech

  • Metadata/FAIR

  • Surrogates

All meet once a month

33 - Meeting Notes 01-20-2021

Meeting Notes from 01-20-2021

**Indiana University **

Report SBI-Meeting-IU-Jan20-2021

University of Tennessee Knoxville

Report SBI @ UTK 2k21

Deborah Penchoff joining the team

UTK Schema

MLFlow – reproducibility

Is training repeatable

Need to have a group on this

UTK have their own surrogates science and performance

Storage

Uq

Hardware

Rutgers

**Report **SBI-Rutgers Jan 20-2021

Performance of surrogates

What does it mean

Gordon bell prizes

Deepdrivemd greatly advanced

Working with Princeton Gordon Bell

2 billion paper

Argonne

Report SBI-Meeting-IU-Jan20-2021

Clear plans

Candle

Paper creates a surrogate howto – GCF forgets this

Github site infrastructure

Web site built on Github - Possible Hugo web site

Form Google group

Form working groups

Infrastructure & Benchmarking Tech

Metadata/FAIR

Surrogates

All meet once a month