EngiSphere icone
EngiSphere

Digital Twin for High-Performance Computing ⚡ Smarter, Greener, Faster

: ; ; ; ; ;

Discover how a new Digital Twin framework reshapes supercomputer scheduling, energy efficiency, and sustainability through smart simulation.

Published September 3, 2025 By EngiSphere Research Editors
Data Center Digital Twin (High-Performance Computing) © AI Illustration
Data Center Digital Twin (High-Performance Computing) © AI Illustration

TL;DR

A recent research introduces S-RAPS, the first digital twin framework with integrated scheduling for supercomputers, enabling “what-if” simulations with AI and incentives to improve energy efficiency, cooling, and job performance before real-world deployment.

The R&D

🌍 Why This Research Matters

Supercomputers are the backbone of modern science — they model climate change, design new materials, and even accelerate AI. But running them efficiently is tricky. Imagine trying to schedule thousands of jobs on a system consuming megawatts of power, all while balancing performance, fairness, and sustainability.

Traditionally, schedulers (the software deciding which job runs when) are tested only after deployment. That means mistakes can waste energy, delay jobs, or strain cooling systems. Enter Digital Twins (DTs) — virtual replicas of physical systems that let researchers experiment in a safe, simulated environment 🖥️.

This research introduces the first Data Center Digital Twin (DCDT) with scheduling built in. It lets scientists run “what-if” experiments on supercomputer job scheduling, without touching the actual machine.

🔧 What Did the Researchers Build?

The research team developed Scheduled-RAPS (S-RAPS), an extension of the open-source ExaDigiT digital twin framework.

Here’s what makes it special:

1. Integrated Scheduling in Digital Twins 🗓️
  • Previous DTs simulated cooling and power but ignored schedulers.
  • S-RAPS adds realistic scheduling, letting researchers test how different policies affect performance and energy use.
2. Use of Open Datasets 📊
  • The system was tested on real workload traces from Frontier, Marconi100, Fugaku, Lassen, and Adastra supercomputers.
  • This ensures transparency and reproducibility.
3. External Scheduler Support 🔌

S-RAPS can plug into popular schedulers like ScheduleFlow and FastSim, enabling wide community adoption.

4. Machine Learning for Scheduling 🤖

ML-guided scheduling was tested, and it outperformed traditional methods under high load, reducing energy spikes and job turnaround times.

5. Incentive Structures 💰

The system can model reward systems (like “Fugaku points”) to encourage energy-efficient job submissions.

📊 Key Findings
1. Scheduling Affects Energy Use 🔋

Different scheduling strategies can smooth out or worsen power spikes. For example:

  • Under low load, all policies behave similarly.
  • Under high load, ML-based scheduling reduces power swings and improves efficiency.
2. ML-Guided Scheduling Shines ✨

Compared to first-come-first-serve (FCFS) or priority-based approaches, ML scheduling achieved:

  • Lower average wait time ⏳
  • Better turnaround time 🔄
  • Lower energy-delay product (EDP) ⚡

In short, more science per watt.

3. Incentives Shape User Behavior 🎯

By simulating “point systems” (like Fugaku’s), S-RAPS showed how rewarding users for low-power jobs or efficient runs can shift workloads toward greener computing.

4. External Scheduler Integration Works ✅

Even though ScheduleFlow wasn’t optimized for this setup, it successfully interacted with the DT. FastSim ran 688× faster than real-time, showing the potential of ultra-fast forecasting.

🔭 Future Prospects

The research opens exciting possibilities for next-generation HPC management:

  1. Smarter Predictions with Incomplete Data 📉
    • Right now, accurate predictions require detailed job traces and power profiles.
    • Future work will focus on job fingerprinting and estimation, making DTs useful even without perfect logs.
  2. AI-Driven Energy-Aware Scheduling 🤝 With ML models learning from telemetry, schedulers could dynamically balance job performance and sustainability goals.
  3. Designing Next-Gen Data Centers 🏗️ DTs let operators prototype cooling strategies, job mixes, and incentive systems before new supercomputers are built.
  4. Sustainability at Scale 🌱 By reducing wasted energy and cooling costs, DT-driven scheduling could make HPC more eco-friendly, aligning with global climate goals.
🧭 Why This Is a Breakthrough

Think of this as flight simulators for supercomputers ✈️. Instead of risking real energy waste or delays, operators can now test scenarios safely:

  • “What if we reward users for submitting smaller, energy-efficient jobs?”
  • “How would switching to ML scheduling affect power costs during peak hours?”
  • “Can we prevent sudden cooling spikes by spreading out workloads?”

All these questions can now be answered virtually, before making real-world changes.

📌 Final Thoughts

This study is the first of its kind to merge scheduling into Data Center Digital Twins. By combining AI, open datasets, and incentive modeling, it empowers HPC centers to be:

  • More efficient (better turnaround, less wasted power)
  • More sustainable (reduced cooling and energy demand)
  • More innovative (safe space for testing radical ideas)

As supercomputers continue to grow in scale and complexity, digital twins will be the key to running them smarter, not just faster.

The future of HPC is not only about more FLOPS (floating point operations per second) — it’s about more FLOPS per watt ⚡🌍.


Terms to Know

Digital Twin (DT) 🖥️ A virtual copy of a physical system (like a supercomputer) that behaves just like the real thing. It lets engineers test “what-if” scenarios safely without touching the actual machine. - More about this concept in the article "Digital Twin for Smart Intersections 🚦 The Future of Traffic Management".

High-Performance Computing (HPC) ⚡ Supercomputing on steroids! These are massively powerful systems designed to crunch huge amounts of data for science, AI, and engineering.

Scheduler 🗓️ Software that decides which job runs where and when on a supercomputer. Think of it like air traffic control but for computing jobs. - More about this concept in the article "Powering AIoT with Purpose 🌱 Meet GreenPod, the Eco-Friendly Kubernetes Scheduler!".

Job Trace / Workload Trace 📊 A record of jobs (programs) that ran on a system, including details like start time, duration, and resource use. Researchers replay these to test new scheduling ideas.

Resource Allocator 💻 A manager inside the system that assigns CPUs, GPUs, and memory to jobs so everything runs smoothly.

Backfill Scheduling 🔄 A trick schedulers use to fill small gaps in the system’s timeline with shorter jobs, improving efficiency.

Energy-Delay Product (EDP) ⚡⏳ A metric combining how much energy a job uses and how long it takes. Lower EDP = more efficient computing.

Power Usage Effectiveness (PUE) 🌡️ A measure of data center efficiency. A PUE of 1.0 means all energy goes to computing; higher values mean more energy wasted on cooling or overhead. - More about this concept in the article "🤖💡 AI's Appetite for Energy: Is Your Power Grid Ready?".

Incentive Structures 🎯 Reward systems (like giving “points” for energy-efficient jobs) that encourage users to submit greener workloads. - More about this concept in the article "Transforming Power Grids for EV Charging 🚗 🔋 A Sustainable Revolution".

Machine Learning (ML) for Scheduling 🤖 Using AI models to predict job behavior (like runtime or energy use) and optimize scheduling for both speed and sustainability.


Source: Matthias Maiterth, Wesley H. Brewer, Jaya S. Kuruvella, Arunavo Dey, Tanzima Z. Islam, Kevin Menear, Dmitry Duplyakin, Rashadul Kabir, Tapasya Patki, Terry Jones, Feiyi Wang. HPC Digital Twins for Evaluating Scheduling Policies, Incentive Structures and their Impact on Power and Cooling. https://doi.org/10.48550/arXiv.2508.20016

From: Oak Ridge National Laboratory; Texas State University; National Renewable Energy Laboratory; Colorado State University; Lawrence Livermore National Laboratory.

© 2025 EngiSphere.com