A recent research introduces S-RAPS, the first digital twin framework with integrated scheduling for supercomputers, enabling “what-if” simulations with AI and incentives to improve energy efficiency, cooling, and job performance before real-world deployment.
Supercomputers are the backbone of modern science — they model climate change, design new materials, and even accelerate AI. But running them efficiently is tricky. Imagine trying to schedule thousands of jobs on a system consuming megawatts of power, all while balancing performance, fairness, and sustainability.
Traditionally, schedulers (the software deciding which job runs when) are tested only after deployment. That means mistakes can waste energy, delay jobs, or strain cooling systems. Enter Digital Twins (DTs) — virtual replicas of physical systems that let researchers experiment in a safe, simulated environment 🖥️.
This research introduces the first Data Center Digital Twin (DCDT) with scheduling built in. It lets scientists run “what-if” experiments on supercomputer job scheduling, without touching the actual machine.
The research team developed Scheduled-RAPS (S-RAPS), an extension of the open-source ExaDigiT digital twin framework.
Here’s what makes it special:
S-RAPS can plug into popular schedulers like ScheduleFlow and FastSim, enabling wide community adoption.
ML-guided scheduling was tested, and it outperformed traditional methods under high load, reducing energy spikes and job turnaround times.
The system can model reward systems (like “Fugaku points”) to encourage energy-efficient job submissions.
Different scheduling strategies can smooth out or worsen power spikes. For example:
Compared to first-come-first-serve (FCFS) or priority-based approaches, ML scheduling achieved:
In short, more science per watt.
By simulating “point systems” (like Fugaku’s), S-RAPS showed how rewarding users for low-power jobs or efficient runs can shift workloads toward greener computing.
Even though ScheduleFlow wasn’t optimized for this setup, it successfully interacted with the DT. FastSim ran 688× faster than real-time, showing the potential of ultra-fast forecasting.
The research opens exciting possibilities for next-generation HPC management:
Think of this as flight simulators for supercomputers ✈️. Instead of risking real energy waste or delays, operators can now test scenarios safely:
All these questions can now be answered virtually, before making real-world changes.
This study is the first of its kind to merge scheduling into Data Center Digital Twins. By combining AI, open datasets, and incentive modeling, it empowers HPC centers to be:
As supercomputers continue to grow in scale and complexity, digital twins will be the key to running them smarter, not just faster.
The future of HPC is not only about more FLOPS (floating point operations per second) — it’s about more FLOPS per watt ⚡🌍.
Digital Twin (DT) 🖥️ A virtual copy of a physical system (like a supercomputer) that behaves just like the real thing. It lets engineers test “what-if” scenarios safely without touching the actual machine. - More about this concept in the article "Digital Twin for Smart Intersections 🚦 The Future of Traffic Management".
High-Performance Computing (HPC) ⚡ Supercomputing on steroids! These are massively powerful systems designed to crunch huge amounts of data for science, AI, and engineering.
Scheduler 🗓️ Software that decides which job runs where and when on a supercomputer. Think of it like air traffic control but for computing jobs. - More about this concept in the article "Powering AIoT with Purpose 🌱 Meet GreenPod, the Eco-Friendly Kubernetes Scheduler!".
Job Trace / Workload Trace 📊 A record of jobs (programs) that ran on a system, including details like start time, duration, and resource use. Researchers replay these to test new scheduling ideas.
Resource Allocator 💻 A manager inside the system that assigns CPUs, GPUs, and memory to jobs so everything runs smoothly.
Backfill Scheduling 🔄 A trick schedulers use to fill small gaps in the system’s timeline with shorter jobs, improving efficiency.
Energy-Delay Product (EDP) ⚡⏳ A metric combining how much energy a job uses and how long it takes. Lower EDP = more efficient computing.
Power Usage Effectiveness (PUE) 🌡️ A measure of data center efficiency. A PUE of 1.0 means all energy goes to computing; higher values mean more energy wasted on cooling or overhead. - More about this concept in the article "🤖💡 AI's Appetite for Energy: Is Your Power Grid Ready?".
Incentive Structures 🎯 Reward systems (like giving “points” for energy-efficient jobs) that encourage users to submit greener workloads. - More about this concept in the article "Transforming Power Grids for EV Charging 🚗 🔋 A Sustainable Revolution".
Machine Learning (ML) for Scheduling 🤖 Using AI models to predict job behavior (like runtime or energy use) and optimize scheduling for both speed and sustainability.
Source: Matthias Maiterth, Wesley H. Brewer, Jaya S. Kuruvella, Arunavo Dey, Tanzima Z. Islam, Kevin Menear, Dmitry Duplyakin, Rashadul Kabir, Tapasya Patki, Terry Jones, Feiyi Wang. HPC Digital Twins for Evaluating Scheduling Policies, Incentive Structures and their Impact on Power and Cooling. https://doi.org/10.48550/arXiv.2508.20016
From: Oak Ridge National Laboratory; Texas State University; National Renewable Energy Laboratory; Colorado State University; Lawrence Livermore National Laboratory.