X-MAS in AI | Boosting Multi-Agent Systems with a Sleigh Full of LLMs

Unwrapping the Future of Collaborative AI Using Heterogeneous Large Language Models (LLMs).

Keywords

AI; AI Agent; Computer Engineering; LLMs

Published May 24, 2025 By EngiSphere Research Editors

In Brief

This research introduces X-MAS, a framework that significantly boosts multi-agent system performance by leveraging diverse, specialized large language models (LLMs) instead of relying on a single model for all agent roles.

In Depth

Imagine this: What if instead of using just one brainy AI model to solve all your problems, you built a team of diverse AI agents, each with their own specialty? That’s exactly what the researchers behind X-MAS propose — and the results are spectacular!

Let’s dive into how mixing different kinds of Large Language Models (LLMs) into a single multi-agent system (MAS) can dramatically improve performance — and why this could be the next big leap in artificial intelligence.

What Is a Multi-Agent System (MAS)?

A Multi-Agent System (MAS) is like a team of AIs. Each "agent" has a specific job — one might answer questions, another might double-check answers, a third might combine everyone's ideas, and so on. Together, they tackle complex problems no single AI could handle alone.

Until now, most MAS setups used one type of LLM to power all the agents. It’s like assembling your Avengers team but hiring only clones of Iron Man.

Cool, but limiting.

Enter X-MAS: A Gift of Diversity

The research team introduces X-MAS (short for eXpert Multi-Agent Systems). It’s a fresh twist on MAS — instead of using one LLM for everything, X-MAS combines heterogeneous LLMs, meaning each agent gets the model that fits its job best.

It’s like building a tech startup where:

The coder is powered by a coding-specialized LLM,
The planner is run by a logic-heavy reasoning LLM,
The communicator is backed by a chatbot-trained LLM.

Now we’re talking teamwork!

X-MAS-Bench: Putting AI to the Test

To see if this mix-and-match strategy really works, the researchers built X-MAS-Bench, a huge testbed to evaluate how 27 LLMs perform across:

Math
Coding
Science
Medicine
Finance

They looked at 5 crucial agent skills:

Question-Answering (QA)
Revising
Aggregating Answers
Planning Tasks
Evaluating Responses

Over 1.7 million tests were run — that’s more than many AIs can count!

Big Results from Diverse Models

Here’s what they found:

No single LLM is best at everything — some excel at coding, others shine in medicine.
Specialized, smaller models often outperform big general ones in certain tasks. Size isn’t everything!
Mixing different LLMs in MAS leads to better overall performance. Heterogeneous systems were up to:
- 8.4% better on math problems!
- 47% better in a reasoning challenge (AIME)!

Even flawed models can be useful when teamed up strategically — a weak QA agent might still be a great evaluator.
The more diverse the models, the better the results. Adding more types of LLMs improved outcomes across the board.

X-MAS-Design: How to Build It

The team didn’t just test ideas in theory — they created a blueprint called X-MAS-Design, showing how to upgrade existing MAS frameworks into smarter, diverse teams. Here's how it works:

Keep the same MAS structure — no need to rebuild from scratch.
Replace the one-size-fits-all LLM with the best-fit LLM for each role.
Instant performance boost!

They applied this to 4 MAS systems (including AgentVerse and DyLAN), and every single one performed better with a mix of LLMs.

Real-World Example: X-MAS-Proto

The researchers also built X-MAS-Proto, their own prototype MAS with all 5 agent functions. By selecting the best LLM for each task, they saw a huge jump in performance — even beating newer math benchmarks like AIME-2025 and MATH-MAS by 33-34%.

In short: if your MAS is powered by a "variety pack" of AIs instead of just one, you get smarter results. It’s the AI version of "don’t put all your eggs in one basket."

Why This Matters

This research matters because it shows us how to get more intelligence without training new models — just by being smart about which ones we use.

Think of the possibilities:

Education tools that use different LLMs to tutor, test, and evaluate students.
Medical agents that consult specialized LLMs in diagnostics, treatment plans, and risk evaluation.
Legal AI assistants that plan cases, draft arguments, and critique decisions — all powered by different legal-trained models.

What’s Next?

Future directions inspired by X-MAS:

Auto-selecting the best LLMs for each agent based on the task.
Training LLMs specifically for MAS roles (e.g., expert planner or sharp evaluator).
Scaling this to real-world systems in healthcare, research, finance, and more.

The era of one-brain-fits-all AI is ending. It’s time to build diverse teams of minds — just like in the real world!

TL;DR (Too Long, Debugged & Reviewed)

The X-MAS project shows that Multi-Agent Systems work way better when powered by diverse LLMs, not just one.
They tested 27 LLMs, 5 domains, and 5 functions with 1.7 million experiments.
Mixing models improves performance dramatically, especially in math, planning, and reasoning tasks.
Just swapping in different LLMs — without changing the MAS structure — gives a performance boost.
It’s time to think beyond single AI models and build teams of specialists.

Whether you’re an AI researcher, developer, or curious engineer, X-MAS reminds us that diversity in intelligence is a feature, not a bug.

Stay curious, and keep building smart.

In Brief

Multi-Agent System (MAS) - A setup where multiple AI "agents" work together, each doing a specific job (like planning, answering, or reviewing) to solve a complex task as a team. - More about this concept in the article "Smart Swarms at Sea: How Unmanned Boats Patrol the Oceans More Efficiently".

Large Language Model (LLM) - A type of AI trained on tons of text data that can understand and generate human-like language — like ChatGPT, Gemini, or Claude. - More about this concept in the article "Phishing, Be Gone! | How Small AI Models Are Learning to Outsmart Big Email Scammers".

Heterogeneous LLMs - A mix of different LLMs — instead of using the same AI everywhere, you choose the best one for each job (e.g., one for math, another for medicine).

Chatbot - An LLM fine-tuned to have conversations — good at talking, answering questions, and following instructions in natural language.

Reasoner - An LLM focused on logic and problem-solving — great for tasks like planning steps or solving math puzzles.

Benchmark - A test or dataset used to measure how well an AI model performs on a specific task or in a specific domain.

X-MAS-Bench - A giant testing platform built by the researchers to compare how 27 LLMs perform across 5 domains and 5 agent tasks.

X-MAS-Design - A method for upgrading existing MAS setups by swapping in different LLMs for each agent — no need to rebuild everything from scratch!

Agent Functions - The different roles AI agents can play in a MAS — like: