EngiSphere icone
EngiSphere

X-MAS in AI ๐ŸŽ„Boosting Multi-Agent Systems with a Sleigh Full of LLMs

: ; ; ;

Unwrapping the Future of Collaborative AI Using Heterogeneous Large Language Models (LLMs) ๐ŸŽ

Published May 24, 2025 By EngiSphere Research Editors
Robots Working Together Symbolising Collaboration ยฉ AI Illustration
Robots Working Together Symbolising Collaboration ยฉ AI Illustration

The Main Idea

This research introduces X-MAS, a framework that significantly boosts multi-agent system performance by leveraging diverse, specialized large language models (LLMs) instead of relying on a single model for all agent roles.


The R&D

Imagine this: What if instead of using just one brainy AI model to solve all your problems, you built a team of diverse AI agents, each with their own specialty? Thatโ€™s exactly what the researchers behind X-MAS propose โ€” and the results are spectacular! ๐ŸŒŸ

Letโ€™s dive into how mixing different kinds of Large Language Models (LLMs) into a single multi-agent system (MAS) can dramatically improve performance โ€” and why this could be the next big leap in artificial intelligence. ๐Ÿค–โœจ

๐Ÿค” What Is a Multi-Agent System (MAS)?

A Multi-Agent System (MAS) is like a team of AIs. Each "agent" has a specific job โ€” one might answer questions, another might double-check answers, a third might combine everyone's ideas, and so on. Together, they tackle complex problems no single AI could handle alone.

Until now, most MAS setups used one type of LLM to power all the agents. Itโ€™s like assembling your Avengers team but hiring only clones of Iron Man. ๐Ÿง๐Ÿง๐Ÿง

Cool, but limiting. ๐Ÿ˜•

๐ŸŽ Enter X-MAS: A Gift of Diversity ๐ŸŽ‰

The research team introduces X-MAS (short for eXpert Multi-Agent Systems). Itโ€™s a fresh twist on MAS โ€” instead of using one LLM for everything, X-MAS combines heterogeneous LLMs, meaning each agent gets the model that fits its job best.

Itโ€™s like building a tech startup where:

๐Ÿ’ป The coder is powered by a coding-specialized LLM,
๐Ÿง  The planner is run by a logic-heavy reasoning LLM,
๐Ÿ’ฌ The communicator is backed by a chatbot-trained LLM.

Now weโ€™re talking teamwork! ๐Ÿ”งโš™๏ธ๐Ÿงฉ

๐Ÿงช X-MAS-Bench: Putting AI to the Test

To see if this mix-and-match strategy really works, the researchers built X-MAS-Bench, a huge testbed to evaluate how 27 LLMs perform across:

๐Ÿงฎ Math
๐Ÿ’ป Coding
๐Ÿ”ฌ Science
๐Ÿงฌ Medicine
๐Ÿ’ฐ Finance

They looked at 5 crucial agent skills:

  1. Question-Answering (QA) ๐Ÿค”
  2. Revising ๐Ÿ› ๏ธ
  3. Aggregating Answers ๐Ÿงต
  4. Planning Tasks ๐Ÿ—บ๏ธ
  5. Evaluating Responses ๐Ÿ“Š

๐Ÿง  Over 1.7 million tests were run โ€” thatโ€™s more than many AIs can count! ๐Ÿ˜„

๐Ÿ“ˆ Big Results from Diverse Models

Hereโ€™s what they found:

โœ… No single LLM is best at everything โ€” some excel at coding, others shine in medicine.
โœ… Specialized, smaller models often outperform big general ones in certain tasks. Size isnโ€™t everything!
โœ… Mixing different LLMs in MAS leads to better overall performance. Heterogeneous systems were up to:
๐Ÿ’ฅ 8.4% better on math problems!
๐Ÿง  47% better in a reasoning challenge (AIME)!

โœ… Even flawed models can be useful when teamed up strategically โ€” a weak QA agent might still be a great evaluator.
โœ… The more diverse the models, the better the results. Adding more types of LLMs improved outcomes across the board. ๐Ÿ“ˆ

๐Ÿ› ๏ธ X-MAS-Design: How to Build It

The team didnโ€™t just test ideas in theory โ€” they created a blueprint called X-MAS-Design, showing how to upgrade existing MAS frameworks into smarter, diverse teams. Here's how it works:

  1. Keep the same MAS structure โ€” no need to rebuild from scratch.
  2. Replace the one-size-fits-all LLM with the best-fit LLM for each role.
  3. Instant performance boost!

They applied this to 4 MAS systems (including AgentVerse and DyLAN), and every single one performed better with a mix of LLMs.

๐Ÿ”„ Real-World Example: X-MAS-Proto

The researchers also built X-MAS-Proto, their own prototype MAS with all 5 agent functions. By selecting the best LLM for each task, they saw a huge jump in performance โ€” even beating newer math benchmarks like AIME-2025 and MATH-MAS by 33-34%. ๐Ÿง ๐Ÿ“Š

In short: if your MAS is powered by a "variety pack" of AIs instead of just one, you get smarter results. Itโ€™s the AI version of "donโ€™t put all your eggs in one basket." ๐Ÿงบ๐Ÿฅš

๐Ÿ’ก Why This Matters

This research matters because it shows us how to get more intelligence without training new models โ€” just by being smart about which ones we use. ๐Ÿ’ฅ

Think of the possibilities:

๐Ÿ“š Education tools that use different LLMs to tutor, test, and evaluate students.
โš•๏ธ Medical agents that consult specialized LLMs in diagnostics, treatment plans, and risk evaluation.
๐Ÿ›๏ธ Legal AI assistants that plan cases, draft arguments, and critique decisions โ€” all powered by different legal-trained models.

๐Ÿ”ฎ Whatโ€™s Next?

Future directions inspired by X-MAS:

๐Ÿ”ง Auto-selecting the best LLMs for each agent based on the task.
๐Ÿค Training LLMs specifically for MAS roles (e.g., expert planner or sharp evaluator).
๐ŸŒ Scaling this to real-world systems in healthcare, research, finance, and more.

The era of one-brain-fits-all AI is ending. Itโ€™s time to build diverse teams of minds โ€” just like in the real world! ๐ŸŒ๐Ÿง ๐Ÿ‘ฅ

๐Ÿ“Œ TL;DR (Too Long, Debugged & Reviewed)

๐Ÿง  The X-MAS project shows that Multi-Agent Systems work way better when powered by diverse LLMs, not just one.
๐Ÿ”ฌ They tested 27 LLMs, 5 domains, and 5 functions with 1.7 million experiments.
โš™๏ธ Mixing models improves performance dramatically, especially in math, planning, and reasoning tasks.
๐Ÿ“ˆ Just swapping in different LLMs โ€” without changing the MAS structure โ€” gives a performance boost.
๐ŸŒŸ Itโ€™s time to think beyond single AI models and build teams of specialists.

๐ŸŽ‰ Whether youโ€™re an AI researcher, developer, or curious engineer, X-MAS reminds us that diversity in intelligence is a feature, not a bug. ๐Ÿค–โค๏ธ๐Ÿ’ก

Stay curious, and keep building smart.


Concepts to Know

๐Ÿงฉ Multi-Agent System (MAS) - A setup where multiple AI "agents" work together, each doing a specific job (like planning, answering, or reviewing) to solve a complex task as a team. - More about this concept in the article "Smart Swarms at Sea: How Unmanned Boats Patrol the Oceans More Efficiently ๐ŸŒŠ ๐Ÿšค".

๐Ÿ’ฌ Large Language Model (LLM) - A type of AI trained on tons of text data that can understand and generate human-like language โ€” like ChatGPT, Gemini, or Claude. - More about this concept in the article "Phishing, Be Gone! ๐ŸŽฃ๐Ÿšซ How Small AI Models Are Learning to Outsmart Big Email Scammers".

๐ŸŒˆ Heterogeneous LLMs - A mix of different LLMs โ€” instead of using the same AI everywhere, you choose the best one for each job (e.g., one for math, another for medicine).

๐Ÿค– Chatbot - An LLM fine-tuned to have conversations โ€” good at talking, answering questions, and following instructions in natural language.

๐Ÿง  Reasoner - An LLM focused on logic and problem-solving โ€” great for tasks like planning steps or solving math puzzles.

๐Ÿงช Benchmark - A test or dataset used to measure how well an AI model performs on a specific task or in a specific domain.

๐Ÿ”„ X-MAS-Bench - A giant testing platform built by the researchers to compare how 27 LLMs perform across 5 domains and 5 agent tasks.

๐Ÿ› ๏ธ X-MAS-Design - A method for upgrading existing MAS setups by swapping in different LLMs for each agent โ€” no need to rebuild everything from scratch!

๐Ÿ“Š Agent Functions - The different roles AI agents can play in a MAS โ€” like:

  • QA (Question-Answering): answering queries ๐Ÿ—ฃ๏ธ
  • Revise: fixing or improving answers ๐Ÿ› ๏ธ
  • Aggregate: combining multiple answers into one ๐Ÿงต
  • Plan: creating steps to solve problems ๐Ÿ“
  • Evaluate: checking if an answer is good โœ…

Source: Rui Ye, Xiangrui Liu, Qimin Wu, Xianghe Pang, Zhenfei Yin, Lei Bai, Siheng Chen. X-MAS: Towards Building Multi-Agent Systems with Heterogeneous LLMs. https://doi.org/10.48550/arXiv.2505.16997

From: Shanghai Jiao Tong University; University of Oxford; The University of Sydney; Shanghai AI Laboratory.

ยฉ 2025 EngiSphere.com