A new AI system using Vision Transformers + a lightweight LLM classifies citrus fruit quality with 98.29% accuracy, explains its decisions with heatmaps and reports, and enables real-time, transparent, on-site quality control.
If you’ve ever picked up an orange at the market and thought, “Is this one fresh enough?”, you’ve faced the same challenge citrus farmers and suppliers battle every day — fruit quality assessment. Traditionally, this work relied on human eyes 👀, subjective judgment, and hours of sorting. But what if Artificial Intelligence (AI) could step in, classifying fruit with superhuman accuracy while explaining its reasoning clearly?
That’s exactly what a team of researchers from Morocco set out to achieve. Their study introduces an AI pipeline that combines Vision Transformers (ViTs) with a lightweight Large Language Model (LLM). The goal? To automatically classify citrus fruits into good, medium, or bad quality, while also giving human-readable explanations for each decision.
The results were stunning: 98.29% classification accuracy, real-time performance on edge devices, and AI-generated reports explaining why a fruit was labeled as fresh, damaged, or rotten.
This work is a big leap for precision agriculture, and it shows how modern computer vision + language models can boost transparency in automated decision-making.
Before diving into the new system, let’s understand the evolution of fruit classification methods:
The Moroccan team harnessed this latest shift, building their citrus quality pipeline around ViT-Base (patch size 16×16, 224×224 input) with ImageNet pre-training.
The researchers collected a diverse dataset of citrus fruits:
Each fruit image was carefully labeled by experts according to international standards (USDA & UNECE). The dataset was also made publicly available on Kaggle, ensuring transparency and reproducibility.
Fruits were categorized into three classes:
This structured dataset laid the foundation for training a robust AI model.
Instead of scanning pixels locally like CNNs, the ViT breaks an image into patches, embeds them, and applies multi-head self-attention. This allows it to learn:
In training, the team used:
Performance quickly stabilized at high accuracy, showing the ViT’s ability to generalize across citrus varieties.
On the test dataset, the ViT achieved:
This is a major leap compared to older approaches (thresholding ~65%, clustering ~70%, CNNs ~90–95%).
Even in real-time tests with new, unseen fruits, the model performed flawlessly — all predictions were correct.
One challenge with AI in agriculture is trust. Farmers and distributors won’t rely on a “black box” that just says “Bad fruit” without explanation.
To solve this, the researchers added two interpretability layers:
“This fruit is of medium quality (confidence: 99%). Minor imperfections detected on the surface, recommended for processing rather than fresh markets.”
And it does this fast: 0.3 seconds per report, with low power consumption (3.2 W) — perfect for edge devices in farms or warehouses.
This research is more than a technical achievement. It has real agricultural impact:
In short: better profits + less waste + more trust in AI systems.
The study proves the pipeline works, but there’s more ahead:
This could transform not just citrus, but the entire global fresh produce supply chain.
The combination of Vision Transformers and Lightweight LLMs represents a new era in agricultural AI. This research shows how:
In the near future, picking out a perfect orange might not rely on your eyes alone — but on an AI-powered assistant making sure only the best fruits reach your basket.
Vision Transformer (ViT) - An AI model that breaks an image into patches and uses self-attention to see both the big picture and tiny details at once. - More about this concept in the article "Building a Smarter Wireless Future: How Transformers Revolutionize 6G Radio Technology".
Large Language Model (LLM) - A smart text-based AI trained on huge amounts of data — it explains, summarizes, and writes like a human assistant. - More about this concept in the article "Dive Smarter | How AI Is Making Underwater Robots Super Adaptive!".
Grad-CAM - A heatmap tool that shows where the AI is “looking” in an image when making decisions (like highlighting a bruise on an orange).
ImageNet Pre-training - Training an AI first on a giant image dataset (ImageNet) so it learns general vision skills, then fine-tuning it for citrus fruit quality.
Edge Devices - Small, portable computers (like phones, tablets, or IoT gadgets) that can run AI locally — no need for constant internet. - More about this concept in the article "Smarter Forest Fire Detection in Real Time | F3-YOLO".
Classification Accuracy - How often the AI gets it right. Example: 98.29% accuracy = almost 99 correct out of 100 tries.
Precision & Recall
Confidence Score - How sure the AI is about its decision, shown as a number between 0 and 1. (e.g., “Bad fruit, confidence 0.99”).
Transfer Learning - Reusing what a model learned on one task (like cats/dogs) and applying it to another (like citrus quality).
Interpretability - Making AI decisions understandable to humans — so users can trust what the model says.
Jrondi, Z.; Moussaid, A.; Hadi, M.Y. Interpretable Citrus Fruit Quality Assessment Using Vision Transformers and Lightweight Large Language Models. AgriEngineering 2025, 7, 286. https://doi.org/10.3390/agriengineering7090286
From: Ibn Tofail University; University Mohammed VI Polytechnic (UM6P).