EngiSphere icone
EngiSphere

Vision Transformers Meet Citrus 🍊 Smarter Fruit Quality Control

: ; ; ; ; ;

Discover how Vision Transformer models and lightweight Large Language Models bring accuracy, transparency, and real-time insights to citrus fruit quality assessment.

Published September 7, 2025 By EngiSphere Research Editors
Vision Transformers Meet Citrus Β© AI Illustration
Vision Transformers Meet Citrus Β© AI Illustration

TL;DR

A new AI system using Vision Transformers + a lightweight LLM classifies citrus fruit quality with 98.29% accuracy, explains its decisions with heatmaps and reports, and enables real-time, transparent, on-site quality control.

The R&D

If you’ve ever picked up an orange at the market and thought, β€œIs this one fresh enough?”, you’ve faced the same challenge citrus farmers and suppliers battle every day β€” fruit quality assessment. Traditionally, this work relied on human eyes πŸ‘€, subjective judgment, and hours of sorting. But what if Artificial Intelligence (AI) could step in, classifying fruit with superhuman accuracy while explaining its reasoning clearly?

That’s exactly what a team of researchers from Morocco set out to achieve. Their study introduces an AI pipeline that combines Vision Transformers (ViTs) with a lightweight Large Language Model (LLM). The goal? To automatically classify citrus fruits into good, medium, or bad quality, while also giving human-readable explanations for each decision.

The results were stunning: 98.29% classification accuracy βœ…, real-time performance on edge devices, and AI-generated reports explaining why a fruit was labeled as fresh, damaged, or rotten.

This work is a big leap for precision agriculture πŸŠπŸ€–, and it shows how modern computer vision + language models can boost transparency in automated decision-making.

From Old Methods to Vision Transformers πŸ”„βž‘οΈ

Before diving into the new system, let’s understand the evolution of fruit classification methods:

1. Thresholding & Basic Image Processing πŸ–ΌοΈ
  • Early attempts used color histograms and simple thresholding to separate fresh fruits from damaged ones.
  • Problem: They were too sensitive to lighting 🌞☁️ and background noise.
2. Unsupervised Learning (Clustering) πŸŒ€
  • Techniques like K-means grouped pixels based on similarity.
  • Accuracy improved slightly (~70%) but still failed with overlapping or partially hidden fruits.
3. Classical Machine Learning πŸ“Š
  • Algorithms like SVMs and Random Forests used hand-crafted features (color, texture, shape).
  • Accuracy rose to 75–85%, but models required heavy feature engineering.
4. Deep Learning (CNNs) 🧠
  • Convolutional Neural Networks (ResNet, VGG, EfficientNet) pushed performance above 90%.
  • But CNNs have a bias: they mostly focus on local features and can struggle with global context.
5. Transformers Enter the Field πŸ‹
  • Inspired by their success in Natural Language Processing, Vision Transformers (ViTs) were adapted for images.
  • Unlike CNNs, ViTs split an image into patches and analyze global relationships using self-attention mechanisms.
  • This makes them powerful for subtle visual cues β€” perfect for fruit quality, where defects can be small but critical.

The Moroccan team harnessed this latest shift, building their citrus quality pipeline around ViT-Base (patch size 16Γ—16, 224Γ—224 input) with ImageNet pre-training.

The Dataset πŸŠπŸ“Έ

The researchers collected a diverse dataset of citrus fruits:

  • Varieties: oranges, lemons, limes πŸŠπŸ‹
  • Stages: green, yellow, orange 🌈
  • Conditions: fresh, slightly damaged, rotten 🀒

Each fruit image was carefully labeled by experts according to international standards (USDA & UNECE). The dataset was also made publicly available on Kaggle, ensuring transparency and reproducibility.

Fruits were categorized into three classes:

  • Good quality βœ… smooth skin, uniform color, no defects.
  • Medium quality ⚠️ minor bruises, slight discoloration.
  • Bad quality ❌ major defects, rot, unsuitable for fresh markets.

This structured dataset laid the foundation for training a robust AI model.

How the Vision Transformer Works 🧩

Instead of scanning pixels locally like CNNs, the ViT breaks an image into patches, embeds them, and applies multi-head self-attention. This allows it to learn:

  • Global patterns (overall color balance, shape)
  • Local anomalies (small bruises, spots)
  • Contextual relationships (is discoloration a shadow or actual damage?)

In training, the team used:

  • Optimizer: AdamW πŸ› οΈ A smart algorithm that helps the model learn faster and more efficiently by adjusting weights during training.
  • Learning Rate: 2 Γ— 10⁻⁡ πŸ“‰ Like the speed of learning. A small number means the model learns carefully, step by step, without rushing into mistakes.
  • Loss: Cross-entropy πŸ“Š A metric that quantifies the error in a model's predictions. Training reduces this β€œloss” so the model gets closer to the right answers.
  • Epochs: 30 πŸ”„ One epoch = the model looks at the entire dataset once. So 30 epochs = the AI reviewed the fruit images 30 times to learn patterns.

Performance quickly stabilized at high accuracy, showing the ViT’s ability to generalize across citrus varieties.

Results: 98.29% Accuracy 🎯🍊

On the test dataset, the ViT achieved:

  • Precision: 0.96–0.99 across classes
  • Recall: 0.96–0.99
  • F1-Score: ~0.97–0.99
  • Overall Accuracy: 98.29% πŸ”₯

This is a major leap compared to older approaches (thresholding ~65%, clustering ~70%, CNNs ~90–95%).

Even in real-time tests with new, unseen fruits, the model performed flawlessly β€” all predictions were correct βœ….

Explainability: Grad-CAM + LLM πŸ“ŠπŸ—£οΈ

One challenge with AI in agriculture is trust. Farmers and distributors won’t rely on a β€œblack box” that just says β€œBad fruit βŒβ€ without explanation.

To solve this, the researchers added two interpretability layers:

1. Grad-CAM Heatmaps πŸ”₯
  • Highlights regions of the fruit image that influenced the decision.
  • Example: if a lemon was classified as β€œbad,” Grad-CAM might highlight a dark spot on its peel.
  • This builds transparency and helps users see what the model sees.
2. Lightweight LLM Reports πŸ“
  • Enter Microsoft’s Phi-3-mini (3.8B parameters) β€” a compact, efficient language model.
  • The LLM takes ViT’s outputs (class, confidence, defect percentage) and generates human-readable reports.
  • Example:

β€œThis fruit is of medium quality (confidence: 99%). Minor imperfections detected on the surface, recommended for processing rather than fresh markets.”

And it does this fast: 0.3 seconds per report, with low power consumption (3.2 W) β€” perfect for edge devices in farms or warehouses.

Why This Matters 🚜🍊

This research is more than a technical achievement. It has real agricultural impact:

  • Farmers πŸ‘©β€πŸŒΎ: Quickly identify damaged fruits before selling.
  • Distributors πŸš›: Automate sorting lines to separate premium vs. processing fruits.
  • Consumers πŸ›’: Get fresher, higher-quality fruit with fewer surprises.
  • Sustainability 🌱: Reduce waste by diverting medium-quality fruits to juice production instead of discarding them.

In short: better profits + less waste + more trust in AI systems.

Future Prospects πŸ”­

The study proves the pipeline works, but there’s more ahead:

  1. Multi-fruit Expansion πŸŽπŸ‡ Extending ViT + LLM models to apples, grapes, bananas, and beyond.
  2. On-Device AI for Farmers πŸ“± Imagine a smartphone app: snap a picture, get instant quality feedback + report.
  3. Integration with Robotics πŸ€– Combine with robotic arms for automated harvesting and sorting.
  4. Hyperspectral + ViT Fusion 🌈 Using hyperspectral imaging for even more precise detection of hidden defects.
  5. User Trust Studies πŸ‘₯ Researching how farmers perceive AI-generated reports vs. traditional inspection.

This could transform not just citrus, but the entire global fresh produce supply chain.

Closing Thoughts πŸ“Œ

The combination of Vision Transformers and Lightweight LLMs represents a new era in agricultural AI. This research shows how:

  • ViTs deliver top-tier accuracy (98.29%) in fruit quality classification.
  • Grad-CAM makes decisions transparent.
  • LLMs provide actionable, human-readable insights.
  • The system runs efficiently enough for real-world deployment.

In the near future, picking out a perfect orange might not rely on your eyes alone β€” but on an AI-powered assistant making sure only the best fruits reach your basket 🧺🍊.


Terms to Know

Vision Transformer (ViT) πŸ€–πŸ–ΌοΈ An AI model that breaks an image into patches and uses self-attention to see both the big picture and tiny details at once. - More about this concept in the article "Building a Smarter Wireless Future: How Transformers Revolutionize 6G Radio Technology πŸŒπŸ“‘".

Large Language Model (LLM) πŸ§ πŸ’¬ A smart text-based AI trained on huge amounts of data β€” it explains, summarizes, and writes like a human assistant. - More about this concept in the article "Dive Smarter 🐠 How AI Is Making Underwater Robots Super Adaptive!".

Grad-CAM πŸ”₯πŸ‘€ A heatmap tool that shows where the AI is β€œlooking” in an image when making decisions (like highlighting a bruise on an orange).

ImageNet Pre-training πŸ‹οΈβ€β™‚οΈπŸ“Έ Training an AI first on a giant image dataset (ImageNet) so it learns general vision skills, then fine-tuning it for citrus fruit quality.

Edge Devices πŸ“±πŸ’» Small, portable computers (like phones, tablets, or IoT gadgets) that can run AI locally β€” no need for constant internet. - More about this concept in the article "Smarter Forest Fire Detection in Real Time πŸ”₯ F3-YOLO".

Classification Accuracy πŸŽ―βœ… How often the AI gets it right. Example: 98.29% accuracy = almost 99 correct out of 100 tries.

Precision & Recall πŸ“ŠπŸ”

  • Precision: Of all fruits labeled β€œgood,” how many were truly good?
  • Recall: Of all good fruits, how many did the AI actually catch?

Confidence Score πŸ“ˆπŸ‘Œ How sure the AI is about its decision, shown as a number between 0 and 1. (e.g., β€œBad fruit, confidence 0.99”).

Transfer Learning πŸ”„πŸ§© Reusing what a model learned on one task (like cats/dogs) and applying it to another (like citrus quality).

Interpretability πŸ“πŸ”¦ Making AI decisions understandable to humans β€” so users can trust what the model says.


Source: Jrondi, Z.; Moussaid, A.; Hadi, M.Y. Interpretable Citrus Fruit Quality Assessment Using Vision Transformers and Lightweight Large Language Models. AgriEngineering 2025, 7, 286. https://doi.org/10.3390/agriengineering7090286

From: Ibn Tofail University; University Mohammed VI Polytechnic (UM6P).

Β© 2025 EngiSphere.com