Vision Transformers Meet Citrus | Smarter Fruit Quality Control

Discover how Vision Transformer models and lightweight Large Language Models bring accuracy, transparency, and real-time insights to citrus fruit quality assessment.

Keywords

Agricultural Engineering; AI; Computer Engineering; Computer Vision; LLMs; Quality Control

Published September 7, 2025 By EngiSphere Research Editors

In Brief

A new AI system using Vision Transformers + a lightweight LLM classifies citrus fruit quality with 98.29% accuracy, explains its decisions with heatmaps and reports, and enables real-time, transparent, on-site quality control.

In Depth

If you’ve ever picked up an orange at the market and thought, “Is this one fresh enough?”, you’ve faced the same challenge citrus farmers and suppliers battle every day — fruit quality assessment. Traditionally, this work relied on human eyes 👀, subjective judgment, and hours of sorting. But what if Artificial Intelligence (AI) could step in, classifying fruit with superhuman accuracy while explaining its reasoning clearly?

That’s exactly what a team of researchers from Morocco set out to achieve. Their study introduces an AI pipeline that combines Vision Transformers (ViTs) with a lightweight Large Language Model (LLM). The goal? To automatically classify citrus fruits into good, medium, or bad quality, while also giving human-readable explanations for each decision.

The results were stunning: 98.29% classification accuracy, real-time performance on edge devices, and AI-generated reports explaining why a fruit was labeled as fresh, damaged, or rotten.

This work is a big leap for precision agriculture, and it shows how modern computer vision + language models can boost transparency in automated decision-making.

From Old Methods to Vision Transformers

Before diving into the new system, let’s understand the evolution of fruit classification methods:

1. Thresholding & Basic Image Processing

Early attempts used color histograms and simple thresholding to separate fresh fruits from damaged ones.
Problem: They were too sensitive to lighting and background noise.

2. Unsupervised Learning (Clustering)

Techniques like K-means grouped pixels based on similarity.
Accuracy improved slightly (~70%) but still failed with overlapping or partially hidden fruits.

3. Classical Machine Learning

Algorithms like SVMs and Random Forests used hand-crafted features (color, texture, shape).
Accuracy rose to 75–85%, but models required heavy feature engineering.

4. Deep Learning (CNNs)

Convolutional Neural Networks (ResNet, VGG, EfficientNet) pushed performance above 90%.
But CNNs have a bias: they mostly focus on local features and can struggle with global context.

5. Transformers Enter the Field

Inspired by their success in Natural Language Processing, Vision Transformers (ViTs) were adapted for images.
Unlike CNNs, ViTs split an image into patches and analyze global relationships using self-attention mechanisms.
This makes them powerful for subtle visual cues — perfect for fruit quality, where defects can be small but critical.

The Moroccan team harnessed this latest shift, building their citrus quality pipeline around ViT-Base (patch size 16×16, 224×224 input) with ImageNet pre-training.

The Dataset

The researchers collected a diverse dataset of citrus fruits:

Varieties: oranges, lemons, limes
Stages: green, yellow, orange
Conditions: fresh, slightly damaged, rotten

Each fruit image was carefully labeled by experts according to international standards (USDA & UNECE). The dataset was also made publicly available on Kaggle, ensuring transparency and reproducibility.

Fruits were categorized into three classes:

Good quality - smooth skin, uniform color, no defects.
Medium quality - minor bruises, slight discoloration.
Bad quality - major defects, rot, unsuitable for fresh markets.

This structured dataset laid the foundation for training a robust AI model.

How the Vision Transformer Works

Instead of scanning pixels locally like CNNs, the ViT breaks an image into patches, embeds them, and applies multi-head self-attention. This allows it to learn:

Global patterns (overall color balance, shape)
Local anomalies (small bruises, spots)
Contextual relationships (is discoloration a shadow or actual damage?)

In training, the team used:

Optimizer: AdamW - A smart algorithm that helps the model learn faster and more efficiently by adjusting weights during training.
Learning Rate: 2 × 10⁻⁵ - Like the speed of learning. A small number means the model learns carefully, step by step, without rushing into mistakes.
Loss: Cross-entropy - A metric that quantifies the error in a model's predictions. Training reduces this “loss” so the model gets closer to the right answers.
Epochs: 30 - One epoch = the model looks at the entire dataset once. So 30 epochs = the AI reviewed the fruit images 30 times to learn patterns.

Performance quickly stabilized at high accuracy, showing the ViT’s ability to generalize across citrus varieties.

Results: 98.29% Accuracy

On the test dataset, the ViT achieved:

Precision: 0.96–0.99 across classes
Recall: 0.96–0.99
F1-Score: ~0.97–0.99
Overall Accuracy: 98.29%

This is a major leap compared to older approaches (thresholding ~65%, clustering ~70%, CNNs ~90–95%).

Even in real-time tests with new, unseen fruits, the model performed flawlessly — all predictions were correct.

Explainability: Grad-CAM + LLM

One challenge with AI in agriculture is trust. Farmers and distributors won’t rely on a “black box” that just says “Bad fruit” without explanation.

To solve this, the researchers added two interpretability layers:

1. Grad-CAM Heatmaps

Highlights regions of the fruit image that influenced the decision.
Example: if a lemon was classified as “bad,” Grad-CAM might highlight a dark spot on its peel.
This builds transparency and helps users see what the model sees.

2. Lightweight LLM Reports

Enter Microsoft’s Phi-3-mini (3.8B parameters) — a compact, efficient language model.
The LLM takes ViT’s outputs (class, confidence, defect percentage) and generates human-readable reports.
Example:

“This fruit is of medium quality (confidence: 99%). Minor imperfections detected on the surface, recommended for processing rather than fresh markets.”

And it does this fast: 0.3 seconds per report, with low power consumption (3.2 W) — perfect for edge devices in farms or warehouses.

Why This Matters

This research is more than a technical achievement. It has real agricultural impact:

Farmers: Quickly identify damaged fruits before selling.
Distributors: Automate sorting lines to separate premium vs. processing fruits.
Consumers: Get fresher, higher-quality fruit with fewer surprises.
Sustainability: Reduce waste by diverting medium-quality fruits to juice production instead of discarding them.

In short: better profits + less waste + more trust in AI systems.

Future Prospects

The study proves the pipeline works, but there’s more ahead:

Multi-fruit Expansion - Extending ViT + LLM models to apples, grapes, bananas, and beyond.
On-Device AI for Farmers - Imagine a smartphone app: snap a picture, get instant quality feedback + report.
Integration with Robotics - Combine with robotic arms for automated harvesting and sorting.
Hyperspectral + ViT Fusion - Using hyperspectral imaging for even more precise detection of hidden defects.
User Trust Studies - Researching how farmers perceive AI-generated reports vs. traditional inspection.

This could transform not just citrus, but the entire global fresh produce supply chain.

Closing Thoughts

The combination of Vision Transformers and Lightweight LLMs represents a new era in agricultural AI. This research shows how:

ViTs deliver top-tier accuracy (98.29%) in fruit quality classification.
Grad-CAM makes decisions transparent.
LLMs provide actionable, human-readable insights.
The system runs efficiently enough for real-world deployment.

In the near future, picking out a perfect orange might not rely on your eyes alone — but on an AI-powered assistant making sure only the best fruits reach your basket.

In Terms

Vision Transformer (ViT) - An AI model that breaks an image into patches and uses self-attention to see both the big picture and tiny details at once. - More about this concept in the article "Building a Smarter Wireless Future: How Transformers Revolutionize 6G Radio Technology".

Large Language Model (LLM) - A smart text-based AI trained on huge amounts of data — it explains, summarizes, and writes like a human assistant. - More about this concept in the article "Dive Smarter | How AI Is Making Underwater Robots Super Adaptive!".

Grad-CAM - A heatmap tool that shows where the AI is “looking” in an image when making decisions (like highlighting a bruise on an orange).

ImageNet Pre-training - Training an AI first on a giant image dataset (ImageNet) so it learns general vision skills, then fine-tuning it for citrus fruit quality.

Edge Devices - Small, portable computers (like phones, tablets, or IoT gadgets) that can run AI locally — no need for constant internet. - More about this concept in the article "Smarter Forest Fire Detection in Real Time | F3-YOLO".

Classification Accuracy - How often the AI gets it right. Example: 98.29% accuracy = almost 99 correct out of 100 tries.

Precision & Recall