Technology & AI June 2026 5 min read

Vision Language Models for iRAP Assessment: Automating Road Safety Audits at Scale

How large multimodal models are transforming road safety infrastructure evaluation in low and middle-income countries.

Road safety remains critical in developing nations. The vast majority of roads in India and other low-income countries lack safety assessments, yet conducting iRAP audits costs ₹1,000–₹1,600 per km—money most governments struggle to allocate at scale.

Vision Language Models (VLMs) like Gemini and GPT-4o offer a solution: they can classify road safety attributes from street-level photos without training data, at a fraction of the cost.

What Are Vision Language Models?

VLMs are AI systems trained on images and text that can understand visual content and respond to instructions without fine-tuning. Unlike traditional computer vision models that need thousands of labeled examples, VLMs work "zero-shot"—they can perform new tasks based purely on prompt instructions.

This changes everything for road safety. A VLM can analyze street-level imagery and classify iRAP attributes (guardrails, lane markings, lighting, medians) based on a descriptive prompt, without requiring expensive labeled datasets for each new region.

V-RoAst: The Proof of Concept

The V-RoAst framework demonstrates this viability. Researchers evaluated Gemini-1.5-Flash and GPT-4o-mini on over 2,000 Thai street-level images annotated with iRAP attributes. Results: VLMs achieved 70–80% accuracy on visible attributes, with no training data required.

While not as accurate as fine-tuned CNNs, VLMs offer 20–40x cost savings and work across geographies without retraining.

The key is prompt engineering. Instead of training on thousands of examples, researchers design prompts like: "Identify barriers on the roadside. Describe type, material, and condition." The VLM processes the image and responds—drawing on its broad visual understanding.

The Economics: Cost and Speed

Traditional iRAP audit: ₹1,000–₹1,600 per km
VLM-based assessment: ₹150–₹400 per km

A 100 km corridor (1,000 images) takes under an hour with VLMs versus weeks with human auditors.

For governments in India: A 1 km road network assessment costs ₹1,000–₹1,600 with traditional audits. With VLMs, it's ₹150–₹400—a 4–7x reduction. Suddenly, comprehensive safety assessments become feasible.

How It Works: The Pipeline

  1. Image collection: Street-level imagery from Mapillary, Google Street View, or mobile cameras
  2. VLM processing: Batch images through Gemini/GPT-4o with optimized prompts
  3. Attribute extraction: Classify 59+ iRAP attributes per image
  4. iRAP ViDA integration: Convert attributes to star ratings
  5. Results: Safety assessments and risk corridors identified

VLM vs. Traditional Deep Learning

Fine-tuned CNNs: Higher accuracy (85–95%) but require 5,000+ labeled images per region and 2–3 months development

VLMs: Lower accuracy (70–80%) but zero training data, works anywhere, deployed in weeks

The choice isn't between perfect systems. It's between assessments at 70–80% accuracy versus no assessments at all. For unrated roads in India claiming lives every day, 70% is transformative.

Real Limitations

Weather: Poor visibility in fog, heavy rain, or glare degrades performance

Ambiguity: Some attributes (barrier adequacy, pavement condition) require judgment calls—VLM agreement with auditors may be lower

Multi-view integration: Combining front/left/right views requires careful prompt design

Validation: Systems need ground-truth validation before deployment at scale

Why This Matters

In India, road deaths exceed 172,000 annually. Most crashes occur on unassessed roads. Engineers lack data to prioritize safety investments. Advocacy groups can't prove which roads are most dangerous.

VLMs break this cycle. A state highway authority with street-level imagery and an internet connection can now assess 1,000 km of road for ₹20–₹40 lakh instead of ₹40 lakh–₹1.6 crore. These assessments inform targeted interventions, identify safe school routes, and provide evidence for funding.

* * *

The Next Steps

VLM-based road assessment is no longer theoretical. Gemini 2.0 and 2.5 evaluations show the approach is maturing. The path forward:

  • Hybrid systems combining VLMs with fine-tuned models for high-ambiguity attributes
  • Validation datasets from Indian road networks to improve prompt engineering
  • Multi-temporal tracking to monitor how road safety changes over time
  • Integration with state highway management systems for real-world deployment

For the first time, comprehensive road safety assessments are within reach for governments with limited budgets. That changes everything.

References

[1] Jongwiriyanurak, N., Zeng, Z., Goo, J. M., Wang, X., Ilyankou, I., Sriroongvikrai, K., Christie, N., Wang, M., Chen, H., & Haworth, J. (2024) "V-RoAst: Visual Road Assessment. Can VLM be a Road Safety Assessor Using the iRAP Standard?" arXiv preprint arXiv:2408.10872.
[2] Malik, S., Hasan, S., & Meng, X. (2025) "Vision-Language Models for Highway Roadside Safety Management: A Comparative Study." Journal of Management in Engineering, 42(1).
[3] Ashqar, H. I., Jaber, A., Alhadidi, T. I., & Elhenawy, M. (2024) "Advancing Object Detection in Transportation with Multimodal Large Language Models (MLLMs): A Comprehensive Review and Empirical Testing." arXiv preprint arXiv:2409.18286.
[4] Sainju, A. M. & Jiang, Z. (2019) "Mapping road safety features from streetview imagery: A deep learning approach." arXiv preprint arXiv:1907.12647.
[5] Arya, D., Maeda, H., Ghosh, S. K., Toshniwal, D., & Sekimoto, Y. "Improving Road Safety through Deep Learning-based Approaches for Road Damage Detection and Classification." International Journal of Computer Applications.
[6] Eslami, E. & Yun, H. B. (2021) "Attention-Based Multi-Scale Convolutional Neural Network for Multi-Class Classification in Road Images." Sensors, 21(15), 5137.
[7] Jan, Z., Verma, B., Affum, J., Atabak, S., & Moir, L. (2018) "A Convolutional Neural Network Based Deep Learning Technique for Identifying Road Attributes." In 2018 International Conference on Image and Vision Computing New Zealand (IVCNZ) (pp. 1–6). IEEE.
[8] Kacan, M., Orsic, M., Segvic, S., & Sevrovic, M. (2020) "Multi-Task Learning for iRAP Attribute Classification and Road Safety Assessment." IEEE Transactions on Intelligent Transportation Systems.
[9] World Health Organization (2023) "Global status report on road safety." WHO Press.

Ready to Learn More?

Explore how NayaTransit applies these principles to real road safety assessments across India.

View All Resources