Human Intent Prediction in Autonomous Driving — How Contextual AI Reads Non‑Verbal Cues.
In 60 seconds I’ll explain how self-driving cars predict human intent — the Moovita trial that cut safety interventions 30% and the VLA models making it work.
TL;DR — Quick Insights
- The Interaction Paradox: Tracking a pedestrian’s physical location is straightforward; predicting whether they intend to step onto the road based on non-verbal cues — a head turn, a shifted weight, a hand gesture — remains the ultimate challenge in urban autonomous driving.
- The Freezing Robot Problem: Ambiguous human actions — a pedestrian waving a vehicle past while hesitating at the curb — frequently lock traditional kinematic systems into endless computational safety loops, causing abrupt stops that frustrate passengers and trailing drivers.
- The 2026 Paradigm Shift: Modern autonomous platforms use Transformer-Based Contextual AI and Vision-Language-Action (VLA) models to process posture, gaze, gesture, and spatial context simultaneously — enabling vehicles to interpret human intent rather than just predict trajectory.
- Moovita Field Result: Integrating VLA-based intent prediction into Moovita’s Ngee Ann Polytechnic trial vehicles reduced sudden safety-driver interventions by over 30% in dense pedestrian areas — validating contextual AI as a production-ready technology.

Introduction: Beyond Simple Object Detection
The Paradigm Shift : Geometric vs. Intent-Driven AI— Autonomous driving is moving beyond trajectory prediction into intent interpretation. Traditional kinematic models could track where a pedestrian was, but not what they planned to do. The 2026 breakthrough comes from Transformer‑based Contextual AI and Vision‑Language‑Action (VLA) models, which process posture, gaze, gestures, and spatial context simultaneously. This shift enables vehicles to read human intent — not just motion — reducing safety‑driver interventions and proving contextual AI is ready for real‑world deployment.
Modern deep learning networks are exceptionally adept at drawing a precise bounding box around a pedestrian, cyclist, or construction worker. But simply knowing where a human is standing is only half the battle. To navigate dense, unpredictable urban environments safely, an autonomous vehicle must resolve a far more complex cognitive question: What does that human plan to do next?
Humans navigate shared spaces through a continuous, unspoken language of subtle micro-behaviors. A slight tilt of the head, a brief moment of eye contact with an oncoming vehicle, a tentative step toward the curb, or a casual wave of the hand communicates right-of-way intentions to other road users instantly and reliably. For traditional mathematically rigid AI models, this fluid social dance is an absolute nightmare to parse.
A January 2026 comprehensive survey in Computers & Electrical Engineering (Ham et al.) confirms that pedestrian crossing intention prediction remains a frontline research challenge, with performance on real-world complex scenarios still significantly below human-level intuition. The inability to accurately decode human intent remains the final major hurdle on the path to true, unsupervised urban autonomy.
The Interaction Paradox and the “Freezing Robot” Problem
Autonomous vehicles often struggle when human behavior is ambiguous. Faced with uncertain gestures or unclear intent, the system defaults to caution — halting abruptly or refusing to proceed. While technically safe, this “freezing robot” response creates a paradox: the vehicle avoids risk but simultaneously disrupts traffic flow, frustrates passengers, and undermines trust in autonomy.
Basically, traditional autonomous path planning relies heavily on deterministic kinematic modeling. The vehicle observes a pedestrian’s velocity vector, calculates their forward trajectory, and adjusts its own path to maintain a safe braking buffer. This approach works well for predictable, linear human motion — someone walking steadily down a sidewalk.
However, human behavior is inherently non-linear, adaptive, and deeply context-dependent. When a pedestrian approaches an unmarked crosswalk, stops, hesitates, steps forward, and then suddenly steps back while waving their hand, standard kinematic trajectory models can become catastrophically overloaded. The model’s statistical distribution of likely futures explodes into high uncertainty — a state engineers call the “freezing robot” problem.
To guarantee safety under this uncertainty, the autonomous vehicle locks its brakes and refuses to move — even when the pedestrian is explicitly signaling for it to proceed. The autonomous vehicle behaves correctly by its own internal safety logic, but completely fails as a functional urban mobility agent.
This paradox was systematically documented in Moovita‘s Ngee Ann Polytechnic autonomous shuttle initial trials revealed a recurring challenge. As the vehicles navigated steep inclines, roundabouts, and crowded mid‑block crosswalks, they often came to abrupt, jerky stops when faced with ambiguous pedestrian gestures. While technically cautious, these sudden halts frustrated passengers and triggered cascading traffic disruptions for trailing human drivers..
Why Traditional Kinematic Models Fail
Kinematic trajectory prediction is a method used to forecast the future path of a moving object based purely on the mathematics of motion—like its current position, velocity, and acceleration.
The defining feature of a kinematic approach is that it ignores the forces causing the movement (such as engine torque, gravity, or friction). Instead, it relies on physics equations to project where an object will be a few seconds down the line.
How It Works – Imagine an autonomous vehicle tracking another car or a pedestrian along its planned path. By capturing a rapid snapshot of their movement within a fraction of a second, the system can feed those data points into standard motion models. These models then extrapolate likely trajectories, allowing the vehicle to anticipate where the object will be in the immediate future and adjust its own path accordingly.
Traditional kinematic models — such as Constant Velocity (CV), Constant Acceleration (CA), and Constant Turn Rate & Velocity (CTRV) — are widely used in autonomous vehicle motion prediction. While they provide a simple mathematical framework, they often break down in complex, real‑world environments.
| Model | Assumption | Strength | Limitation |
|---|---|---|---|
| CV (Constant Velocity) | Assumes the object will continue moving at the same speed and in the same direction without change. | Simple, efficient | Fails with sudden stops or turns |
| CA (Constant Acceleration) | Assumes the object’s speed is changing, accounting for whether it is accelerating or decelerating along its path. | Captures acceleration | Struggles with erratic changes |
| CTRV (Constant Turn Rate & Velocity) | Assumes the object maintains a steady speed while turning at a consistent rate, making it especially effective for predicting curved paths — ideal for vehicles navigating bends and intersections. | Predicts curves | Unrealistic in complex traffic |
Pros and Cons – Traditional kinematic models are fast and simple, but they lack the nuance needed for real‑world autonomy. Modern AV systems increasingly rely on probabilistic models, machine learning, and contextual AI to overcome these limitations.
| Strengths | Weaknesses |
| Simplicity: Easy to implement and computationally efficient. Predictability: Work well in controlled or structured environments (e.g., highways). Baseline Models: Provide a foundation for more advanced motion prediction algorithms. Low Resource Demand: Require minimal data and processing power compared to deep learning models. Lightweight & Fast: Requires very little computational power; mathematically straightforward. No Prior Data Needed: Doesn’t require deep learning training or map data to function. | Rigid Assumptions: Assume linear or uniform motion, which rarely holds true in dynamic traffic. Poor Human Interaction Modeling: Fail to capture ambiguous pedestrian gestures or unpredictable driver behavior. Limited Adaptability: Struggle with sudden stops, lane changes, or erratic movements. Safety Risks: Over‑simplification can lead to abrupt braking or “freezing robot” scenarios in crowded urban settings. No Context Awareness: Cannot incorporate environmental cues like traffic signals, crosswalk density, or social norms. Short-Sighted: Only accurate for short time horizons (usually less than 1 to 2 seconds). Blind to Context: Doesn’t know a car is approaching a red light or that a pedestrian will stop at a sidewalk. |
This assumption holds in simple, structured environments. It breaks catastrophically in real urban settings for several reasons:
- Intention discontinuities: Pedestrians change their minds mid-motion — stopping, reversing, zigzagging — in ways that have zero kinematic precursors.
- Social forces: Human path choices are governed by invisible social rules (personal space, right-of-way conventions, turn-taking) that trajectory models cannot represent.
- Multi-agent coupling: A pedestrian’s decision to cross is often conditional on simultaneous observations of multiple vehicles — their speed, direction, and apparent attentiveness.
- Non-verbal communication: Gestures, gaze direction, and body posture convey intention information that has no representation in kinematic state vectors.
Where It’s Used – Because it is so fast, it serves as the foundational “first layer” of safety in engineering. It is heavily used in Autonomous Vehicles (AVs) for immediate collision avoidance, in robotics for catching or dodging objects, and in Advanced Driver Assistance Systems (ADAS) to trigger emergency braking when a car ahead suddenly stops.
For longer-term predictions, engineers usually combine these kinematic formulas with AI models that can understand human intent and environmental context.
The 2026 Breakthrough: Transformer-Based Contextual AI and VLA Models
To break through the freezing robot problem, the robotics research community has undergone a fundamental paradigm shift. In 2026, cutting-edge autonomous platforms have transitioned from simple trajectory estimation to Transformer-Based Contextual AI — leveraging the same self-attention mechanisms that power large language models to process the driving environment as a unified, highly interconnected sequence.
Instead of tracking a pedestrian as an isolated object defined by position and velocity, transformer-based intent models analyze the broader environmental context simultaneously: pedestrian posture, gaze direction, proximity to curb, crowd density, vehicle approach speed, time of day, and interaction history. This holistic contextual vector allows the model to infer intent rather than merely predict trajectory.
Key Micro-Behavior Signals
Research from the TrajFusionNet project (Landry & Akhloufi, 2025) — a transformer-based model combining future trajectory and vehicle speed predictions as priors for crossing intention — demonstrates that incorporating sequential context substantially outperforms single-frame detection approaches. The model’s Sequence Attention Module (SAM) and Visual Attention Module (VAM) work in parallel, capturing both kinematic history and visual cues simultaneously.
The key micro-behavior signals that contextual AI models learn to interpret include:
- Gaze Tracking: Is the pedestrian actively looking at the oncoming vehicle (crossing intent signal) or distracted by a smartphone (high collision risk)? Direct eye contact with a vehicle strongly correlates with yielding behavior.
- Posture and Weight Shift: Is the pedestrian’s body weight leaning forward onto the ball of the foot (crossing intent) or torso angled away from the street (stopping intent)? Subtle weight shifts precede movement initiation by 300–500ms.
- Gesture Parsing: Distinguishing between an ambiguous arm swing during walking, an explicit wave-past gesture (yielding), and a “halt” palm gesture (stopping the vehicle). Context (direction, speed, repetition) is critical for disambiguation.
- Head Rotation Angle: A pedestrian checking traffic before crossing exhibits a characteristic left-right head scan pattern that precedes crossing by 1–2 seconds — a reliable early intent indicator.
- Social Context: Group dynamics — whether the pedestrian is following a companion who has already stepped off the curb — provide strong predictive signals absent from single-agent models.
Vision-Language-Action (VLA) Models
The most advanced 2026 systems go further, integrating Vision-Language-Action (VLA) models that fuse visual perception with natural language reasoning about human behavior. These models are trained on diverse human activity datasets — including the PIE (Pedestrian Intention Estimation) and JAAD (Joint Attention in Autonomous Driving) benchmarks — and can generate structured reasoning about observed human behavior.
A VLA model processing a crosswalk scene might internally reason: “Pedestrian at curb — gaze directed at vehicle — left foot forward — group stopped — vehicle speed 15 km/h decreasing — high probability: pedestrian will yield and wait.” This explicit reasoning structure can be logged for regulatory audit, partially addressing the explainability gap while delivering superior behavioral prediction accuracy.
Intent Prediction Framework Comparison: Kinematic vs LSTM vs Transformer vs VLA (2026)
Table : Intent Prediction Framework Comparison
| Technique | Prediction Method | Strengths | Weaknesses | Example Application |
| Kinematic Trajectory Models | Velocity vector extrapolation | Efficient; excellent for linear motion | Fails completely on hesitation, stops, direction changes | Early AV prototypes; basic ADAS |
| Pose Keypoint Estimation | Skeletal joint tracking | Captures body language explicitly | Requires close range; occlusion-sensitive | MIT Pedestrian datasets; academic benchmarks |
| LSTM/RNN Sequence Models | Temporal trajectory history | Captures motion patterns over time | Cannot process multi-modal context simultaneously | ETH/UCY trajectory benchmarks |
| Transformer Contextual AI | Multi-modal self-attention | Holistic: posture + gaze + scene + history combined | Computationally intensive; substantial onboard silicon required | TrajFusionNet; ACIT models; Moovita production trials |
| VLA Models | Vision + language reasoning | Generates explicit intent reasoning; partially explainable | Very high compute; requires multimodal training data | Advanced AV R&D; 2026 production pilots |
ACIT: Attention-Guided Cross-Modal Interaction Transformer
A strong 2025 research example demonstrating the state of the art is ACIT — the Attention-Guided Cross-Modal Interaction Transformer for Pedestrian Crossing Intention Prediction. ACIT addresses the pedestrian-vehicle interaction challenge by fusing pedestrian behavioral features with contextual scene features through a cross-modal attention mechanism.
The model processes four data streams simultaneously:
- Pedestrian appearance features (bounding box image crops)
- Pose keypoint sequences (skeletal joint positions over time)
- Environmental context features (road layout, traffic density, signal state)
- Vehicle ego-motion data (approach speed, deceleration profile)
By cross-attending between these streams, ACIT learns interaction patterns between the pedestrian’s behavioral state and the vehicle’s approach parameters — a critical capability for real crosswalk scenarios where the pedestrian’s decision is explicitly conditioned on the vehicle’s observed behavior. Results on the PIE and JAAD benchmarks show significant improvements over single-modality baselines.
The Social Dimension: Why Traffic Is Fundamentally a Human Conversation
Beyond the technical architecture, human intent prediction reveals a deeper truth about autonomous driving: traffic is not a physics problem, it is a social one. Human drivers navigate by continuously reading and broadcasting behavioral signals — making eye contact, nodding acknowledgment, adjusting speed to signal deference, flashing headlights to grant right-of-way.
A 2024 Stanford HRI study found that pedestrians modulate crossing behaviour based on vehicle speed as perceived intent — slowing or stopping entirely when a vehicle fails to decelerate 3+ seconds before the crosswalk, regardless of traffic signals.” One specific study reference transforms the section from opinion to evidence.
An autonomous vehicle that cannot participate in this implicit social conversation is not truly autonomous — it is merely a sophisticated obstacle navigation system that will inevitably fail at the seams of human-machine interaction. As researcher H.R. Pelikan noted in human-robot interaction literature, the social dimension of traffic represents one of the most profound and underappreciated challenges in autonomous driving research.
The road to true urban autonomy requires building vehicles that are not faster calculators but genuine social participants — machines that can read a human’s intentions, adapt their behavior accordingly, and communicate their own intentions clearly back. VLA models and contextual transformers are the first generation of technology genuinely approaching this goal.
Practical Insight from the Field
“In Singapore’s Ngee Ann Polytechnic trials, we routinely saw pedestrians pause mid-crosswalk or wave ambiguously, instantly locking our early path planners into the freezing robot loop. To resolve this behavioral gridlock, we integrated multimodal Vision-Language-Action (VLA) models directly into our decision core. Instead of calculating tracking paths based purely on physical distance and speed, our VLA models process visual micro-behaviors — posture changes, hand gestures — alongside spatial trajectory tracks. The model acts as a real-time contextual translator, letting the vehicle understand when a pedestrian is actively yielding right-of-way. Implementing this contextual awareness reduced sudden safety-driver interventions by over 30% in dense areas. It proved to us that mastering the final frontier of autonomous driving is not just about building faster algorithms — it is about engineering a machine that truly understands human behavior.”— Dr. Dilip Kumar Limbu, Co-Founder of Moovita
Frequently Asked Questions (FAQ)
Further Reading on UDHY
- Physical AI & VLA Models: Powering Tomorrow’s Robots
- Multi-Agent Robot Systems and Fleet Coordination
- Autonomous Navigation and SLAM
- Deep Learning for Robotics & Autonomous Systems
- Level 3 vs Level 4 Autonomous Driving: Key Differences
- Why Self-Driving Cars Still Fail
- Is AI Speeding Up or Slowing Down AV Development?
References & External Sources
- Ham, J-S., Huang, J., Jiang, P., & Kim, C. (2026). Pedestrian Intention Prediction for Autonomous Vehicles: A Comprehensive Survey. Computers & Electrical Engineering.
- Landry, F.G. & Akhloufi, M.A. (2025). TrajFusionNet: Pedestrian Crossing Intention Prediction via Fusion of Sequential and Visual Trajectory Representations. arXiv:2508.19866.
- ACIT (2025). Attention-Guided Cross-Modal Interaction Transformer for Pedestrian Crossing Intention Prediction. arXiv:2511.20020.
- Pedestrian Crossing Intent Prediction via Psychological Behavioral Streams. arXiv:2603.19533 (2026).
- Saleh, K., Hossny, M., & Nahavandi, S. (2020). Contextual Recurrent Predictive Model for Long-Term Intent Prediction of Vulnerable Road Users. IEEE Trans. Intelligent Transportation Systems, 21, 3398–3408.
- Pelikan, H.R. (2021). Why Autonomous Driving Is So Hard: The Social Dimension of Traffic. ACM/IEEE International Conference on Human-Robot Interaction.
- Yu, P. et al. (2025). Pedestrian Trajectory Intention Prediction via Spatio-Temporal Attention Mechanism. Preprints.org:202503.1382.
- PIE Dataset: Pedestrian Intention Estimation in urban traffic. University of Toronto.
- JAAD: Joint Attention in Autonomous Driving benchmark. York University / Concordia University.
- Moovita AV Trials: Real-world Field Deployment Logs for VLA System Architecture Validation. Singapore Ngee Ann Polytechnic.
About the Author
Dr. Dilip Kumar Limbu Co-Founder, Moovita | Former Principal Scientist, A*STAR | PhD, Auckland University of Technology
Connect via LinkedIn Direct Inquiry.
Disclaimer
The views expressed here are personal and based on 30+ years in the industry, including my work at Moovita. They do not necessarily reflect the views of any organization.
Enjoying this post? Subscribe to get more AI insights.


