ADVersa: Abductive Driving Accident Video Understanding

ORCID ID

Hongkai Yu, https://orcid.org/0000-0001-5383-8913

Document Type

Article

Publication Date

2-11-2026

Publication Title

IEEE Transactions on Patterns Analysis and Machine Intelligence

Abstract

Understanding traffic accident scenes is a long-standing research for vision-based safe driving. It seeks to answer why accidents occur, how near-crash scenes develop, and what the key elements of an accident are. This research is challenging due to the scarcity and fragmentation of accident data, as well as the complex accident environments. To study this, we present a framework of Abductive Driving accident Video understanding (ADVersa), which infers a plausible visual and textual explanation for the absent near-crash scenes. ADVersa underscores three groups of tasks: 1) visual past recovery of near-crash scenes, 2) visual prediction of near-crash scenes, and 3) accident cause involved video synthesis. To support the study, we first contribute MM-AU, a novel dataset for Multi-Modal Accident video Understanding. MM-AU contains 11,727 in-the-wild driving accident videos with temporally aligned text descriptions, 2.23 million well-annotated object boxes, and 58,650 pairs of video-based accident cause texts. We then propose an Abductive CLIP model and a Contrastive Graph Video Pre-training (CGVP) model, which exploit relation-aware cross-modal semantic learning to drive spatially abductive and temporally abductive accident video diffusion. Extensive experiments verify the superiority of ADVersa to the state-of-the-art approaches on different tasks, i.e., historical near-crash video frame recovering, crashing video frame prediction, textual accident cause and category reasoning, normal-to-accident video synthesis, and accident video editing. With these efforts, we hope this research can advance the progress on multimodal accident video understanding.

DOI

10.1109/TPAMI.2026.3663545

Share

COinS