z-logo
open-access-imgOpen Access
ETIA:Enhancing Text2Image surround view scene generation with semantic annotation via diffusion for autonomous driving
Author(s) -
Ramyashree,
S Raghavendra,
S K Abhilash,
Venu Madhav Nookala,
Arun Kumar,
P Malashree
Publication year - 2025
Publication title -
ieee access
Language(s) - English
Resource type - Magazines
SCImago Journal Rank - 0.587
H-Index - 127
eISSN - 2169-3536
DOI - 10.1109/access.2025.3591146
Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation
Generating high-fidelity surround view images from text prompts is a complex task that requires balancing contextual coherence with computational efficiency. The proposed work introduces a novel methodology that combines recurrent attention-based encoder-decoder architectures with text-to-image diffusion models to produce coherent and continuous surround view images. The approach utilizes a custom text encoder to convert input text prompts into contextual embeddings, which are then processed by the proposed ViewNet Unet2d architecture within the decoder. This architecture employs dual cross-attention mechanisms: one aligns text embeddings with corresponding noise image latents, while the other integrates previously generated image latents to ensure continuity across the sequence. This method guarantees that each generated image adheres to its specific prompt, while maintaining coherence with preceding images. In addition, an annotation decoder was introduced that generates semantic segmentation maps, instance segmentation masks, and object detection annotations. The annotation decoder processes latent image maps using a shared feature extraction backbone and dedicated heads for each annotation task.Experimental results on the nuScenes validation set demonstrate the effectiveness of the proposed model in producing high-quality contextually aligned surround view images. The proposed model achieves an FVD of 99 and an FID of 12.6, outperforming existing methods such as Panacea+ and DriveDreamer-2. Furthermore, our approach improves segmentation and detection accuracy, achieving a PQ of 67.4, mIoU of 80.1, and mAP of 65.4, surpassing methods like OpenSeeD and D2Det. An ablation study highlights the contributions of key components in our architecture. Integrating positional encoding, self-attention, and concurrent attention significantly enhances generation quality, reducing FVD to 99 and FID to 12.6. Experimental results demonstrate the effectiveness of proposed work in producing high-quality, contextually aligned surround view images with comprehensive annotations, pushing the boundaries of text-to-image synthesis and scene understanding.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom