ETIA:Enhancing Text2Image surround view scene generation with semantic annotation via diffusion for autonomous driving | Zendy

Ramyashree | Zendy; S Raghavendra | Zendy; S K Abhilash | Zendy; Venu Madhav Nookala | Zendy; Arun Kumar | Zendy; P Malashree | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

ETIA:Enhancing Text2Image surround view scene generation with semantic annotation via diffusion for autonomous driving

Author(s) -

Ramyashree,

S Raghavendra,

S K Abhilash,

Venu Madhav Nookala,

Arun Kumar,

P Malashree

Publication year - 2025

Publication title -

ieee access

Language(s) - English

Resource type - Magazines

SCImago Journal Rank - 0.587

H-Index - 127

eISSN - 2169-3536

DOI - 10.1109/access.2025.3591146

Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation

Generating high-fidelity surround view images from text prompts is a complex task that requires balancing contextual coherence with computational efficiency. The proposed work introduces a novel methodology that combines recurrent attention-based encoder-decoder architectures with text-to-image diffusion models to produce coherent and continuous surround view images. The approach utilizes a custom text encoder to convert input text prompts into contextual embeddings, which are then processed by the proposed ViewNet Unet2d architecture within the decoder. This architecture employs dual cross-attention mechanisms: one aligns text embeddings with corresponding noise image latents, while the other integrates previously generated image latents to ensure continuity across the sequence. This method guarantees that each generated image adheres to its specific prompt, while maintaining coherence with preceding images. In addition, an annotation decoder was introduced that generates semantic segmentation maps, instance segmentation masks, and object detection annotations. The annotation decoder processes latent image maps using a shared feature extraction backbone and dedicated heads for each annotation task.Experimental results on the nuScenes validation set demonstrate the effectiveness of the proposed model in producing high-quality contextually aligned surround view images. The proposed model achieves an FVD of 99 and an FID of 12.6, outperforming existing methods such as Panacea+ and DriveDreamer-2. Furthermore, our approach improves segmentation and detection accuracy, achieving a PQ of 67.4, mIoU of 80.1, and mAP of 65.4, surpassing methods like OpenSeeD and D2Det. An ablation study highlights the contributions of key components in our architecture. Integrating positional encoding, self-attention, and concurrent attention significantly enhances generation quality, reducing FVD to 99 and FID to 12.6. Experimental results demonstrate the effectiveness of proposed work in producing high-quality, contextually aligned surround view images with comprehensive annotations, pushing the boundaries of text-to-image synthesis and scene understanding.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research