z-logo
open-access-imgOpen Access
Enhancing Temporal Coherence in Image-to-Video Facial Expression Synthesis: A Dual-Loss Framework for Smoother Generation
Author(s) -
Rafael Luiz Testa,
Ariane Machado-Lima,
Fatima L. S. Nunes
Publication year - 2025
Publication title -
ieee access
Language(s) - English
Resource type - Magazines
SCImago Journal Rank - 0.587
H-Index - 127
eISSN - 2169-3536
DOI - 10.1109/access.2025.3612820
Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation
Facial expression synthesis for video sequences presents significant challenges in terms of maintaining temporal coherence while preserving identity and expression accuracy. Most current approaches primarily focus on individual frame quality. This results in videos with unnatural transitions, flickering artifacts, and inconsistent expression dynamics. This paper presents a novel method for improving temporal coherence in facial expression image-to-video synthesis. The proposed approach uses a dual-focus framework that explicitly models both pixel-level transitions and geometric consistency between consecutive frames. Our method includes two specialized loss functions: a frame difference loss and a landmark consistency loss. We also implement a two-stage training strategy and a frame-blending post-processing technique. Extensive testing on the MUG Facial Expression Database demonstrates that the enhanced-coherence approach outperforms baseline methods and competes with other approaches from the literature. The proposed method has the lowest Average Content Distance score of all the approaches tested. This indicates smoother frame transitions while maintaining high visual quality. Subjective evaluation with 140 participants confirms that the synthesized expressions are more realistic and expressive. Our approach outperformed real videos in recognition rates for certain emotion expressions, particularly disgust and sadness. These findings demonstrate the effectiveness of explicitly modeling temporal coherence in generating realistic facial expression videos. This has promising applications in therapy, entertainment, and human-computer interaction.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom