Retrieval of TV Talk-Show Speakers by Associating Audio Transcript to Visual Clusters | Zendy

Yina Han | Zendy; Shanghuan Song | Zendy; Weikang Zhao | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Retrieval of TV Talk-Show Speakers by Associating Audio Transcript to Visual Clusters

Author(s) -

Yina Han,

Shanghuan Song,

Weikang Zhao

Publication year - 2017

Publication title -

ieee access

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.587

H-Index - 127

ISSN - 2169-3536

DOI - 10.1109/access.2017.2756451

Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation

Retrieval of TV talk-show speakers based on solely visual face recognition is hard because of the significant visual variation caused by illumination, pose, size, and expression, which can exceed those due to identity. Fortunately, TV talk-shows often exhibit specific visual production styles and are accompanied with other modalities, such as audio transcript. Hence, this paper presents a speaker retrieval framework which associates the who and when information provided by the audio transcript to a set of visual clusters. First, to obtain the visual clusters, an unsupervised speaker identity clustering strategy is proposed, by which the same speakers are grouped together but without knowing who exactly he/she is. Then, to further identify the specific speaker for each group, we propose an association strategy, by which the search are initially limited to those corresponding to when the queried speaker speaking, followed by a graph-based densest sub-graph refinement. Comprehensive experiments on 3 h French TV talk-show “Le Grand Echiquier” provided by K-space project show satisfactory results. Moreover, evaluation of the proposed association strategy on more challenging MediaEval 2015 task with just the provided speaker diarization module and face tracking module could provide state-of-the-art performances, demonstrating the effect of the proposed association strategy.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research