Show simple item record

dc.contributor.authorAdeel, Ahsan
dc.contributor.authorGogate, Mandar
dc.contributor.authorHussain, Amir
dc.date.accessioned2020-01-10T13:05:53Z
dc.date.available2020-01-10T13:05:53Z
dc.date.issued2019-08-19
dc.identifier.citationAdeel, A., Gogate, M. and Hussain, A. (2020) Contextual deep learning-based audio-visual switching for speech enhancement in real-world environments, Information Fusion, 59, pp. 163-170. https://doi.org/10.1016/j.inffus.2019.08.008en
dc.identifier.issn1566-2535en
dc.identifier.doi10.1016/j.inffus.2019.08.008en
dc.identifier.urihttp://hdl.handle.net/2436/622981
dc.description.abstractHuman speech processing is inherently multi-modal, where visual cues (lip movements) help better understand speech in noise. Our recent work [1] has shown lip-reading driven, audio-visual (AV) speech enhancement can significantly outperform benchmark audio-only approaches at low signal-to-noise ratios (SNRs). However, consistent with our cognitive hypothesis, visual cues were found to be relatively less effective for speech enhancement at high SNRs or low levels of background noise, whereas audio-only cues worked well enough. Therefore, a more cognitively-inspired, context-aware AV approach is required, that contextually utilises both visual and noisy audio features, and thus more effectively accounts for different noisy conditions. In this paper, we introduce a novel context-aware AV framework that contextually exploits AV cues with respect to different operating conditions to estimate clean audio, without requiring any prior SNR estimation. The switching module is developed by integrating a convolutional neural network (CNN) and long-short-term memory (LSTM) network, that learns to switch between visual-only (V-only), audio-only (A-only), and both audio-visual cues at low, high and moderate SNR levels, respectively. For testing, the estimated clean audio features are utilised using an enhanced visually-derived Wiener filter (EVWF) for noisy speech filtering. The context-aware AV speech enhancement framework is evaluated under dynamic real-world scenarios (including cafe, street, bus, and pedestrian) at different SNR levels (ranging from low to high SNRs), using benchmark Grid and ChiME3 corpora. For objective testing, perceptual evaluation of speech quality (PESQ) is used to evaluate the quality of the restored speech. For subjective testing, the standard mean-opinion-score (MOS) method is used. Comparative experimental results show the superior performance of our context-aware AV approach, over A-only, V-only, spectral subtraction (SS), and log-minimum mean square error (LMMSE) based speech enhancement methods, at both low and high SNRs. These preliminary findings demonstrate the capability of our proposed approach to deal with spectro-temporal variations in any real-world noisy environment, by contextually exploiting the complementary strengths of audio and visual cues. In conclusion, our deep learning-driven AV framework is posited as a benchmark resource for the multi-modal speech processing and machine learning communities.en
dc.description.sponsorshipThis work was supported by the UK Engineering and Physical Sciences Research Council (EPSRC) Grant No. EP/M026981/1 and deepCI.org.en
dc.formatapplication/pdfen
dc.languageen
dc.language.isoenen
dc.publisherElsevier BVen
dc.relation.urlhttps://www.sciencedirect.com/science/article/pii/S1566253518306018?via%3Dihuben
dc.subjectcontext-aware learningen
dc.subjectmulti-modal speech enhancementen
dc.subjectWiener filteringen
dc.subjectaudio-visualen
dc.subjectdeep learningen
dc.titleContextual deep learning-based audio-visual switching for speech enhancement in real-world environmentsen
dc.typeJournal articleen
dc.identifier.journalInformation Fusionen
dc.date.updated2020-01-08T18:16:14Z
dc.date.accepted2019-08-19
rioxxterms.funderUK Engineering and Physical Sciences Research Council (EPSRC)
rioxxterms.identifier.projectEP/M026981/1en
rioxxterms.versionAMen
rioxxterms.licenseref.urihttps://creativecommons.org/licenses/by-nc-nd/4.0/en
rioxxterms.licenseref.startdate2021-02-19en
dc.description.versionPublished version
refterms.dateFCD2020-01-10T13:05:11Z
refterms.versionFCDAM


Files in this item

Thumbnail
Name:
Publisher version
Thumbnail
Name:
IF2019.pdf
Embargo:
2021-02-19
Size:
1.496Mb
Format:
PDF

This item appears in the following Collection(s)

Show simple item record

https://creativecommons.org/licenses/by-nc-nd/4.0/
Except where otherwise noted, this item's license is described as https://creativecommons.org/licenses/by-nc-nd/4.0/