Show simple item record

dc.contributor.authorAdeel, Ahsan
dc.contributor.authorGogate, Mandar
dc.contributor.authorHussain, Amir
dc.contributor.authorWhitmer, William M
dc.date.accessioned2019-10-22T11:16:30Z
dc.date.available2019-10-22T11:16:30Z
dc.date.issued2019-09-05
dc.identifier.citationAdeel, A., Gogate, M., Hussain, A. and Whitmer, W. M. (2019) Lip-reading driven deep learning approach for speech enhancement, IEEE Transactions on Emerging Topics in Computational Intelligence (forthcoming)en
dc.identifier.issn2471-285Xen
dc.identifier.doi10.1109/tetci.2019.2917039en
dc.identifier.urihttp://hdl.handle.net/2436/622874
dc.description.abstractThis paper proposes a novel lip-reading driven deep learning framework for speech enhancement. The proposed approach leverages the complementary strengths of both deep learning and analytical acoustic modelling (filtering based approach) as compared to recently published, comparatively simpler benchmark approaches that rely only on deep learning. The proposed audio-visual (AV) speech enhancement framework operates at two levels. In the first level, a novel deep learning-based lip-reading regression model is employed. In the second level, lip-reading approximated clean-audio features are exploited, using an enhanced, visually-derived Wiener filter (EVWF), for the clean audio power spectrum estimation. Specifically, a stacked long-short-term memory (LSTM) based lip-reading regression model is designed for clean audio features estimation using only temporal visual features considering different number of prior visual frames. For clean speech spectrum estimation, a new filterbank-domain EVWF is formulated, which exploits estimated speech features. The proposed EVWF is compared with conventional Spectral Subtraction and Log-Minimum Mean-Square Error methods using both ideal AV mapping and LSTM driven AV mapping. The potential of the proposed speech enhancement framework is evaluated under different dynamic real-world commercially-motivated scenarios (e.g. cafe, public transport, pedestrian area) at different SNR levels (ranging from low to high SNRs) using benchmark Grid and ChiME3 corpora. For objective testing, perceptual evaluation of speech quality is used to evaluate the quality of restored speech. For subjective testing, the standard mean-opinion-score method is used with inferential statistics. Comparative simulation results demonstrate significant lip-reading and speech enhancement improvement in terms of both speech quality and speech intelligibility.en
dc.description.sponsorshipUK Engineering and Physical Sciences Research Council (EPSRC) Grant No. EP/M026981/1.en
dc.formatapplication/PDFen
dc.language.isoenen
dc.publisherInstitute of Electrical and Electronics Engineers (IEEE)en
dc.relation.urlhttps://ieeexplore.ieee.org/document/8825842en
dc.subjectlip readingen
dc.subjectstacked long-short-term memoryen
dc.subjectenhanced visually-derived Wiener filteringen
dc.subjectcontext-aware audio-visual speech enhancementen
dc.subjectaudio-visual ChiME3 corpusen
dc.titleLip-reading driven deep learning approach for speech enhancementen
dc.typeJournal articleen
dc.identifier.journalIEEE Transactions on Emerging Topics in Computational Intelligenceen
dc.date.updated2019-09-29T16:19:34Z
dc.date.accepted2019-04-28
rioxxterms.funderUniversity of Wolverhamptonen
rioxxterms.identifier.projectEP/M026981/1en
rioxxterms.versionAMen
rioxxterms.licenseref.urihttps://creativecommons.org/licenses/by/4.0/en
rioxxterms.licenseref.startdate2019-10-22en
dc.source.volumeabs/1808.00046
dc.source.beginpage1
dc.source.endpage10
dc.description.versionPublished version
refterms.dateFCD2019-10-22T11:16:15Z
refterms.versionFCDAM
refterms.dateFOA2019-10-22T11:16:30Z


Files in this item

Thumbnail
Name:
08825842.pdf
Size:
3.924Mb
Format:
PDF

This item appears in the following Collection(s)

Show simple item record

https://creativecommons.org/licenses/by/4.0/
Except where otherwise noted, this item's license is described as https://creativecommons.org/licenses/by/4.0/