Loading...
Thumbnail Image
Item

Speech enhancement using multisensory cooperative computing

Alternative
Abstract
This thesis investigates novel approaches for audio-visual speech enhancement (AVSE) through biologically inspired architectures, lightweight multimodal learning, and hybrid classical–deep frameworks. It advances three complementary themes of the speech enhancement problem. First, Multisensory Cooperative Computing (MCC) is introduced, a deep neural architecture inspired by the context-sensitive processing of two-point layer five pyramidal neurons (L5PCs). Unlike conventional point neuron models that indiscriminately process inputs, MCC adaptively filters and amplifies only contextually salient signals via a dendritic gating mechanism. Implemented on Xilinx Ultra- Scale+ MPSoC hardware, the system achieves substantial energy savings, up to 62% in semi-supervised settings and 1250× fewer energy demands per feedforward in supervised modes, by suppressing redundant synaptic activity. MCC establishes a paradigm for energy-efficient, high-capacity neuromorphic computing suited to real-time audio-visual learning. Second, to address AVSE on resource-constrained edge devices and the challenges of real-world noise, a novel target mask, the Ideal Smoothed Mask (ISM), is proposed. ISM combines morphological and spectral filtering for robust speech separation. A transfer-learning fusion framework maps visual lip movements to speech representations with enhanced temporal modelling. Nonlinear transfer functions and a multi-objective loss incorporating mutual information strengthen cross-modal attention. The resulting system improves generalisation, reduces mask complexity, and supports real-time enhancement on constrained hardware. Third, a lightweight AVSE framework is presented that merges classical spectral subtraction with visual speech detection. A CNN–LSTM module classifies short lip sequences into speech/no-speech labels to guide noise estimation and subtraction, overcoming the unreliability of audio-only voice activity detection (VAD) at low SNR. By isolating noise-only segments using robust lip-based cues, the approach preserves the interpretability and efficiency of classical methods while achieving substantial perceptual gains. Collectively, these contributions provide a unified vision for context-aware, resource-efficient, and explainable speech enhancement, bridging deep learning with neuro-inspired design and practical deployment.
Citation
Ahmed, K. (2025) Speech enhancement using multisensory cooperative computing. University of Wolverhampton. https://wlv.openrepository.com/handle/2436/626132
Journal
Research Unit
DOI
PubMed ID
PubMed Central ID
Embedded videos
Additional Links
Type
Thesis or dissertation
Language
en
Description
A thesis submitted in partial fulfilment of the requirements of the University of Wolverhampton for the degree of Doctor of Philosophy.
Series/Report no.
ISSN
EISSN
ISBN
ISMN
Gov't Doc #
Sponsors
EPSRC
Rights
Research Projects
Organizational Units
Journal Issue
Embedded videos