A GRU-based pipeline approach for word-sentence segmentation and punctuation restoration in English
MetadataShow full item record
AbstractIn this study, we propose a Gated Recurrent Unit (GRU) model to restore the following features: word and sentence boundaries, periods, commas, and capitalisation for unformatted English text. We approach feature restoration as a binary classification task where the model learns to predict whether a feature should be restored or not. A pipeline approach is proposed, in which only one feature (word boundary, sentence boundary, punctuation, capitalisation) is restored in each component of the pipeline model. To optimise the model, we conducted a grid search on the parameters. The effect of changing the order of the pipeline is also investigated experimentally; PERIODS > COMMAS > SPACES > CASING yielded the best result. Our findings highlight several specifcaction points with optimisation potential to be targeted in follow-up research.
CitationSivakumar, J., Muga, J., Spadavecchia, F., White, D. and Can, B. (2022) A GRU-based pipeline approach for word-sentence segmentation and punctuation restoration in English. 2021 International Conference on Asian Language Processing (IALP), pp.268-273.
DescriptionThis is an accepted manuscript of an article published by IEEE in Proceedings of 2021 International Conference on Asian Language Processing (IALP) on 20 Jan 2022. Available online at https://doi.org/10.1109/IALP54817.2021.9675269 The accepted version of the publication may differ from the final published version.
Except where otherwise noted, this item's license is described as https://creativecommons.org/licenses/by-nc-nd/4.0/