A GRU-based pipeline approach for word-sentence segmentation and punctuation restoration in English
Abstract
In this study, we propose a Gated Recurrent Unit (GRU) model to restore the following features: word and sentence boundaries, periods, commas, and capitalisation for unformatted English text. We approach feature restoration as a binary classification task where the model learns to predict whether a feature should be restored or not. A pipeline approach is proposed, in which only one feature (word boundary, sentence boundary, punctuation, capitalisation) is restored in each component of the pipeline model. To optimise the model, we conducted a grid search on the parameters. The effect of changing the order of the pipeline is also investigated experimentally; PERIODS > COMMAS > SPACES > CASING yielded the best result. Our findings highlight several specifcaction points with optimisation potential to be targeted in follow-up research.Citation
Sivakumar, J., Muga, J., Spadavecchia, F., White, D. and Can, B. (2022) A GRU-based pipeline approach for word-sentence segmentation and punctuation restoration in English. 2021 International Conference on Asian Language Processing (IALP), pp.268-273.Publisher
IEEEAdditional Links
https://ieeexplore.ieee.org/abstract/document/9675269Type
Conference contributionLanguage
enDescription
This is an accepted manuscript of an article published by IEEE in Proceedings of 2021 International Conference on Asian Language Processing (IALP) on 20 Jan 2022. Available online at https://doi.org/10.1109/IALP54817.2021.9675269 The accepted version of the publication may differ from the final published version.ISBN
9781665483117ae974a485f413a2113503eed53cd6c53
10.1109/IALP54817.2021.9675269
Scopus Count
Except where otherwise noted, this item's license is described as https://creativecommons.org/licenses/by-nc-nd/4.0/