Loading...
Thumbnail Image
Item

A GRU-based pipeline approach for word-sentence segmentation and punctuation restoration in English

Sivakumar, Jasivan
Muga, Jake
Spadavecchia, Flavio
White, Daniel
Can Buglalilar, Burcu
Alternative
Abstract
In this study, we propose a Gated Recurrent Unit (GRU) model to restore the following features: word and sentence boundaries, periods, commas, and capitalisation for unformatted English text. We approach feature restoration as a binary classification task where the model learns to predict whether a feature should be restored or not. A pipeline approach is proposed, in which only one feature (word boundary, sentence boundary, punctuation, capitalisation) is restored in each component of the pipeline model. To optimise the model, we conducted a grid search on the parameters. The effect of changing the order of the pipeline is also investigated experimentally; PERIODS > COMMAS > SPACES > CASING yielded the best result. Our findings highlight several specifcaction points with optimisation potential to be targeted in follow-up research.
Citation
Sivakumar, J., Muga, J., Spadavecchia, F., White, D. and Can, B. (2022) A GRU-based pipeline approach for word-sentence segmentation and punctuation restoration in English. 2021 International Conference on Asian Language Processing (IALP), pp.268-273.
Publisher
Journal
Research Unit
PubMed ID
PubMed Central ID
Embedded videos
Type
Conference contribution
Language
en
Description
This is an accepted manuscript of an article published by IEEE in Proceedings of 2021 International Conference on Asian Language Processing (IALP) on 20 Jan 2022. Available online at https://doi.org/10.1109/IALP54817.2021.9675269 The accepted version of the publication may differ from the final published version.
Series/Report no.
ISSN
EISSN
ISBN
9781665483117
ISMN
Gov't Doc #
Sponsors
Rights
Research Projects
Organizational Units
Journal Issue
Embedded videos