Multimodal Quality Estimation for Machine Translation

We propose approaches to Quality Estimation (QE) for Machine Translation that explore both text and visual modalities for Multimodal QE. We compare various multimodality integration and fusion strategies. For both sentence-level and document-level predictions, we show that state-of-the-art neural and feature-based QE frameworks obtain better results when using the additional modality.


Introduction
Quality Estimation (QE) for Machine Translation (MT) (Blatz et al., 2004;Specia et al., 2009) aims to predict the quality of a machine-translated text without using reference translations. It estimates a label (a category, such as 'good' or 'bad', or a numerical score) for a translation, given text in a source language and its machine translation in a target language (Specia et al., 2018b). QE can operate at different linguistic levels, including sentence and document levels. Sentence-level QE estimates the translation quality of a whole sentence, while document-level QE predicts the translation quality of an entire document, even though in practice in literature the documents have been limited to a small set of 3-5 sentences (Specia et al., 2018b).
Existing work has only explored textual context. We posit that to judge (or estimate) the quality of a translated text, additional context is paramount. Sentences or short documents taken out of context may lack information on the correct translation of certain (esp. ambiguous) constructions. Inspired by recent work on multimodal machine learning (Baltrusaitis et al., 2019;Barrault et al., 2018), we propose to explore the visual modality in addition to the text modality for this task.
Multimodality through vision offers interesting opportunities for real-life data since texts are in- * Two authors contributed equally.

Source (EN)
Danskin Women's Bermuda Shorts MT (FR) Bermuda Danskin féminines court Table 1: Example of incorrectly machine-translated text: the word shorts is used to indicate short trousers, but gets translated in French as court, the adjective short. Here multimodality could help to detect the error (extracted from the Amazon Reviews Dataset of McAuley et al., 2015).
creasingly accompanied with visual elements such as images or videos, especially in social media but also in domains such as e-commerce. Multimodality has not yet been applied to QE. Table 1 shows an example from our e-commerce dataset in which multimodality could help to improve QE. Here, the English noun shorts is translated by the adjective court (for the adjective short) in French, which is a possible translation out of context. However, as the corresponding product image shows, this product is an item of clothing, and thus the machine translation is incorrect. External information can hence help identify mismatches between translations which are difficult to find within the text. Progress in QE is mostly benchmarked as part of the Conference on Machine Translation (WMT) Shared Task on QE. This paper is based on data from the WMT'18 edition's Task 4 -documentlevel QE. This Task 4 aims to predict a translation quality score for short documents based on the number and the severity of translation errors at the word level (Specia et al., 2018a). This data was chosen as it is the only one for which meta information (images in this case) is available. We extend this dataset by computing scores for each sentence for a sentence-level prediction task. We consider both feature-based and neural state-of-theart models for QE. Having these as our starting points, we propose different ways to integrate the visual modality.
The main contributions of this paper are as follows: (i) we introduce the task of Multimodal QE (MQE) for MT as an attempt to improve QE by using external sources of information, namely images; (ii) we propose several ways of incorporating visual information in neural-based and featurebased QE architectures; and (iii) we achieve the state-of-the-art performance for such architectures in document and sentence-level QE.

QE Frameworks and Models
We explore feature-based and neural-based models from two open-source frameworks: (Specia et al., 2015) is a feature-based QE framework composed of two modules: a feature extractor module, to extract the relevant QE features from both the source sentences and their translations, and a machine learning module. We only use this framework for our experiments on document-level QE, since it does not perform well enough for sentence-level prediction. We use the same model (Support Vector Regression), hyperparameters and feature settings as the baseline model for the document-level QE task at WMT'18.
deepQuest: deepQuest (Ive et al., 2018) is a neural-based framework that provides state-of-theart models for multi-level QE. We use the BiRNN model, a light-weight architecture which can be trained at either sentence or document level.
The BiRNN model uses an encoder-decoder architecture: it takes on its input both the source sentence and its translation which are encoded separately by two independent bi-directional Recurrent Neural Networks (RNNs). The two resulting sentence representations are then concatenated as a weighted sum of their word vectors, generated by an attention mechanism. For sentence-level predictions, the weighted representation of the two input sentences is passed through a dense layer with sigmoid activation to generate the quality estimates. For document-level predictions, the final representation of a document is generated by a second attention mechanism, as the weighted sum of the weighted sentence-level representations of all the sentences within the document. The resulting document-level representation is then passed through a dense layer with sigmoid activation to generate the quality estimates.
Additionally, we propose and experiment with BERT-BiRNN, a variant of the BiRNN model. Rather than training the token embeddings with the task at hand, we use large-scale pre-trained token-level representations from the multilingual cased base BERT model (Devlin et al., 2019). During training, the BERT model is fine-tuned by unfreezing the weights of the last four hidden layers along with the token embedding layer. This performs comparably to the state-of-the-art predictorestimator neural model in Kepler et al. (2019).

Data
WMT'18 QE Task 4 data: This dataset was created for the document-level track. It contains a sample of products from the Amazon Reviews Dataset (McAuley et al., 2015) taken from the Sports & Outdoors category. 'Documents' consist of the English product title and its description, its French machinetranslation and a numerical score to predict, namely the MQM score (Multidimensional Quality Metrics) (Lommel et al., 2014). This score is computed by annotating and weighting each word-level translation error according to its severity (minor, major and critical): where n is the total number of words, and n i is the number of errors annotated with the corresponding error severity. Additionally, the dataset provides one picture per product, as well as pre-extracted visual features, as we discuss below. For the sentence-level QE task, each document of the dataset was split into sentences (lines), where every sentence has its corresponding MQM score computed in the same way as for the document. We note that this variant is different from the official sentence-level track at WMT since for that task visual information is not available.
Text features: For the feature-based approach, we extract the same 15 features as those for the baseline of WMT'18 at document level. For the neural-based approaches, text features are either the learned word embeddings (BiRNN) or pre-trained word embeddings (BERT-BiRNN).
Visual features: The visual features are preextracted vectors with 4,096 dimensions, also provided in the Amazon Reviews Dataset (McAuley et al., 2015). The method to obtain the features uses a deep convolutional neural network which has been pre-trained on the ImageNet dataset for image classification (Deng et al., 2009). The visual features extracted represent a vectorial summary of the image taken from the last pooled layer of the network. He and McAuley (2016) have shown that this representation contains useful visual features for a number of tasks.

Multimodal QE
We propose different ways to integrate visual features in our two monomodal QE approaches (Sections 3.1 and 3.2). We compare each proposed model with its monomodal QE counterpart as baseline, both using the same hyperparameters.

Multimodal feature-based QE
The feature-based textual features contain 15 numerical scores, while the visual feature vector contains 4,096 dimensions. To avoid over-weighting the visual features, we reduce their dimensionality using Principal Component Analysis (PCA). We consider up to 15 principal components in order to keep a balance between the visual features and the 15 text features from QuEst++. We choose the final number of principal components to keep according to the explained variance with the PCA, so this number is treated as a hyperparameter. After analysing the explained variance for up to 15 kept principal components (see Figure 4 in Appendix), we selected six numbers of principal components to train QE models with (1, 2, 3, 5, 10, and 15). As fusion strategy, we concatenate the two feature vectors.

Multimodal neural-based QE
Multimodality is achieved with two changes in our monomodal models: multimodality integration (where to integrate the visual features in the architecture), and fusion strategy (how to fuse the visual and textual features). We propose the following places to integrate the visual feature vector into the BiRNN architecture: • embed -the visual feature vector is used after the word embedding layer; • annot -the visual feature vector is used after the encoding of the two input sentences by the two bi-directional RNNs; • last -the visual feature vector is used just before the last layer.
To fuse the visual and text features, we reduce the size of the visual features using a dense layer with a ReLu activation and reshape it to match the shape of the text-feature vector. As fusion strategies between visual and textual feature vectors, we propose the following: • conc -concatenation with both source and target word representations for the 'embed' strategy; concatenation with the text features for the 'last' strategy; • mult -element-wise multiplication for the target word representations and concatenation for the source word representations for the 'embed' strategy; element-wise multiplication with the text features for the 'annot' and 'last' strategies; • mult2 -element-wise multiplication for both source and target word representations (exclusive to the 'embed' model). Figure 1 presents the high-level architecture of the document-level BiRNN model, with the various multimodality integration and fusion approaches.
For example, in the 'embed' setting, the visual features are fused with each word representation from the embedding layers. Since this strategy modifies the embedding for each word, it can be expected to have a bigger impact on the result.

Results
We use the standard training, development and test datasets from the WMT'18 Task 4 track. For feature-based systems, we follow the built-in crossvalidation in QuEst++, and train a single model with the hyperparameters found by cross-validation. For neural-based models, we use early-stopping with a patience of 10 to avoid over-fitting, and all reported figures are averaged over 5 runs corresponding to different seeds.
We follow the evaluation method of the WMT QE tasks: Pearson's r correlation as the main metric (Graham, 2015), Mean-Absolute Error (MAE) and Root-Mean-Squared Error (RMSE) as secondary metrics. For statistical significance on Pearson's r, we compute Williams test (Williams, 1959) as suggested by Graham and Baldwin (2014).
For all neural-based models, we experiment with the all three integration strategies ('embed', 'annot' and 'last') and all three fusion strategies ('conc', 'mult' and 'mult2') presented in Section 3.2. This leads to 6 multimodal models for each BiRNN and BERT-BiRNN. In Tables 2 and 4, as well as in Figures 2 and 3, we report the top three performing models. We refer the reader to the Appendix for the full set of results.

Sentence-level MQE
The first part of Table 2 presents the results for sentence-level multimodal QE with BiRNN. The best model is BiRNN+Vis-embed-mult2, achieving a Pearson's r of 0.535, significantly outperforming the baseline (p-value<0.01). Visual features can, therefore, help to improve the performance of sentence-level neural-based QE systems significantly. Figure 2 presents the result of Williams significance test for BiRNN model variants. It is a correlation matrix that can be read as follows: the value in cell (i, j) is the p-value of Williams test for the change in performance of the model at row i compared to the model at column j (Graham, 2015).
With the pre-trained token-level representations from BERT (second half of Table 2), the best model is BERT-BiRNN+Vis-annot-mult, achieving a Pear-  Table 2: Pearson correlation at sentence-level on the WMT'18 dataset. We report the monomodal models (BiRNN, BERT-BiRNN) and their respective top-3 best performing multimodal variants (+Vis). We refer the reader to the Appendix for the full set of results. Here, BERT, ann-mul and emb-mul2 correspond to the BERT-BiRNN, the BERT-BiRNN+Vis-annot-mult and the BiRNN+Vis-embed-mult2 models of Table 2. son's r of 0.602. This shows that even when using better word presentations, the visual features help to get further (albeit modest) improvements. Table 3 shows an example of predicted scores at the sentence-level for the baseline model (BiRNN) and for the best multimodal BiRNN model (BiRNN+Vis-embed-mult2). The multimodal model has predicted a closer score (-0.002) to the gold MQM score (0.167) than the baseline model (-0.248). The French translation is poor (cumulative-split is, for instance, not translated) as the low gold MQM score shows. However, the (main) word stopwatch is correctly translated as chronomètre in French. Since the associated picture indeed represents a stopwatch, one explanation for this improvement could be that the multimodal model may have rewarded this correct and important part of the translation.

Source (EN)
The A601X stopwatch features cumulative-split timing.

MT (FR)
Le chronomètre A601X dispose calendrier cumulative-split. gold MQM score 0.167 BiRNN -0.248 BiRNN+Vis-embed-mult2 -0.002 Table 3: Example of performance of sentence-level multimodal QE. Compared to the baseline prediction (BiRNN), the prediction from the best multimodal model (BiRNN+Vis-embed-mult2) is closer to the gold MQM score. This could be because the word stopwatch is correctly translated as chronomètre in French, and the additional visual feature confirms it. This could lead to an increase in the predicted score to reward the correct part, despite the poor translation (extracted from the Amazon Reviews Dataset of McAuley et al., 2015). Table 4 presents the results for the documentlevel feature-based and BiRNN neural QE models. 1 The first section shows the official models from the WMT'18 QE Task 4 report (Specia et al., 2018a). The neural-based approach SHEF-PT is the winning submission, outperforming another neural-based approach (SHEF-mtl-bRNN). For our BiRNN models (second section), BiRNN+Visembed-conc performs only slightly better than the monomodal baseline. For the feature-based models (third section), on the other hand, the baseline monomodal QuEst++ is outperformed by various multimodal variants by a large margin, with the one with two principal components (QuEst+Vis-2) performing the best. The more PCA components kept, the worse the results (see Appendix for full set of results).  Table 4: Pearson correlation at document-level on the WMT'18 dataset: state-of-the-art models as reported by task organisers, our BiRNN model and its multimodal versions and feature-based QuEst++ and its multimodal versions. Figure 3 shows the Williams significance test for document-level QuEst++ on the WMT'18 dataset. 1 The BERT-BiRNN models performed very poorly at this level and more research on why is left for future work.

Document-level MQE
As we can see, QuEst+Vis-2 model outperforms the baseline with p-value = 0.002. Thus, visual features significantly improve the performance of featurebased QE systems compared to the monomodal QE counterparts.

Conclusions
We introduced Multimodal Quality Estimation for Machine Translation, where an external modality -visual information -is incorporated to featurebased and neural-based QE approaches, on sentence and document levels. The use of visual features extracted from images has led to significant improvements in the results of state-of-the-art QE approaches, especially at sentence level.
The version of deepQuest for multimodal QE and scripts to convert document into sentencelevel data are available on https://github.com/ sheffieldnlp/deepQuest. Figure 4 shows an almost linear relationship between the number of principal components and the explained variance of the PCA (see Section 3.1), i.e. the higher the number of principal components, the larger the explained variance. Therefore, we experimented with various numbers of components up to 15 (1, 2, 3, 5, 10, and 15) on the development set to find the best settings for quality prediction. Complete results Tables 5 and 6 present the full set of results of our experiments on document and sentence-level multimodal QE on our main test set, the WMT'18 test set. These are a super-set of the results presented in the main paper but include all combinations of multimodality integration and fusion strategies for sentence-level prediction, as well as different numbers of principal components kept for document-level QuEst prediction models.

PCA analysis
Additional test set Tables 7 and 8 present the full set of results of our experiments on the WMT'19 Task 2 test set on document and sentencelevel multimodal QE, respectively. This was the follow-up edition of the WMT'18 Task 4, where the same training set is used, but a new test set is released.
For document-level, we observe nuanced results with more modest benefits in using visual features, regardless of the integration method or fusion strategy.
For sentence-level, we observe on the one hand quite significant improvements with a gain of almost 8 points in Pearson's r over BiRNN, our monomodal baseline without pre-trained word embedding. It is interesting to note that almost all   multimodal variants achieve better performance compared to the monomodal BiRNN baseline, with a peak when the visual features are fused with the word embedding representations by elementwise multiplication. On the other hand, we do not observe any gain in using visual features on the WMT'19 test set compared to our monomodal baseline with pre-trained word-embedding (BERT-BiRNN). Here that the BERT-BiRNN baseline model already performs very well. According to the task organisers, the mean MQM value on the WMT'19 test set is higher than on the WMT'18 test set, but actually closer to the training data (Fonseca et al., 2019). We therefore hypothesise here that the highly dimensional and contextualised word-level representations from BERT are already enough and do not benefit from the extra information provided by the visual features.