Exact Sampling and Optimisation in Statistical Machine Translation
dc.contributor.advisor | Mitkov, Ruslan; Specia, Lucia; Dymetman, Marc | |
dc.contributor.author | Aziz, Wilker | |
dc.date.accessioned | 2014-03-25T12:14:56Z | |
dc.date.available | 2014-03-25T12:14:56Z | |
dc.date.issued | 2014 | |
dc.identifier.uri | http://hdl.handle.net/2436/314591 | |
dc.description | A thesis submitted in partial ful lment of the requirements of the University of Wolverhampton for the degree of Doctor of Philosophy | |
dc.description.abstract | In Statistical Machine Translation (SMT), inference needs to be performed over a high-complexity discrete distribution de ned by the intersection between a translation hypergraph and a target language model. This distribution is too complex to be represented exactly and one typically resorts to approximation techniques either to perform optimisation { the task of searching for the optimum translation { or sampling { the task of nding a subset of translations that is statistically representative of the goal distribution. Beam-search is an example of an approximate optimisation technique, where maximisation is performed over a heuristically pruned representation of the goal distribution. For inference tasks other than optimisation, rather than nding a single optimum, one is really interested in obtaining a set of probabilistic samples from the distribution. This is the case in training where one wishes to obtain unbiased estimates of expectations in order to t the parameters of a model. Samples are also necessary in consensus decoding where one chooses from a sample of likely translations the one that minimises a loss function. Due to the additional computational challenges posed by sampling, n-best lists, a by-product of optimisation, are typically used as a biased approximation to true probabilistic samples. A more direct procedure is to attempt to directly draw samples from the underlying distribution rather than rely on n-best list approximations. Markov Chain Monte Carlo (MCMC) methods, such as Gibbs sampling, o er a way to overcome the tractability issues in sampling, however their convergence properties are hard to assess. That is, it is di cult to know when, if ever, an MCMC sampler is producing samples that are compatible iii with the goal distribution. Rejection sampling, a Monte Carlo (MC) method, is more fundamental and natural, it o ers strong guarantees, such as unbiased samples, but is typically hard to design for distributions of the kind addressed in SMT, rendering an intractable method. A recent technique that stresses a uni ed view between the two types of inference tasks discussed here | optimisation and sampling | is the OS approach. OS can be seen as a cross between Adaptive Rejection Sampling (an MC method) and A optimisation. In this view the intractable goal distribution is upperbounded by a simpler (thus tractable) proxy distribution, which is then incrementally re ned to be closer to the goal until the maximum is found, or until the sampling performance exceeds a certain level. This thesis introduces an approach to exact optimisation and exact sampling in SMT by addressing the tractability issues associated with the intersection between the translation hypergraph and the language model. The two forms of inference are handled in a uni ed framework based on the OS approach. In short, an intractable goal distribution, over which one wishes to perform inference, is upperbounded by tractable proposal distributions. A proposal represents a relaxed version of the complete space of weighted translation derivations, where relaxation happens with respect to the incorporation of the language model. These proposals give an optimistic view on the true model and allow for easier and faster search using standard dynamic programming techniques. In the OS approach, such proposals are used to perform a form of adaptive rejection sampling. In rejection sampling, samples are drawn from a proposal distribution and accepted or rejected as a function of the mismatch between the proposal and the goal. The technique is adaptive in that rejected samples are used to motivate a re nement of the upperbound proposal that brings it closer to the goal, improving the rate of acceptance. Optimisation can be connected to an extreme form of sampling, thus the framework introduced here suits both exact optimisation and exact iv sampling. Exact optimisation means that the global maximum is found with a certi cate of optimality. Exact sampling means that unbiased samples are independently drawn from the goal distribution. We show that by using this approach exact inference is feasible using only a fraction of the time and space that would be required by a full intersection, without recourse to pruning techniques that only provide approximate solutions. We also show that the vast majority of the entries (n-grams) in a language model can be summarised by shorter and optimistic entries. This means that the computational complexity of our approach is less sensitive to the order of the language model distribution than a full intersection would be. Particularly in the case of sampling, we show that it is possible to draw exact samples compatible with distributions which incorporate a high-order language model component from proxy distributions that are much simpler. In this thesis, exact inference is performed in the context of both hierarchical and phrase-based models of translation, the latter characterising a problem that is NP-complete in nature. | |
dc.language.iso | en | |
dc.publisher | University of Wolverhampton | |
dc.subject | statistical machine translation | |
dc.subject | exact inference | |
dc.subject | optimisation | |
dc.subject | sampling | |
dc.subject | coarse-to-fine search | |
dc.subject | decoding | |
dc.title | Exact Sampling and Optimisation in Statistical Machine Translation | |
dc.type | Thesis or dissertation | |
dc.type.qualificationname | PhD | |
dc.type.qualificationlevel | Doctoral | |
refterms.dateFOA | 2018-08-21T11:11:42Z | |
html.description.abstract | In Statistical Machine Translation (SMT), inference needs to be performed over a high-complexity discrete distribution de ned by the intersection between a translation hypergraph and a target language model. This distribution is too complex to be represented exactly and one typically resorts to approximation techniques either to perform optimisation { the task of searching for the optimum translation { or sampling { the task of nding a subset of translations that is statistically representative of the goal distribution. Beam-search is an example of an approximate optimisation technique, where maximisation is performed over a heuristically pruned representation of the goal distribution. For inference tasks other than optimisation, rather than nding a single optimum, one is really interested in obtaining a set of probabilistic samples from the distribution. This is the case in training where one wishes to obtain unbiased estimates of expectations in order to t the parameters of a model. Samples are also necessary in consensus decoding where one chooses from a sample of likely translations the one that minimises a loss function. Due to the additional computational challenges posed by sampling, n-best lists, a by-product of optimisation, are typically used as a biased approximation to true probabilistic samples. A more direct procedure is to attempt to directly draw samples from the underlying distribution rather than rely on n-best list approximations. Markov Chain Monte Carlo (MCMC) methods, such as Gibbs sampling, o er a way to overcome the tractability issues in sampling, however their convergence properties are hard to assess. That is, it is di cult to know when, if ever, an MCMC sampler is producing samples that are compatible iii with the goal distribution. Rejection sampling, a Monte Carlo (MC) method, is more fundamental and natural, it o ers strong guarantees, such as unbiased samples, but is typically hard to design for distributions of the kind addressed in SMT, rendering an intractable method. A recent technique that stresses a uni ed view between the two types of inference tasks discussed here | optimisation and sampling | is the OS approach. OS can be seen as a cross between Adaptive Rejection Sampling (an MC method) and A optimisation. In this view the intractable goal distribution is upperbounded by a simpler (thus tractable) proxy distribution, which is then incrementally re ned to be closer to the goal until the maximum is found, or until the sampling performance exceeds a certain level. This thesis introduces an approach to exact optimisation and exact sampling in SMT by addressing the tractability issues associated with the intersection between the translation hypergraph and the language model. The two forms of inference are handled in a uni ed framework based on the OS approach. In short, an intractable goal distribution, over which one wishes to perform inference, is upperbounded by tractable proposal distributions. A proposal represents a relaxed version of the complete space of weighted translation derivations, where relaxation happens with respect to the incorporation of the language model. These proposals give an optimistic view on the true model and allow for easier and faster search using standard dynamic programming techniques. In the OS approach, such proposals are used to perform a form of adaptive rejection sampling. In rejection sampling, samples are drawn from a proposal distribution and accepted or rejected as a function of the mismatch between the proposal and the goal. The technique is adaptive in that rejected samples are used to motivate a re nement of the upperbound proposal that brings it closer to the goal, improving the rate of acceptance. Optimisation can be connected to an extreme form of sampling, thus the framework introduced here suits both exact optimisation and exact iv sampling. Exact optimisation means that the global maximum is found with a certi cate of optimality. Exact sampling means that unbiased samples are independently drawn from the goal distribution. We show that by using this approach exact inference is feasible using only a fraction of the time and space that would be required by a full intersection, without recourse to pruning techniques that only provide approximate solutions. We also show that the vast majority of the entries (n-grams) in a language model can be summarised by shorter and optimistic entries. This means that the computational complexity of our approach is less sensitive to the order of the language model distribution than a full intersection would be. Particularly in the case of sampling, we show that it is possible to draw exact samples compatible with distributions which incorporate a high-order language model component from proxy distributions that are much simpler. In this thesis, exact inference is performed in the context of both hierarchical and phrase-based models of translation, the latter characterising a problem that is NP-complete in nature. |