Dissertations / Theses: 'Machine translations'

1

Ilisei, Iustina-Narcisa. "A machine learning approach to the identification of translational language : an inquiry into translationese learning models." Thesis, University of Wolverhampton, 2012. http://hdl.handle.net/2436/299371.

Full text

Abstract:

In the world of Descriptive Translation Studies, translationese refers to the specific traits that characterise the language used in translations. While translationese has been often investigated to illustrate that translational language is different from non-translational language, scholars have also proposed a set of hypotheses which may characterise such di erences. In the quest for the validation of these hypotheses, embracing corpus-based techniques had a well-known impact in the domain, leading to several advances in the past twenty years. Despite extensive research, however, there are no universally recognised characteristics of translational language, nor universally recognised patterns likely to occur within translational language. This thesis addresses these issues, with a less used approach in the eld of Descriptive Translation Studies, by investigating the nature of translational language from a machine learning perspective. While the main focus is on analysing translationese, this thesis investigates two related sub-hypotheses: simplication and explicitation. To this end, a multilingual learning framework is designed and implemented for the identification of translational language. The framework is modelled as a categorisation task, the learning techniques having the major goal to automatically learn to distinguish between translated and non-translated texts. The second and third major goals of this research are the retrieval of the recurring patterns that are revealed in the process of solving the task of categorisation, as well as the ranking of the most in uential characteristics used to accomplish the learning task. These aims are ful lled by implementing a system that adopts the machine learning methodology proposed in this research. The learning framework proves to be an adaptable multilingual framework for the investigation of the nature of translational language, its adaptability being illustrated in this thesis by applying it to the investigation of two languages: Spanish and Romanian. In this thesis, di erent research scenarios and learning models are experimented with in order to assess to what extent translated texts can be diff erentiated from non-translated texts in certain contexts. The findings show that machine learning algorithms, aggregating a large set of potentially discriminative characteristics for translational language, are able to diff erentiate translated texts from non-translated ones with high scores. The evaluation experiments report performance values such as accuracy, precision, recall, and F-measure on two datasets. The present research is situated at the con uence of three areas, more precisely: Descriptive Translation Studies, Machine Learning and Natural Language Processing, justifying the need to combine these elds for the investigation of translationese and translational hypotheses.

APA, Harvard, Vancouver, ISO, and other styles

2

Tirnauca, Catalin Ionut. "Syntax-directed translations, tree transformations and bimorphisms." Doctoral thesis, Universitat Rovira i Virgili, 2016. http://hdl.handle.net/10803/381246.

Full text

Abstract:

La traducció basada en la sintaxi va sorgir en l'àmbit de la traducció automàtica dels llenguatges naturals. Els sistemes han de modelar les transformacions d'arbres, reordenar parts d'oracions, ser simètrics i posseir propietats com la componibilitat o simetria. Existeixen diverses maneres de definir transformacions d'arbres: gramàtiques síncrones, transductors d'arbres i bimorfismes d'arbres. Les gramàtiques síncrones fan tot tipus de rotacions, però les propietats matemàtiques són més difícils de provar. Els transductors d'arbres són operacionals i fàcils d'implementar, però les classes principals no són tancades sota la composició. Els bimorfismes d'arbres són difícils d'implementar, però proporcionen una eina natural per provar componibilitat o simetria. Per millorar el procés de traducció, les gramàtiques síncrones es relacionen amb els bimorfismes d'arbres i amb els transductors d'arbres. En aquesta tesi es duu a terme un ampli estudi de la teoria i les propietats dels sistemes de traducció dirigides per la sintaxi, des d'aquestes tres perspectives molt diferents que es complementen perfectament entre si: com a dispositius generatius (gramàtiques síncrones), com a màquines acceptadores (transductors) i com a estructures algebraiques (bimorfismes). S'investiguen i comparen al nivell de la transformació d'arbres i com a dispositius que defineixen translacions. L'estudi es centra en bimorfismes, amb especial èmfasi en les seves aplicacions per al processament del llenguatge natural. També es proposa una completa i actualitzada visió general sobre les classes de transformacions d'arbres definits per bimorfismes, vinculant-los amb els tipus coneguts de gramàtiques síncrones i transductors d'arbres. Provem o recordem totes les propietats interessants que les esmentades classes posseeixen, millorant així els coneixements matemàtics previs. A més, s'exposen les relacions d'inclusió entre les principals classes de bimorfismes mitjançant un diagrama Hasse, com a dispositius de traducció i com a mecanismes de transformació d'arbres.
La traducción basada en la sintaxis surgió en el ámbito de la traducción automática de los lenguajes naturales. Los sistemas deben modelar las transformaciones de árboles, reordenar partes de oraciones, ser simétricos y poseer propiedades como la composición o simetría. Existen varias maneras de definir transformaciones de árboles: gramáticas síncronas, transductores de árboles y bimorfismos de árboles. Las gramáticas síncronas hacen todo tipo de rotaciones, pero las propiedades matemáticas son más difíciles de probar. Los transductores de árboles son operacionales y fáciles de implementar pero las clases principales no son cerradas bajo la composición. Los bimorfismos de árboles son difíciles de implementar, pero proporcionan una herramienta natural para probar composición o simetría. Para mejorar el proceso de traducción, las gramáticas síncronas se relacionan con los bimorfismos de árboles y con los transductores de árboles. En esta tesis se lleva a cabo un amplio estudio de la teoría y las propiedades de los sistemas de traducción dirigidas por la sintaxis, desde estas tres perspectivas muy diferentes que se complementan perfectamente entre sí: como dispositivos generativos (gramáticas síncronas), como máquinas aceptadores (transductores) y como estructuras algebraicas (bimorfismos). Se investigan y comparan al nivel de la transformación de árboles y como dispositivos que definen translaciones. El estudio se centra en bimorfismos, con especial énfasis en sus aplicaciones para el procesamiento del lenguaje natural. También se propone una completa y actualizada visión general sobre las clases de transformaciones de árboles definidos por bimorfismos, vinculándolos con los tipos conocidos de gramáticas síncronas y transductores de árboles. Probamos o recordamos todas las propiedades interesantes que tales clases poseen, mejorando así los previos conocimientos matemáticos. Además, se exponen las relaciones de inclusión entre las principales clases de bimorfismos a través de un diagrama Hasse, como dispositivos de traducción y como mecanismos de transformación de árboles.
Syntax-based machine translation was established by the demanding need of systems used in practical translations between natural languages. Such systems should, among others, model tree transformations, re-order parts of sentences, be symmetric and possess composability or forward and backward application. There are several formal ways to define tree transformations: synchronous grammars, tree transducers and tree bimorphisms. The synchronous grammars do all kind of rotations, but mathematical properties are harder to prove. The tree transducers are operational and easy to implement, but closure under composition does not hold for the main types. The tree bimorphisms are difficult to implement, but they provide a natural tool for proving composability or symmetry. To improve the translation process, synchronous grammars were related to tree bimorphisms and tree transducers. Following this lead, we give a comprehensive study of the theory and properties of syntax-directed translation systems seen from these three very different perspectives that perfectly complement each other: as generating devices (synchronous grammars), as acceptors (transducer machines) and as algebraic structures (bimorphisms). They are investigated and compared both as tree transformation and translation defining devices. The focus is on bimorphisms as they only recently got again into the spotlight especially given their applications to natural language processing. Moreover, we propose a complete and up-to-date overview on tree transformations classes defined by bimorphisms, linking them with well-known types of synchronous grammars and tree transducers. We prove or recall all the interesting properties such classes possess improving thus the mathematical knowledge on synchronous grammars and/or tree transducers. Also, inclusion relations between the main classes of bimorphisms both as translation devices and as tree transformation mechanisms are given for the first time through a Hasse diagram. Directions for future work are suggested by exhibiting how to extend previous results to more general classes of bimorphisms and synchronous grammars.

APA, Harvard, Vancouver, ISO, and other styles

3

Al, Batineh Mohammed S. "Latent Semantic Analysis, Corpus stylistics and Machine Learning Stylometry for Translational and Authorial Style Analysis: The Case of Denys Johnson-Davies’ Translations into English." Kent State University / OhioLINK, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=kent1429300641.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Tebbifakhr, Amirhossein. "Machine Translation For Machines." Doctoral thesis, Università degli studi di Trento, 2021. http://hdl.handle.net/11572/320504.

Full text

Abstract:

Traditionally, Machine Translation (MT) systems are developed by targeting fluency (i.e. output grammaticality) and adequacy (i.e. semantic equivalence with the source text) criteria that reflect the needs of human end-users. However, recent advancements in Natural Language Processing (NLP) and the introduction of NLP tools in commercial services have opened new opportunities for MT. A particularly relevant one is related to the application of NLP technologies in low-resource language settings, for which the paucity of training data reduces the possibility to train reliable services. In this specific condition, MT can come into play by enabling the so-called “translation-based” workarounds. The idea is simple: first, input texts in the low-resource language are translated into a resource-rich target language; then, the machine-translated text is processed by well-trained NLP tools in the target language; finally, the output of these downstream components is projected back to the source language. This results in a new scenario, in which the end-user of MT technology is no longer a human but another machine. We hypothesize that current MT training approaches are not the optimal ones for this setting, in which the objective is to maximize the performance of a downstream tool fed with machine-translated text rather than human comprehension. Under this hypothesis, this thesis introduces a new research paradigm, which we named “MT for machines”, addressing a number of questions that raise from this novel view of the MT problem. Are there different quality criteria for humans and machines? What makes a good translation from the machine standpoint? What are the trade-offs between the two notions of quality? How to pursue machine-oriented objectives? How to serve different downstream components with a single MT system? How to exploit knowledge transfer to operate in different language settings with a single MT system? Elaborating on these questions, this thesis: i) introduces a novel and challenging MT paradigm, ii) proposes an effective method based on Reinforcement Learning analysing its possible variants, iii) extends the proposed method to multitask and multilingual settings so as to serve different downstream applications and languages with a single MT system, iv) studies the trade-off between machine-oriented and human-oriented criteria, and v) discusses the successful application of the approach in two real-world scenarios.

APA, Harvard, Vancouver, ISO, and other styles

5

Tiedemann, Jörg. "Recycling Translations : Extraction of Lexical Data from Parallel Corpora and their Application in Natural Language Processing." Doctoral thesis, Uppsala University, Department of Linguistics, 2003. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-3791.

Full text

Abstract:

The focus of this thesis is on re-using translations in natural language processing. It involves the collection of documents and their translations in an appropriate format, the automatic extraction of translation data, and the application of the extracted data to different tasks in natural language processing.

Five parallel corpora containing more than 35 million words in 60 languages have been collected within co-operative projects. All corpora are sentence aligned and parts of them have been analyzed automatically and annotated with linguistic markup.

Lexical data are extracted from the corpora by means of word alignment. Two automatic word alignment systems have been developed, the Uppsala Word Aligner (UWA) and the Clue Aligner. UWA implements an iterative "knowledge-poor" word alignment approach using association measures and alignment heuristics. The Clue Aligner provides an innovative framework for the combination of statistical and linguistic resources in aligning single words and multi-word units. Both aligners have been applied to several corpora. Detailed evaluations of the alignment results have been carried out for three of them using fine-grained evaluation techniques.

A corpus processing toolbox, Uplug, has been developed. It includes the implementation of UWA and is freely available for research purposes. A new version, Uplug II, includes the Clue Aligner. It can be used via an experimental web interface (UplugWeb).

Lexical data extracted by the word aligners have been applied to different tasks in computational lexicography and machine translation. The use of word alignment in monolingual lexicography has been investigated in two studies. In a third study, the feasibility of using the extracted data in interactive machine translation has been demonstrated. Finally, extracted lexical data have been used for enhancing the lexical components of two machine translation systems.

APA, Harvard, Vancouver, ISO, and other styles

6

Joelsson, Jakob. "Translationese and Swedish-English Statistical Machine Translation." Thesis, Uppsala universitet, Institutionen för lingvistik och filologi, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-305199.

Full text

Abstract:

This thesis investigates how well machine learned classifiers can identify translated text, and the effect translationese may have in Statistical Machine Translation -- all in a Swedish-to-English, and reverse, context. Translationese is a term used to describe the dialect of a target language that is produced when a source text is translated. The systems trained for this thesis are SVM-based classifiers for identifying translationese, as well as translation and language models for Statistical Machine Translation. The classifiers successfully identified translationese in relation to non-translated text, and to some extent, also what source language the texts were translated from. In the SMT experiments, variation of the translation model was whataffected the results the most in the BLEU evaluation. Systems configured with non-translated source text and translationese target text performed better than their reversed counter parts. The language model experiments showed that those trained on known translationese and classified translationese performed better than known non-translated text, though classified translationese did not perform as well as the known translationese. Ultimately, the thesis shows that translationese can be identified by machine learned classifiers and may affect the results of SMT systems.

APA, Harvard, Vancouver, ISO, and other styles

7

Karlbom, Hannes. "Hybrid Machine Translation : Choosing the best translation with Support Vector Machines." Thesis, Uppsala universitet, Institutionen för informationsteknologi, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-304257.

Full text

Abstract:

In the field of machine translation there are various systems available which have different strengths and weaknesses. This thesis investigates the combination of two systems, a rule based one and a statistical one, to see if such a hybrid system can provide higher quality translations. The classification approach was taken, where a support vector machine is used to choose which sentences from each of the two systems result in the best translation. To label the sentences from the collected data a new method of simulated annealing was applied and compared to previously tried heuristics. The results show that a hybrid system has an increased average BLEU score of 6.10% or 1.86 points over the single best system, and that using the labels created through simulated annealing, over heuristic rules, gives a significant improvement in classifier performance.

APA, Harvard, Vancouver, ISO, and other styles

8

Ahmadniaye, Bosari Benyamin. "Reliable training scenarios for dealing with minimal parallel-resource language pairs in statistical machine translation." Doctoral thesis, Universitat Autònoma de Barcelona, 2017. http://hdl.handle.net/10803/461204.

Full text

Abstract:

La tesis trata sobre sistemas de traducción automática estadística (SMT) de alta calidad, para trabajar con pares de lenguajes con recursos paralelos mínimos, titulado “Reliable Training Scenarios for Dealing with Minimal Parallel-Resource Language Pairs in Statistical Machine Translation”. El desafío principal que abordamos en nuestro enfoque es la carencia de datos paralelos y este se enfrenta en diferentes escenarios. SMT es uno de los enfoques preferidos para traducción automática (MT), y se podrían detectar varias mejoras en este enfoque, específicamente en la calidad de salida en una serie de sistemas para pares de idiomas, desde los avances en potencia computacional, junto con la exploración llevada a cabo de nuevos métodos y algoritmos. Cuando reflexionamos sobre el desarrollo de sistemas SMT para muchos idiomas pares, el principal cuello de botella que encontraremos es la falta de datos paralelos de entrenamiento. Debido al hecho de que se requiere mucho tiempo y esfuerzo para crear estos corpus, están disponibles en cantidad, género e idioma limitados. Los modelos de SMT aprenden cómo podrían hacer la traducción a través del proceso de examen de un corpus paralelo bilingüe que contenga las oraciones alineadas con sus traducciones producidas por humanos. Sin embargo, la calidad de salida de los sistemas de SMT es depende de la disponibilidad de cantidades masivas de texto paralelo dentro de los idiomas de origen y destino. Por lo tanto, los recursos paralelos juegan un papel importante en la mejora de la calidad de los sistemas de SMT. Definimos la mínima configuración de los recursos paralelos de SMT que poseen solo pequeñas cantidades de datos paralelos, que también se puede apreciar en varios pares de idiomas. El rendimiento logrado por el mínimo recurso paralelo en SMT en el estado del arte es altamente apreciable, pero generalmente usan el texto monolingüe y no abordan fundamentalmente la escasez de entrenamiento de textos paralelos. Cuando creamos la ampliación en los datos de entrenamiento paralelos, sin proporcionar ningún tipo de garantía sobre la calidad de los pares de oraciones bilingües que se han generado recientemente, también aumentan las preocupaciones. Las limitaciones que surgen durante el entrenamiento de la SMT de recursos paralelos mínimos demuestran que los sistemas actuales son incapaces de producir resultados de traducción de alta calidad. En esta tesis, hemos propuesto dos escenarios, uno de “direct-bridge combination” y otro escenario de “round-trip training”. El primero se basa en la técnica de “bridge language”, mientras que el segundo se basa en el enfoque de “retraining”, para tratar con SMT de recursos paralelos mínimos. Nuestro objetivo principal para presentar el escenario de “direct-bridge combination” es que podamos acercarlo al rendimiento existente en el estado del arte. Este escenario se ha propuesto para maximizar la ganancia de información, eligiendo las partes apropiadas del sistema de traducción basado en “bridge” que no interfieran con el sistema de traducción directa en el que se confía más. Además, el escenario de “round trip training” ha sido propuesto para aprovechar la fácil disponibilidad del par de frases bilingües generadas para construir un sistema de SMT de alta calidad en un comportamiento iterativo, seleccionando el subconjunto de alta calidad de los pares de oraciones generados en el lado del objetivo, preparando sus oraciones adecuadas correspondientes de origen y juntándolas con los pares de oraciones originales para re-entrenar el sistema de SMT. Los métodos propuestos se evalúan intrínsecamente, y su comparación se realiza en base a los sistemas de traducción de referencia. También hemos llevado a cabo los experimentos en los escenarios propuestos antes mencionados con datos bilingües iniciales mínimos. Hemos demostrado la mejora en el rendimiento a través del uso de los métodos propuestos al construir sistemas de SMT de alta calidad sobre la línea de base que involucra a cada escenario.
The thesis is about the topic of high-quality Statistical Machine Translation (SMT) systems for working with minimal parallel-resource language pairs entitled “Reliable Training Scenarios for Dealing with Minimal Parallel-Resource Language Pairs in Statistical Machine Translation”. Then main challenge we targeted in our approaches is parallel data scarcity, and this challenge is faced in different solution scenarios. SMT is one of the preferred approaches to Machine Translation (MT), and various improvements could be detected in this approach, specifically in the output quality in a number of systems for language pairs since the advances in computational power, together with the exploration of new methods and algorithms have been made. When we ponder over the development of SMT systems for many language pairs, the major bottleneck that we will find is the lack of training parallel data. Due to the fact that lots of time and effort is required to create these corpora, they are available in limited quantity, genre, and language. SMT models learn that how they could do translation through the process of examining a bilingual parallel corpus that contains the sentences aligned with their human-produced translations. However, the output quality of SMT systems is heavily dependent on the availability of massive amounts of parallel text within the source and target languages. Hence, an important role is played by the parallel resources so that the quality of SMT systems could be improved. We define minimal parallel-resource SMT settings possess only small amounts of parallel data, which can also be seen in various pairs of languages. The performance achieved by current state-of-the-art minimal parallel-resource SMT is highly appreciable, but they usually use the monolingual text and do not fundamentally address the shortage of parallel training text. Creating enlargement in the parallel training data without providing any sort of guarantee on the quality of the bilingual sentence pairs that have been newly generated, is also raising concerns. The limitations that emerge during the training of the minimal parallel- resource SMT prove that the current systems are incapable of producing the high- quality translation output. In this thesis, we have proposed the “direct-bridge combination” scenario as well as the “round-trip training” scenario, that the former is based on bridge language technique while the latter one is based on retraining approach, for dealing with minimal parallel-resource SMT systems. Our main aim for putting forward the direct-bridge combination scenario is that we might bring it closer to state-of-the-art performance. This scenario has been proposed to maximize the information gain by choosing the appropriate portions of the bridge-based translation system that do not interfere with the direct translation system which is trusted more. Furthermore, the round-trip training scenario has been proposed to take advantage of the readily available generated bilingual sentence pairs to build high-quality SMT system in an iterative behavior; by selecting high- quality subset of generated sentence pairs in target side, preparing their suitable correspond source sentences, and using them together with the original sentence pairs to retrain the SMT system. The proposed methods are intrinsically evaluated, and their comparison is made against the baseline translation systems. We have also conducted the experiments in the aforementioned proposed scenarios with minimal initial bilingual data. We have demonstrated improvement made in the performance through the use of proposed methods while building high-quality SMT systems over the baseline involving each scenario.

APA, Harvard, Vancouver, ISO, and other styles

9

Davis, Paul C. "Stone Soup Translation: The Linked Automata Model." Connect to this title online, 2002. http://rave.ohiolink.edu/etdc/view?acc%5Fnum=osu1023806593.

Full text

Abstract:

Thesis (Ph. D.)--Ohio State University, 2002.
Title from first page of PDF file. Document formatted into pages; contains xvi, 306 p.; includes graphics. Includes abstract and vita. Advisor: Chris Brew, Dept. of Linguistics. Includes indexes. Includes bibliographical references (p. 284-293).

APA, Harvard, Vancouver, ISO, and other styles

10

Martínez, Garcia Eva. "Document-level machine translation : ensuring translational consistency of non-local phenomena." Doctoral thesis, Universitat Politècnica de Catalunya, 2019. http://hdl.handle.net/10803/668473.

Full text

Abstract:

In this thesis, we study the automatic translation of documents by taking into account cross-sentence phenomena. This document-level information is typically ignored by most of the standard state-of-the-art Machine Translation (MT) systems, which focus on translating texts processing each of their sentences in isolation. Translating each sentence without looking at its surrounding context can lead to certain types of translation errors, such as inconsistent translations for the same word or for elements in a coreference chain. We introduce methods to attend to document-level phenomena in order to avoid those errors, and thus, reach translations that properly convey the original meaning. Our research starts by identifying the translation errors related to such document-level phenomena that commonly appear in the output of state-of-the-art Statistical Machine Translation (SMT) systems. For two of those errors, namely inconsistent word translations as well as gender and number disagreements among words, we design simple and yet effective post-processing techniques to tackle and correct them. Since these techniques are applied a posteriori, they can access the whole source and target documents, and hence, they are able to perform a global analysis and improve the coherence and consistency of the translation. Nevertheless, since following such a two-pass decoding strategy is not optimal in terms of efficiency, we also focus on introducing the context-awareness during the decoding process itself. To this end, we enhance a document-oriented SMT system with distributional semantic information in the form of bilingual and monolingual word embeddings. In particular, these embeddings are used as Semantic Space Language Models (SSLMs) and as a novel feature function. The goal of the former is to promote word translations that are semantically close to their preceding context, whereas the latter promotes the lexical choice that is closest to its surrounding context, for those words that have varying translations throughout the document. In both cases, the context extends beyond sentence boundaries. Recently, the MT community has transitioned to the neural paradigm. The finalstep of our research proposes an extension of the decoding process for a Neural Machine Translation (NMT) framework, independent of the model architecture, by shallow fusing the information from a neural translation model and the context semantics enclosed in the previously studied SSLMs. The aim of this modification is to introduce the benefits of context information also into the decoding process of NMT systems, as well as to obtain an additional validation for the techniques we explored. The automatic evaluation of our approaches does not reflect significant variations. This is expected since most automatic metrics are neither context-nor semantic-aware and because the phenomena we tackle are rare, leading to few modifications with respect to the baseline translations. On the other hand, manual evaluations demonstrate the positive impact of our approaches since human evaluators tend to prefer the translations produced by our document-aware systems. Therefore, the changes introduced by our enhanced systems are important since they are related to how humans perceive translation quality for long texts.
En esta tesis se estudia la traducción automática de documentos teniendo en cuenta fenómenos que ocurren entre oraciones. Típicamente, esta información a nivel de documento se ignora por la mayoría de los sistemas de Traducción Automática (MT), que se centran en traducir los textos procesando cada una de las frases que los componen de manera aislada. Traducir cada frase sin mirar al contexto que la rodea puede llevar a generar cierto tipo de errores de traducción, como pueden ser traducciones inconsistentes para la misma palabra o para elementos que aparecen en la misma cadena de correferencia. En este trabajo se presentan métodos para prestar atención a fenómenos a nivel de documento con el objetivo de evitar este tipo de errores y así llegar a generar traducciones que transmitan correctamente el significado original del texto. Nuestra investigación empieza por identificar los errores de traducción relacionados con los fenómenos a nivel de documento que aparecen de manera común en la salida de los sistemas Estadísticos del Traducción Automática (SMT). Para dos de estos errores, la traducción inconsistente de palabras, así como los desacuerdos en género y número entre palabras, diseñamos técnicas simples pero efectivas como post-procesos para tratarlos y corregirlos. Como estas técnicas se aplican a posteriori, pueden acceder a los documentos enteros tanto del origen como la traducción generada, y así son capaces de hacer un análisis global y mejorar la coherencia y la consistencia de la traducción. Sin embargo, como seguir una estrategia de traducción en dos pasos no es óptima en términos de eficiencia, también nos centramos en introducir la conciencia del contexto durante el propio proceso de generación de la traducción. Para esto, extendemos un sistema SMT orientado a documentos incluyendo información semántica distribucional en forma de word embeddings bilingües y monolingües. En particular, estos embeddings se usan como un Modelo de Lenguaje de Espacio Semántico (SSLM) y como una nueva función característica del sistema. La meta del primero es promover traducciones de palabras que sean semánticamente cercanas a su contexto precedente, mientras que la segunda quiere promover la selección léxica que es más cercana a su contexto para aquellas palabras que tienen diferentes traducciones a lo largo de un documento. En ambos casos, el contexto que se tiene en cuenta va más allá de los límites de una frase u oración. Recientemente, la comunidad MT ha hecho una transición hacia el paradigma neuronal. El paso final de nuestra investigación propone una extensión del proceso de decodificación de un sistema de Traducción Automática Neuronal (NMT), independiente de la arquitectura del modelo de traducción, aplicando la técnica de Shallow Fusion para combinar la información del modelo de traducción neuronal y la información semántica del contexto encerrada en los modelos SSLM estudiados previamente. La motivación de esta modificación está en introducir los beneficios de la información del contexto también en el proceso de decodificación de los sistemas NMT, así como también obtener una validación adicional para las técnicas que se han ido explorando a lo largo de esta tesis. La evaluación automática de nuestras propuestas no refleja variaciones significativas. Esto es un comportamiento esperado ya que la mayoría de las métricas automáticas no se diseñan para ser sensibles al contexto o a la semántica, y además los fenómenos que tratamos son escasos, llevando a pocas modificaciones con respecto a las traducciones de partida. Por otro lado, las evaluaciones manuales demuestran el impacto positivo de nuestras propuestas ya que los evaluadores humanos tienen a preferir las traducciones generadas por nuestros sistemas a nivel de documento. Entonces, los cambios introducidos por nuestros sistemas extendidos son importantes porque están relacionados con la forma en que los humanos perciben la calidad de la traducción de textos largos.

APA, Harvard, Vancouver, ISO, and other styles

11

Кириченко, Олена Анатоліївна, Елена Анатольевна Кириченко, Olena Anatoliivna Kyrychenko, and Y. V. Kalashnyk. "Machine translation." Thesis, Видавництво СумДУ, 2011. http://essuir.sumdu.edu.ua/handle/123456789/12977.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

Quernheim, Daniel. "Bimorphism Machine Translation." Doctoral thesis, Universitätsbibliothek Leipzig, 2017. http://nbn-resolving.de/urn:nbn:de:bsz:15-qucosa-223667.

Full text

Abstract:

The field of statistical machine translation has made tremendous progress due to the rise of statistical methods, making it possible to obtain a translation system automatically from a bilingual collection of text. Some approaches do not even need any kind of linguistic annotation, and can infer translation rules from raw, unannotated data. However, most state-of-the art systems do linguistic structure little justice, and moreover many approaches that have been put forward use ad-hoc formalisms and algorithms. This inevitably leads to duplication of effort, and a separation between theoretical researchers and practitioners. In order to remedy the lack of motivation and rigor, the contributions of this dissertation are threefold: 1. After laying out the historical background and context, as well as the mathematical and linguistic foundations, a rigorous algebraic model of machine translation is put forward. We use regular tree grammars and bimorphisms as the backbone, introducing a modular architecture that allows different input and output formalisms. 2. The challenges of implementing this bimorphism-based model in a machine translation toolkit are then described, explaining in detail the algorithms used for the core components. 3. Finally, experiments where the toolkit is applied on real-world data and used for diagnostic purposes are described. We discuss how we use exact decoding to reason about search errors and model errors in a popular machine translation toolkit, and we compare output formalisms of different generative capacity.

APA, Harvard, Vancouver, ISO, and other styles

13

Caglayan, Ozan. "Multimodal Machine Translation." Thesis, Le Mans, 2019. http://www.theses.fr/2019LEMA1016/document.

Full text

Abstract:

La traduction automatique vise à traduire des documents d’une langue à une autre sans l’intervention humaine. Avec l’apparition des réseaux de neurones profonds (DNN), la traduction automatique neuronale(NMT) a commencé à dominer le domaine, atteignant l’état de l’art pour de nombreuses langues. NMT a également ravivé l’intérêt pour la traduction basée sur l’interlangue grâce à la manière dont elle place la tâche dans un cadre encodeur-décodeur en passant par des représentations latentes. Combiné avec la flexibilité architecturale des DNN, ce cadre a aussi ouvert une piste de recherche sur la multimodalité, ayant pour but d’enrichir les représentations latentes avec d’autres modalités telles que la vision ou la parole, par exemple. Cette thèse se concentre sur la traduction automatique multimodale(MMT) en intégrant la vision comme une modalité secondaire afin d’obtenir une meilleure compréhension du langage, ancrée de façon visuelle. J’ai travaillé spécifiquement avec un ensemble de données contenant des images et leurs descriptions traduites, où le contexte visuel peut être utile pour désambiguïser le sens des mots polysémiques, imputer des mots manquants ou déterminer le genre lors de la traduction vers une langue ayant du genre grammatical comme avec l’anglais vers le français. Je propose deux approches principales pour intégrer la modalité visuelle : (i) un mécanisme d’attention multimodal qui apprend à prendre en compte les représentations latentes des phrases sources ainsi que les caractéristiques visuelles convolutives, (ii) une méthode qui utilise des caractéristiques visuelles globales pour amorcer les encodeurs et les décodeurs récurrents. Grâce à une évaluation automatique et humaine réalisée sur plusieurs paires de langues, les approches proposées se sont montrées bénéfiques. Enfin,je montre qu’en supprimant certaines informations linguistiques à travers la dégradation systématique des phrases sources, la véritable force des deux méthodes émerge en imputant avec succès les noms et les couleurs manquants. Elles peuvent même traduire lorsque des morceaux de phrases sources sont entièrement supprimés
Machine translation aims at automatically translating documents from one language to another without human intervention. With the advent of deep neural networks (DNN), neural approaches to machine translation started to dominate the field, reaching state-ofthe-art performance in many languages. Neural machine translation (NMT) also revived the interest in interlingual machine translation due to how it naturally fits the task into an encoder-decoder framework which produces a translation by decoding a latent source representation. Combined with the architectural flexibility of DNNs, this framework paved the way for further research in multimodality with the objective of augmenting the latent representations with other modalities such as vision or speech, for example. This thesis focuses on a multimodal machine translation (MMT) framework that integrates a secondary visual modality to achieve better and visually grounded language understanding. I specifically worked with a dataset containing images and their translated descriptions, where visual context can be useful forword sense disambiguation, missing word imputation, or gender marking when translating from a language with gender-neutral nouns to one with grammatical gender system as is the case with English to French. I propose two main approaches to integrate the visual modality: (i) a multimodal attention mechanism that learns to take into account both sentence and convolutional visual representations, (ii) a method that uses global visual feature vectors to prime the sentence encoders and the decoders. Through automatic and human evaluation conducted on multiple language pairs, the proposed approaches were demonstrated to be beneficial. Finally, I further show that by systematically removing certain linguistic information from the input sentences, the true strength of both methods emerges as they successfully impute missing nouns, colors and can even translate when parts of the source sentences are completely removed

APA, Harvard, Vancouver, ISO, and other styles

14

Wang, Long Qi. "Translation accuracy comparison between machine translation and context-free machine natural language grammar–based translation." Thesis, University of Macau, 2018. http://umaclib3.umac.mo/record=b3950657.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

Sim, Smith Karin M. "Coherence in machine translation." Thesis, University of Sheffield, 2018. http://etheses.whiterose.ac.uk/20083/.

Full text

Abstract:

Coherence ensures individual sentences work together to form a meaningful document. When properly translated, a coherent document in one language should result in a coherent document in another language. In Machine Translation, however, due to reasons of modeling and computational complexity, sentences are pieced together from words or phrases based on short context windows and with no access to extra-sentential context. In this thesis I propose ways to automatically assess the coherence of machine translation output. The work is structured around three dimensions: entity-based coherence, coherence as evidenced via syntactic patterns, and coherence as evidenced via discourse relations. For the first time, I evaluate existing monolingual coherence models on this new task, identifying issues and challenges that are specific to the machine translation setting. In order to address these issues, I adapted a state-of-the-art syntax model, which also resulted in improved performance for the monolingual task. The results clearly indicate how much more difficult the new task is than the task of detecting shuffled texts. I proposed a new coherence model, exploring the crosslingual transfer of discourse relations in machine translation. This model is novel in that it measures the correctness of the discourse relation by comparison to the source text rather than to a reference translation. I identified patterns of incoherence common across different language pairs, and created a corpus of machine translated output annotated with coherence errors for evaluation purposes. I then examined lexical coherence in a multilingual context, as a preliminary study for crosslingual transfer. Finally, I determine how the new and adapted models correlate with human judgements of translation quality and suggest that improvements in general evaluation within machine translation would benefit from having a coherence component that evaluated the translation output with respect to the source text.

APA, Harvard, Vancouver, ISO, and other styles

16

Sato, Satoshi. "Example-Based Machine Translation." Kyoto University, 1992. http://hdl.handle.net/2433/154652.

Full text

Abstract:

本文データは平成22年度国立国会図書館の学位論文(博士)のデジタル化実施により作成された画像ファイルを基にpdf変換したものである
Kyoto University (京都大学)
0048
新制・論文博士
博士(工学)
乙第7735号
論工博第2539号
新制||工||860(附属図書館)
UT51-92-B162
(主査)教授長尾真, 教授堂下修司, 教授池田克夫
学位規則第4条第2項該当

APA, Harvard, Vancouver, ISO, and other styles

17

García, Martínez Mercedes. "Factored neural machine translation." Thesis, Le Mans, 2018. http://www.theses.fr/2018LEMA1002/document.

Full text

Abstract:

La diversité des langues complexifie la tâche de communication entre les humains à travers les différentes cultures. La traduction automatique est un moyen rapide et peu coûteux pour simplifier la communication interculturelle. Récemment, laTraduction Automatique Neuronale (NMT) a atteint des résultats impressionnants. Cette thèse s'intéresse à la Traduction Automatique Neuronale Factorisé (FNMT) qui repose sur l'idée d'utiliser la morphologie et la décomposition grammaticale des mots (lemmes et facteurs linguistiques) dans la langue cible. Cette architecture aborde deux défis bien connus auxquelles les systèmes NMT font face. Premièrement, la limitation de la taille du vocabulaire cible, conséquence de la fonction softmax, qui nécessite un calcul coûteux à la couche de sortie du réseau neuronale, conduisant à un taux élevé de mots inconnus. Deuxièmement, le manque de données adéquates lorsque nous sommes confrontés à un domaine spécifique ou une langue morphologiquement riche. Avec l'architecture FNMT, toutes les inflexions des mots sont prises en compte et un vocabulaire plus grand est modélisé tout en gardant un coût de calcul similaire. De plus, de nouveaux mots non rencontrés dans les données d'entraînement peuvent être générés. Dans ce travail, j'ai développé différentes architectures FNMT en utilisant diverses dépendances entre les lemmes et les facteurs. En outre, j'ai amélioré la représentation de la langue source avec des facteurs. Le modèle FNMT est évalué sur différentes langues dont les plus riches morphologiquement. Les modèles à l'état de l'art, dont certains utilisant le Byte Pair Encoding (BPE) sont comparés avec le modèle FNMT en utilisant des données d'entraînement de petite et de grande taille. Nous avons constaté que les modèles utilisant les facteurs sont plus robustes aux conditions d'entraînement avec des faibles ressources. Le FNMT a été combiné avec des unités BPE permettant une amélioration par rapport au modèle FNMT entrainer avec des données volumineuses. Nous avons expérimenté avec dfférents domaines et nous avons montré des améliorations en utilisant les modèles FNMT. De plus, la justesse de la morphologie est mesurée à l'aide d'un ensemble de tests spéciaux montrant l'avantage de modéliser explicitement la morphologie de la cible. Notre travail montre les bienfaits de l'applicationde facteurs linguistiques dans le NMT
Communication between humans across the lands is difficult due to the diversity of languages. Machine translation is a quick and cheap way to make translation accessible to everyone. Recently, Neural Machine Translation (NMT) has achievedimpressive results. This thesis is focus on the Factored Neural Machine Translation (FNMT) approach which is founded on the idea of using the morphological and grammatical decomposition of the words (lemmas and linguistic factors) in the target language. This architecture addresses two well-known challenges occurring in NMT. Firstly, the limitation on the target vocabulary size which is a consequence of the computationally expensive softmax function at the output layer of the network, leading to a high rate of unknown words. Secondly, data sparsity which is arising when we face a specific domain or a morphologically rich language. With FNMT, all the inflections of the words are supported and larger vocabulary is modelled with similar computational cost. Moreover, new words not included in the training dataset can be generated. In this work, I developed different FNMT architectures using various dependencies between lemmas and factors. In addition, I enhanced the source language side also with factors. The FNMT model is evaluated on various languages including morphologically rich ones. State of the art models, some using Byte Pair Encoding (BPE) are compared to the FNMT model using small and big training datasets. We found out that factored models are more robust in low resource conditions. FNMT has been combined with BPE units performing better than pure FNMT model when trained with big data. We experimented with different domains obtaining improvements with the FNMT models. Furthermore, the morphology of the translations is measured using a special test suite showing the importance of explicitly modeling the target morphology. Our work shows the benefits of applying linguistic factors in NMT

APA, Harvard, Vancouver, ISO, and other styles

18

Fernández, Parra Maria Asunción. "Formulaic expressions in computer-assisted translation : a specialised translation approach." Thesis, Swansea University, 2011. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.579586.

Full text

APA, Harvard, Vancouver, ISO, and other styles

19

Chen, Yuan Yuan. "A critical review of current E-to-C machine translation of academic abstracts." Thesis, University of Macau, 2012. http://umaclib3.umac.mo/record=b2586616.

Full text

APA, Harvard, Vancouver, ISO, and other styles

20

Liu, Yan. "Translation hypotheses re-ranking for statistical machine translation." Thesis, University of Macau, 2017. http://umaclib3.umac.mo/record=b3691283.

Full text

APA, Harvard, Vancouver, ISO, and other styles

21

Di, Gangi Mattia Antonino. "Neural Speech Translation: From Neural Machine Translation to Direct Speech Translation." Doctoral thesis, Università degli studi di Trento, 2020. http://hdl.handle.net/11572/259137.

Full text

Abstract:

Sequence-to-sequence learning led to significant improvements to machine translation (MT) and automatic speech recognition (ASR) systems. These advancements were first reflected in spoken language translation (SLT) when using a cascade of (at least) ASR and MT with the new "neural" models, then by using sequence-to-sequence learning to directly translate the input audio speech into text in the target language. In this thesis we cover both approaches to the SLT task. First, we show the limits of NMT in terms of robustness to input errors when compared to the previous phrase-based state of the art. We then focus on the NMT component to achieve better translation quality with higher computational efficiency by using a network based on weakly-recurrent units. Our last work involving a cascade explores the effects on the NMT robustness when adding automatic transcripts to the training data. In order to move to the direct speech-to-text approach, we introduce MuST-C, the largest multilingual SLT corpus for training direct translation systems. MuST-C increases significantly the size of publicly available data for this task as well as their language coverage. With such availability of data, we adapted the Transformer architecture to the SLT task for its computational efficiency . Our adaptation, which we call S-Transformer, is meant to better model the audio input, and with it we set a new state of the art for MuST-C. Building on these positive results, we finally use S-Transformer with different data applications: i) one-to-many multilingual translation by training it on MuST-C; ii participation to the IWSLT 19 shared task with data augmentation; and iii) instance-based adaptation for using the training data at test time. The results in this thesis show a steady quality improvement in direct SLT. Our hope is that the presented resources and technological solutions will increase its adoption in the near future, so to make multilingual information access easier in a globalized world.

APA, Harvard, Vancouver, ISO, and other styles

22

Di, Gangi Mattia Antonino. "Neural Speech Translation: From Neural Machine Translation to Direct Speech Translation." Doctoral thesis, Università degli studi di Trento, 2020. http://hdl.handle.net/11572/259137.

Full text

Abstract:

Sequence-to-sequence learning led to significant improvements to machine translation (MT) and automatic speech recognition (ASR) systems. These advancements were first reflected in spoken language translation (SLT) when using a cascade of (at least) ASR and MT with the new "neural" models, then by using sequence-to-sequence learning to directly translate the input audio speech into text in the target language. In this thesis we cover both approaches to the SLT task. First, we show the limits of NMT in terms of robustness to input errors when compared to the previous phrase-based state of the art. We then focus on the NMT component to achieve better translation quality with higher computational efficiency by using a network based on weakly-recurrent units. Our last work involving a cascade explores the effects on the NMT robustness when adding automatic transcripts to the training data. In order to move to the direct speech-to-text approach, we introduce MuST-C, the largest multilingual SLT corpus for training direct translation systems. MuST-C increases significantly the size of publicly available data for this task as well as their language coverage. With such availability of data, we adapted the Transformer architecture to the SLT task for its computational efficiency . Our adaptation, which we call S-Transformer, is meant to better model the audio input, and with it we set a new state of the art for MuST-C. Building on these positive results, we finally use S-Transformer with different data applications: i) one-to-many multilingual translation by training it on MuST-C; ii participation to the IWSLT 19 shared task with data augmentation; and iii) instance-based adaptation for using the training data at test time. The results in this thesis show a steady quality improvement in direct SLT. Our hope is that the presented resources and technological solutions will increase its adoption in the near future, so to make multilingual information access easier in a globalized world.

APA, Harvard, Vancouver, ISO, and other styles

23

Law, Mei In. "Assessing online translation systems using the BLEU score : Google Language Tools & SYSTRANBox." Thesis, University of Macau, 2011. http://umaclib3.umac.mo/record=b2525828.

Full text

APA, Harvard, Vancouver, ISO, and other styles

24

Watanabe, Taro. "Example-Based Statistical Machine Translation." 京都大学 (Kyoto University), 2004. http://hdl.handle.net/2433/147584.

Full text

APA, Harvard, Vancouver, ISO, and other styles

25

Naruedomkul, Kanlaya. "Generate and repair machine translation." Thesis, National Library of Canada = Bibliothèque nationale du Canada, 2000. http://www.collectionscanada.ca/obj/s4/f2/dsk1/tape3/PQDD_0016/NQ54676.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

26

Levenberg, Abby D. "Stream-based statistical machine translation." Thesis, University of Edinburgh, 2011. http://hdl.handle.net/1842/5760.

Full text

Abstract:

We investigate a new approach for SMT system training within the streaming model of computation. We develop and test incrementally retrainable models which, given an incoming stream of new data, can efficiently incorporate the stream data online. A naive approach using a stream would use an unbounded amount of space. Instead, our online SMT system can incorporate information from unbounded incoming streams and maintain constant space and time. Crucially, we are able to match (or even exceed) translation performance of comparable systems which are batch retrained and use unbounded space. Our approach is particularly suited for situations when there is arbitrarily large amounts of new training material and we wish to incorporate it efficiently and in small space. The novel contributions of this thesis are: 1. An online, randomised language model that can model unbounded input streams in constant space and time. 2. An incrementally retrainable translationmodel for both phrase-based and grammarbased systems. The model presented is efficient enough to incorporate novel parallel text at the single sentence level. 3. Strategies for updating our stream-based language model and translation model which demonstrate how such components can be successfully used in a streaming translation setting. This operates both within a single streaming environment and also in the novel situation of having to translate multiple streams. 4. Demonstration that recent data from the stream is beneficial to translation performance. Our stream-based SMT system is efficient for tackling massive volumes of new training data and offers-up new ways of thinking about translating web data and dealing with other natural language streams.

APA, Harvard, Vancouver, ISO, and other styles

27

Hardmeier, Christian. "Discourse in Statistical Machine Translation." Doctoral thesis, Uppsala universitet, Institutionen för lingvistik och filologi, 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-223798.

Full text

Abstract:

This thesis addresses the technical and linguistic aspects of discourse-level processing in phrase-based statistical machine translation (SMT). Connected texts can have complex text-level linguistic dependencies across sentences that must be preserved in translation. However, the models and algorithms of SMT are pervaded by locality assumptions. In a standard SMT setup, no model has more complex dependencies than an n-gram model. The popular stack decoding algorithm exploits this fact to implement efficient search with a dynamic programming technique. This is a serious technical obstacle to discourse-level modelling in SMT. From a technical viewpoint, the main contribution of our work is the development of a document-level decoder based on stochastic local search that translates a complete document as a single unit. The decoder starts with an initial translation of the document, created randomly or by running a stack decoder, and refines it with a sequence of elementary operations. After each step, the current translation is scored by a set of feature models with access to the full document context and its translation. We demonstrate the viability of this decoding approach for different document-level models. From a linguistic viewpoint, we focus on the problem of translating pronominal anaphora. After investigating the properties and challenges of the pronoun translation task both theoretically and by studying corpus data, a neural network model for cross-lingual pronoun prediction is presented. This network jointly performs anaphora resolution and pronoun prediction and is trained on bilingual corpus data only, with no need for manual coreference annotations. The network is then integrated as a feature model in the document-level SMT decoder and tested in an English–French SMT system. We show that the pronoun prediction network model more adequately represents discourse-level dependencies for less frequent pronouns than a simpler maximum entropy baseline with separate coreference resolution. By creating a framework for experimenting with discourse-level features in SMT, this work contributes to a long-term perspective that strives for more thorough modelling of complex linguistic phenomena in translation. Our results on pronoun translation shed new light on a challenging, but essential problem in machine translation that is as yet unsolved.

APA, Harvard, Vancouver, ISO, and other styles

28

Pirrelli, Vito. "Morphology, analogy and machine translation." Thesis, University of Salford, 1993. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.238781.

Full text

APA, Harvard, Vancouver, ISO, and other styles

29

Yahyaei, Mohammad Sirvan. "Reordering in statistical machine translation." Thesis, Queen Mary, University of London, 2012. http://qmro.qmul.ac.uk/xmlui/handle/123456789/2517.

Full text

Abstract:

Machine translation is a challenging task that its difficulties arise from several characteristics of natural language. The main focus of this work is on reordering as one of the major problems in MT and statistical MT, which is the method investigated in this research. The reordering problem in SMT originates from the fact that not all the words in a sentence can be consecutively translated. This means words must be skipped and be translated out of their order in the source sentence to produce a fluent and grammatically correct sentence in the target language. The main reason that reordering is needed is the fundamental word order differences between languages. Therefore, reordering becomes a more dominant issue, the more source and target languages are structurally different. The aim of this thesis is to study the reordering phenomenon by proposing new methods of dealing with reordering in SMT decoders and evaluating the effectiveness of the methods and the importance of reordering in the context of natural language processing tasks. In other words, we propose novel ways of performing the decoding to improve the reordering capabilities of the SMT decoder and in addition we explore the effect of improving the reordering on the quality of specific NLP tasks, namely named entity recognition and cross-lingual text association. Meanwhile, we go beyond reordering in text association and present a method to perform cross-lingual text fragment alignment, based on models of divergence from randomness. The main contribution of this thesis is a novel method named dynamic distortion, which is designed to improve the ability of the phrase-based decoder in performing reordering by adjusting the distortion parameter based on the translation context. The model employs a discriminative reordering model, which is combining several fea- 2 tures including lexical and syntactic, to predict the necessary distortion limit for each sentence and each hypothesis expansion. The discriminative reordering model is also integrated into the decoder as an extra feature. The method achieves substantial improvements over the baseline without increase in the decoding time by avoiding reordering in unnecessary positions. Another novel method is also presented to extend the phrase-based decoder to dynamically chunk, reorder, and apply phrase translations in tandem. Words inside the chunks are moved together to enable the decoder to make long-distance reorderings to capture the word order differences between languages with different sentence structures. Another aspect of this work is the task-based evaluation of the reordering methods and other translation algorithms used in the phrase-based SMT systems. With more successful SMT systems, performing multi-lingual and cross-lingual tasks through translating becomes more feasible. We have devised a method to evaluate the performance of state-of-the art named entity recognisers on the text translated by a SMT decoder. Specifically, we investigated the effect of word reordering and incorporating reordering models in improving the quality of named entity extraction. In addition to empirically investigating the effect of translation in the context of crosslingual document association, we have described a text fragment alignment algorithm to find sections of the two documents in different languages, that are content-wise related. The algorithm uses similarity measures based on divergence from randomness and word-based translation models to perform text fragment alignment on a collection of documents in two different languages. All the methods proposed in this thesis are extensively empirically examined. We have tested all the algorithms on common translation collections used in different evaluation campaigns. Well known automatic evaluation metrics are used to compare the suggested methods to a state-of-the art baseline and results are analysed and discussed.

APA, Harvard, Vancouver, ISO, and other styles

30

Dudnyk, Tamara. "Machine translation advantages and disadvantages." Thesis, Київський національний університет технологій та дизайну, 2020. https://er.knutd.edu.ua/handle/123456789/15236.

Full text

APA, Harvard, Vancouver, ISO, and other styles

31

Beaven, John L. "Lexicalist unification-based machine translation." Thesis, University of Edinburgh, 1992. http://hdl.handle.net/1842/19993.

Full text

Abstract:

A novel approach to Machine Translation (MT) called Shake-and-Bake, is presented, which exploits recent advances in Computational Linguistics in terms of the rise of lexicalist unification-based grammar theories. It is argued that it overcomes many deficiencies of current methods, such as those based on transfer rules, interlingual representations, and isomorphic grammars. The key advantages are a greater modularity of the monolingual components, which can be written with great independence of each other, using purely monolingual considerations. They can be used for parsing and generation, and may be used for multi-lingual translation systems. The two monolingual components involved in translation are put into correspondence by means of a bilingual lexicon which contains information similar to what one might expect to find in an ordinary bilingual dictionary. The approach is demonstrated by presenting very different Unification Categorial Grammars from small fragments of English and Spanish. Although their coverage is small, they have been chosen to contain linguistically interesting phenomena known to be difficult in MT, such as word order variation and clitic placement. These monolingual grammars are put into correspondence by means of a bilingual lexicon. The Shake-and-Bake approach to MT consists of parsing the Source Language in any usual way, then looking up the words in the bilingual lexicon, and finally generating from the set of translations of these words, but allowing the Target Language grammar to instantiate the relative word ordering, taking advantage of the fact that the parse produces lexical and phrasal signs which are highly constrained (specifically in the semantics). The main algorithm presented for generation is a variation on the well-known CKY one used for parsing.

APA, Harvard, Vancouver, ISO, and other styles

32

Sabtan, Yasser Muhammad Naguib mahmoud. "Lexical selection for machine translation." Thesis, University of Manchester, 2011. https://www.research.manchester.ac.uk/portal/en/theses/lexical-selection-for-machine-translation(28ea687c-5eaf-4412-992a-16fc88b977c8).html.

Full text

Abstract:

Current research in Natural Language Processing (NLP) tends to exploit corpus resources as a way of overcoming the problem of knowledge acquisition. Statistical analysis of corpora can reveal trends and probabilities of occurrence, which have proved to be helpful in various ways. Machine Translation (MT) is no exception to this trend. Many MT researchers have attempted to extract knowledge from parallel bilingual corpora. The MT problem is generally decomposed into two sub-problems: lexical selection and reordering of the selected words. This research addresses the problem of lexical selection of open-class lexical items in the framework of MT. The work reported in this thesis investigates different methodologies to handle this problem, using a corpus-based approach. The current framework can be applied to any language pair, but we focus on Arabic and English. This is because Arabic words are hugely ambiguous and thus pose a challenge for the current task of lexical selection. We use a challenging Arabic-English parallel corpus, containing many long passages with no punctuation marks to denote sentence boundaries. This points to the robustness of the adopted approach. In our attempt to extract lexical equivalents from the parallel corpus we focus on the co-occurrence relations between words. The current framework adopts a lexicon-free approach towards the selection of lexical equivalents. This has the double advantage of investigating the effectiveness of different techniques without being distracted by the properties of the lexicon and at the same time saving much time and effort, since constructing a lexicon is time-consuming and labour-intensive. Thus, we use as little, if any, hand-coded information as possible. The accuracy score could be improved by adding hand-coded information. The point of the work reported here is to see how well one can do without any such manual intervention. With this goal in mind, we carry out a number of preprocessing steps in our framework. First, we build a lexicon-free Part-of-Speech (POS) tagger for Arabic. This POS tagger uses a combination of rule-based, transformation-based learning (TBL) and probabilistic techniques. Similarly, we use a lexicon-free POS tagger for English. We use the two POS taggers to tag the bi-texts. Second, we develop lexicon-free shallow parsers for Arabic and English. The two parsers are then used to label the parallel corpus with dependency relations (DRs) for some critical constructions. Third, we develop stemmers for Arabic and English, adopting the same knowledge -free approach. These preprocessing steps pave the way for the main system (or proposer) whose task is to extract translational equivalents from the parallel corpus. The framework starts with automatically extracting a bilingual lexicon using unsupervised statistical techniques which exploit the notion of co-occurrence patterns in the parallel corpus. We then choose the target word that has the highest frequency of occurrence from among a number of translational candidates in the extracted lexicon in order to aid the selection of the contextually correct translational equivalent. These experiments are carried out on either raw or POS-tagged texts. Having labelled the bi-texts with DRs, we use them to extract a number of translation seeds to start a number of bootstrapping techniques to improve the proposer. These seeds are used as anchor points to resegment the parallel corpus and start the selection process once again. The final F-score for the selection process is 0.701. We have also written an algorithm for detecting ambiguous words in a translation lexicon and obtained a precision score of 0.89.

APA, Harvard, Vancouver, ISO, and other styles

33

Lopez, Adam David. "Machine translation by pattern matching." College Park, Md.: University of Maryland, 2008. http://hdl.handle.net/1903/8110.

Full text

Abstract:

Thesis (Ph. D.) -- University of Maryland, College Park, 2008.
Thesis research directed by: Dept. of Linguistics and Institute for Advanced Computer Studies. Title from t.p. of PDF. Includes bibliographical references. Published by UMI Dissertation Services, Ann Arbor, Mich. Also available in paper.

APA, Harvard, Vancouver, ISO, and other styles

34

Fomicheva, Marina. "The Role of human reference translation in machine translation evaluation." Doctoral thesis, Universitat Pompeu Fabra, 2017. http://hdl.handle.net/10803/404987.

Full text

Abstract:

Both manual and automatic methods for Machine Translation (MT) evaluation heavily rely on professional human translation. In manual evaluation, human translation is often used instead of the source text in order to avoid the need for bilingual speakers, whereas the majority of automatic evaluation techniques measure string similarity between MT output and a human translation (commonly referred to as candidate and reference translations), assuming that the closer they are, the higher the MT quality. In spite of the crucial role of human reference translation in the assessment of MT quality, its fundamental characteristics have been largely disregarded. An inherent property of professional translation is the adaptation of the original text to the expectations of the target audience. As a consequence, human translation can be rather different from the original text, which, as will be shown throughout this work, has a strong impact on the results of MT evaluation. The first goal of our research was to assess the effects of using human translation as a benchmark for MT evaluation. To achieve this goal, we started with a theoretical discussion of the relation between original and translated texts. We identified the presence of optional translation shifts as one of the fundamental characteristics of human translation. We analyzed the impact of translation shifts on automatic and manual MT evaluation showing that in both cases quality assessment is strongly biased by the reference provided. The second goal of our work was to improve the accuracy of automatic evaluation in terms of the correlation with human judgments. Given the limitations of reference-based evaluation discussed in the first part of the work, instead of considering different aspects of similarity we focused on the differences between MT output and reference translation searching for criteria that would allow distinguishing between acceptable linguistic variation and deviations induced by MT errors. In the first place, we explored the use of local syntactic context for validating the matches between candidate and reference words. In the second place, to compensate for the lack of information regarding the MT segments for which no counterpart in the reference translation was found, we enhanced reference-based evaluation with fluency-oriented features. We implemented our approach as a family of automatic evaluation metrics that showed highly competitive performance in a series of well-known MT evaluation campaigns.
Tanto los métodos manuales como los automáticos para la evaluación de la Traducción Automática (TA) dependen en gran medida de la traducción humana profesional. En la evaluación manual, la traducción humana se utiliza a menudo en lugar del texto original para evitar la necesidad de hablantes bilingües, mientras que la mayoría de las técnicas de evaluación automática miden la similitud entre la TA y una traducción humana (comúnmente llamadas traducción candidato y traducción de referencia), asumiendo que cuanto más cerca están, mayor es la calidad de la TA. A pesar del papel fundamental que juega la traducción de referencia en la evaluación de la calidad de la TA, sus características han sido en gran parte ignoradas. Una propiedad inherente de la traducción profesional es la adaptación del texto original a las expectativas del lector. Como consecuencia, la traducción humana puede ser bastante diferente del texto original, lo cual, como se demostrará a lo largo de este trabajo, tiene un fuerte impacto en los resultados de la evaluación de la TA. El primer objetivo de nuestra investigación fue evaluar los efectos del uso de la traducción humana como punto de referencia para la evaluación de la TA. Para lograr este objetivo, comenzamos con una discusión teórica sobre la relación entre textos originales y traducidos. Se identificó la presencia de cambios de traducción opcionales como una de las características fundamentales de la traducción humana. Se analizó el impacto de estos cambios en la evaluación automática y manual de la TA demostrándose en ambos casos que la evaluación está fuertemente sesgada por la referencia proporcionada. El segundo objetivo de nuestro trabajo fue mejorar la precisión de la evaluación automática medida en términos de correlación con los juicios humanos. Dadas las limitaciones de la evaluación basada en la referencia discutidas en la primera parte del trabajo, en lugar de enfocarnos en la similitud, nos concentramos en el impacto de las diferencias entre la TA y la traducción de referencia buscando criterios que permitiesen distinguir entre variación lingüística aceptable y desviaciones inducidas por los errores de TA. En primer lugar, exploramos el uso del contexto sintáctico local para validar las coincidencias entre palabras candidato y de referencia. En segundo lugar, para compensar la falta de información sobre los segmentos de la TA para los cuales no se encontró ninguna relación con la traducción de referencia, introdujimos características orientadas a la fluidez de la TA en la evaluación basada en la referencia. Implementamos nuestro enfoque como una familia de métricas de evaluación automática que mostraron un rendimiento altamente competitivo en una serie de conocidas campañas de evaluación de la TA.

APA, Harvard, Vancouver, ISO, and other styles

35

Mehay, Dennis Nolan. "Bean Soup Translation: Flexible, Linguistically-motivated Syntax for Machine Translation." The Ohio State University, 2012. http://rave.ohiolink.edu/etdc/view?acc_num=osu1345433807.

Full text

APA, Harvard, Vancouver, ISO, and other styles

36

Moré, i. López Joaquim. "Machine Translationness: a Concept for Machine Translation Evaluation and Detection." Doctoral thesis, Universitat Oberta de Catalunya, 2015. http://hdl.handle.net/10803/305494.

Full text

Abstract:

La tradautomaticitat és el fenomen lingüístic que fa que les traduccions automàtiques sonin a màquina. Aquesta tesi introdueix el concepte de tradautomaticitat com un objecte de recerca i presenta un mètode d'avaluació que consisteix en determinar si la traducció és pròpia d'una màquina en comptes de determinar la seva semblança amb una traducció humana, com en els mètodes d'avaluació actuals. El mètode avalua la qualitat d'una traducció amb una mètrica, la MTS (Machine Translationness Score). Aquesta mètrica és conseqüent amb la percepció de la tradautomaticitat de la gent corrent. La MTS correlaciona bé amb les valoracions de qualitat dels avaluadors humans. A més, la nostra proposta permet realitzar avaluacions de baix cost perquè no necessiten de recursos que són cars d'obtenir (traduccions de referència, corpus d'entrenament, etc.). El criteri de tradautomaticitat té aplicacions que van més enllà de l'avaluació de traduccions automàtiques (detecció de plagi, detecció de publicacions no supervisades a Internet, etc.).
La tradautomacidad es el fenómeno lingüístico que hace que las traducciones automáticas suenen a máquina. Esta tesis introduce el concepto de tradautomaticidad como un objeto de investigación y presenta un método de evaluación que consiste en determinar si la traducción es propia de una máquina en vez de determinar su parecido a una traducción humana, como en los métodos de evaluación actuales. El método evalúa la calidad de una traducción con una métrica, la MTS (Machine Translationness Score). Esta métrica es consecuente con la percepción de la tradautomaticidad de la gente corriente. La MTS correlaciona bien con las valoraciones de calidad de evaluadores humanos. Además, nuestra propuesta permite realizar evaluaciones de bajo coste porque no requieren de recursos que son caros de obtener (traducciones de referencia, corpus de entrenamiento, etc.). El criterio de tradautomaticidad tiene aplicaciones que van más allá de la evaluación de traducciones automáticas (detección de plagio, detección de publicaciones no supervisadas en Internet, etc.).
Machine translationness (MTness) is the linguistic phenomena that make machine translations distinguishable from human translations. This thesis introduces MTness as a research object and presents an MT evaluation method based on determining whether the translation is machinelike instead of determining its humanlikeness as in current evaluation approaches. The method rates the MTness of a translation with a metric, the MTS (Machine Translationness Score). The MTS calculation is in accordance with the results of an experimental study on machine translation perception by common people. MTS proved to correlate well with human ratings on translation quality. Besides, our approach allows the performance of cheap evaluations since expensive resources (e.g. reference translations, training corpora) are not needed. Machine translationness ratings can be applied for other uses beyond machine translation evaluation (plagiarism and other forms of cheating, detection of unsupervised MT documents published on the Web, etc.).

APA, Harvard, Vancouver, ISO, and other styles

37

Giménez, Linares Jesús Ángel. "Empirical machine translation and its evaluation." Doctoral thesis, Universitat Politècnica de Catalunya, 2008. http://hdl.handle.net/10803/6674.

Full text

Abstract:

Aquesta tesi estudia l'aplicació de les tecnologies del Processament del Llenguatge Natural disponibles actualment al problema de la Traducció Automàtica basada en Mètodes Empírics i la seva Avaluació.

D'una banda, tractem el problema de l'avaluació automàtica. Hem analitzat les principals deficiències dels mètodes d'avaluació actuals, les quals es deuen, al nostre parer, als principis de qualitat superficials en els que es basen. En comptes de limitar-nos al nivell lèxic, proposem una nova direcció cap a avaluacions més heterogènies. El nostre enfocament es basa en el disseny d'un ric conjunt de mesures automàtiques destinades a capturar un ampli ventall d'aspectes de qualitat a diferents nivells lingüístics (lèxic, sintàctic i semàntic). Aquestes mesures lingüístiques han estat avaluades sobre diferents escenaris. El resultat més notable ha estat la constatació de que les mètriques basades en un coneixement lingüístic més profund (sintàctic i semàntic) produeixen avaluacions a nivell de sistema més fiables que les mètriques que es limiten a la dimensió lèxica, especialment quan els sistemes avaluats pertanyen a paradigmes de traducció diferents. Tanmateix, a nivell de frase, el comportament d'algunes d'aquestes mètriques lingüístiques empitjora lleugerament en comparació al comportament de les mètriques lèxiques. Aquest fet és principalment atribuïble als errors comesos pels processadors lingüístics. A fi i efecte de millorar l'avaluació a nivell de frase, a més de recòrrer a la similitud lèxica en absència d'anàlisi lingüística, hem estudiat la possibiliat de combinar les puntuacions atorgades per mètriques a diferents nivells lingüístics en una sola mesura de qualitat. S'han presentat dues estratègies no paramètriques de combinació de mètriques, essent el seu principal avantatge no haver d'ajustar la contribució relativa de cadascuna de les mètriques a la puntuació global. A més, el nostre treball mostra com fer servir el conjunt de mètriques heterogènies per tal d'obtenir detallats informes d'anàlisi d'errors automàticament.

D'altra banda, hem estudiat el problema de la selecció lèxica en Traducció Automàtica Estadística. Amb aquesta finalitat, hem construit un sistema de Traducció Automàtica Estadística Castellà-Anglès basat en -phrases', i hem iterat en el seu cicle de desenvolupament, analitzant diferents maneres de millorar la seva qualitat mitjançant la incorporació de coneixement lingüístic. En primer lloc, hem extès el sistema a partir de la combinació de models de traducció basats en anàlisi sintàctica superficial, obtenint una millora significativa. En segon lloc, hem aplicat models de traducció discriminatius basats en tècniques d'Aprenentatge Automàtic. Aquests models permeten una millor representació del contexte de traducció en el que les -phrases' ocorren, efectivament conduint a una millor selecció lèxica. No obstant, a partir d'avaluacions automàtiques heterogènies i avaluacions manuals, hem observat que les millores en selecció lèxica no comporten necessàriament una millor estructura sintàctica o semàntica. Així doncs, la incorporació d'aquest tipus de prediccions en el marc estadístic requereix, per tant, un estudi més profund.

Com a qüestió complementària, hem estudiat una de les principals crítiques en contra dels sistemes de traducció basats en mètodes empírics, la seva forta dependència del domini, i com els seus efectes negatius poden ésser mitigats combinant adequadament fonts de coneixement externes. En aquest sentit, hem adaptat amb èxit un sistema de traducció estadística Anglès-Castellà entrenat en el domini polític, al domini de definicions de diccionari.

Les dues parts d'aquesta tesi estan íntimament relacionades, donat que el desenvolupament d'un sistema real de Traducció Automàtica ens ha permès viure en primer terme l'important paper dels mètodes d'avaluació en el cicle de desenvolupament dels sistemes de Traducció Automàtica.
In this thesis we have exploited current Natural Language Processing technology for Empirical Machine Translation and its Evaluation.

On the one side, we have studied the problem of automatic MT evaluation. We have analyzed the main deficiencies of current evaluation methods, which arise, in our opinion, from the shallow quality principles upon which they are based. Instead of relying on the lexical dimension alone, we suggest a novel path towards heterogeneous evaluations. Our approach is based on the design of a rich set of automatic metrics devoted to capture a wide variety of translation quality aspects at different linguistic levels (lexical, syntactic and semantic). Linguistic metrics have been evaluated over different scenarios. The most notable finding is that metrics based on deeper linguistic information (syntactic/semantic) are able to produce more reliable system rankings than metrics which limit their scope to the lexical dimension, specially when the systems under evaluation are different in nature. However, at the sentence level, some of these metrics suffer a significant decrease, which is mainly attributable to parsing errors. In order to improve sentence-level evaluation, apart from backing off to lexical similarity in the absence of parsing, we have also studied the possibility of combining the scores conferred by metrics at different linguistic levels into a single measure of quality. Two valid non-parametric strategies for metric combination have been presented. These offer the important advantage of not having to adjust the relative contribution of each metric to the overall score. As a complementary issue, we show how to use the heterogeneous set of metrics to obtain automatic and detailed linguistic error analysis reports.

On the other side, we have studied the problem of lexical selection in Statistical Machine Translation. For that purpose, we have constructed a Spanish-to-English baseline phrase-based Statistical Machine Translation system and iterated across its development cycle, analyzing how to ameliorate its performance through the incorporation of linguistic knowledge. First, we have extended the system by combining shallow-syntactic translation models based on linguistic data views. A significant improvement is reported. This system is further enhanced using dedicated discriminative phrase translation models. These models allow for a better representation of the translation context in which phrases occur, effectively yielding an improved lexical choice. However, based on the proposed heterogeneous evaluation methods and manual evaluations conducted, we have found that improvements in lexical selection do not necessarily imply an improved overall syntactic or semantic structure. The incorporation of dedicated predictions into the statistical framework requires, therefore, further study.

As a side question, we have studied one of the main criticisms against empirical MT systems, i.e., their strong domain dependence, and how its negative effects may be mitigated by properly combining outer knowledge sources when porting a system into a new domain. We have successfully ported an English-to-Spanish phrase-based Statistical Machine Translation system trained on the political domain to the domain of dictionary definitions.

The two parts of this thesis are tightly connected, since the hands-on development of an actual MT system has allowed us to experience in first person the role of the evaluation methodology in the development cycle of MT systems.

APA, Harvard, Vancouver, ISO, and other styles

38

Kauchak, David. "Contributions to research on machine translation." Connect to a 24 p. preview or request complete full text in PDF format. Access restricted to UC campuses, 2006. http://wwwlib.umi.com/cr/ucsd/fullcit?p3237012.

Full text

Abstract:

Thesis (Ph. D.)--University of California, San Diego, 2006.
Title from first page of PDF file (viewed December 8, 2006). Available via ProQuest Digital Dissertations. Vita. Includes bibliographical references (p. 87-92).

APA, Harvard, Vancouver, ISO, and other styles

39

Shah, Kashif. "Model adaptation techniques in machine translation." Phd thesis, Université du Maine, 2012. http://tel.archives-ouvertes.fr/tel-00718226.

Full text

Abstract:

Nowadays several indicators suggest that the statistical approach to machinetranslation is the most promising. It allows fast development of systems for anylanguage pair provided that sufficient training data is available.Statistical Machine Translation (SMT) systems use parallel texts ‐ also called bitexts ‐ astraining material for creation of the translation model and monolingual corpora fortarget language modeling.The performance of an SMT system heavily depends upon the quality and quantity ofavailable data. In order to train the translation model, the parallel texts is collected fromvarious sources and domains. These corpora are usually concatenated, word alignmentsare calculated and phrases are extracted.However, parallel data is quite inhomogeneous in many practical applications withrespect to several factors like data source, alignment quality, appropriateness to thetask, etc. This means that the corpora are not weighted according to their importance tothe domain of the translation task. Therefore, it is the domain of the training resourcesthat influences the translations that are selected among several choices. This is incontrast to the training of the language model for which well‐known techniques areused to weight the various sources of texts.We have proposed novel methods to automatically weight the heterogeneous data toadapt the translation model.In a first approach, this is achieved with a resampling technique. A weight to eachbitexts is assigned to select the proportion of data from that corpus. The alignmentscoming from each bitexts are resampled based on these weights. The weights of thecorpora are directly optimized on the development data using a numerical method.Moreover, an alignment score of each aligned sentence pair is used as confidencemeasurement.In an extended work, we obtain such a weighting by resampling alignments usingweights that decrease with the temporal distance of bitexts to the test set. By thesemeans, we can use all the available bitexts and still put an emphasis on the most recentone. The main idea of our approach is to use a parametric form or meta‐weights for theweighting of the different parts of the bitexts. This ensures that our approach has onlyfew parameters to optimize.In another work, we have proposed a generic framework which takes into account thecorpus and sentence level "goodness scores" during the calculation of the phrase‐tablewhich results into better distribution of probability mass of the individual phrase pairs.

APA, Harvard, Vancouver, ISO, and other styles

40

Nakazawa, Toshiaki. "Fully Syntactic Example-based Machine Translation." 京都大学 (Kyoto University), 2010. http://hdl.handle.net/2433/120373.

Full text

APA, Harvard, Vancouver, ISO, and other styles

41

Yamashita, Naomi. "Supporting machine translation mediated collaborative work." 京都大学 (Kyoto University), 2006. http://hdl.handle.net/2433/135939.

Full text

APA, Harvard, Vancouver, ISO, and other styles

42

Payvar, Bamdad. "Machine Translation, universal languages and Descartes." Thesis, Blekinge Tekniska Högskola, Sektionen för datavetenskap och kommunikation, 2012. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-3643.

Full text

Abstract:

The aim of this thesis is to explore Machine Translation and the problems that these system are experiencing when translation between two different languages. The grammatical structures will be studied for English, Swedish and Persian to find a common pattern that could relate different ideas in each language to each other. In the other hand an inter lingual MT will be developed according to “René Descartes” principals that not only produces translations to English, Persian and Swedish but it even provides a new way of inputting text just by clicking buttons which each represent a word or concept. Then the system will be presented to a group of chosen users to study the human interaction with the application and identifying new problems associated with the new developed system and evaluating the results. The specific objectives are: the role of prepositions and other grammatical structures in determining the meaning of a text. The study even examines the possibility of using Descartes theory for improving Machine Translation. The study was conducted in “BTH”. The data was collected through research, experiments, and Self-reporting.
bamdadpayvar@msn.com

APA, Harvard, Vancouver, ISO, and other styles

43

Song, Xingyi. "Training machine translation for human acceptability." Thesis, University of Sheffield, 2016. http://etheses.whiterose.ac.uk/14284/.

Full text

Abstract:

Discriminative training, a.k.a. tuning, is an important part of Statistical Machine Translation. This step optimises weights for the several statistical models and heuristics used in a machine translation system, in order to balance their relative effect on the translation output. Different weights lead to significant changes in the quality of translation outputs, and thus selecting appropriate weights is of key importance. This thesis addresses three major problems with current discriminative training methods in order to improve translation quality. First, we design more accurate automatic machine translation evaluation metrics that have better correlation with human judgements. An automatic evaluation metric is used in the loss function in most discriminative training methods, however what the best metric is for this purpose is still an open question. In this thesis we propose two novel evaluation metrics that achieve better correlation with human judgements than the current de facto standard, the BLEU metric. We show that these metrics can improve translation quality when used in discriminative training. Second, we design an algorithm to select sentence pairs for training the discriminative learner from large pools of freely available parallel sentences. These resources tend to be noisy and include translations of varying degrees of quality and suitability for the translation task at hand, especially if obtained using crowdsourcing methods. Nevertheless, they are crucial when professionally created training data is scarce or unavailable. There is very little previous research on the data selection for discriminative training. Our novel data selection algorithm does not require knowledge of the test set nor uses decoding outputs, and is thus more generally useful and efficient. Our experiments show that with this data selection algorithm, translation quality consistently improves over strong baselines. Finally, the third component of the thesis is a novel weighted ranking-based optimisation algorithm for discriminative training. In contrast to previous approaches, this technique assigns a different weight to each training instance according to its reachability and its relationship to test sentence being decoded, a form of transductive learning. Our experimental results show improvements over a modern state-of-the-art method across different language pairs. Overall, the proposed approaches lead to better translation quality when compared strong baselines in our experiments, both in isolation and when combined, and can be easily applied to most existing statistical machine translation approaches.

APA, Harvard, Vancouver, ISO, and other styles

44

Ueffing, Nicola. "Word confidence measures for machine translation." [S.l.] : [s.n.], 2006. http://deposit.ddb.de/cgi-bin/dokserv?idn=97967669X.

Full text

APA, Harvard, Vancouver, ISO, and other styles

45

Birch, Alexandra. "Reordering metrics for statistical machine translation." Thesis, University of Edinburgh, 2011. http://hdl.handle.net/1842/5024.

Full text

Abstract:

Natural languages display a great variety of different word orders, and one of the major challenges facing statistical machine translation is in modelling these differences. This thesis is motivated by a survey of 110 different language pairs drawn from the Europarl project, which shows that word order differences account for more variation in translation performance than any other factor. This wide ranging analysis provides compelling evidence for the importance of research into reordering. There has already been a great deal of research into improving the quality of the word order in machine translation output. However, there has been very little analysis of how best to evaluate this research. Current machine translation metrics are largely focused on evaluating the words used in translations, and their ability to measure the quality of word order has not been demonstrated. In this thesis we introduce novel metrics for quantitatively evaluating reordering. Our approach isolates the word order in translations by using word alignments. We reduce alignment information to permutations and apply standard distance metrics to compare the word order in the reference to that of the translation. We show that our metrics correlate more strongly with human judgements of word order quality than current machine translation metrics. We also show that a combined lexical and reordering metric, the LRscore, is useful for training translation model parameters. Humans prefer the output of models trained using the LRscore as the objective function, over those trained with the de facto standard translation metric, the BLEU score. The LRscore thus provides researchers with a reliable metric for evaluating the impact of their research on the quality of word order.

APA, Harvard, Vancouver, ISO, and other styles

46

Babych, Bogdan. "Information extraction technology in machine translation." Thesis, University of Leeds, 2005. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.416402.

Full text

APA, Harvard, Vancouver, ISO, and other styles

47

Trujillo, Indalecio Arturo. "Lexicalist machine translation of spatial prepositions." Thesis, University of Cambridge, 1995. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.388507.

Full text

APA, Harvard, Vancouver, ISO, and other styles

48

Bérard, Alexandre. "Neural machine translation architectures and applications." Thesis, Lille 1, 2018. http://www.theses.fr/2018LIL1I022/document.

Full text

Abstract:

Cette thèse est centrée sur deux principaux objectifs : l'adaptation de techniques de traduction neuronale à de nouvelles tâches, et la reproduction de travaux de recherche existants. Nos efforts pour la reproductibilité ont résulté en la création de deux ressources : MultiVec, un outil permettant l'utilisation de plusieurs techniques liées au word embeddings; ainsi qu'un outil proposant plusieurs modèles pour la traduction automatique et d’autres tâches similaires (par ex. post-édition automatique). Nous travaillons ensuite sur plusieurs tâches liées à la traduction : la Traduction Automatique (TA), Traduction Automatique de la Parole, et la Post-Édition Automatique. Pour la tâche de TA, nous répliquons des travaux fondateurs basés sur les réseaux de neurones, et effectuons une étude sur des TED Talks, où nous avançons l'état de l'art. La tâche suivante consiste à traduire la parole dans une langue vers le texte dans une autre langue. Dans cette thèse, nous nous concentrons sur le problème inexploré de traduction dite « end-to-end », qui ne passe pas par une transcription intermédiaire dans la langue source. Nous proposons le premier modèle end-to-end, et l'évaluons sur deux problèmes : la traduction de livres audio, et d'expressions de voyage. Notre tâche finale est la post-édition automatique, qui consiste à corriger les sorties d'un système de traduction dans un scénario « boîte noire », en apprenant à partir de données produites par des post-éditeurs humains. Nous étendons des résultats publiés dans le cadre des tâches de WMT 2016 et 2017, et proposons de nouveaux modèles pour la post-édition automatique dans un scénario avec peu de données
This thesis is centered on two main objectives: adaptation of Neural Machine Translation techniques to new tasks and research replication. Our efforts towards research replication have led to the production of two resources: MultiVec, a framework that facilitates the use of several techniques related to word embeddings (Word2vec, Bivec and Paragraph Vector); and a framework for Neural Machine Translation that implements several architectures and can be used for regular MT, Automatic Post-Editing, and Speech Recognition or Translation. These two resources are publicly available and now extensively used by the research community. We extend our NMT framework to work on three related tasks: Machine Translation (MT), Automatic Speech Translation (AST) and Automatic Post-Editing (APE). For the machine translation task, we replicate pioneer neural-based work, and do a case study on TED talks where we advance the state-of-the-art. Automatic speech translation consists in translating speech from one language to text in another language. In this thesis, we focus on the unexplored problem of end-to-end speech translation, which does not use an intermediate source-language text transcription. We propose the first model for end-to-end AST and apply it on two benchmarks: translation of audiobooks and of basic travel expressions. Our final task is automatic post-editing, which consists in automatically correcting the outputs of an MT system in a black-box scenario, by training on data that was produced by human post-editors. We replicate and extend published results on the WMT 2016 and 2017 tasks, and propose new neural architectures for low-resource automatic post-editing

APA, Harvard, Vancouver, ISO, and other styles

49

Logacheva, Varvara. "Human feedback in Statistical Machine Translation." Thesis, University of Sheffield, 2017. http://etheses.whiterose.ac.uk/18534/.

Full text

Abstract:

The thesis addresses the challenge of improving Statistical Machine Translation (SMT) systems via feedback given by humans on translation quality. The amount of human feedback available to systems is inherently low due to cost and time limitations. One of our goals is to simulate such information by automatically generating pseudo-human feedback. This is performed using Quality Estimation (QE) models. QE is a technique for predicting the quality of automatic translations without comparing them to oracle (human) translations, traditionally at the sentence or word levels. QE models are trained on a small collection of automatic translations manually labelled for quality, and then can predict the quality of any number of unseen translations. We propose a number of improvements for QE models in order to increase the reliability of pseudo-human feedback. These include strategies to artificially generate instances for settings where QE training data is scarce. We also introduce a new level of granularity for QE: the level of phrases. This level aims to improve the quality of QE predictions by better modelling inter-dependencies among errors at word level, and in ways that are tailored to phrase-based SMT, where the basic unit of translation is a phrase. This can thus facilitate work on incorporating human feedback during the translation process. Finally, we introduce approaches to incorporate pseudo-human feedback in the form of QE predictions in SMT systems. More specifically, we use quality predictions to select the best translation from a number of alternative suggestions produced by SMT systems, and integrate QE predictions into an SMT system decoder in order to guide the translation generation process.

APA, Harvard, Vancouver, ISO, and other styles

50

Tapkanova, Elmira. "Machine Translation and Text Simplification Evaluation." Scholarship @ Claremont, 2016. http://scholarship.claremont.edu/scripps_theses/790.

Full text

Abstract:

Machine translation translates a text from one language to another, while text simplification converts a text from its original form to a simpler one, usually in the same language. This survey paper discusses the evaluation (manual and automatic) of both fields, providing an overview of existing metrics along with their strengths and weaknesses. The first chapter takes an in-depth look at machine translation evaluation metrics, namely BLEU, NIST, AMBER, LEPOR, MP4IBM1, TER, MMS, METEOR, TESLA, RTE, and HTER. The second chapter focuses more generally on text simplification, starting with a discussion of the theoretical underpinnings of the field (i.e what ``simple'' means). Then, an overview of automatic evaluation metrics, namely BLEU and Flesch-Kincaid, is given, along with common approaches to text simplification. The paper concludes with a discussion of the future trajectory of both fields.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Machine translations'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles