Log in

Relevant bibliographies by topics / Multiwords / Dissertations / Theses

To see the other types of publications on this topic, follow the link: Multiwords.

Dissertations / Theses on the topic 'Multiwords'

Author: Grafiati

Published: 10 March 2023

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 36 dissertations / theses for your research on the topic 'Multiwords.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Monti, Johanna. "Multi-word unit processing in machine translation. Developing and using language resources for multi-word unit processing in machine translation." Doctoral thesis, Universita degli studi di Salerno, 2015. http://hdl.handle.net/10556/2042.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Waszczuk, Jakub. "Leveraging MWEs in practical TAG parsing : towards the best of the two worlds." Thesis, Tours, 2017. http://www.theses.fr/2017TOUR4024/document.

Full text

Abstract:

Dans ce mémoire, nous nous penchons sur les expressions polylexicales (EP) et leurs relations avec l’analyse syntaxique, la tâche qui consiste à déterminer les relations syntaxiques entre les mots dans une phrase donnée. Le défi que posent les EP dans ce contexte, par rapport aux expressions linguistiques régulières, provient de leurs propriétés parfois inattendues qui les rendent difficiles à gérer dans te traitement automatique des langues. Dans nos travaux, nous montrons qu’il est pourtant possible de profiter de ce cette caractéristique des EP afin d’améliorer les résultats d’analyse syntaxique. Notamment, avec les grammaires d’arbres adjoints (TAGs), qui fournissent un cadre naturel et puissant pour la modélisation des EP, ainsi qu’avec des stratégies de recherche basées sur l’algorithme A* , il est possible d’obtenir des gains importants au niveau de la vitesse sans pour autant détériorer la qualité de l’analyse syntaxique. Cela contraste avec des méthodes purement statistiques qui, malgré l’efficacité, ne fournissent pas de solutions satisfaisantes en ce qui concerne les EP. Nous proposons un analyseur syntaxique novateur qui combine les grammaires TAG avec La technique A*, axé sur la prédiction des EP, dont les fonctionnalités permettent des applications à grande échelle, facilement extensible au contexte probabiliste
In this thesis, we focus on multiword expressions (MWEs) and their relationships with syntactic parsing. The latter task consists in retrieving the syntactic relations holding between the words in a given sentence. The challenge of MWEs in this respect is that, in contrast to regular linguistic expressions, they exhibit various irregular properties which make them harder to deal with in natural language processing. In our work, we show that the challenge of the MWE-related irregularities can be turned into an advantage in practical symbolic parsing. Namely, with tree adjoining grammars (TAGs), which provide first-cLass support for MWEs, and A* search strategies, considerable speed-up gains can be achieved by promoting MWE-based analyses with virtually no loss in syntactic parsing accuracy. This is in contrast to purely statistical state-of-the-art parsers, which, despite efficiency, provide no satisfactory support for MWEs. We contribute a TAG-A* -MWE-aware parsing architecture with facilities (grammar compression and feature structures) enabling real-world applications, easily extensible to a probabilistic framework

APA, Harvard, Vancouver, ISO, and other styles

3

Su, Kim Nam. "Statistical modeling of multiword expressions." Connect to thesis, 2008. http://repository.unimelb.edu.au/10187/3147.

Full text

Abstract:

In natural languages, words can occur in single units called simplex words or in a group of simplex words that function as a single unit, called multiword expressions (MWEs). Although MWEs are similar to simplex words in their syntax and semantics, they pose their own sets of challenges (Sag et al. 2002). MWEs are arguably one of the biggest roadblocks in computational linguistics due to the bewildering range of syntactic, semantic, pragmatic and statistical idiomaticity they are associated with, and their high productivity. In addition, the large numbers in which they occur demand specialized handling. Moreover, dealing with MWEs has a broad range of applications, from syntactic disambiguation to semantic analysis in natural language processing (NLP) (Wacholder and Song 2003; Piao et al. 2003; Baldwin et al. 2004; Venkatapathy and Joshi 2006).
Our goals in this research are: to use computational techniques to shed light on the underlying linguistic processes giving rise to MWEs across constructions and languages; to generalize existing techniques by abstracting away from individual MWE types; and finally to exemplify the utility of MWE interpretation within general NLP tasks.
In this thesis, we target English MWEs due to resource availability. In particular, we focus on noun compounds (NCs) and verb-particle constructions (VPCs) due to their high productivity and frequency.
Challenges in processing noun compounds are: (1) interpreting the semantic relation (SR) that represents the underlying connection between the head noun and modifier(s); (2) resolving syntactic ambiguity in NCs comprising three or more terms; and (3) analyzing the impact of word sense on noun compound interpretation. Our basic approach to interpreting NCs relies on the semantic similarity of the NC components using firstly a nearest-neighbor method (Chapter 5), then verb semantics based on the observation that it is often an underlying verb that relates the nouns in NCs (Chapter 6), and finally semantic variation within NC sense collocations, in combination with bootstrapping (Chapter 7).
Challenges in dealing with verb-particle constructions are: (1) identifying VPCs in raw text data (Chapter 8); and (2) modeling the semantic compositionality of VPCs (Chapter 5). We place particular focus on identifying VPCs in context, and measuring the compositionality of unseen VPCs in order to predict their meaning. Our primary approach to the identification task is to adapt localized context information derived from linguistic features of VPCs to distinguish between VPCs and simple verb-PP combinations. To measure the compositionality of VPCs, we use semantic similarity among VPCs by testing the semantic contribution of each component.
Finally, we conclude the thesis with a chapter-by-chapter summary and outline of the findings of our work, suggestions of potential NLP applications, and a presentation of further research directions (Chapter 9).

APA, Harvard, Vancouver, ISO, and other styles

4

Korkontzelos, Ioannis. "Unsupervised learning of multiword expressions." Thesis, University of York, 2010. http://etheses.whiterose.ac.uk/2091/.

Full text

Abstract:

Multiword expressions are expressions consisting of two or more words that correspond to some conventional way of saying things (Manning & Schutze 1999). Due to the idiomatic nature of many of them and their high frequency of occurence in all sorts of text, they cause problems in many Natural Language Processing (NLP) applications and are frequently responsible for their shortcomings. Efficiently recognising multiword expressions and deciding the degree of their idiomaticity would be useful to all applications that require some degree of semantic processing, such as question-answering, summarisation, parsing, language modelling and language generation. In this thesis we investigate the issues of recognising multiword expressions, domainspecific or not, and of deciding whether they are idiomatic. Moreover, we inspect the extent to which multiword expressions can contribute to a basic NLP task such as shallow parsing and ways that the basic property of multiword expressions, idiomaticity, can be employed to define a novel task for Compositional Distributional Semantics (CDS). The results show that it is possible to recognise multiword expressions and decide their compositionality in an unsupervised manner, based on cooccurrence statistics and distributional semantics. Further, multiword expressions are beneficial for other fundamental applications of Natural Language Processing either by direct integration or as an evaluation tool. In particular, termhood-based methods, which are based on nestedness information, are shown to outperform unithood-based methods, which measure the strength of association among the constituents of a multi-word candidate term. A simple heuristic was proved to perform better than more sophisticated methods. A new graph-based algorithm employing sense induction is proposed to address multiword expression compositionality and is shown to perform better than a standard vector space model. Its parameters were estimated by an unsupervised scheme based on graph connectivity. Multiword expressions are shown to contribute to shallow parsing. Moreover, they are used to define a new evaluation task for distributional semantic composition models.

APA, Harvard, Vancouver, ISO, and other styles

5

Taslimipoor, Shiva. "Automatic identification and translation of multiword expressions." Thesis, University of Wolverhampton, 2018. http://hdl.handle.net/2436/622068.

Full text

Abstract:

Multiword Expressions (MWEs) belong to a class of phraseological phenomena that is ubiquitous in the study of language. They are heterogeneous lexical items consisting of more than one word and feature lexical, syntactic, semantic and pragmatic idiosyncrasies. Scholarly research on MWEs benefits both natural language processing (NLP) applications and end users. This thesis involves designing new methodologies to identify and translate MWEs. In order to deal with MWE identification, we first develop datasets of annotated verb-noun MWEs in context. We then propose a method which employs word embeddings to disambiguate between literal and idiomatic usages of the verb-noun expressions. Existence of expression types with various idiomatic and literal distributions leads us to re-examine their modelling and evaluation. We propose a type-aware train and test splitting approach to prevent models from overfitting and avoid misleading evaluation results. Identification of MWEs in context can be modelled with sequence tagging methodologies. To this end, we devise a new neural network architecture, which is a combination of convolutional neural networks and long-short term memories with an optional conditional random field layer on top. We conduct extensive evaluations on several languages demonstrating a better performance compared to the state-of-the-art systems. Experiments show that the generalisation power of the model in predicting unseen MWEs is significantly better than previous systems. In order to find translations for verb-noun MWEs, we propose a bilingual distributional similarity approach derived from a word embedding model that supports arbitrary contexts. The technique is devised to extract translation equivalents from comparable corpora which are an alternative resource to costly parallel corpora. We finally conduct a series of experiments to investigate the effects of size and quality of comparable corpora on automatic extraction of translation equivalents.

APA, Harvard, Vancouver, ISO, and other styles

6

Cordeiro, Silvio Ricardo. "Distributional models of multiword expression compositionality prediction." Thesis, Aix-Marseille, 2017. http://www.theses.fr/2017AIXM0501/document.

Full text

Abstract:

Les systèmes de traitement automatique des langues reposent souvent sur l'idée que le langage est compositionnel, c'est-à-dire que le sens d'une entité linguistique peut être déduite à partir du sens de ses parties. Cette supposition ne s’avère pas vraie dans le cas des expressions polylexicales (EPLs). Par exemple, une "poule mouillée" n'est ni une poule, ni nécessairement mouillée. Les techniques pour déduire le sens des mots en fonction de leur distribution dans le texte ont obtenu de bons résultats sur plusieurs tâches, en particulier depuis l'apparition des word embeddings. Cependant, la représentation des EPLs reste toujours un problème non résolu. En particulier, on ne sait pas comment prédire avec précision, à partir des corpus, si une EPL donnée doit être traitée comme une unité indivisible (p.ex. "carton plein") ou comme une combinaison du sens de ses parties (p.ex. "eau potable"). Cette thèse propose un cadre méthodologique pour la prédiction de compositionnalité d'EPLs fondé sur des représentations de la sémantique distributionnelle, que nous instancions à partir d’une variété de paramètres. Nous présenterons une évaluation complète de l'impact de ces paramètres sur trois nouveaux ensembles de données modélisant la compositionnalité d'EPLs, en anglais, français et portugais. Finalement, nous présenterons une évaluation extrinsèque des niveaux de compositionnalité prédits par le modèle dans le contexte d’un système d'identification d'EPLs. Les résultats suggèrent que le choix spécifique de modèle distributionnel et de paramètres de corpus peut produire des prédictions de compositionnalité qui sont comparables à celles présentées dans l'état de l'art
Natural language processing systems often rely on the idea that language is compositional, that is, the meaning of a linguistic entity can be inferred from the meaning of its parts. This expectation fails in the case of multiword expressions (MWEs). For example, a person who is a "sitting duck" is neither a duck nor necessarily sitting. Modern computational techniques for inferring word meaning based on the distribution of words in the text have been quite successful at multiple tasks, especially since the rise of word embedding approaches. However, the representation of MWEs still remains an open problem in the field. In particular, it is unclear how one could predict from corpora whether a given MWE should be treated as an indivisible unit (e.g. "nut case") or as some combination of the meaning of its parts (e.g. "engine room"). This thesis proposes a framework of MWE compositionality prediction based on representations of distributional semantics, which we instantiate under a variety of parameters. We present a thorough evaluation of the impact of these parameters on three new datasets of MWE compositionality, encompassing English, French and Portuguese MWEs. Finally, we present an extrinsic evaluation of the predicted levels of MWE compositionality on the task of MWE identification. Our results suggest that the proper choice of distributional model and corpus parameters can produce compositionality predictions that are comparable to the state of the art

APA, Harvard, Vancouver, ISO, and other styles

7

Cordeiro, Silvio Ricardo. "Distributional models of multiword expression compositionality prediction." reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2018. http://hdl.handle.net/10183/174519.

Full text

Abstract:

Sistemas de processamento de linguagem natural baseiam-se com frequência na hipótese de que a linguagem humana é composicional, ou seja, que o significado de uma entidade linguística pode ser inferido a partir do significado de suas partes. Essa expectativa falha no caso de expressões multipalavras (EMPs). Por exemplo, uma pessoa caracterizada como pão-duro não é literalmente um pão, e também não tem uma consistência molecular mais dura que a de outras pessoas. Técnicas computacionais modernas para inferir o significado das palavras com base na sua distribuição no texto vêm obtendo um considerável sucesso em múltiplas tarefas, especialmente após o surgimento de abordagens de word embeddings. No entanto, a representação de EMPs continua a ser um problema em aberto na área. Em particular, não existe um método consolidado que prediga, com base em corpora, se uma determinada EMP deveria ser tratada como unidade indivisível (por exemplo olho gordo) ou como alguma combinação do significado de suas partes (por exemplo tartaruga marinha). Esta tese propõe um modelo de predição de composicionalidade de EMPs com base em representações de semântica distribucional, que são instanciadas no contexto de uma variedade de parâmetros. Também é apresentada uma avaliação minuciosa do impacto desses parâmetros em três novos conjuntos de dados que modelam a composicionalidade de EMP, abrangendo EMPs em inglês, francês e português. Por fim, é apresentada uma avaliação extrínseca dos níveis previstos de composicionalidade de EMPs, através da tarefa de identificação de EMPs. Os resultados obtidos sugerem que a escolha adequada do modelo distribucional e de parâmetros de corpus pode produzir predições de composicionalidade que são comparáveis às observadas no estado da arte.
Natural language processing systems often rely on the idea that language is compositional, that is, the meaning of a linguistic entity can be inferred from the meaning of its parts. This expectation fails in the case of multiword expressions (MWEs). For example, a person who is a sitting duck is neither a duck nor necessarily sitting. Modern computational techniques for inferring word meaning based on the distribution of words in the text have been quite successful at multiple tasks, especially since the rise of word embedding approaches. However, the representation of MWEs still remains an open problem in the field. In particular, it is unclear how one could predict from corpora whether a given MWE should be treated as an indivisible unit (e.g. nut case) or as some combination of the meaning of its parts (e.g. engine room). This thesis proposes a framework of MWE compositionality prediction based on representations of distributional semantics, which we instantiate under a variety of parameters. We present a thorough evaluation of the impact of these parameters on three new datasets of MWE compositionality, encompassing English, French and Portuguese MWEs. Finally, we present an extrinsic evaluation of the predicted levels of MWE compositionality on the task of MWE identification. Our results suggest that the proper choice of distributional model and corpus parameters can produce compositionality predictions that are comparable to the state of the art.

APA, Harvard, Vancouver, ISO, and other styles

8

Alghamdi, Ayman Ahmad O. "A computational lexicon and representational model for Arabic multiword expressions." Thesis, University of Leeds, 2018. http://etheses.whiterose.ac.uk/22821/.

Full text

Abstract:

The phenomenon of multiword expressions (MWEs) is increasingly recognised as a serious and challenging issue that has attracted the attention of researchers in various language-related disciplines. Research in these many areas has emphasised the primary role of MWEs in the process of analysing and understanding language, particularly in the computational treatment of natural languages. Ignoring MWE knowledge in any NLP system reduces the possibility of achieving high precision outputs. However, despite the enormous wealth of MWE research and language resources available for English and some other languages, research on Arabic MWEs (AMWEs) still faces multiple challenges, particularly in key computational tasks such as extraction, identification, evaluation, language resource building, and lexical representations. This research aims to remedy this deficiency by extending knowledge of AMWEs and making noteworthy contributions to the existing literature in three related research areas on the way towards building a computational lexicon of AMWEs. First, this study develops a general understanding of AMWEs by establishing a detailed conceptual framework that includes a description of an adopted AMWE concept and its distinctive properties at multiple linguistic levels. Second, in the use of AMWE extraction and discovery tasks, the study employs a hybrid approach that combines knowledge-based and data-driven computational methods for discovering multiple types of AMWEs. Third, this thesis presents a representative system for AMWEs which consists of multilayer encoding of extensive linguistic descriptions. This project also paves the way for further in-depth AMWE-aware studies in NLP and linguistics to gain new insights into this complicated phenomenon in standard Arabic. The implications of this research are related to the vital role of the AMWE lexicon, as a new lexical resource, in the improvement of various ANLP tasks and the potential opportunities this lexicon provides for linguists to analyse and explore AMWE phenomena.

APA, Harvard, Vancouver, ISO, and other styles

9

Obermeier, Andrew Stanton. "Multiword Units at the Interface: Deliberate Learning and Implicit Knowledge Gains." Diss., Temple University Libraries, 2015. http://cdm16002.contentdm.oclc.org/cdm/ref/collection/p245801coll10/id/360635.

Full text

Abstract:

Language Arts
Ed.D.
Multiword units (MWUs) is a term used in the current study to broadly cover what second language acquisition (SLA) researchers refer to as collocations, conventional expressions, chunks, idioms, formulaic sequences, or other such terms, depending on their research perspective. They are ubiquitous in language and essential in both first language (L1) and second language (L2) acquisition. Although MWUs are typically learned implicitly while using language naturally in both of these types of acquisition, the current study is an investigation of whether they are acquired in implicit knowledge when they are learned explicitly in a process called deliberate paired association learning. In SLA research, it is widely accepted that explicit knowledge is developed consciously and implicit knowledge is developed subconsciously. It is also believed that there is little crossover from explicit learning to implicit knowledge. However, recent research has cast doubt on this assumption. In a series of priming experiments, Elgort (2007, 2011) demonstrated that the formal and semantic lexical representations of deliberately learned pseudowords were accessed fluently and integrated into the mental lexicon, convincing evidence that deliberately learned words are immediately acquired in implicit knowledge. The current study aimed to extend these findings to MWUs in a psycholinguistic experiment that tested for implicit knowledge gains resulting from deliberate learning. Participants’ response times (RTs) were measured in three ways, on two testing instruments. First, subconscious formal recognition processing was measured in a masked repetition priming lexical decision task. In the second instrument, a self-paced reading task, both formulaic sequencing and semantic association gains were measured. The experiment was a counterbalanced, within-subjects design; so all comparisons were between conditions on items. Results were analyzed in a repeated measures linear mixed-effects model with participants and items as crossed random effects. The dependent variable was RTs on target words. The primary independent variable was learning condition: half of the critical MWUs were learned and half of them were not. The secondary independent variable was MWU composition at two levels: literal and figurative. The masked priming lexical decision task results showed that priming effects increased especially for learned figurative MWUs, evidence that implicit knowledge gains were made on their formal and semantic lexical representations as a result of deliberate learning. Results of the self-paced reading task were analyzed from two perspectives, but were less conclusive with regard to the effects of deliberate learning. Regarding formulaic sequencing gains, literal MWUs showed the most evidence of acquisition, but this happened as a result of both incidental and deliberate learning. With regard to semantic associations, it was shown that deliberate learning had similar effects on both literal and figurative MWUs. However, a serendipitous finding from this aspect of the self-paced reading results showed clearly that literal MWUs reliably primed semantic associations and sentence processing more strongly than figurative MWUs did, both before and after deliberate learning. In sum, results revealed that the difficulties learners have with developing fluent processing of figurative MWUs can be lessened by deliberate learning. On the other hand, for literal MWUs incidental learning is adequate for incrementally developing representation strength.
Temple University--Theses

APA, Harvard, Vancouver, ISO, and other styles

10

GARRAO, MILENA DE UZEDA. "THE CORPUS NEVER LIES: ON THE IDENTIFICATION AND USE OF MULTIWORD EXPRESSIONS." PONTIFÍCIA UNIVERSIDADE CATÓLICA DO RIO DE JANEIRO, 2006. http://www.maxwell.vrac.puc-rio.br/Busca_etds.php?strSecao=resultado&nrSeq=8873@1.

Full text

Abstract:

COORDENAÇÃO DE APERFEIÇOAMENTO DO PESSOAL DE ENSINO SUPERIOR
Muitos estudos recentes sobre a identificação e uso de combinações multivocabulares (CMs) adotam uma perspectiva representacionista do significado da palavra. Este estudo propõe que é muito mais interessante identificar as CMs por um olhar não-representacionista. A metodologia proposta foi testada em CMs do tipo V+SN, um padrão bastante freqüente no português do Brasil (PB). Trata-se de uma análise estatística com base em córpus que pode ser resumida em três etapas: 1) córpus robusto do PB como base de análise, 2) aplicação de um teste estatístico ao córpus, a saber, teste de Logaritmo de Verossimilhança (Banerjee e Pedersen, 2003), para detecção das CMs mais freqüentes com padrão V+SN (como tomar café) e exclusão de co-ocorrências sintáticas aleatórias dos mesmos itens lexicais, 3) aplicação de Medidas de Similaridade (Baeza-Yates e Ribeiro-Neto, 1999) entre todos os parágrafos contendo uma certa CM (por exemplo, fazer campanha) e todos os parágrafos contendo o substantivo fora da CM (campanha). Esta última etapa foi utilizada para avaliar o grau de composicionalidade da CM. Pôde-se concluir que quanto maior a similaridade entre os parágrafos contendo a CM e os parágrafos contendo o substantivo fora da expressão, maior será o grau de composicionalidade da CM. Por essa razão, este estudo tem um impacto tanto teórico quanto prático para a semântica.
A considerable amount of recent researches on defining multi-word expressions´ (MWE) phenomenon has an underlying representational framework of word meaning. In this study we claim that it is much more interesting to view MWE from a non-representational perspective. By choosing this path, we avoid the time-consuming and controversial human intuitions to MWE identification and definition. Our methodology was tested on Brazilian Portuguese verbal phrases of V+NP pattern. It is a statistically-based corpus analysis which could be summed up as the following three sequent steps: 1) robust linguistic corpora as output, 2) application of a probabilistic test to the corpora, namely Log Likelihood test (Banerjee and Pedersen, 2003), in order to spot the Portuguese MWEs of V+NP pattern (such as tomar café) and disregard casual syntactic and not otherwise motivated co-occurrences of the same lexical items, 3) application of Similarity Measures (Baeza-Yates and Ribeiro-Neto, 1999) between all the paragraphs containing a certain MWE and all the paragraphs containing its separate noun. This latter step is crucial to assess the MWE compositionality level. We conclude that the higher are the similarity measures between the MWE (such as fazer campanha) and its separate noun (campanha), the more compositional will be the MWE. Therefore, we believe that this work has both a practical and a theoretical impact to semantics.

APA, Harvard, Vancouver, ISO, and other styles

11

Alshaikhi, Adel Zain. "The Effects of Using Textual Enhancement on Processing and Learning Multiword Expressions." Scholar Commons, 2018. https://scholarcommons.usf.edu/etd/7464.

Full text

Abstract:

Multiword Expressions (MWEs) are crucial aspects of language use. Second language (L2) learners need to master these MWEs to be able to communicate effectively. In addition, mastering these MWEs helps L2 learners improve their cognitive processing of language input. In this study, my primary objectives were to explore the effectiveness of using Textual Enhancement (TE) to assist L2 speakers’ comprehension of MWEs, to explore whether there is a difference in comprehension between collocations and idioms, and finally, to explore how L2 speakers transact the MWEs’ meanings as presented in texts. While several researchers have explored how input enhancement in general helps L2 learners to learn collocations and idioms for productive use (e.g., Boers et al., 2017; Pam & Karimi, 2016), my focus in this study was to understand and explain in depth how the technique of TE helps L2 learners comprehend MWEs. I included in this study two types of MWEs: collocations and idioms. I also studied the differences in the comprehension between these two types to further understand the transparency factors in the comprehension process. I employed an explanatory sequential mixed methods design in which I used experimental quantitative methods and qualitative methods in one study. In phase one, I started with the experimental part and followed with the qualitative analysis to explain in depth the outcomes of the experimental part. In the qualitative section, I followed an explanatory descriptive case study approach to obtain a deeper understanding of how the participants transacted the meanings of the MWEs. A total of 26 adult Arabic-speaking students in a major Southeastern university in the United States of America volunteered to take part in this study. I collected data through: (1) a reading proficiency test, and (2) a brief survey to gather background information, self-evaluation of language proficiency, and previous experiences with MWEs. In the experimental part, I presented 20 paragraphs derived from online newspaper and magazine articles. Each paragraph contained a collocation or an idiom. Following each paragraph, I presented multiple-choice questions to measure the comprehension of the MWE in the paragraph and an open-ended question for the participants to describe how they had comprehended the MWE. I divided the participants into control and experimental groups in which the MWEs were textually enhanced in the experimental group using bolding, italicization, and highlighting. The results of the study demonstrated TE was effective in assisting the participants to comprehend idioms. In contrast, TE did not show a significant effect in leading the participants to comprehend the collocations. The qualitative data analysis showed the participants used contextual factors, guessing, constituents of the MWEs, and similarities of the MWEs with the first language (L1) as the major strategies to comprehend the MWEs meanings with different degrees between both groups.

APA, Harvard, Vancouver, ISO, and other styles

12

Ramisch, Carlos Eduardo. "A generic and open framework for multiword expressions treatment : from acquisition to applications." reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2012. http://hdl.handle.net/10183/65777.

Full text

Abstract:

The treatment of multiword expressions (MWEs), like take off, bus stop and big deal, is a challenge for NLP applications. This kind of linguistic construction is not only arbitrary but also much more frequent than one would initially guess. This thesis investigates the behaviour of MWEs across different languages, domains and construction types, proposing and evaluating an integrated methodological framework for their acquisition. There have been many theoretical proposals to define, characterise and classify MWEs. We adopt generic definition stating that MWEs are word combinations which must be treated as a unit at some level of linguistic processing. They present a variable degree of institutionalisation, arbitrariness, heterogeneity and limited syntactic and semantic variability. There has been much research on automatic MWE acquisition in the recent decades, and the state of the art covers a large number of techniques and languages. Other tasks involving MWEs, namely disambiguation, interpretation, representation and applications, have received less emphasis in the field. The first main contribution of this thesis is the proposal of an original methodological framework for automatic MWE acquisition from monolingual corpora. This framework is generic, language independent, integrated and contains a freely available implementation, the mwetoolkit. It is composed of independent modules which may themselves use multiple techniques to solve a specific sub-task in MWE acquisition. The evaluation of MWE acquisition is modelled using four independent axes. We underline that the evaluation results depend on parameters of the acquisition context, e.g., nature and size of corpora, language and type of MWE, analysis depth, and existing resources. The second main contribution of this thesis is the application-oriented evaluation of our methodology proposal in two applications: computer-assisted lexicography and statistical machine translation. For the former, we evaluate the usefulness of automatic MWE acquisition with the mwetoolkit for creating three lexicons: Greek nominal expressions, Portuguese complex predicates and Portuguese sentiment expressions. For the latter, we test several integration strategies in order to improve the treatment given to English phrasal verbs when translated by a standard statistical MT system into Portuguese. Both applications can benefit from automatic MWE acquisition, as the expressions acquired automatically from corpora can both speed up and improve the quality of the results. The promising results of previous and ongoing experiments encourage further investigation about the optimal way to integrate MWE treatment into other applications. Thus, we conclude the thesis with an overview of the past, ongoing and future work.

APA, Harvard, Vancouver, ISO, and other styles

13

Acevedo, Giménez César Esteban. "Planificador consciente del almacenamiento para Multiwork ows en Cluster Galaxy." Doctoral thesis, Universitat Autònoma de Barcelona, 2017. http://hdl.handle.net/10803/456672.

Full text

Abstract:

En l'àmbit bioinformàtic, l'experimentació es realitza a través de seqüències d'execucions d'aplicacions, cada aplicació utilitza com a arxiu d'entrada el generat per l'aplicació anterior. Aquest procés d'anàlisi format per una llista d'aplicacions descrivint una cadena de dependència es diu Workflow. Dues característiques rellevants dels workflows bioinformàtics, fan referència al maneig de grans volums de dades i de la complexitat de les dependències de dades. Molts dels gestors de recursos actuals, ignoren la ubicació dels arxius, això implica un elevat cost si els elements de processament no estan propers als arxius i se'ls ha de moure. El model de graf dirigit acíclic (DAG), utilitzat per representar l'ordre d'execució dels treballs del workflow, no ajuda a establir la millor ubicació dels arxius d'entrada o temporals per a una execució eficient. La solució per a aquest desafiament, pot ser la planificació de recursos conscient de l'emmagatzematge, on una estratègia intelligent de collocació d'arxius, afegida a una planificació de recursos d'acord a aquest coneixement; contribuirà a evitar els períodes d'inactivitat en els sistemes, causats pels temps d'espera d'arxius en els elements de processament. Amb la capacitat de còmput actual dels clústers, és possible que múltiples workflows puguin ser executats en parallel. A més, els clústers permeten que els multiworkflows, puguin compartir els arxius d'entrada i temporals en la jerarquia d'emmagatzematge. Proposem una jerarquia d'emmagatzematge composta pel sistema d'arxius distribuït, un RamDisk Local, Disc Local i Disc d'Estat Sòlid (SSD) Local. A fi de resoldre l'assignació d'aplicacions de multiworkflows als recursos del clúster, vam estendre l'heurística basada en llista per multiworkflows anomenada HEFT (Heterogeneuos Earliest Finish Time). Aquesta comprèn dues fases: primer es realitza una fase de priorització de tasques, per a posteriorment realitzar la selecció de processadors, que consisteix a assignar les aplicacions al node que minimitza el temps de finalització de cadascuna d'elles. El planificador conscient de l'emmagatzematge proposat, considera ubicar els arxius en la jerarquia d'emmagatzematge abans de començar l'execució. La pre-ubicació d'arxius en els nodes de còmput fa que les aplicacions que les utilitzen, puguin ser assignades al mateix node que els arxius, reduint el temps d'accés a disc. Per determinar la ubicació inicial dels arxius d'entrada i temporals, el planificador realitza la fusió de tots els workflows en un sol meta-workflow, a continuació, l'algoritme estableix segons les precedències d'aplicacions, mida dels arxius i grau de compartició de els mateixos; l'emmagatzematge adequat de cada arxiu dins de la jerarquia. L'objectiu del nostre treball és implementar una nova política de planificació conscient de l'emmagatzematge per multiworkflows que millori el temps de makespan d'aplicacions amb còmput intensiu de dades. Per avaluar l'escalabilitat de la política de planificació i a més poder comparar-la amb altres polítiques clàssiques de la literatura, hem utilitzat simuladors. Aquest és un mètode bastant comú per validar heurístiques de planificació i estalviar temps de còmput buscant la millor opció. Per a això, hem extès WorkflowSim i l'hem dotat d'un planificador conscient de la jerarquia d'emmagatzematge. El planificador resultant ha sigut validat en diversos escenaris, amb una càrrega composta per workflows sintètics de bioinformàtica, implementats a partir de la caracterització d'aplicacions bioinformàtiques reals, i workflows de referència àmpliament utilitzats, com Montage i Epigenomics, ja que són workflows que generen una gran quantitat d'arxius temporals. Per validar la nostra proposta hem utilitzat el planificador en dos escenaris: sistemes de clúster real de 128 nuclis i simulador de clúster en WorkflowSim fins a 1024 nuclis. L'escenari real, llanço millores de makespan de fins a 70%. A l'escenari simulat, la millora de makespan va ser del 69% amb errors entre 0,9% i 3%.
En el ámbito bioinformático, la experimentación se realiza a través de secuencias de ejecuciones de aplicaciones, cada aplicación utiliza como archivo de entrada el generado por la aplicación anterior. Este proceso de análisis formado por una lista de aplicaciones describiendo una cadena de dependencia se llama Workflow. Dos características relevantes de los workflows bioinformáticos, hacen referencia al manejo de grandes volúmenes de datos y a la complejidad de las dependencias de datos. Muchos de los gestores de recursos actuales, ignoran la ubicación de los archivos, esto implica un elevado costo si los elementos de procesamiento no están próximos a los archivos y hay que moverlos. El modelo de grafo dirigido acíclico (DAG), utilizado para representar el orden de ejecución de los trabajos del workflow, no ayuda a establecer la mejor ubicación de los archivos de entrada o temporales para una ejecución eficiente. La solución para este desafío, puede ser la planificación de recursos consciente del almacenamiento, donde una estrategia inteligente de colocación de archivos, añadida a una planificación de recursos acorde a este conocimiento; contribuirá a evitar los periodos de inactividad en los sistemas, causados por los tiempos de espera de archivos en los elementos de procesamiento. Con la capacidad de cómputo actual de los clústers, es posible que múltiples workflows puedan ser ejecutados en paralelo. Además, los clústers permiten que los multiworkflows, puedan compartir los archivos de entrada y temporales en la jerarquía de almacenamiento. Proponemos una jerarquía de almacenamiento compuesta por el sistema de archivos distribuido, una RamDisk Local, Disco Local y Disco de Estado Solido (SSD) Local. Con objeto de resolver la asignación de aplicaciones de multiworkflows a los recursos del clúster, extendimos la heurística basada en lista para multiworkflows llamada HEFT (Heterogeneuos Earliest Finish Time). Esta comprende dos fases: primero se realiza una fase de priorización de tareas, para posteriormente realizar la selección de procesadores, que consiste en asignar las aplicaciones al nodo que minimiza el tiempo de finalización de cada una de ellas. El planificador consciente del almacenamiento propuesto, considera ubicar los archivos en la jerarquía de almacenamiento antes de comenzar la ejecución. La pre-ubicación de archivos en los nodos de cómputo hace que las aplicaciones que las utilizan, puedan ser asignadas al mismo nodo que los archivos, reduciendo el tiempo de acceso a disco. Para determinar la ubicación inicial de los archivos de entrada y temporales, el planificador realiza la fusión de todos los workflows en un solo meta-workflow, a continuación, el algoritmo establece según las precedencias de aplicaciones, tamaño de los archivos y grado de compartición de los mismos; el almacenamiento adecuado de cada archivo dentro de la jerarquía. El objetivo del trabajo es implementar una política de planificación consciente del almacenamiento para multiworkflows que mejore el makespan de aplicaciones con cómputo intensivo de datos. Para evaluar la escalabilidad de la propuesta y compararla con otras políticas de la literatura, utilizamos simuladores. Este es un método común para validar heurísticas de planificación y ahorrar tiempo de cómputo buscando la mejor opción. Para ello, extendimos WorkflowSim dotándolo de un planificador consciente de la jerarquía de almacenamiento. El trabajo fue validado, con workflows sintéticos, implementados a partir de la caracterización de aplicaciones bioinformáticas reales, y workflows ampliamente utilizados como Montage y Epigenomics debido a que generan una gran cantidad de archivos temporales. La experimentación se realizó en dos escenarios: sistemas de clúster real de 128 núcleos y simulador de clúster en WorkflowSim hasta 1024 núcleos. El escenario real, arrojo mejoras de makespan de hasta 70%. En el escenario simulado, la mejora de makespan fue del 69% con errores entre 0,9% y 3%.
In the bioinformatic field, experimentation is performed through sequential execution of applications, each application uses as input file the one generated by the previous application. This analysis process consisting of a list of applications describing a dependency chain is called Workflow. Two relevant characteristics of bioinformatic workflows refer to the handling of large volumes of data and the complexity of data dependencies. Many of the current resource managers ignore the location of the files, this implies a high cost if the processing elements are not close to the files and have to be moved. The direct acyclic graph (DAG) model, used to represent the execution order of workflow jobs, does not help to establish the best location of input or temporary data files for efficient execution. The solution to this challenge may be the data-aware scheduling, where an intelligent file placement strategy, added to a resource scheduling according to this knowledge; Will help prevent system downtime caused by the waiting time of data file on processing elements. With the current computing power of clusters, it is possible that multiple workflows to be executed in parallel. In addition, clusters allow multiworkflows to share input and temporal data files in the storage hierarchy. We propose a storage hierarchy composed by the distributed file system, a Local RamDisk, Local Disk and Local Solid State Disk (SSD). In order to solve the assignment of multiworkflows applications to the cluster resources, we extended the multiworkflow heuristic called HEFT (Heterogeneous Earliest Finish Time). This comprises two phases: first a task prioritization phase is performed, and then the processors selection is performed, which consists of assigning the applications to the node that minimizes the execution time of each one of them. The data-aware scheduler considers placing the files in the storage hierarchy before starting the execution. The data files pre-fetching on the compute nodes makes the applications that use them, can be assigned to the same node as the data files, reducing the access time to disk. To determine the initial location of the input and temporal data files, the scheduler performs the merging of all workflows into a single meta-workflow, then the algorithm sets according to application precedence, file size and sharing degree; The proper storage of each file within the hierarchy. The goal of the research is to implement a multi-workflow data-aware scheduler policy that improves the makespan of data-intensive applications. To evaluate the scalability of the proposal and to compare it with other policies in the literature, we use simulators. This is a common method for validating scheduling heuristics and saving computation time by looking for the best option. To do this, we extend WorkflowSim by providing it with a data-aware scheduler with a storage hierarchy. Our work was validated, with synthetic workflows, implemented from the characterization of real bioinformatics applications, and workflows benchmark as Montage and Epigenomics because they generate a large amount of temporal files. The experimentation was performed in two scenarios: real cluster system of 128 cores and a simulated cluster in WorkflowSim with up to 1024 cores. In the real scenario, we achieve a makespan improvement of up to 70%. In the simulated scenario, the makespan improvement was 69% with errors between 0.9% and 3%.

APA, Harvard, Vancouver, ISO, and other styles

14

Ochieng, Dunlop. "Indirect Influence of English on Kiswahili: The Case of Multiword Duplicates between Kiswahili and English." Doctoral thesis, Universitätsbibliothek Chemnitz, 2015. http://nbn-resolving.de/urn:nbn:de:bsz:ch1-qucosa-179613.

Full text

Abstract:

Some proverbs, idioms, nominal compounds, and slogans duplicate in form and meaning between several languages. An example of these between German and English is Liebe auf den ersten Blick and “love at first sight” (Flippo, 2009), whereas, an example between Kiswahili and English is uchaguzi ulio huru na haki and “free and fair election.” Duplication of these strings of words between languages that are as different in descent and typology as Kiswahili and English is irregular. On this ground, Kiswahili academies and a number of experts of Kiswahili assumed – prior to the present study – that the Kiswahili versions of the expressions are the derivatives from their English congruent counterparts. The assumption nonetheless lacked empirical evidence and also discounted other potential causes of the phenomenon, i.e. analogical extension, nativism and cognitive metaphoricalization (Makkai, 1972; Land, 1974; Lakoff & Johnson, 1980b; Ruhlen, 1987; Lakoff, 1987; Gleitman and Newport, 1995). Out of this background, we assumed an academic obligation of empirically investigating what causes this formal and semantic duplication of strings of words (multiword expressions) between English and Kiswahili to a degree beyond chance expectations. In this endeavour, we employed checklist to 24, interview to 43, online questionnaire to 102, translation test to 47 and translationality test to 8 respondents. Online questionnaire respondents were from 21 regions of Tanzania, whereas, those of the rest of the tools were from Zanzibar, Dar es Salaam, Pwani, Lindi, Dodoma and Kigoma. Complementarily, we analysed the Chemnitz Corpus of Swahili (CCS), the Helsinki Swahili Corpus (HSC), and the Corpus of Contemporary American English (COCA) for clues on the sources and trends of expressions exhibiting this characteristic between Kiswahili and English. Furthermore, we reviewed the Bible, dictionaries, encyclopaedia, books, articles, expressions lists, wikis, and phrase books in pursuit of etymologies, and histories of concepts underlying the focus expressions. Our analysis shows that most of the Kiswahili versions of the focus expressions are the function of loan translation and rendition from English. We found that economic, political and technological changes, mostly induced by liberalization policy of the 1990s in Tanzania, created lexical gaps in Kiswahili that needed to be filled. We discovered that Kiswahili, among other means, fill such gaps through loan translation and loan rendition of English phrases. Prototypical examples of notions whose English labels Kiswahili has translated word for word are such as “human rights”, “free and fair election”, “the World Cup” and “multiparty democracy”. We can conclude that Kiswahili finds it easier and economical to translate the existing English labels for imported notions rather than innovating original labels for the concepts. Even so, our analysis revealed that a few of the Kiswahili duplicate multiword expressions might be a function of nativism, cognitive metaphoricalization and analogy phenomena. We, for instance, observed that formulation of figurative meanings follow more or less similar pattern across human languages – the secondary meanings deriving from source domains. As long as the source domains are common in many human\'s environment, we found it plausible for certain multiword expressions to spontaneously duplicate between several human languages. Academically, our study has demonstrated how multiword expressions, which duplicate between several languages, can be studied using primary data, corpora, documentary review and observation. In particular, the study has designed a framework for studying sources of the expressions and even terminologies for describing the phenomenon. What\'s more, the study has collected a number of expressions that duplicate between Kiswahili and English languages, which other researchers can use in similar studies.

APA, Harvard, Vancouver, ISO, and other styles

15

Acosta, Otavio Costa. "Identificação e tratamento de expressões multipalavras aplicado à recuperação de informação." reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2011. http://hdl.handle.net/10183/134318.

Full text

Abstract:

A vasta utilização de Expressões Multipalavras em textos de linguagem natural requer atenção para um estudo aprofundado neste assunto, para que posteriormente seja possível a manipulação e o tratamento, de forma robusta, deste tipo de expressão. Uma Expressão Multipalavra costuma transmitir precisamente conceitos e ideias que geralmente não podem ser expressos por apenas uma palavra e estima-se que sua frequência, em um léxico de um falante nativo, seja semelhante à quantidade de palavras simples. A maioria das aplicações reais simplesmente ignora ou lista possíveis termos compostos, porém os identifica e trata seus itens lexicais individualmente e não como uma unidade de conceito. Para o sucesso de uma aplicação de Processamento de Linguagem Natural, que envolva processamento semântico, é necessário um tratamento diferenciado para essas expressões. Com o devido tratamento, é investigada a hipótese das Expressões Multipalavras possibilitarem uma melhora nos resultados de uma aplicação, tal como os sistemas de Recuperação de Informação. Os objetivos desse trabalho estão voltados ao estudo de técnicas de descoberta automática de Expressões Multipalavras, permitindo a criação de dicionários, para fins de indexação, em um mecanismo de Recuperação de Informação. Resultados experimentais apontaram melhorias na recuperação de documentos relevantes, ao identificar Expressões Multipalavras e tratá-las como uma unidade de indexação única.
The use of Multiword Expressions (MWE) in natural language texts requires a detailed study, to further support in manipulating and processing, robustly, these kinds of expression. A MWE typically gives concepts and ideas that usually cannot be expressed by a single word and it is estimated that the number of MWEs in the lexicon of a native speaker is similar to the number of single words. Most real applications simply ignore them or create a list of compounds, treating and identifying them as isolated lexical items and not as an individual unit. For the success of a Natural Language Processing (NLP) application, involving semantic processing, adequate treatment for these expressions is required. In this work we investigate the hypothesis that an appropriate identification of Multiword Expressions provide better results in an application, such as Information Retrieval (IR). The objectives of this work are to compare techniques of MWE extraction for creating MWE dictionaries, to be used for indexing purposes in IR. Experimental results show qualitative improvements on the retrieval of relevant documents when identifying MWEs and treating them as a single indexing unit.

APA, Harvard, Vancouver, ISO, and other styles

16

Ochieng, Dunlop [Verfasser], Josef [Akademischer Betreuer] Schmied, and Roy Bertus [Gutachter] Van. "Indirect Influence of English on Kiswahili: The Case of Multiword Duplicates between Kiswahili and English / Dunlop Ochieng ; Gutachter: Bertus Van Roy ; Betreuer: Josef Schmied." Chemnitz : Universitätsbibliothek Chemnitz, 2015. http://d-nb.info/1213813700/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

17

Schreiner, Paulo. "Alinhamento léxico utilizando técnicas híbridas discriminativas e de pós-processamento." reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2010. http://hdl.handle.net/10183/27658.

Full text

Abstract:

O alinhamento léxico automático é uma tarefa essencial para as técnicas de tradução de máquina empíricas modernas. A abordagem gerativa não-supervisionado têm sido substituída recentemente por uma abordagem discriminativa supervisionada que facilite inclusão de conhecimento linguístico de uma diversidade de fontes. Dentro deste contexto, este trabalho descreve uma série alinhadores léxicos discriminativos que incorporam heurísticas de pós-processamento com o objetivo de melhorar o desempenho dos mesmos para expressões multi-palavra, que constituem um dos desafios da área de processamento de linguagens naturais atualmente. A avaliação é realizada utilizando um gold-standard obtido a partir da anotação de um corpus paralelo de legendas de filmes. Os alinhadores propostos apresentam um desempenho superior tanto ao obtido por uma baseline quanto ao obtido por um alinhador gerativo do estado-da-arte (Giza++), tanto no caso geral quanto para as expressões foco do trabalho.
Lexical alignment is an essential task for modern empirical machine translation techniques. The unsupervised generative approach is being replaced by a supervised, discriminative one that considerably facilitates the inclusion of linguistic knowledge from several sources. Given this context, the present work describes a series of discriminative lexical aligners that incorporate post-processing heuristics with the goal of improving the quality of the alignments of multiword expressions, which is one of the major challanges in natural language processing today. The evaluation is conducted using a gold-standard obtained from a movie subtitle parallel corpus. The aligners proposed show an alignment quality that is superior both to our baseline and to a state-of-the-art generative aligner (Giza++), for the general case as well as for the expressions that are the focus of this work.

APA, Harvard, Vancouver, ISO, and other styles

18

Bellanger, Cindy. "Mémorisation et reconnaissance de séquences multimots chez l'enfant et l'adulte : effets de la fréquence et de la variabilité interne." Thesis, Université Grenoble Alpes (ComUE), 2017. http://www.theses.fr/2017GREAS047/document.

Full text

Abstract:

Les modèles de la perception du langage écrit et du langage oral mettent au premier plan l’importance du lexique mental. En effet, parmi les nombreux indices hiérarchisés et guidant la segmentation du flux continu de parole chez l’adulte et l’enfant, les indices lexicaux ont une place prépondérante. Tout au long de ce travail, nous nous intéressons aux spécificités du stockage des séquences multimots dans le lexique mental et à l’hypothèse d’une mémorisation de ces séquences en une seule unité.Ce travail se divise en deux parties, chacune composée d’une série d’expériences. La première partie interroge en premier lieu les indices impliqués dans les effets facilitateurs de la reconnaissance des noms au sein du groupe nominal. Pour cela, sont mis en perspective l’effet du genre grammatical porté par les déterminants et l’effet de fréquence de co-occurrence des séquences déterminant-nom sur le traitement du nom. C’est ensuite l’effet de la cohésion des séquences multimots sur leur reconnaissance qui est examiné.La seconde partie aborde l’influence de la variabilité interne des combinaisons déterminant-nom dans l’acquisition de la structure du groupe nominal chez l’enfant de deux ans à deux ans et demi. Au travers d’une étude longitudinale, nous opposons deux grandes conceptions de l’acquisition du langage chez le jeune enfant: la Grammaire Universelle et les approches Basées sur l’Usage
The mental lexicon is usually assumed as the main foundation of written and spoken-language perception. Numerous and hierarchically-organized cues drive speech segmentation in adults and infants but lexical cues appear as overriding. Throughout this work, we question multiword-sequence storage idiosyncrasy and multiword-sequence memorizing as one unit in the mental lexicon.This work splits into two parts, each composed of a set of experiments. The first one assesses the cues involved in recognition facilitation of nouns in noun phrases. For that purpose, we disentangled grammatical-gender effects and co-occurrence frequency effects on the processing of determiner-noun sequences. Then, we tested the cohesiveness effect on three-word sequences’ recognition.The second set of experiments is about the influence of determiner-noun sequences’ internal variability in noun-phrase’s structure aquisition in 2 to 2,5 year-old children. In a three-month longitudinal study, we contrast two main conceptions of first-language acquisition: Universal Grammar and Usage-Based approaches

APA, Harvard, Vancouver, ISO, and other styles

19

Al, Saied Hazem. "Analyse automatique par transitions pour l'identification des expressions polylexicales." Electronic Thesis or Diss., Université de Lorraine, 2019. http://www.theses.fr/2019LORR0206.

Full text

Abstract:

Cette thèse porte sur l'identification des expressions polylexicales, abordée au moyen d'une analyse par transitions. Une expression polylexicale (EP) est une construction linguistique composée de plusieurs éléments dont la combinaison montre une irrégularité à un ou plusieurs niveaux linguistiques. La tâche d'identification d'EPs consiste à annoter en contexte les occurrences d'EPs dans des textes, i.e à détecter les ensembles de tokens formant de telles occurrences. L'analyse par transitions est une approche célèbre qui construit une sortie structurée à partir d'une séquence d'éléments, en appliquant une séquence de «transitions» choisies parmi un ensemble prédéfini, pour construire incrémentalement la sortie. Dans cette thèse, nous proposons un système par transitions dédié à l'identification des EPs au sein de phrases représentées comme des séquences de tokens, et étudions diverses architectures pour le classifieur qui sélectionne les transitions à appliquer, permettant de construire l'analyse de la phrase. La première variante de notre système utilise un classifieur linéaire de type machine à vecteur support. Les variantes suivantes utilisent des modèles neuronaux: un simple perceptron multicouche, puis des variantes intégrant une ou plusieurs couches récurrentes. Le scénario privilégié est une identification d'EPs n'utilisant pas d'informations syntaxiques, alors même que l'on sait les deux tâches liées. Nous étudions ensuite une approche par apprentissage multitâche, réalisant conjointement l’étiquetage morphosyntaxique, l’identification des EPs par transitions et l’analyse syntaxique en dépendances par transitions. La thèse comporte une partie expérimentale importante. Nous avons d'une part étudié quelles techniques de ré-échantillonnage des données permettent une bonne stabilité de l'apprentissage malgré des initialisations aléatoires. D'autre part, nous avons proposé une méthode de réglage des hyperparamètres de nos modèles par analyse de tendances au sein d'une recherche aléatoire de combinaison d'hyperparamètres. Nous utilisons en effet de manière privilégiée les données des deux compétitions internationales PARSEME des EPs verbales. Nos variantes produisent de très bons résultats, et notamment les scores d’état de l’art pour de nombreuses langues de PARSEME. L’une des variantes s'est classée première pour la plupart des langues de PARSEME 1.0. Pourtant, nos modèles ont des performances faibles sur les EPs non vues à l'apprentissage
This thesis focuses on the identification of multi-word expressions, addressed through a transition-based system. A multi-word expression (MWE) is a linguistic construct composed of several elements whose combination shows irregularity at one or more linguistic levels. Identifying MWEs in context amounts to annotating the occurrences of MWEs in texts, i.e. to detecting sets of tokens forming such occurrences. For example, in the sentence This has nothing to do with the book, the tokens has, to, do and with would be marked as forming an occurrence of the MWE have to do with. Transition-based analysis is a famous NLP technique to build a structured output from a sequence of elements, applying a sequence of actions (called «transitions») chosen from a predefined set, to incrementally build the output structure. In this thesis, we propose a transition system dedicated to MWE identification within sentences represented as token sequences, and we study various architectures for the classifier which selects the transitions to apply to build the sentence analysis. The first variant of our system uses a linear support vector machine (SVM) classifier. The following variants use neural models: a simple multilayer perceptron (MLP), followed by variants integrating one or more recurrent layers. The preferred scenario is an identification of MWEs without the use of syntactic information, even though we know the two related tasks. We further study a multitasking approach, which jointly performs and take mutual advantage of morphosyntactic tagging, transition-based MWE identification and dependency parsing. The thesis comprises an important experimental part. Firstly, we studied which resampling techniques allow good learning stability despite random initializations. Secondly, we proposed a method for tuning the hyperparameters of our models by trend analysis within a random search for a hyperparameter combination. We produce systems with the constraint of using the same hyperparameter combination for different languages. We use data from the two PARSEME international competitions for verbal MWEs. Our variants produce very good results, including state-of-the-art scores for many languages in the PARSEME 1.0 and 1.1 datasets. One of the variants ranked first for most languages in the PARSEME 1.0 shared task. By the way, our models have poor performance on MWEs that are were not seen at learning time

APA, Harvard, Vancouver, ISO, and other styles

20

Candarli, Duygu. "A longitudinal study of multi-word units in L1 and L2 novice academic writing." Thesis, University of Manchester, 2017. https://www.research.manchester.ac.uk/portal/en/theses/a-longitudinal-study-of-multiword-units-in-l1-and-l2-novice-academic-writing(c57f2773-6965-4a96-9cfa-79e2b11e9408).html.

Full text

APA, Harvard, Vancouver, ISO, and other styles

21

Ramisch, Carlos Eduardo. "Un environnement générique et ouvert pour le traitement des expressions polylexicales : de l'acquisition aux applications." Phd thesis, Université de Grenoble, 2012. http://tel.archives-ouvertes.fr/tel-00741147.

Full text

Abstract:

Cette thèse présente un environnement ouvert et souple pour l'acquisition automatique d'expressions multimots (MWE) à partir de corpus textuels monolingues. Cette recherche est motivée par l'importance des MWE pour les applications du TALN. Après avoir brièvement présenté les modules de l'environnement, le mémoire présente des résultats d'évaluation intrinsèque en utilisant deux applications: la lexicographie assistée par ordinateur et la traduction automatique statistique. Ces deux applications peuvent bénéficier de l'acquisition automatique de MWE, et les expressions acquises automatiquement à partir de corpus peuvent à la fois les accélérer et améliorer leur qualité. Les résultats prometteurs de nos expériences nous encouragent à mener des recherches ultérieures sur la façon optimale d'intégrer le traitement des MWE dans ces applications et dans bien d'autres

APA, Harvard, Vancouver, ISO, and other styles

22

Pasquer, Caroline. "Garder la trace, mettre de l'ordre et relier les points : modéliser la variation et l'ambiguïté des expressions polylexicales." Thesis, Tours, 2019. http://www.theses.fr/2019TOUR4017.

Full text

Abstract:

L’identification automatique d’expressions polylexicales (EP) est un pré-requis pour de nombreuses applications de traitement automatique des langues. Cette tâche représente un défi car les EP, et en particulier les verbales (EPV) telles que 'casser sa pipe' (signifiant 'mourir'), ont des formes de surface très variables ('cassera-t-il un jour sa pipe ?'). Cependant, comparée à des constructions libres, cette variabilité est généralement plus restreinte (p. ex. certains noms non modifiables par un adjectif), d’où des profils de variabilité distincts. On se penche ici sur un sous-problème de l’identification d’EPV, à savoir l’identification d’occurrences d’EPV vues dans d’autres contextes, quelque soit leur forme de surface, ce qui nécessite de prendre en compte l’ambiguïté pour éviter des lectures littérales ('casser sa vieille pipe') ou des co-occurrences fortuites ('casser le tuyau de sa pipe'). On considère pour cela deux approches : la première se fonde sur une mesure de la variabilité des EPV indépendante de la langue. La seconde consiste à modéliser le problème comme une tâche de classification d’après des traits pertinents pour la variabilité morpho-syntaxique des EPV, ce qui nous a conduit à développer un système (VarIDE), qui a participé à la compétition PARSEME d’identification automatique d’EPV en 2018
Automatic identification of multiword expressions (MWEs) is a pre-requisite for many natural language processing applications. This task is challenging because MWEs, especially verbal ones (VMWEs) like to kick the bucket (which means to die), exhibit surface variability (no buckets were kicked ). However, compared with regular constructions, this variability is usually more restricted (e.g. some nouns cannot be modified by an adjective), hence various variability profiles. We address here a subproblem of VMWE identification, namely the identification of occurrences of VMWEs previously seen in corpora, whatever their surface form, which requires to take ambiguity into account to avoidliteral (he kicked the old bucket) or coincidental occurrences (he kicked the ball and the bucket fell down). To this end, we considered two main approaches : The first one is based on a language independent measure of VMWE variability. The second one consists in modeling the problem as a classification task on the basis of features relevant to the VMWE morphosyntactic variability, which led to a system (VarIDE) that participated in the PARSEME shared task on automatic identification of VMWEs in 2018

APA, Harvard, Vancouver, ISO, and other styles

23

Kyriakopoulou, Anthoula. "Elaboration de ressources électroniques pour les noms composés de type N (E+DET=G) N=G du grec moderne." Phd thesis, Université Paris-Est, 2011. http://pastel.archives-ouvertes.fr/pastel-00666189.

Full text

Abstract:

L'objectif de cette recherche est la construction manuelle de ressources lexicales pour les noms composés grecs qui sont définis par la structure morphosyntaxique : Nom (E+Déterminant au génitif) Nom au génitif, notés N (E+DET:G) N:G (e.g. ζώνη ασφαλείας/ceinture de sécurité). Les ressources élaborées peuvent être utilisées pour leur reconnaissance lexicale automatique dans les textes écrits et dans d'autres applications du TAL. Notre travail s'inscrit dans la perspective de l'élaboration du lexique-grammaire général du grec moderne en vue de l'analyse automatique des textes écrits. Le cadre théorique et méthodologique de cette étude est celui du lexique-grammaire (M. Gross 1975, 1977), qui s'appuie sur la grammaire transformationnelle harisienne.Notre travail s'organise en cinq parties. Dans la première partie, nous délimitons l'objet de notre travail tout en essayant de définir la notion fondamentale qui régit notre étude, à savoir celle de figement. Dans la deuxième partie, nous présentons la méthodologie utilisée pour le recensement de nos données lexicales et nous étudions les phénomènes de variation observés au sein des noms composés de type N (E+DET:G) N:G. La troisième partie est consacrée à la présentation des différentes sous-catégories des N (E+DET:G) N:G identifiées lors de l'étape du recensement et à l'étude de leur structure lexicale interne. La quatrième partie porte sur l'étude syntaxico-sémantique des N (E+DET:G) N:G. Enfin, dans la cinquième partie, nous présentons les différentes méthodes de représentation formalisée que nous proposons pour nos données lexicales en vue de leur reconnaissance lexicale automatique dans les textes écrits. Des échantillons représentatifs des ressources élaborées sont présentés en Annexe

APA, Harvard, Vancouver, ISO, and other styles

24

Gonçalves, Carlos Jorge de Sousa. "Parallel and Distributed Statistical-based Extraction of Relevant Multiwords from Large Corpora." Doctoral thesis, 2017. http://hdl.handle.net/10362/28488.

Full text

Abstract:

The amount of information available through the Internet has been showing a significant growth in the last decade. The information can result from various sources such as scientific experiments resulting from particle acceleration, recording the flight data of a commercial aircraft, or sets of documents from a given domain such as medical articles, news headlines from a newspaper, or social networks contents. Due to the volume of data that must be analyzed, it is necessary to endow the search engines with new tools that allow the user to obtain the desired information in a timely and accurate manner. One approach is the annotation of documents with their relevant expressions. The extraction of relevant expressions from natural language text documents can be accomplished by the use of semantic, syntactic, or statistical techniques. Although the latter tend to be not so accurate, they have the advantage of being independent of the language. This investigation was performed in the context of LocalMaxs, which is a statistical method, thus language-independent, capable of extracting relevant expressions from natural language corpora. However, due to the large volume of data involved, the sequential implementations of the above techniques have severe limitations both in terms of execution time and memory space. In this thesis we propose a distributed architecture and strategies for parallel implementations of statistical-based extraction of relevant expressions from large corpora. A methodology was developed for modeling and evaluating those strategies based on empirical and theoretical approaches to estimate the statistical distribution of n-grams in natural language corpora. These approaches were applied to guide the design and evaluation of the behavior of LocalMaxs parallel and distributed implementations on cluster and cloud computing platforms. The implementation alternatives were compared regarding their precision and recall, and their performance metrics, namely, execution time, parallel speedup and sizeup. The performance results indicate almost linear speedup and sizeup for the range of large corpora sizes.

APA, Harvard, Vancouver, ISO, and other styles

25

Moszczyński, Radosław. "Formal approaches to multiword lexemes." Thesis, 2006. https://bc.klf.uw.edu.pl/246/1/3301-MGR-FL-A-25320.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

26

Fazly, Afsaneh. "Automatic acquisition of lexical knowledge about multiword predicates." 2007. http://link.library.utoronto.ca/eir/EIRdetail.cfm?Resources__ID=478903&T=F.

Full text

APA, Harvard, Vancouver, ISO, and other styles

27

Bai, Ming-Hong, and 白明弘. "Extraction of Bilingual Multiword Expressions with Application to Bilingual Concordancer." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/63001866540723370915.

Full text

Abstract:

博士
國立清華大學
資訊工程學系
101
A bilingual concordancer is a computer-assisted translation tool that uses the parallel corpus as its knowledge base. Given a word or phrase, the bilingual concordancer retrieves aligned sentence pairs, which contain the word or phrase in the source sentences, from the parallel corpus. Then, it identifies the translation equivalents in the target sentences and reorders the sentence pairs according to the correlation from the query string and the translation equivalents. It helps not only on finding translation equivalents of the query but also presenting various contexts of occurrence. As a result, it is extremely useful for bilingual lexicographers, human translators and second language learners. Extraction of bilingual multi-word expressions is the most important part of a bilingual concordancer. For example, highlighting translation equivalents in the target sentence and generating translation equivalent list are highly depend on a high quality extraction model. However, the existing models for extracting translation equivalents still have many problems and still room to improve. In this thesis, we discuss some problems of the existing models for extracting bilingual multi-word expressions, including the over-alignment problem and the under-alignment problem. Then, we propose a novel model to address these problems to improve the quality the extracted translation equivalents. Further, we implement a bilingual concordancer employs the proposed translation extraction model. To measure the performance of the bilingual concordancer, we use three type of multi-word expression as our test target. The results are compared with the existing statistical machine translation models.

APA, Harvard, Vancouver, ISO, and other styles

28

Wu, Tzu-Wei, and 吳紫葦. "Extraction of Multiword Expressions related to Grammatical Collocation Based on Syntactic and Statistical Information." Thesis, 2006. http://ndltd.ncl.edu.tw/handle/40531784349915519893.

Full text

Abstract:

碩士
國立清華大學
資訊系統與應用研究所
94
This paper concentrates on the study of multiword expressions related to grammatical collocations. We propose a method to automatically extract grammatical collocations from a corpus. Our method involves selecting collocations in line with certain structure based on part of speech information and analyses of base phrases, extracting meaningful grammatical collocations by statistical analysis of associativity. In addition to statistics and linguistic knowledge, we also rely on syntactic patterns of multiword expressions. Take the collocate pattern of (“at”, “cost”) for example. Pattern of seed MWEs will enable us to obtain multiword expressions like “at cost” or “at all costs”. We exploit mutual information (MI) to evaluate each collocation candidate and filter out ones with low mutual information rate, which is a threshold trained on real data. Collocations with MI higher than the lower-bound are further used to assist in the extraction of multiword expressions. The grammatical collocations and related multiword expressions can be used in many Natural Language Processing applications, including computer assisted language learning, parsing, and machine translation.

APA, Harvard, Vancouver, ISO, and other styles

29

Reynolds, Barry Lee, and 雷貝利. "Rethinking Frequency in Incidental Vocabulary Acquisition: The Effects of Word Form Variation and Multiword Patterns." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/9br86s.

Full text

Abstract:

博士
國立中央大學
學習與教學研究所
101
Within language acquisition research there exists a substantial body of literature supporting extensive reading as a means of vocabulary growth for both L1 and L2 learners. Moreover, vocabulary acquisition through extensive reading has been considered as occurring incidentally because learners are focused on the task of reading instead of learning vocabulary. In recent years, a large number of extensive reading studies have been conducted investigating the effects of numerous variables on the incidental acquisition of vocabulary through reading. These experiments, however, leave two crucial language acquisition issues unaddressed. Namely, none of these studies investigated how word-- internal form variation and/or word-‐external multiword pattern variation affects the incidental acquisition of target vocabulary through reading. The purpose of this dissertation, accordingly, is to investigate whether varying degrees of word form variation of target words (i.e., no variation, inflectional variation, and derivational variation) and the appearance of target word tokens in multiword patterns might affect the incidental acquisition of target vocabulary through reading. A group of L1 English-‐speaking and L2 English-‐speaking participants were given a copy of an unmodified English novel, The BFG, containing nonce words to read within two weeks. After reading, they were first given two surprise forms of assessment (meaning recall translation and meaning recognition multiple-‐choice) measuring acquisition of 49 target nonce words followed by an open-‐response reflective questionnaire. Five weeks later, post hoc interviews were conducted to gain a deeper understanding of both the L1 and L2-‐English speaking participants' perceptions of the nonce words used as target words in the dissertation research. The results revealed various significant similarities and differences between L1 and L2 speakers. As shown through the post hoc interviews, the L1 speakers did not perceive nonce words as worth learning, whereas the L2 speakers were less clear on whether they gave nonce words a different status than real English words. Moreover, data collected from an L1 control group suggests any interpretation of the L1 experimental group data must be done cautiously since the acquisition results cannot be totally attributed to incidental acquisition through reading. Analysis of target word acquisition data in terms of word form variation illuminated differences between L1 and L2 speakers, while analysis of target word acquisition data in terms of multiword patterns highlighted similarities between L1 and L2 speakers. Analysis of L2 speakers' target word acquisition results as shown on both assessments found an interaction effect between word form variation and frequency. For the meaning recall data, L2 speakers showed a statistically significant difference in acquisition between lower and higher frequency target words that exhibited derivational variation in form. However, for the meaning recognition data, L2 speakers showed a statistically significant difference in acquisition between lower and higher frequency target words that exhibited inflectional and derivational variation in form. Analysis of L1 speakers' target word acquisition results as shown on both assessments failed to find an interaction effect for word form variation and frequency. However, a statistically significant effect for word form variation was shown for both assessments. For the meaning recall, L1 speakers acquired more target words that did not vary in form than target words that exhibited inflectional or derivational variation in form. On the meaning recognition, L1 speakers acquired more target words that did not vary in form or exhibited derivational variation in form than target words that exhibited inflectional variation in form. Both groups of experimental participants acquired more target words that appeared in multiword patterns than did not appear in multiword patterns, regardless of assessment. Furthermore, an interaction effect between patterns and frequency was shown for both L1 and L2 participants' meaning recognition assessment results. While there was no significant difference in acquisition for lower frequency words that appeared in multiword patterns and lower frequency words that did not appear in multiword patterns, a significant difference in acquisition was shown between higher frequency words that appeared in multiword patterns and higher frequency words that did not appear in multiword patterns. Taking all the results together, the present dissertation research suggests: (1) frequency matters more for L2 speakers when encountering target words whose tokens exhibit inflectional and derivational variation in form, and (2) the appearance of target words in multiword patterns, especially higher frequency target words, matters to L1 and L2 speakers. Implications of the present dissertation research for the incidental vocabulary acquisition research community, corpus-‐derived analyses, teaching practices, materials development, and L2 vocabulary acquisition through extensive reading are discussed.

APA, Harvard, Vancouver, ISO, and other styles

30

Ochieng, Dunlop. "Indirect Influence of English on Kiswahili: The Case of Multiword Duplicates between Kiswahili and English." Doctoral thesis, 2014. https://monarch.qucosa.de/id/qucosa%3A20316.

Full text

Abstract:

Some proverbs, idioms, nominal compounds, and slogans duplicate in form and meaning between several languages. An example of these between German and English is Liebe auf den ersten Blick and “love at first sight” (Flippo, 2009), whereas, an example between Kiswahili and English is uchaguzi ulio huru na haki and “free and fair election.” Duplication of these strings of words between languages that are as different in descent and typology as Kiswahili and English is irregular. On this ground, Kiswahili academies and a number of experts of Kiswahili assumed – prior to the present study – that the Kiswahili versions of the expressions are the derivatives from their English congruent counterparts. The assumption nonetheless lacked empirical evidence and also discounted other potential causes of the phenomenon, i.e. analogical extension, nativism and cognitive metaphoricalization (Makkai, 1972; Land, 1974; Lakoff & Johnson, 1980b; Ruhlen, 1987; Lakoff, 1987; Gleitman and Newport, 1995). Out of this background, we assumed an academic obligation of empirically investigating what causes this formal and semantic duplication of strings of words (multiword expressions) between English and Kiswahili to a degree beyond chance expectations. In this endeavour, we employed checklist to 24, interview to 43, online questionnaire to 102, translation test to 47 and translationality test to 8 respondents. Online questionnaire respondents were from 21 regions of Tanzania, whereas, those of the rest of the tools were from Zanzibar, Dar es Salaam, Pwani, Lindi, Dodoma and Kigoma. Complementarily, we analysed the Chemnitz Corpus of Swahili (CCS), the Helsinki Swahili Corpus (HSC), and the Corpus of Contemporary American English (COCA) for clues on the sources and trends of expressions exhibiting this characteristic between Kiswahili and English. Furthermore, we reviewed the Bible, dictionaries, encyclopaedia, books, articles, expressions lists, wikis, and phrase books in pursuit of etymologies, and histories of concepts underlying the focus expressions. Our analysis shows that most of the Kiswahili versions of the focus expressions are the function of loan translation and rendition from English. We found that economic, political and technological changes, mostly induced by liberalization policy of the 1990s in Tanzania, created lexical gaps in Kiswahili that needed to be filled. We discovered that Kiswahili, among other means, fill such gaps through loan translation and loan rendition of English phrases. Prototypical examples of notions whose English labels Kiswahili has translated word for word are such as “human rights”, “free and fair election”, “the World Cup” and “multiparty democracy”. We can conclude that Kiswahili finds it easier and economical to translate the existing English labels for imported notions rather than innovating original labels for the concepts. Even so, our analysis revealed that a few of the Kiswahili duplicate multiword expressions might be a function of nativism, cognitive metaphoricalization and analogy phenomena. We, for instance, observed that formulation of figurative meanings follow more or less similar pattern across human languages – the secondary meanings deriving from source domains. As long as the source domains are common in many human\'s environment, we found it plausible for certain multiword expressions to spontaneously duplicate between several human languages. Academically, our study has demonstrated how multiword expressions, which duplicate between several languages, can be studied using primary data, corpora, documentary review and observation. In particular, the study has designed a framework for studying sources of the expressions and even terminologies for describing the phenomenon. What\'s more, the study has collected a number of expressions that duplicate between Kiswahili and English languages, which other researchers can use in similar studies.

APA, Harvard, Vancouver, ISO, and other styles

31

Scheepers, Ruth Angela. "Lexical levels and formulaic language : an exploration of undergraduate students' vocabulary and written production of delexical multiword units." Thesis, 2014. http://hdl.handle.net/10500/18245.

Full text

Abstract:

This study investigates undergraduate students’ vocabulary size, and their use of formulaic language. Using the Vocabulary Levels Test (Laufer and Nation 1995), it measures the vocabulary size of native and non-native speakers of English and explores relationships between this and course of study, gender, age and home language, and their academic performance. A corpus linguistic approach is then applied to compare student writers’ uses of three high-frequency verbs (have, make and take) relative to expert writers. Multiword units (MWUs) featuring these verbs are identified and analysed, focusing on delexical MWUs as one very specific aspect of depth of vocabulary knowledge. Student and expert use of these MWUs is compared. Grammatically and semantically deviant MWUs are also analysed. Finally, relationships between the size and depth of students’ vocabulary knowledge, and between the latter and academic performance, are explored. Findings reveal that Literature students had larger vocabularies than Law students, females knew more words than males, and older students knew more than younger ones. Importantly, results indicated a relationship between vocabulary size and academic performance. Literature students produced more correct MWUs and fewer errors than Law students. Correlations suggest that the smaller students’ vocabulary, the poorer the depth of their vocabulary is likely to be. Although no robust relationship between vocabulary depth and academic performance emerged, there was evidence of an indirect link between academic performance and correct use of MWUs. In bringing together traditional methods of measuring vocabulary size with an investigation of depth of vocabulary knowledge using corpus analysis methods, this study provides further evidence of the importance of vocabulary knowledge to academic performance. It contributes to debates on the value of a sound knowledge of high-frequency vocabulary and a developing knowledge of at least 5000 words to academic performance, and the analysis and quantification of errors in MWUs adds to our understanding of novice writers’ difficulties with these combinations. The study also explores new ways of investigating relationships between size and depth of vocabulary knowledge, and between depth of vocabulary knowledge and academic performance.
Linguistics and Modern Languages
D. Litt. et Phil. (Linguistics)

APA, Harvard, Vancouver, ISO, and other styles

32

Bejček, Eduard. "Automatické propojování lexikografických zdrojů a korpusových dat." Doctoral thesis, 2015. http://www.nusl.cz/ntk/nusl-351016.

Full text

Abstract:

Along with the increasing development of language resources - i.e., new lexicons, lexical databases, corpora, treebanks - the need for their efficient interlinking is growing. With such a linking, one can easily benefit from all their properties and information. Considering the convergence of resources, universal lexicographic formats are frequently discussed. In the present thesis, we investigate and analyse methods of interlinking language resources automatically. We introduce a system for interlinking lexicons (such as VALLEX, PDT-Vallex, FrameNet or SemLex) that offer information on syntactic properties of their entries. The system is automated and can be used repeatedly with newer versions of lexicons under development. We also design a method for identification of multiword expressions in a parsed text based on syntactic information from the SemLex lexicon. An output that verifies feasibility of the used methods is, among others, the mapping between the VALLEX and the PDT-Vallex lexicons, resulting in tens of thousands of annotated treebank sentences from the PDT and the PCEDT treebanks added into VALLEX. Powered by TCPDF (www.tcpdf.org)

APA, Harvard, Vancouver, ISO, and other styles

33

Jungwirthová, Klára. "Víceslovná pojmenování v italštině." Master's thesis, 2015. http://www.nusl.cz/ntk/nusl-340200.

Full text

Abstract:

The main topic of this thesis are the multiword expressions in the italian language. The thesis is divided into two parts - the theorical and the empirical part. The theorical part deals with the multiword expressions, the syntagmas and the idiomatic expressions. In the empirical part the connections between the constituents of the multiword expressions will be researched. Than four criteria will be on the multiword expressions applied (head inflection, insertion of the head's modifiers, pronominalisation of the head and dislocation and topicalization of the head). These transformations will be verified with the aid of corpora and questionnaires. Depending on the results of this research will be decided if the multiword expressions resemble the syntagmas or the idiomatic expressions.

APA, Harvard, Vancouver, ISO, and other styles

34

Rybáková, Jana. "Kvantitativní a kvalitativní rozbor spojek ve vybrané dětské literatuře." Master's thesis, 2018. http://www.nusl.cz/ntk/nusl-382983.

Full text

Abstract:

This thesis describe distribution of conjunction in the first seven books "Diary of a Wimpy kid". It compare and analyse those conjunctions with its usage in school lessons (through one part of a Czech national corpus - SCHOLA2010) and with text from students (through a corpus SKRIPT2012). I compare also the situation with conjunctions distribution in the textbook for 5th grade students named More vlast je v Evropě. It also research how and which one of multiword conjunction are in these texts used. It also shows, which conjunctions are used like a first member of the sentence. This thesis may analyse, which conjunctions hears, reads and uses a fictitious 5th grade student. It should be connected with a school. This thesis ignore a language situation in the family and medial communication. Analysed books may enrich students vocabulary, because it contents huge number of multiword conjunction. The most used conjunctions in books and students text are "a", "že" and "ale". The textbook contains huge number of "a" and "i".

APA, Harvard, Vancouver, ISO, and other styles

35

Hubková, Helena. "Názvy současných profesí ve zdravotnictví." Master's thesis, 2016. http://www.nusl.cz/ntk/nusl-352469.

Full text

Abstract:

The presented thesis describes the typology of current names of professions in healthcare in terms of onomasiology and structural-morphology. The starting point of the thesis is the existing professional linguistic description of naming structures implemented as one-word names and nominal collocations. The language material consists of the names of professions in healthcare that are currently used. The names are classified in terms of word-formation, lexical semantics and the origin of words.

APA, Harvard, Vancouver, ISO, and other styles

36

Lief, Eric. "Použití hlubokých kontextualizovaných slovních reprezentací založených na znacích pro neuronové sekvenční značkování." Master's thesis, 2019. http://www.nusl.cz/ntk/nusl-393167.

Full text

Abstract:

A family of Natural Language Processing (NLP) tasks such as part-of- speech (PoS) tagging, Named Entity Recognition (NER), and Multiword Expression (MWE) identification all involve assigning labels to sequences of words in text (sequence labeling). Most modern machine learning approaches to sequence labeling utilize word embeddings, learned representations of text, in which words with similar meanings have similar representations. Quite recently, contextualized word embeddings have garnered much attention because, unlike pretrained context- insensitive embeddings such as word2vec, they are able to capture word meaning in context. In this thesis, I evaluate the performance of different embedding setups (context-sensitive, context-insensitive word, as well as task-specific word, character, lemma, and PoS) on the three abovementioned sequence labeling tasks using a deep learning model (BiLSTM) and Portuguese datasets. v

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!