Dissertations / Theses: 'Semantics of word-forming base'

1

Sinha, Ravi Som. "Graph-based Centrality Algorithms for Unsupervised Word Sense Disambiguation." Thesis, University of North Texas, 2008. https://digital.library.unt.edu/ark:/67531/metadc9736/.

Full text

Abstract:

This thesis introduces an innovative methodology of combining some traditional dictionary based approaches to word sense disambiguation (semantic similarity measures and overlap of word glosses, both based on WordNet) with some graph-based centrality methods, namely the degree of the vertices, Pagerank, closeness, and betweenness. The approach is completely unsupervised, and is based on creating graphs for the words to be disambiguated. We experiment with several possible combinations of the semantic similarity measures as the first stage in our experiments. The next stage attempts to score individual vertices in the graphs previously created based on several graph connectivity measures. During the final stage, several voting schemes are applied on the results obtained from the different centrality algorithms. The most important contributions of this work are not only that it is a novel approach and it works well, but also that it has great potential in overcoming the new-knowledge-acquisition bottleneck which has apparently brought research in supervised WSD as an explicit application to a plateau. The type of research reported in this thesis, which does not require manually annotated data, holds promise of a lot of new and interesting things, and our work is one of the first steps, despite being a small one, in this direction. The complete system is built and tested on standard benchmarks, and is comparable with work done on graph-based word sense disambiguation as well as lexical chains. The evaluation indicates that the right combination of the above mentioned metrics can be used to develop an unsupervised disambiguation engine as powerful as the state-of-the-art in WSD.

APA, Harvard, Vancouver, ISO, and other styles

2

Grover, Ishaan. "A semantics based computational model for word learning." Thesis, Massachusetts Institute of Technology, 2018. http://hdl.handle.net/1721.1/120694.

Full text

Abstract:

Thesis: S.M., Massachusetts Institute of Technology, School of Architecture and Planning, Program in Media Arts and Sciences, 2018.
Cataloged from PDF version of thesis.
Includes bibliographical references (pages 73-77).
Studies have shown that children's early literacy skills can impact their ability to achieve academic success, attain higher education and secure employment later in life. However, lack of resources and limited access to educational content causes a "knowledge gap" between children that come from different socio-economic backgrounds. To solve this problem, there has been a recent surge in the development of Intelligent Tutoring Systems (ITS) to provide learning benefits to children. However, before providing new content, an ITS must assess a child's existing knowledge. Several studies have shown that children learn new words by forming semantic relationships with words they already know. Human tutors often implicitly use semantics to assess a tutee's word knowledge from partial and noisy data. In this thesis, I present a cognitively inspired model that uses word semantics (semantics-based model) to make inferences about a child's vocabulary from partial information about their existing vocabulary. Using data from a one-to-one learning intervention between a robotic tutor and 59 children, I show that the proposed semantics-based model outperforms (on average) models that do not use word semantics (semantics-free models). A subject level analysis of results reveals that different models perform well for different children, thus motivating the need to combine predictions. To this end, I present two methods to combine predictions from semantics-based and semantics-free models and show that these methods yield better predictions of a child's vocabulary knowledge. Finally, I present an application of the semantics-based model to evaluate if a learning intervention was successful in teaching children new words while enhancing their semantic understanding. More concretely, I show that a personalized word learning intervention with a robotic tutor is better suited to enhance children's vocabulary when compared to a non-personalized intervention. These results motivate the use of semantics-based models to assess children's knowledge and build ITS that maximize children's semantic understanding of words.
"This research was supported by NSF IIP-1717362 and NSF IIS-1523118"--Page 10.
by Ishaan Grover.
S.M.

APA, Harvard, Vancouver, ISO, and other styles

3

Burton, Marilyn Elizabeth. "Semantics of glory : a cognitive, corpus-based approach to Hebrew word meaning." Thesis, University of Edinburgh, 2014. http://hdl.handle.net/1842/9573.

Full text

Abstract:

The concept of ‘glory’ is one of the most significant themes in the Hebrew Bible, lying at the heart of God’s self-disclosure in biblical revelation. Yet, while the concept has received theological treatment, and while various relevant Hebrew roots have individually benefited from linguistic survey, the group of lexemes surrounding this concept is as yet untouched by a comprehensive semantic study. Through indepth semantic study this thesis offers a clearer understanding of the interrelations and differences between the Classical Hebrew lexemes centring around the concept of ‘glory’. The first chapter opens with a critical examination of both structuralist and cognitivist approaches to semantic research, focussing particularly on their historical use and current applicability to the study of ancient languages. It outlines the superior claims of cognitive semantics accurately to model patterns of language usage, addressing the challenges inherent in the application of such an approach to ancient language. The proposed methodology is characterised as cognitive in nature, focussed on both lexical interrelations (relational) and the internal composition of lexemes (decompositional), exhaustive in relating lexemes to each other point by point, and based on the entirety of the Classical Hebrew corpus. Finally, this chapter discusses issues relating to the limited, diachronic and fragmentary nature of the Classical Hebrew corpus. The second chapter delineates the boundaries of the semantic domain of כבוד . It opens with a methodological discussion introducing parallel terms and word pairs as valuable tools in the objective identification of semantically related terms. Proposing the theory that members of a semantic domain will regularly co-occur, it systematically analyses firstly the extant word associations of כבוד itself and secondly of those lexemes recurring in association with it, accepting or rejecting each as a member of its semantic domain on the basis of word associations. This process results in the identification of eleven lexemes as members of the semantic domain of The concept of ‘glory’ is one of the most significant themes in the Hebrew Bible, lying at the heart of God’s self-disclosure in biblical revelation. Yet, while the concept has received theological treatment, and while various relevant Hebrew roots have individually benefited from linguistic survey, the group of lexemes surrounding this concept is as yet untouched by a comprehensive semantic study. Through indepth semantic study this thesis offers a clearer understanding of the interrelations and differences between the Classical Hebrew lexemes centring around the concept of ‘glory’. The first chapter opens with a critical examination of both structuralist and cognitivist approaches to semantic research, focussing particularly on their historical use and current applicability to the study of ancient languages. It outlines the superior claims of cognitive semantics accurately to model patterns of language usage, addressing the challenges inherent in the application of such an approach to ancient language. The proposed methodology is characterised as cognitive in nature, focussed on both lexical interrelations (relational) and the internal composition of lexemes (decompositional), exhaustive in relating lexemes to each other point by point, and based on the entirety of the Classical Hebrew corpus. Finally, this chapter discusses issues relating to the limited, diachronic and fragmentary nature of the Classical Hebrew corpus. The second chapter delineates the boundaries of the semantic domain of כבוד . It opens with a methodological discussion introducing parallel terms and word pairs as valuable tools in the objective identification of semantically related terms. Proposing the theory that members of a semantic domain will regularly co-occur, it systematically analyses firstly the extant word associations of כבוד itself and secondly of those lexemes recurring in association with it, accepting or rejecting each as a member of its semantic domain on the basis of word associations. This process results in the identification of eleven lexemes as members of the semantic domain of The concept of ‘glory’ is one of the most significant themes in the Hebrew Bible, lying at the heart of God’s self-disclosure in biblical revelation. Yet, while the concept has received theological treatment, and while various relevant Hebrew roots have individually benefited from linguistic survey, the group of lexemes surrounding this concept is as yet untouched by a comprehensive semantic study. Through indepth semantic study this thesis offers a clearer understanding of the interrelations and differences between the Classical Hebrew lexemes centring around the concept of ‘glory’. The first chapter opens with a critical examination of both structuralist and cognitivist approaches to semantic research, focussing particularly on their historical use and current applicability to the study of ancient languages. It outlines the superior claims of cognitive semantics accurately to model patterns of language usage, addressing the challenges inherent in the application of such an approach to ancient language. The proposed methodology is characterised as cognitive in nature, focussed on both lexical interrelations (relational) and the internal composition of lexemes (decompositional), exhaustive in relating lexemes to each other point by point, and based on the entirety of the Classical Hebrew corpus. Finally, this chapter discusses issues relating to the limited, diachronic and fragmentary nature of the Classical Hebrew corpus. The second chapter delineates the boundaries of the semantic domain of כבוד . It opens with a methodological discussion introducing parallel terms and word pairs as valuable tools in the objective identification of semantically related terms. Proposing the theory that members of a semantic domain will regularly co-occur, it systematically analyses firstly the extant word associations of כבוד itself and secondly of those lexemes recurring in association with it, accepting or rejecting each as a member of its semantic domain on the basis of word associations. This process results in the identification of eleven lexemes as members of the semantic domain of כבוד.

APA, Harvard, Vancouver, ISO, and other styles

4

Sinha, Ravi Som Mihalcea Rada F. "Graph-based centrality algorithms for unsupervised word sense disambiguation." [Denton, Tex.] : University of North Texas, 2008. http://digital.library.unt.edu/permalink/meta-dc-9736.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Esin, Yunus Emre. "Improvement Of Corpus-based Semantic Word Similarity Using Vector Space Model." Master's thesis, METU, 2009. http://etd.lib.metu.edu.tr/upload/12610759/index.pdf.

Full text

Abstract:

This study presents a new approach for finding semantically similar words from corpora using window based context methods. Previous studies mainly concentrate on either finding new combination of distance-weight measurement methods or proposing new context methods. The main difference of this new approach is that this study reprocesses the outputs of the existing methods to update the representation of related word vectors used for measuring semantic distance between words, to improve the results further. Moreover, this novel technique provides a solution to the data sparseness of vectors which is a common problem in methods which uses vector space model. The main advantage of this new approach is that it is applicable to many of the existing word similarity methods using the vector space model. The other and the most important advantage of this approach is that it improves the performance of some of these existing word similarity measuring methods.

APA, Harvard, Vancouver, ISO, and other styles

6

Manion, Steve Lawrence. "Unsupervised Knowledge-based Word Sense Disambiguation: Exploration & Evaluation of Semantic Subgraphs." Thesis, University of Canterbury. Department of Mathematics & Statistics, 2014. http://hdl.handle.net/10092/10016.

Full text

Abstract:

Hypothetically, if you were told: Apple uses the apple as its logo . You would immediately detect two different senses of the word apple , these being the company and the fruit respectively. Making this distinction is the formidable challenge of Word Sense Disambiguation (WSD), which is the subtask of many Natural Language Processing (NLP) applications. This thesis is a multi-branched investigation into WSD, that explores and evaluates unsupervised knowledge-based methods that exploit semantic subgraphs. The nature of research covered by this thesis can be broken down to: 1. Mining data from the encyclopedic resource Wikipedia, to visually prove the existence of context embedded in semantic subgraphs 2. Achieving disambiguation in order to merge concepts that originate from heterogeneous semantic graphs 3. Participation in international evaluations of WSD across a range of languages 4. Treating WSD as a classification task, that can be optimised through the iterative construction of semantic subgraphs The contributions of each chapter are ranged, but can be summarised by what has been produced, learnt, and raised throughout the thesis. Furthermore an API and several resources have been developed as a by-product of this research, all of which can be accessed by visiting the author’s home page at http://www.stevemanion.com. This should enable researchers to replicate the results achieved in this thesis and build on them if they wish.

APA, Harvard, Vancouver, ISO, and other styles

7

Lilliehöök, Hampus. "Extraction of word senses from bilingual resources using graph-based semantic mirroring." Thesis, Linköpings universitet, Interaktiva och kognitiva system, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-91880.

Full text

Abstract:

In this thesis we retrieve semantic information that exists implicitly in bilingual data. We gather input data by repeatedly applying the semantic mirroring procedure. The data is then represented by vectors in a large vector space. A resource of synonym clusters is then constructed by performing K-means centroid-based clustering on the vectors. We evaluate the result manually, using dictionaries, and against WordNet, and discuss prospects and applications of this method.
I det här arbetet utvinner vi semantisk information som existerar implicit i tvåspråkig data. Vi samlar indata genom att upprepa proceduren semantisk spegling. Datan representeras som vektorer i en stor vektorrymd. Vi bygger sedan en resurs med synonymkluster genom att applicera K-means-algoritmen på vektorerna. Vi granskar resultatet för hand med hjälp av ordböcker, och mot WordNet, och diskuterar möjligheter och tillämpningar för metoden.

APA, Harvard, Vancouver, ISO, and other styles

8

Milajevs, Dmitrijs. "A study of model parameters for scaling up word to sentence similarity tasks in distributional semantics." Thesis, Queen Mary, University of London, 2018. http://qmro.qmul.ac.uk/xmlui/handle/123456789/36225.

Full text

Abstract:

Representation of sentences that captures semantics is an essential part of natural language processing systems, such as information retrieval or machine translation. The representation of a sentence is commonly built by combining the representations of the words that the sentence consists of. Similarity between words is widely used as a proxy to evaluate semantic representations. Word similarity models are well-studied and are shown to positively correlate with human similarity judgements. Current evaluation of models of sentential similarity builds on the results obtained in lexical experiments. The main focus is how the lexical representations are used, rather than what they should be. It is often assumed that the optimal representations for word similarity are also optimal for sentence similarity. This work discards this assumption and systematically looks for lexical representations that are optimal for similarity measurement between sentences. We find that the best representation for word similarity is not always the best for sentence similarity and vice versa. The best models in word similarity tasks perform best with additive composition. However, the best result on compositional tasks is achieved with Kroneckerbased composition. There are representations that are equally good in both tasks when used with multiplicative composition. The systematic study of the parameters of similarity models reveals that the more information lexical representations contain, the more attention should be paid to noise. In particular, the word vectors in models with the feature size at the magnitude of the vocabulary size should be sparse, but if a small number of context features is used then the vectors should be dense. Given the right lexical representations, compositional operators achieve state-of-the-art performance, improving over models that use neural-word embeddings. To avoid overfitting, either several test datasets should be used or parameter selection should be based on parameters' average behaviours.

APA, Harvard, Vancouver, ISO, and other styles

9

Islam, Md Aminul. "Applications of corpus-based semantic similarity and word segmentation to database schema matching." Thesis, University of Ottawa (Canada), 2006. http://hdl.handle.net/10393/27256.

Full text

Abstract:

In this thesis, we present a method for database schema matching, the problem of identifying elements of two given schemas that correspond to each other. Schema matching is useful in e-commerce exchanges, in data integration/warehousing, and in Semantic Web applications. We first present two corpus-based methods: one method is for determining the semantic similarity of two target words and the other is for automatic word segmentation. Then we present a name-based element-level database schema matching method that exploits both the semantic similarity and the word segmentation methods. Our word similarity method uses Pointwise Mutual Information (PMI) to sort lists of important neighbor words of the two target words and distinguish the words which are common in both lists and aggregate their PMI values (from the opposite list) to calculate the relative similarity score. Our word segmentation method uses corpus type frequency information to choose the type with maximum length and frequency from "desegmented" text. It also uses a modified forward-backward matching technique using maximum length frequency and entropy rate if any non-matching portions of the text exist. For the database schema matching method, we also use normalized and modified versions of the Longest Common Subsequence (LCS) string matching algorithm with weight factors to allow for a balanced combination. We validate our methods with experimental studies, the results of which suggest that these methods can be a useful addition to the set of existing methods.

APA, Harvard, Vancouver, ISO, and other styles

10

Stigeborn, Olivia. "Text ranking based on semantic meaning of sentences." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-300442.

Full text

Abstract:

Finding a suitable candidate to client match is an important part of consultant companies work. It takes a lot of time and effort for the recruiters at the company to read possibly hundreds of resumes to find a suitable candidate. Natural language processing is capable of performing a ranking task where the goal is to rank the resumes with the most suitable candidates ranked the highest. This ensures that the recruiters are only required to look at the top ranked resumes and can quickly get candidates out in the field. Former research has used methods that count specific keywords in resumes and can make decisions on whether a candidate has an experience or not. The main goal of this thesis is to use the semantic meaning of the text in the resumes to get a deeper understanding of a candidate’s level of experience. It also evaluates if the model is possible to run on-device and if the database can contain a mix of English and Swedish resumes. An algorithm was created that uses the word embedding model DistilRoBERTa that is capable of capturing the semantic meaning of text. The algorithm was evaluated by generating job descriptions from the resumes by creating a summary of each resume. The run time, memory usage and the ranking the wanted candidate achieved was documented and used to analyze the results. When the candidate who was used to generate the job description is ranked in the top 10 the classification was considered to be correct. The accuracy was calculated using this method and an accuracy of 68.3% was achieved. The results show that the algorithm is capable of ranking resumes. The algorithm is able to rank both Swedish and English resumes with an accuracy of 67.7% for Swedish resumes and 74.7% for English. The run time was fast enough at an average of 578 ms but the memory usage was too large to make it possible to use the algorithm on-device. In conclusion the semantic meaning of resumes can be used to rank resumes and possible future work would be to combine this method with a method that counts keywords to research if the accuracy would increase.
Att hitta en lämplig kandidat till kundmatchning är en viktig del av ett konsultföretags arbete. Det tar mycket tid och ansträngning för rekryterare på företaget att läsa eventuellt hundratals CV:n för att hitta en lämplig kandidat. Det finns språkteknologiska metoder för att rangordna CV:n med de mest lämpliga kandidaterna rankade högst. Detta säkerställer att rekryterare endast behöver titta på de topprankade CV:erna och snabbt kan få kandidater ut i fältet. Tidigare forskning har använt metoder som räknar specifika nyckelord i ett CV och är kapabla att avgöra om en kandidat har specifika erfarenheter. Huvudmålet med denna avhandling är att använda den semantiska innebörden av texten iCV:n för att få en djupare förståelse för en kandidats erfarenhetsnivå. Den utvärderar också om modellen kan köras på mobila enheter och om algoritmen kan rangordna CV:n oberoende av om CV:erna är på svenska eller engelska. En algoritm skapades som använder ordinbäddningsmodellen DistilRoBERTa som är kapabel att fånga textens semantiska betydelse. Algoritmen utvärderades genom att generera jobbeskrivningar från CV:n genom att skapa en sammanfattning av varje CV. Körtiden, minnesanvändningen och rankningen som den önskade kandidaten fick dokumenterades och användes för att analysera resultatet. När den kandidat som användes för att generera jobbeskrivningen rankades i topp 10 ansågs klassificeringen vara korrekt. Noggrannheten beräknades med denna metod och en noggrannhet på 68,3 % uppnåddes. Resultaten visar att algoritmen kan rangordna CV:n. Algoritmen kan rangordna både svenska och engelska CV:n med en noggrannhet på 67,7 % för svenska och 74,7 % för engelska. Körtiden var i genomsnitt 578 ms vilket skulle möjliggöra att algoritmen kan köras på mobila enheter men minnesanvändningen var för stor. Sammanfattningsvis kan den semantiska betydelsen av CV:n användas för att rangordna CV:n och ett eventuellt framtida arbete är att kombinera denna metod med en metod som räknar nyckelord för att undersöka hur noggrannheten skulle påverkas.

APA, Harvard, Vancouver, ISO, and other styles

11

Matikainen, Tiina Johanna. "Semantic Representation of L2 Lexicon in Japanese University Students." Diss., Temple University Libraries, 2011. http://cdm16002.contentdm.oclc.org/cdm/ref/collection/p245801coll10/id/133319.

Full text

Abstract:

CITE/Language Arts
Ed.D.
In a series of studies using semantic relatedness judgment response times, Jiang (2000, 2002, 2004a) has claimed that L2 lexical entries fossilize with their equivalent L1 content or something very close to it. In another study using a more productive test of lexical knowledge (Jiang 2004b), however, the evidence for this conclusion was less clear. The present study is a partial replication of Jiang (2004b) with Japanese learners of English. The aims of the study are to investigate the influence of the first language (L1) on second language (L2) lexical knowledge, to investigate whether lexical knowledge displays frequency-related, emergent properties, and to investigate the influence of the L1 on the acquisition of L2 word pairs that have a common L1 equivalent. Data from a sentence completion task was completed by 244 participants, who were shown sentence contexts in which they chose between L2 word pairs sharing a common equivalent in the students' first language, Japanese. The data were analyzed using the statistical analyses available in the programming environment R to quantify the participants' ability to discriminate between synonymous and non-synonymous use of these L2 word pairs. The results showed a strong bias against synonymy for all word pairs; the participants tended to make a distinction between the two synonymous items by assigning each word a distinct meaning. With the non-synonymous items, lemma frequency was closely related to the participants' success in choosing the correct word in the word pair. In addition, lemma frequency and the degree of similarity between the words in the word pair were closely related to the participants' overall knowledge of the non-synonymous meanings of the vocabulary items. The results suggest that the participants had a stronger preference for non-synonymous options than for the synonymous option. This suggests that the learners might have adopted a one-word, one-meaning learning strategy (Willis, 1998). The reasonably strong relationship between several of the usage-based statistics and the item measures from R suggest that with exposure learners are better able to use words in ways that are similar to native speakers of English, to differentiate between appropriate and inappropriate contexts and to recognize the boundary separating semantic overlap and semantic uniqueness. Lexical similarity appears to play a secondary role, in combination with frequency, in learners' ability to differentiate between appropriate and inappropriate contexts when using L2 word pairs that have a single translation in the L1.
Temple University--Theses

APA, Harvard, Vancouver, ISO, and other styles

12

Konduri, Aparna. "CLustering of Web Services Based on Semantic Similarity." University of Akron / OhioLINK, 2008. http://rave.ohiolink.edu/etdc/view?acc_num=akron1199657471.

Full text

APA, Harvard, Vancouver, ISO, and other styles

13

Wang, Qianqian. "NATURAL LANGUAGE PROCESSING BASED GENERATOR OF TESTING INSTRUMENTS." CSUSB ScholarWorks, 2017. https://scholarworks.lib.csusb.edu/etd/576.

Full text

Abstract:

Natural Language Processing (NLP) is the field of study that focuses on the interactions between human language and computers. By “natural language” we mean a language that is used for everyday communication by humans. Different from programming languages, natural languages are hard to be defined with accurate rules. NLP is developing rapidly and it has been widely used in different industries. Technologies based on NLP are becoming increasingly widespread, for example, Siri or Alexa are intelligent personal assistants using NLP build in an algorithm to communicate with people. “Natural Language Processing Based Generator of Testing Instruments” is a stand-alone program that generates “plausible” multiple-choice selections by analyzing word sense disambiguation and calculating semantic similarity between two natural language entities. The core is Word Sense Disambiguation (WSD), WSD is identifying which sense of a word is used in a sentence when the word has multiple meanings. WSD is considered as an AI-hard problem. The project presents several algorithms to resolve WSD problem and compute semantic similarity, along with experimental results demonstrating their effectiveness.

APA, Harvard, Vancouver, ISO, and other styles

14

Pierrejean, Bénédicte. "Qualitative evaluation of word embeddings : investigating the instability in neural-based models." Thesis, Toulouse 2, 2020. http://www.theses.fr/2020TOU20001.

Full text

Abstract:

La sémantique distributionnelle a récemment connu de grandes avancées avec l’arrivée des plongements de mots (word embeddings) basés sur des méthodes neuronales qui ont rendu les modèles sémantiques plus accessibles en fournissant des méthodes d’entraînement rapides, efficaces et faciles à utiliser. Ces représentations denses d’unités lexicales basées sur l’analyse non supervisée de gros corpus sont de plus en plus utilisées dans diverses applications. Elles sont intégrées en tant que première couche dans les modèles d’apprentissage profond et sont également utilisées pour faire de l’observation qualitative en linguistique de corpus. Cependant, malgré leur popularité, il n’existe toujours pas de méthode d’évaluation des plongements de mots qui donne à la fois une vision globale et précise des différences existant entre plusieurs modèles.Dans cette thèse, nous proposons une méthodologie pour évaluer les plongements de mots. Nous fournissons également une étude détaillée des modèles entraînés avec la méthode word2vec.Dans la première partie de cette thèse, nous donnons un aperçu de l’évolution de la sémantique distributionnelle et passons en revue les différentes méthodes utilisées pour évaluer les plongements de mots. Par la suite, nous identifions les limites de ces méthodes et proposons de comparer les plongements de mots en utilisant une approche basée sur les voisins sémantiques. Nous expérimentons avec cette approche sur des modèles entrainés avec différents paramètres ou sur différents corpus. Étant donné la nature non déterministe des méthodes neuronales, nous reconnaissons les limites de cette approche et nous concentrons par la suite sur le problème de l’instabilité des voisins sémantiques dans les modèles de plongement de mots. Plutôt que d’éviter ce problème, nous choisissons de l’utiliser comme indice pour mieux comprendre les plongements de mots. Nous montrons que le problème d’instabilité n’affecte pas tous les mots de la même manière et que plus plusieurs traits linguistiques permettent d’expliquer une partie de ce phénomène. Ceci constitue un pas vers une meilleure compréhension du fonctionnement des modèles sémantiques vectoriels
Distributional semantics has been revolutionized by neural-based word embeddings methods such as word2vec that made semantics models more accessible by providing fast, efficient and easy to use training methods. These dense representations of lexical units based on the unsupervised analysis of large corpora are more and more used in various types of applications. They are integrated as the input layer in deep learning models or they are used to draw qualitative conclusions in corpus linguistics. However, despite their popularity, there still exists no satisfying evaluation method for word embeddings that provides a global yet precise vision of the differences between models. In this PhD thesis, we propose a methodology to qualitatively evaluate word embeddings and provide a comprehensive study of models trained using word2vec. In the first part of this thesis, we give an overview of distributional semantics evolution and review the different methods that are currently used to evaluate word embeddings. We then identify the limits of the existing methods and propose to evaluate word embeddings using a different approach based on the variation of nearest neighbors. We experiment with the proposed method by evaluating models trained with different parameters or on different corpora. Because of the non-deterministic nature of neural-based methods, we acknowledge the limits of this approach and consider the problem of nearest neighbors instability in word embeddings models. Rather than avoiding this problem we embrace it and use it as a mean to better understand word embeddings. We show that the instability problem does not impact all words in the same way and that several linguistic features are correlated. This is a step towards a better understanding of vector-based semantic models

APA, Harvard, Vancouver, ISO, and other styles

15

Al, Tayyar Musaid Seleh. "Arabic information retrieval system based on morphological analysis (AIRSMA) : a comparative study of word, stem, root and morpho-semantic methods." Thesis, De Montfort University, 2000. http://hdl.handle.net/2086/4126.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Pellén, Angelica. "Oh foxy lady, where art thou? : A corpus based analysis of the word foxy, from a gender stereotype perspective." Thesis, Växjö University, School of Humanities, 2009. http://urn.kb.se/resolve?urn=urn:nbn:se:vxu:diva-2569.

Full text

Abstract:

Abstract

The aim of this essay is to establish whether or not the word foxy can serve to illustrate gender differences and gender stereotypes in English. The analysis is conducted by using one American English corpus and one British English corpus in order to make a comparison of the two English varieties. Apart from the comparative study, foxy is examined and categorized according to gender and a number of features to help answering the research questions which are:

• What difference in meaning, if any, does the word foxy carry when used for males, females and inanimate things?

• Can the word foxy serve to illustrate gender stereotypes in English?

• Are there any differences regarding how foxy is used in American English compared to British English?

Throughout the essay previous studies are presented, terms and tools that have been used are defined and argued for. One of the conclusions drawn in this study is that there is a significant difference in meaning when foxy is used in American English compared to British English. There are, however, also differences concerning the use of foxy when referring to males, females and inanimate things.

Keywords: Collocation, corpus studies, foxy, gender, language, linguistics, semantic prosody, stereotypes.

APA, Harvard, Vancouver, ISO, and other styles

17

Dergachyova, Olga. "Knowledge-based support for surgical workflow analysis and recognition." Thesis, Rennes 1, 2017. http://www.theses.fr/2017REN1S059/document.

Full text

Abstract:

L'assistance informatique est devenue une partie indispensable pour la réalisation de procédures chirurgicales modernes. Le désir de créer une nouvelle génération de blocs opératoires intelligents a incité les chercheurs à explorer les problèmes de perception et de compréhension automatique de la situation chirurgicale. Dans ce contexte de prise de conscience de la situation, un domaine de recherche en plein essor adresse la reconnaissance automatique du flux chirurgical. De grands progrès ont été réalisés pour la reconnaissance des phases et des gestes chirurgicaux. Pourtant, il existe encore un vide entre ces deux niveaux de granularité dans la hiérarchie du processus chirurgical. Très peu de recherche se concentre sur les activités chirurgicales portant des informations sémantiques vitales pour la compréhension de la situation. Deux facteurs importants entravent la progression. Tout d'abord, la reconnaissance et la prédiction automatique des activités chirurgicales sont des tâches très difficiles en raison de la courte durée d'une activité, de leur grand nombre et d'un flux de travail très complexe et une large variabilité. Deuxièmement, une quantité très limitée de données cliniques ne fournit pas suffisamment d'informations pour un apprentissage réussi et une reconnaissance précise. À notre avis, avant de reconnaître les activités chirurgicales, une analyse soigneuse des éléments qui composent l'activité est nécessaire pour choisir les bons signaux et les capteurs qui faciliteront la reconnaissance. Nous avons utilisé une approche d'apprentissage profond pour évaluer l'impact de différents éléments sémantiques de l'activité sur sa reconnaissance. Grâce à une étude approfondie, nous avons déterminé un ensemble minimum d'éléments suffisants pour une reconnaissance précise. Les informations sur la structure anatomique et l'instrument chirurgical sont de première importance. Nous avons également abordé le problème de la carence en matière de données en proposant des méthodes de transfert de connaissances à partir d'autres domaines ou chirurgies. Les méthodes de ''word embedding'' et d'apprentissage par transfert ont été proposées. Ils ont démontré leur efficacité sur la tâche de prédiction d'activité suivante offrant une augmentation de précision de 22%. De plus, des observations pertinentes
Computer assistance became indispensable part of modern surgical procedures. Desire of creating new generation of intelligent operating rooms incited researchers to explore problems of automatic perception and understanding of surgical situations. Situation awareness includes automatic recognition of surgical workflow. A great progress was achieved in recognition of surgical phases and gestures. Yet, there is still a blank between these two granularity levels in the hierarchy of surgical process. Very few research is focused on surgical activities carrying important semantic information vital for situation understanding. Two important factors impede the progress. First, automatic recognition and prediction of surgical activities is a highly challenging task due to short duration of activities, their great number and a very complex workflow with multitude of possible execution and sequencing ways. Secondly, very limited amount of clinical data provides not enough information for successful learning and accurate recognition. In our opinion, before recognizing surgical activities a careful analysis of elements that compose activity is necessary in order to chose right signals and sensors that will facilitate recognition. We used a deep learning approach to assess the impact of different semantic elements of activity on its recognition. Through an in-depth study we determined a minimal set of elements sufficient for an accurate recognition. Information about operated anatomical structure and surgical instrument was shown to be the most important. We also addressed the problem of data deficiency proposing methods for transfer of knowledge from other domains or surgeries. The methods of word embedding and transfer learning were proposed. They demonstrated their effectiveness on the task of next activity prediction offering 22% increase in accuracy. In addition, pertinent observations about the surgical practice were made during the study. In this work, we also addressed the problem of insufficient and improper validation of recognition methods. We proposed new validation metrics and approaches for assessing the performance that connect methods to targeted applications and better characterize capacities of the method. The work described in this these aims at clearing obstacles blocking the progress of the domain and proposes a new perspective on the problem of surgical workflow recognition

APA, Harvard, Vancouver, ISO, and other styles

18

Utgof, Darja. "The Perception of Lexical Similarities Between L2 English and L3 Swedish." Thesis, Linköping University, Department of Culture and Communication, 2008. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-15874.

Full text

Abstract:

The present study investigates lexical similarity perceptions by students of Swedish as a foreign language (L3) with a good yet non-native proficiency in English (L2). The general theoretical framework is provided by studies in transfer of learning and its specific instance, transfer in language acquisition.

It is accepted as true that all previous linguistic knowledge is facilitative in developing proficiency in a new language. However, a frequently reported phenomenon is that students see similarities between two systems in a different way than linguists and theoreticians of education do. As a consequence, the full facilitative potential of transfer remains unused.

The present research seeks to shed light on the similarity perceptions with the focus on the comprehension of a written text. In order to elucidate students’ views, a form involving similarity judgements and multiple choice questions for formally similar items has been designed, drawing on real language use as provided by corpora. 123 forms have been distributed in 6 groups of international students, 4 of them studying Swedish at Level I and 2 studying at Level II.

The test items in the form vary in the degree of formal, semantic and functional similarity from very close cognates, to similar words belonging to different word classes, to items exhibiting category membership and/or being in subordinate/superordinate relation to each other, to deceptive cognates. The author proposes expected similarity ratings and compares them to the results obtained. The objective measure of formal similarity is provided by a string matching algorithm, Levenshtein distance.

The similarity judgements point at the fact that intermediate similarity values can be considered problematic. Similarity ratings between somewhat similar items are usually lower than could be expected. Besides, difference in grammatical meaning lowers similarity values significantly even if lexical meaning nearly coincides. Thus, the obtained results indicate that in order to utilize similarities to facilitate language learning, more attention should be paid to underlying similarities.

APA, Harvard, Vancouver, ISO, and other styles

19

Chang, Chia-Yang, and 張家揚. "Plagiarism detection based on word semantic clustering." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/3w54sj.

Full text

Abstract:

碩士
國立中山大學
電機工程學系研究所
106
Plagiarism is a common problem in current years. With the advance of Internet, it is more and more easy to obtain other people''s writings. When someone uses the content without citation, he may cause the problem of plagiarism. Plagiarisms will infringe the intellectual property rights. So plagiarism detection is a serious problem in nowadays.Current plagiarism detection methods are similar to near-duplicate detection methods, like VSM(vector space model) or bag-of-words. These methods can''t handle the complex plagiarized technique very well, e.g. word substitution and sentence rewriting. Therefore, we focus on the semantic of words. In this paper, we propose a new method for plagiarism detection by analyzing the semantic of words.Word2vec is a word embedding model proposed by Google group. It can use a vector to represent a word. We use Word2vec to obtain the vector of words and use PCA for dimension reduction. After that, we use spherical K-means to cluster the words into concepts. By using Word2vec, we can consider the semantic of words and cluster the words into concepts in order to deal with the complex plagiarized technique.Finally, we will show our experimental results and compare with other methods. The experimental results show that our method is well performance.

APA, Harvard, Vancouver, ISO, and other styles

20

Mohammad, Saif. "Measuring Semantic Distance using Distributional Profiles of Concepts." Thesis, 2008. http://hdl.handle.net/1807/11238.

Full text

Abstract:

Semantic distance is a measure of how close or distant in meaning two units of language are. A large number of important natural language problems, including machine translation and word sense disambiguation, can be viewed as semantic distance problems. The two dominant approaches to estimating semantic distance are the WordNet-based semantic measures and the corpus-based distributional measures. In this thesis, I compare them, both qualitatively and quantitatively, and identify the limitations of each. This thesis argues that estimating semantic distance is essentially a property of concepts (rather than words) and that two concepts are semantically close if they occur in similar contexts. Instead of identifying the co-occurrence (distributional) profiles of words (distributional hypothesis), I argue that distributional profiles of concepts (DPCs) can be used to infer the semantic properties of concepts and indeed to estimate semantic distance more accurately. I propose a new hybrid approach to calculating semantic distance that combines corpus statistics and a published thesaurus (Macquarie Thesaurus). The algorithm determines estimates of the DPCs using the categories in the thesaurus as very coarse concepts and, notably, without requiring any sense-annotated data. Even though the use of only about 1000 concepts to represent the vocabulary of a language seems drastic, I show that the method achieves results better than the state-of-the-art in a number of natural language tasks. I show how cross-lingual DPCs can be created by combining text in one language with a thesaurus from another. Using these cross-lingual DPCs, we can solve problems in one, possibly resource-poor, language using a knowledge source from another, possibly resource-rich, language. I show that the approach is also useful in tasks that inherently involve two or more languages, such as machine translation and multilingual text summarization. The proposed approach is computationally inexpensive, it can estimate both semantic relatedness and semantic similarity, and it can be applied to all parts of speech. Extensive experiments on ranking word pairs as per semantic distance, real-word spelling correction, solving Reader's Digest word choice problems, determining word sense dominance, word sense disambiguation, and word translation show that the new approach is markedly superior to previous ones.

APA, Harvard, Vancouver, ISO, and other styles

21

Chen, Hsiao-Yi, and 陳曉毅. "A Semantic Search over Encrypted Cloud Data based on Word Embedding 研." Thesis, 2019. http://ndltd.ncl.edu.tw/handle/7b4m86.

Full text

Abstract:

碩士
國立臺灣科技大學
資訊工程系
107
The services of cloud storage have been very popular in recent years. With the superiority of low-cost and high-capacity, people are inclined to move their data from a local computer to a remote facility such as the cloud server. The majority of the existing methods for searching data on the cloud concentrate on keyword-based search scheme. With the rise of information security awareness, data owners hope that the data placed in the cloud server can keep privacy from being snooped by untrusted users, and users also hope that their query content will not be record by untrusted server. Therefore, encrypting data and queries is the most common way.However, the encrypted ciphertext has lost the relationship of the original plaintext, which will cause many difficulties in keyword search.In addition, most of the existing search methods are not able to efficiently obtain the information that the user is really interested in from the user's query keywords. To address these problems, this study proposes a word embedding based semantic search scheme for searching documents on the cloud. The word embedding model is implemented by a neural network. The neural network model can learn the semantic relationship between words in the corpus and express the words in vectors. By using a word-embedded model, a document index vector and a query vector can be generated. The proposed scheme can encrypt the query vector and the index vector into ciphertext, which can preserve the efficiency of the search while protecting the privacy of the user and the security of the document.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Semantics of word-forming base'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles