Dissertations / Theses: 'Cross-language information retrieval'

1

Abusalah, Mustafa A. "Cross language information retrieval using ontologies." Thesis, University of Sunderland, 2008. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.505050.

Full text

Abstract:

The basic idea behind a Cross Language Information Retrieval (CLIR) system is to retrieve documents in a language different from the query. Therefore translation is needed before matching of query and document terms can take place. This translation process tends to cause a reduction in the retrieval effectiveness of CUR as compared to monolingual Information Retrieval systems. The research introduces a new CUR approach, by producing a unique CUR system based on multilingual Arabic/English ontologies; the ontology is used for query expansion and translation. Both Arabic and English ontologies are mapped using unique automatic ontology mapping tools that will be introduced in this study as well. This research addresses lexical ambiguity problems caused by erroneous translations. To prevent this, the study proposed developing a CUR system based on a multilingual ontology to create a mapping that will solve the lexical ambiguity problem. Also this study uses ontology semantic relations to expand the query to produce a better formulated query and gain better results. Finally a weighting algorithm is applied to the result set ofthe proposed system and results are compared to a state ofthe art baseline CUR system that uses a dictionary as a translation base. The CUR system was implemented in the travel domain and two ontologies were developed. A unique ontology mapping tool was also developed to map the two ontologies. The experimental work described consists of the design, development, and evaluation of the proposed CUR system. The evaluation of the proposed system demonstrates that the retrieval effectiveness outperformed the baseline system after running two human centered experiments. Relevancy judgments were measured and the results produced indicated that the proposed system is more effective than the baseline system.

APA, Harvard, Vancouver, ISO, and other styles

2

Wang, Jianqiang. "Matching meaning for cross-language information retrieval." College Park, Md. : University of Maryland, 2005. http://hdl.handle.net/1903/3212.

Full text

Abstract:

Thesis (Ph. D.) -- University of Maryland, College Park, 2005.
Thesis research directed by: Library & Information Services. Title from t.p. of PDF. Includes bibliographical references. Published by UMI Dissertation Services, Ann Arbor, Mich. Also available in paper.

APA, Harvard, Vancouver, ISO, and other styles

3

Nic, Gearailt Donnla Brighid. "Dictionary characteristics in cross-language information retrieval." Thesis, University of Cambridge, 2003. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.619885.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Nyman, Marie, and Maria Patja. "Cross-language information retrieval : sökfrågestruktur & sökfrågeexpansion." Thesis, Högskolan i Borås, Institutionen Biblioteks- och informationsvetenskap / Bibliotekshögskolan, 2008. http://urn.kb.se/resolve?urn=urn:nbn:se:hb:diva-18892.

Full text

Abstract:

This Master’s thesis examines different retrieval strategies used in cross-language information retrieval (CLIR). The aim was to investigate if there were any differences between baseline queries and translated queries in retrieval effectiveness; how the retrieval effectiveness was affected by query structuring and if the results differed between different languages. The languages used in this study were Swedish, English and Finnish. 30 topics from the TrecUta collection were translated into Swedish and Finnish. Baseline queries in Swedish and Finnish were made and translated into English using a dictionary and thereby simulating automatic translation. The queries were expanded by adding all the translations from the main entries to the queries. Two kinds of queries – structured and unstructured – were designed. The queries were fed into the InQuery IR system which presented a list of retrieved documents where the relevant ones were marked. The performance of the queries was analysed by Query Performance Analyser (QPA). Average precision at seen relevant documents at DCV 10, average precision at DCV 10 and precision and recall at DCV 200 were used to measure the retrieval effectiveness. Despite the morphological differences between Swedish and Finnish, none or very small differences in retrieval performance were found, except when average precision at DCV 10 was used. The baseline queries performed the best results and the structured queries performed better in both Swedish and Finnish than the unstructured queries. The results are consistent with previous research.
Uppsatsnivå: D

APA, Harvard, Vancouver, ISO, and other styles

5

Adriani, Mirna. "A query ambiguity model for cross-language information retrieval." Thesis, University of Glasgow, 2004. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.407678.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Loza, Christian. "Cross Language Information Retrieval for Languages with Scarce Resources." Thesis, University of North Texas, 2009. https://digital.library.unt.edu/ark:/67531/metadc12157/.

Full text

Abstract:

Our generation has experienced one of the most dramatic changes in how society communicates. Today, we have online information on almost any imaginable topic. However, most of this information is available in only a few dozen languages. In this thesis, I explore the use of parallel texts to enable cross-language information retrieval (CLIR) for languages with scarce resources. To build the parallel text I use the Bible. I evaluate different variables and their impact on the resulting CLIR system, specifically: (1) the CLIR results when using different amounts of parallel text; (2) the role of paraphrasing on the quality of the CLIR output; (3) the impact on accuracy when translating the query versus translating the collection of documents; and finally (4) how the results are affected by the use of different dialects. The results show that all these variables have a direct impact on the quality of the CLIR system.

APA, Harvard, Vancouver, ISO, and other styles

7

Loza, Christian E. Mihalcea Rada F. "Cross language information retrieval for languages with scarce resources." [Denton, Tex.] : University of North Texas, 2009. http://digital.library.unt.edu/ark:/67531/metadc12157.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Lu, Chengye. "Peer to peer English/Chinese cross-language information retrieval." Thesis, Queensland University of Technology, 2008. https://eprints.qut.edu.au/26444/1/Chengye_Lu_Thesis.pdf.

Full text

Abstract:

Peer to peer systems have been widely used in the internet. However, most of the peer to peer information systems are still missing some of the important features, for example cross-language IR (Information Retrieval) and collection selection / fusion features. Cross-language IR is the state-of-art research area in IR research community. It has not been used in any real world IR systems yet. Cross-language IR has the ability to issue a query in one language and receive documents in other languages. In typical peer to peer environment, users are from multiple countries. Their collections are definitely in multiple languages. Cross-language IR can help users to find documents more easily. E.g. many Chinese researchers will search research papers in both Chinese and English. With Cross-language IR, they can do one query in Chinese and get documents in two languages. The Out Of Vocabulary (OOV) problem is one of the key research areas in crosslanguage information retrieval. In recent years, web mining was shown to be one of the effective approaches to solving this problem. However, how to extract Multiword Lexical Units (MLUs) from the web content and how to select the correct translations from the extracted candidate MLUs are still two difficult problems in web mining based automated translation approaches. Discovering resource descriptions and merging results obtained from remote search engines are two key issues in distributed information retrieval studies. In uncooperative environments, query-based sampling and normalized-score based merging strategies are well-known approaches to solve such problems. However, such approaches only consider the content of the remote database but do not consider the retrieval performance of the remote search engine. This thesis presents research on building a peer to peer IR system with crosslanguage IR and advance collection profiling technique for fusion features. Particularly, this thesis first presents a new Chinese term measurement and new Chinese MLU extraction process that works well on small corpora. An approach to selection of MLUs in a more accurate manner is also presented. After that, this thesis proposes a collection profiling strategy which can discover not only collection content but also retrieval performance of the remote search engine. Based on collection profiling, a web-based query classification method and two collection fusion approaches are developed and presented in this thesis. Our experiments show that the proposed strategies are effective in merging results in uncooperative peer to peer environments. Here, an uncooperative environment is defined as each peer in the system is autonomous. Peer like to share documents but they do not share collection statistics. This environment is a typical peer to peer IR environment. Finally, all those approaches are grouped together to build up a secure peer to peer multilingual IR system that cooperates through X.509 and email system.

APA, Harvard, Vancouver, ISO, and other styles

9

Lu, Chengye. "Peer to peer English/Chinese cross-language information retrieval." Queensland University of Technology, 2008. http://eprints.qut.edu.au/26444/.

Full text

Abstract:

Peer to peer systems have been widely used in the internet. However, most of the peer to peer information systems are still missing some of the important features, for example cross-language IR (Information Retrieval) and collection selection / fusion features. Cross-language IR is the state-of-art research area in IR research community. It has not been used in any real world IR systems yet. Cross-language IR has the ability to issue a query in one language and receive documents in other languages. In typical peer to peer environment, users are from multiple countries. Their collections are definitely in multiple languages. Cross-language IR can help users to find documents more easily. E.g. many Chinese researchers will search research papers in both Chinese and English. With Cross-language IR, they can do one query in Chinese and get documents in two languages. The Out Of Vocabulary (OOV) problem is one of the key research areas in crosslanguage information retrieval. In recent years, web mining was shown to be one of the effective approaches to solving this problem. However, how to extract Multiword Lexical Units (MLUs) from the web content and how to select the correct translations from the extracted candidate MLUs are still two difficult problems in web mining based automated translation approaches. Discovering resource descriptions and merging results obtained from remote search engines are two key issues in distributed information retrieval studies. In uncooperative environments, query-based sampling and normalized-score based merging strategies are well-known approaches to solve such problems. However, such approaches only consider the content of the remote database but do not consider the retrieval performance of the remote search engine. This thesis presents research on building a peer to peer IR system with crosslanguage IR and advance collection profiling technique for fusion features. Particularly, this thesis first presents a new Chinese term measurement and new Chinese MLU extraction process that works well on small corpora. An approach to selection of MLUs in a more accurate manner is also presented. After that, this thesis proposes a collection profiling strategy which can discover not only collection content but also retrieval performance of the remote search engine. Based on collection profiling, a web-based query classification method and two collection fusion approaches are developed and presented in this thesis. Our experiments show that the proposed strategies are effective in merging results in uncooperative peer to peer environments. Here, an uncooperative environment is defined as each peer in the system is autonomous. Peer like to share documents but they do not share collection statistics. This environment is a typical peer to peer IR environment. Finally, all those approaches are grouped together to build up a secure peer to peer multilingual IR system that cooperates through X.509 and email system.

APA, Harvard, Vancouver, ISO, and other styles

10

Gupta, Parth Alokkumar. "Cross-view Embeddings for Information Retrieval." Doctoral thesis, Universitat Politècnica de València, 2017. http://hdl.handle.net/10251/78457.

Full text

Abstract:

In this dissertation, we deal with the cross-view tasks related to information retrieval using embedding methods. We study existing methodologies and propose new methods to overcome their limitations. We formally introduce the concept of mixed-script IR, which deals with the challenges faced by an IR system when a language is written in different scripts because of various technological and sociological factors. Mixed-script terms are represented by a small and finite feature space comprised of character n-grams. We propose the cross-view autoencoder (CAE) to model such terms in an abstract space and CAE provides the state-of-the-art performance. We study a wide variety of models for cross-language information retrieval (CLIR) and propose a model based on compositional neural networks (XCNN) which overcomes the limitations of the existing methods and achieves the best results for many CLIR tasks such as ad-hoc retrieval, parallel sentence retrieval and cross-language plagiarism detection. We empirically test the proposed models for these tasks on publicly available datasets and present the results with analyses. In this dissertation, we also explore an effective method to incorporate contextual similarity for lexical selection in machine translation. Concretely, we investigate a feature based on context available in source sentence calculated using deep autoencoders. The proposed feature exhibits statistically significant improvements over the strong baselines for English-to-Spanish and English-to-Hindi translation tasks. Finally, we explore the the methods to evaluate the quality of autoencoder generated representations of text data and analyse its architectural properties. For this, we propose two metrics based on reconstruction capabilities of the autoencoders: structure preservation index (SPI) and similarity accumulation index (SAI). We also introduce a concept of critical bottleneck dimensionality (CBD) below which the structural information is lost and present analyses linking CBD and language perplexity.
En esta disertación estudiamos problemas de vistas-múltiples relacionados con la recuperación de información utilizando técnicas de representación en espacios de baja dimensionalidad. Estudiamos las técnicas existentes y proponemos nuevas técnicas para solventar algunas de las limitaciones existentes. Presentamos formalmente el concepto de recuperación de información con escritura mixta, el cual trata las dificultades de los sistemas de recuperación de información cuando los textos contienen escrituras en distintos alfabetos debido a razones tecnológicas y socioculturales. Las palabras en escritura mixta son representadas en un espacio de características finito y reducido, compuesto por n-gramas de caracteres. Proponemos los auto-codificadores de vistas-múltiples (CAE, por sus siglas en inglés) para modelar dichas palabras en un espacio abstracto, y esta técnica produce resultados de vanguardia. En este sentido, estudiamos varios modelos para la recuperación de información entre lenguas diferentes (CLIR, por sus siglas en inglés) y proponemos un modelo basado en redes neuronales composicionales (XCNN, por sus siglas en inglés), el cual supera las limitaciones de los métodos existentes. El método de XCNN propuesto produce mejores resultados en diferentes tareas de CLIR tales como la recuperación de información ad-hoc, la identificación de oraciones equivalentes en lenguas distintas y la detección de plagio entre lenguas diferentes. Para tal efecto, realizamos pruebas experimentales para dichas tareas sobre conjuntos de datos disponibles públicamente, presentando los resultados y análisis correspondientes. En esta disertación, también exploramos un método eficiente para utilizar similitud semántica de contextos en el proceso de selección léxica en traducción automática. Específicamente, proponemos características extraídas de los contextos disponibles en las oraciones fuentes mediante el uso de auto-codificadores. El uso de las características propuestas demuestra mejoras estadísticamente significativas sobre sistemas de traducción robustos para las tareas de traducción entre inglés y español, e inglés e hindú. Finalmente, exploramos métodos para evaluar la calidad de las representaciones de datos de texto generadas por los auto-codificadores, a la vez que analizamos las propiedades de sus arquitecturas. Como resultado, proponemos dos nuevas métricas para cuantificar la calidad de las reconstrucciones generadas por los auto-codificadores: el índice de preservación de estructura (SPI, por sus siglas en inglés) y el índice de acumulación de similitud (SAI, por sus siglas en inglés). También presentamos el concepto de dimensión crítica de cuello de botella (CBD, por sus siglas en inglés), por debajo de la cual la información estructural se deteriora. Mostramos que, interesantemente, la CBD está relacionada con la perplejidad de la lengua.
En aquesta dissertació estudiem els problemes de vistes-múltiples relacionats amb la recuperació d'informació utilitzant tècniques de representació en espais de baixa dimensionalitat. Estudiem les tècniques existents i en proposem unes de noves per solucionar algunes de les limitacions existents. Presentem formalment el concepte de recuperació d'informació amb escriptura mixta, el qual tracta les dificultats dels sistemes de recuperació d'informació quan els textos contenen escriptures en diferents alfabets per motius tecnològics i socioculturals. Les paraules en escriptura mixta són representades en un espai de característiques finit i reduït, composat per n-grames de caràcters. Proposem els auto-codificadors de vistes-múltiples (CAE, per les seves sigles en anglès) per modelar aquestes paraules en un espai abstracte, i aquesta tècnica produeix resultats d'avantguarda. En aquest sentit, estudiem diversos models per a la recuperació d'informació entre llengües diferents (CLIR , per les sevas sigles en anglès) i proposem un model basat en xarxes neuronals composicionals (XCNN, per les sevas sigles en anglès), el qual supera les limitacions dels mètodes existents. El mètode de XCNN proposat produeix millors resultats en diferents tasques de CLIR com ara la recuperació d'informació ad-hoc, la identificació d'oracions equivalents en llengües diferents, i la detecció de plagi entre llengües diferents. Per a tal efecte, realitzem proves experimentals per aquestes tasques sobre conjunts de dades disponibles públicament, presentant els resultats i anàlisis corresponents. En aquesta dissertació, també explorem un mètode eficient per utilitzar similitud semàntica de contextos en el procés de selecció lèxica en traducció automàtica. Específicament, proposem característiques extretes dels contextos disponibles a les oracions fonts mitjançant l'ús d'auto-codificadors. L'ús de les característiques proposades demostra millores estadísticament significatives sobre sistemes de traducció robustos per a les tasques de traducció entre anglès i espanyol, i anglès i hindú. Finalment, explorem mètodes per avaluar la qualitat de les representacions de dades de text generades pels auto-codificadors, alhora que analitzem les propietats de les seves arquitectures. Com a resultat, proposem dues noves mètriques per quantificar la qualitat de les reconstruccions generades pels auto-codificadors: l'índex de preservació d'estructura (SCI, per les seves sigles en anglès) i l'índex d'acumulació de similitud (SAI, per les seves sigles en anglès). També presentem el concepte de dimensió crítica de coll d'ampolla (CBD, per les seves sigles en anglès), per sota de la qual la informació estructural es deteriora. Mostrem que, de manera interessant, la CBD està relacionada amb la perplexitat de la llengua.
Gupta, PA. (2017). Cross-view Embeddings for Information Retrieval [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/78457
TESIS

APA, Harvard, Vancouver, ISO, and other styles

11

Zhang, Ying, and ying yzhang@gmail com. "Improved Cross-language Information Retrieval via Disambiguation and Vocabulary Discovery." RMIT University. Computer Science and Information Technology, 2007. http://adt.lib.rmit.edu.au/adt/public/adt-VIT20090224.114940.

Full text

Abstract:

Cross-lingual information retrieval (CLIR) allows people to find documents irrespective of the language used in the query or document. This thesis is concerned with the development of techniques to improve the effectiveness of Chinese-English CLIR. In Chinese-English CLIR, the accuracy of dictionary-based query translation is limited by two major factors: translation ambiguity and the presence of out-of-vocabulary (OOV) terms. We explore alternative methods for translation disambiguation, and demonstrate new techniques based on a Markov model and the use of web documents as a corpus to provide context for disambiguation. This simple disambiguation technique has proved to be extremely robust and successful. Queries that seek topical information typically contain OOV terms that may not be found in a translation dictionary, leading to inappropriate translations and consequent poor retrieval performance. Our novel OOV term translation method is based on the Chinese authorial practice of including unfamiliar English terms in both languages. It automatically extracts correct translations from the web and can be applied to both Chinese-English and English-Chinese CLIR. Our OOV translation technique does not rely on prior segmentation and is thus free from seg mentation error. It leads to a significant improvement in CLIR effectiveness and can also be used to improve Chinese segmentation accuracy. Good quality translation resources, especially bilingual dictionaries, are valuable resources for effective CLIR. We developed a system to facilitate construction of a large-scale translation lexicon of Chinese-English OOV terms using the web. Experimental results show that this method is reliable and of practical use in query translation. In addition, parallel corpora provide a rich source of translation information. We have also developed a system that uses multiple features to identify parallel texts via a k-nearest-neighbour classifier, to automatically collect high quality parallel Chinese-English corpora from the web. These two automatic web mining systems are highly reliable and easy to deploy. In this research, we provided new ways to acquire linguistic resources using multilingual content on the web. These linguistic resources not only improve the efficiency and effectiveness of Chinese-English cross-language web retrieval; but also have wider applications than CLIR.

APA, Harvard, Vancouver, ISO, and other styles

12

Tang, Ling-Xiang. "Link discovery for Chinese/English cross-language web information retrieval." Thesis, Queensland University of Technology, 2012. https://eprints.qut.edu.au/58416/1/Ling-Xiang_Tang_Thesis.pdf.

Full text

Abstract:

Nowadays people heavily rely on the Internet for information and knowledge. Wikipedia is an online multilingual encyclopaedia that contains a very large number of detailed articles covering most written languages. It is often considered to be a treasury of human knowledge. It includes extensive hypertext links between documents of the same language for easy navigation. However, the pages in different languages are rarely cross-linked except for direct equivalent pages on the same subject in different languages. This could pose serious difficulties to users seeking information or knowledge from different lingual sources, or where there is no equivalent page in one language or another. In this thesis, a new information retrieval task—cross-lingual link discovery (CLLD) is proposed to tackle the problem of the lack of cross-lingual anchored links in a knowledge base such as Wikipedia. In contrast to traditional information retrieval tasks, cross language link discovery algorithms actively recommend a set of meaningful anchors in a source document and establish links to documents in an alternative language. In other words, cross-lingual link discovery is a way of automatically finding hypertext links between documents in different languages, which is particularly helpful for knowledge discovery in different language domains. This study is specifically focused on Chinese / English link discovery (C/ELD). Chinese / English link discovery is a special case of cross-lingual link discovery task. It involves tasks including natural language processing (NLP), cross-lingual information retrieval (CLIR) and cross-lingual link discovery. To justify the effectiveness of CLLD, a standard evaluation framework is also proposed. The evaluation framework includes topics, document collections, a gold standard dataset, evaluation metrics, and toolkits for run pooling, link assessment and system evaluation. With the evaluation framework, performance of CLLD approaches and systems can be quantified. This thesis contributes to the research on natural language processing and cross-lingual information retrieval in CLLD: 1) a new simple, but effective Chinese segmentation method, n-gram mutual information, is presented for determining the boundaries of Chinese text; 2) a voting mechanism of name entity translation is demonstrated for achieving a high precision of English / Chinese machine translation; 3) a link mining approach that mines the existing link structure for anchor probabilities achieves encouraging results in suggesting cross-lingual Chinese / English links in Wikipedia. This approach was examined in the experiments for better, automatic generation of cross-lingual links that were carried out as part of the study. The overall major contribution of this thesis is the provision of a standard evaluation framework for cross-lingual link discovery research. It is important in CLLD evaluation to have this framework which helps in benchmarking the performance of various CLLD systems and in identifying good CLLD realisation approaches. The evaluation methods and the evaluation framework described in this thesis have been utilised to quantify the system performance in the NTCIR-9 Crosslink task which is the first information retrieval track of this kind.

APA, Harvard, Vancouver, ISO, and other styles

13

Orengo, Viviane Moreira. "Assessing relevance using automatically translated documents for cross-language information retrieval." Thesis, Middlesex University, 2004. http://eprints.mdx.ac.uk/13606/.

Full text

Abstract:

This thesis focuses on the Relevance Feedback (RF) process, and the scenario considered is that of a Portuguese-English Cross-Language Information Retrieval (CUR) system. CUR deals with the retrieval of documents in one natural language in response to a query expressed in another language. RF is an automatic process for query reformulation. The idea behind it is that users are unlikely to produce perfect queries, especially if given just one attempt. The process aims at improving the queryspecification, which will lead to more relevant documents being retrieved. The method consists of asking the user to analyse an initial sample of documents retrieved in response to a query and judge them for relevance. In that context, two main questions were posed. The first one relates to the user's ability in assessing the relevance of texts in a foreign language, texts hand translated into their language and texts automatically translated into their language. The second question concerns the relationship between the accuracy of the participant's judgements and the improvement achieved through the RF process. In order to answer those questions, this work performed an experiment in which Portuguese speakers were asked to judge the relevance of English documents, documents hand-translated to Portuguese, and documents automatically translated to Portuguese. The results show that machine translation is as effective as hand translation in aiding users to assess relevance. In addition, the impact of misjudged documents on the performance of RF is overall just moderate, and varies greatly for different query topics. This work advances the existing research on RF by considering a CUR scenario and carrying out user experiments, which analyse aspects of RF and CUR that remained unexplored until now. The contributions of this work also include: the investigation of CUR using a new language pair; the design and implementation of a stemming algorithm for Portuguese; and the carrying out of several experiments using Latent Semantic Indexing which contribute data points to the CUR theory.

APA, Harvard, Vancouver, ISO, and other styles

14

Wigder, Chaya. "Word embeddings for monolingual and cross-language domain-specific information retrieval." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-233028.

Full text

Abstract:

Various studies have shown the usefulness of word embedding models for a wide variety of natural language processing tasks. This thesis examines how word embeddings can be incorporated into domain-specific search engines for both monolingual and cross-language search. This is done by testing various embedding model hyperparameters, as well as methods for weighting the relative importance of words to a document or query. In addition, methods for generating domain-specific bilingual embeddings are examined and tested. The system was compared to a baseline that used cosine similarity without word embeddings, and for both the monolingual and bilingual search engines the use of monolingual embedding models improved performance above the baseline. However, bilingual embeddings, especially for domain-specific terms, tended to be of too poor quality to be used directly in the search engines.
Flera studier har visat att ordinbäddningsmodeller är användningsbara för många olika språkteknologiuppgifter. Denna avhandling undersöker hur ordinbäddningsmodeller kan användas i sökmotorer för både enspråkig och tvärspråklig domänspecifik sökning. Experiment gjordes för att optimera hyperparametrarna till ordinbäddningsmodellerna och för att hitta det bästa sättet att vikta ord efter hur viktiga de är i dokumentet eller sökfrågan. Dessutom undersöktes metoder för att skapa domänspecifika tvåspråkiga inbäddningar. Systemet jämfördes med en baslinje utan inbäddningar baserad på cosinuslikhet, och för både enspråkiga och tvärspråkliga sökningar var systemet som använde enspråkiga inbäddningar bättre än baslinjen. Däremot var de tvåspråkiga inbäddningarna, särskilt för domänspecifika ord, av låg kvalitet och gav för dåliga resultat för direkt användning inom sökmotorer.

APA, Harvard, Vancouver, ISO, and other styles

15

Hieber, Felix [Verfasser], and Stefan [Akademischer Betreuer] Riezler. "Translation-based Ranking in Cross-Language Information Retrieval / Felix Hieber ; Betreuer: Stefan Riezler." Heidelberg : Universitätsbibliothek Heidelberg, 2015. http://d-nb.info/1180396189/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Cederlund, Petter. "Cross-Language Information Retrieval : En granskning av tre översättningsmetoder använda i experimentell CLIR-forskning." Thesis, Högskolan i Borås, Institutionen Biblioteks- och informationsvetenskap / Bibliotekshögskolan, 2002. http://urn.kb.se/resolve?urn=urn:nbn:se:hb:diva-20775.

Full text

Abstract:

The purpose of this paper is to examine the three main translation methods used in experimental Cross-language Information Retrieval CLIR research today, namely translation using either machine-readable dictionaries, machine translation systems or corpus-based methods. Working notes from research groups participating in the Text Retrieval Conference TREC and the Cross-Language Evaluation Forum CLEF between 1997 and 2000 have provided the main source material used to discuss the possible advantages and drawbacks that each method presents. It appears that all three approaches have their pros and cons, and because the different researchers tend to favour their own chosen method, it is not possible to establish a "winner approach" to CLIR translation by studying the working notes alone. One should remember however that the present interest in cross-language-applications of information retrieval has arisen as late as in the 1990s, and thus the research is yet in its early stages. The methods discussed in this paper may well be improved, or perhaps replaced by others in the future.
Uppsatsnivå: D

APA, Harvard, Vancouver, ISO, and other styles

17

Boström, Anna. "Cross-Language Information Retrieval : En studie av lingvistiska problem och utvecklade översättningsmetoder för lösningar angående informationsåtervinning över språkliga gränser." Thesis, Umeå University, Sociology, 2004. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-1017.

Full text

Abstract:

Syftet med denna uppsats är att undersöka problem samt lösningar i relation till informationsåtervinning över språkliga gränser. Metoden som har använts i uppsatsen är studier av forskningsmaterial inom lingvistik samt främst den relativt nya forskningsdisciplinen Cross-Language Information Retrieval (CLIR). I uppsatsen hävdas att världens alla olikartade språk i dagsläget måste betraktas som ett angeläget problem för informationsvetenskapen, ty språkliga skillnader utgör ännu ett stort hinder för den internationella informationsåtervinning som tekniska framsteg, uppkomsten av Internet, digitala bibliotek, globalisering, samt stora politiska förändringar i ett flertal länder runtom i världen under de senaste åren tekniskt och teoretiskt sett har möjliggjort. I uppsatsens första del redogörs för några universellt erkända lingvistiska skillnader mellan olika språk – i detta fall främst med exempel från europeiska språk – och vanliga problem som dessa kan bidra till angående översättningar från ett språk till ett annat. I uppsatsen hävdas att dessa skillnader och problem även måste anses som relevanta när det gäller informationsåtervinning över språkliga gränser. Uppsatsen fortskrider med att ta upp ämnet Cross-Language Information Retrieval (CLIR), inom vilken lösningar på flerspråkighet och språkskillnader inom informationsåtervinning försöker utvecklas och förbättras. Målet med CLIR är att en informationssökare så småningom skall kunna söka information på sitt modersmål men ändå hitta relevant information på flera andra språk. Ett ytterligare mål är att den återfunna informationen i sin helhet även skall kunna översättas till ett för sökaren önskat språk. Fyra olika översättningsmetoder som i dagsläget finns utvecklade inom CLIR för att automatiskt kunna översätta sökfrågor, ämnesord, eller, i vissa fall, hela dokument åt en informationssökare med lite eller ingen alls kunskap om det språk som han eller hon söker information på behandlas därefter. De fyra metoderna – identifierade som maskinöversättning, tesaurus- och ordboksöversättning, korpusbaserad översättning, samt ingen översättning – diskuteras även i relation till de lingvistiska problem och skillnader som har tagits upp i uppsatsens första del. Resultatet visar att språk är någonting mycket komplext och att de olika metoderna som hittills finns utvecklade ofta kan lösa något eller några av de uppmärksammade lingvistiska översättningssvårigheterna. Dock finns det inte någon utvecklad metod som i dagsläget kan lösa samtliga problem. Uppsatsen uppmärksammar emellertid även att CLIR-forskarna i hög grad är medvetna om de nuvarande metodernas uppenbara begränsningar och att man prövar att lösa detta genom att försöka kombinera flera olika översättningsmetoder i ett CLIR-system. Avslutningsvis redogörs även för CLIR-forskarnas förväntningar och förhoppningar inför framtiden.

This essay deals with information retrieval across languages by examining different types of literature in the research areas of linguistics and multilingual information retrieval. The essay argues that the many different languages that co-exist around the globe must be recognised as an essential obstacle for information science. The language barrier today remains a major impediment for the expansion of international information retrieval otherwise made technically and theoretically possible over the last few years by new technical developments, the Internet, digital libraries, globalisation, and moreover many political changes in several countries around the world. The first part of the essay explores linguistic differences and difficulties related to general translations from one language to another, using examples from mainly European languages. It is suggested that these problems and differences also must be acknowledged and regarded as highly important when it comes to information retrieval across languages. The essay continues by reporting on Cross-Language Information Retrieval (CLIR), a relatively new research area where methods for multilingual information retrieval are studied and developed. The object of CLIR is that people in the future shall be able to search for information in their native tongue, but still find relevant information in more than one language. Another goal for the future is the possibility to translate complete documents into a person’s language of preference. The essay reports on four different CLIR-methods currently established for automatically translating queries, subject headings, or, in some cases, complete documents, and thus aid people with little or no knowledge of the language in which he or she is looking for information. The four methods – identified as machine translation, translations using a multilingual thesaurus or a manually produced machine readable dictionary, corpus-based translation, and no translation – are discussed in relation to the linguistic translation difficulties mentioned in the paper’s initial part. The conclusion drawn is that language is exceedingly complex and that while the different CLIR-methods currently developed often can solve one or two of the acknowledged linguistic difficulties, none is able to overcome all. The essay also show, however, that CLIR-scientists are highly aware of the limitations of the different translation methods and that many are trying to get to terms with this by incorporating several sources of translation in one single CLIR-system. The essay finally concludes by looking at CLIR-scientists’ expectations and hopes for the future.

APA, Harvard, Vancouver, ISO, and other styles

18

Richardson, W. Ryan. "Using Concept Maps as a Tool for Cross-Language Relevance Determination." Diss., Virginia Tech, 2007. http://hdl.handle.net/10919/28191.

Full text

Abstract:

Concept maps, introduced by Novak, aid learnersâ understanding. I hypothesize that concept maps also can function as a summary of large documents, e.g., electronic theses and dissertations (ETDs). I have built a system that automatically generates concept maps from English-language ETDs in the computing field. The system also will provide Spanish translations of these concept maps for native Spanish speakers. Using machine translation techniques, my approach leads to concept maps that could allow researchers to discover pertinent dissertations in languages they cannot read, helping them to decide if they want a potentially relevant dissertation translated. I am using a state-of-the-art natural language processing system, called Relex, to extract noun phrases and noun-verb-noun relations from ETDs, and then produce concept maps automatically. I also have incorporated information from the table of contents of ETDs to create novel styles of concept maps. I have conducted five user studies, to evaluate user perceptions about these different map styles. I am using several methods to translate node and link text in concept maps from English to Spanish. Nodes labeled with single words from a given technical area can be translated using wordlists, but phrases in specific technical fields can be difficult to translate. Thus I have amassed a collection of about 580 Spanish-language ETDs from Scirus and two Mexican universities and I am using this corpus to mine phrase translations that I could not find otherwise. The usefulness of the automatically-generated and translated concept maps has been assessed in an experiment at Universidad de las Americas (UDLA) in Puebla, Mexico. This experiment demonstrated that concept maps can augment abstracts (translated using a standard machine translation package) in helping Spanish speaking users find ETDs of interest.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

19

Franco, Salvador Marc. "A Cross-domain and Cross-language Knowledge-based Representation of Text and its Meaning." Doctoral thesis, Universitat Politècnica de València, 2017. http://hdl.handle.net/10251/84285.

Full text

Abstract:

Natural Language Processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human languages. One of its most challenging aspects involves enabling computers to derive meaning from human natural language. To do so, several meaning or context representations have been proposed with competitive performance. However, these representations still have room for improvement when working in a cross-domain or cross-language scenario. In this thesis we study the use of knowledge graphs as a cross-domain and cross-language representation of text and its meaning. A knowledge graph is a graph that expands and relates the original concepts belonging to a set of words. We obtain its characteristics using a wide-coverage multilingual semantic network as knowledge base. This allows to have a language coverage of hundreds of languages and millions human-general and -specific concepts. As starting point of our research we employ knowledge graph-based features - along with other traditional ones and meta-learning - for the NLP task of single- and cross-domain polarity classification. The analysis and conclusions of that work provide evidence that knowledge graphs capture meaning in a domain-independent way. The next part of our research takes advantage of the multilingual semantic network and focuses on cross-language Information Retrieval (IR) tasks. First, we propose a fully knowledge graph-based model of similarity analysis for cross-language plagiarism detection. Next, we improve that model to cover out-of-vocabulary words and verbal tenses and apply it to cross-language document retrieval, categorisation, and plagiarism detection. Finally, we study the use of knowledge graphs for the NLP tasks of community questions answering, native language identification, and language variety identification. The contributions of this thesis manifest the potential of knowledge graphs as a cross-domain and cross-language representation of text and its meaning for NLP and IR tasks. These contributions have been published in several international conferences and journals.
El Procesamiento del Lenguaje Natural (PLN) es un campo de la informática, la inteligencia artificial y la lingüística computacional centrado en las interacciones entre las máquinas y el lenguaje de los humanos. Uno de sus mayores desafíos implica capacitar a las máquinas para inferir el significado del lenguaje natural humano. Con este propósito, diversas representaciones del significado y el contexto han sido propuestas obteniendo un rendimiento competitivo. Sin embargo, estas representaciones todavía tienen un margen de mejora en escenarios transdominios y translingües. En esta tesis estudiamos el uso de grafos de conocimiento como una representación transdominio y translingüe del texto y su significado. Un grafo de conocimiento es un grafo que expande y relaciona los conceptos originales pertenecientes a un conjunto de palabras. Sus propiedades se consiguen gracias al uso como base de conocimiento de una red semántica multilingüe de amplia cobertura. Esto permite tener una cobertura de cientos de lenguajes y millones de conceptos generales y específicos del ser humano. Como punto de partida de nuestra investigación empleamos características basadas en grafos de conocimiento - junto con otras tradicionales y meta-aprendizaje - para la tarea de PLN de clasificación de la polaridad mono- y transdominio. El análisis y conclusiones de ese trabajo muestra evidencias de que los grafos de conocimiento capturan el significado de una forma independiente del dominio. La siguiente parte de nuestra investigación aprovecha la capacidad de la red semántica multilingüe y se centra en tareas de Recuperación de Información (RI). Primero proponemos un modelo de análisis de similitud completamente basado en grafos de conocimiento para detección de plagio translingüe. A continuación, mejoramos ese modelo para cubrir palabras fuera de vocabulario y tiempos verbales, y lo aplicamos a las tareas translingües de recuperación de documentos, clasificación, y detección de plagio. Por último, estudiamos el uso de grafos de conocimiento para las tareas de PLN de respuesta de preguntas en comunidades, identificación del lenguaje nativo, y identificación de la variedad del lenguaje. Las contribuciones de esta tesis ponen de manifiesto el potencial de los grafos de conocimiento como representación transdominio y translingüe del texto y su significado en tareas de PLN y RI. Estas contribuciones han sido publicadas en diversas revistas y conferencias internacionales.
El Processament del Llenguatge Natural (PLN) és un camp de la informàtica, la intel·ligència artificial i la lingüística computacional centrat en les interaccions entre les màquines i el llenguatge dels humans. Un dels seus majors reptes implica capacitar les màquines per inferir el significat del llenguatge natural humà. Amb aquest propòsit, diverses representacions del significat i el context han estat proposades obtenint un rendiment competitiu. No obstant això, aquestes representacions encara tenen un marge de millora en escenaris trans-dominis i trans-llenguatges. En aquesta tesi estudiem l'ús de grafs de coneixement com una representació trans-domini i trans-llenguatge del text i el seu significat. Un graf de coneixement és un graf que expandeix i relaciona els conceptes originals pertanyents a un conjunt de paraules. Les seves propietats s'aconsegueixen gràcies a l'ús com a base de coneixement d'una xarxa semàntica multilingüe d'àmplia cobertura. Això permet tenir una cobertura de centenars de llenguatges i milions de conceptes generals i específics de l'ésser humà. Com a punt de partida de la nostra investigació emprem característiques basades en grafs de coneixement - juntament amb altres tradicionals i meta-aprenentatge - per a la tasca de PLN de classificació de la polaritat mono- i trans-domini. L'anàlisi i conclusions d'aquest treball mostra evidències que els grafs de coneixement capturen el significat d'una forma independent del domini. La següent part de la nostra investigació aprofita la capacitat\hyphenation{ca-pa-ci-tat} de la xarxa semàntica multilingüe i se centra en tasques de recuperació d'informació (RI). Primer proposem un model d'anàlisi de similitud completament basat en grafs de coneixement per a detecció de plagi trans-llenguatge. A continuació, vam millorar aquest model per cobrir paraules fora de vocabulari i temps verbals, i ho apliquem a les tasques trans-llenguatges de recuperació de documents, classificació, i detecció de plagi. Finalment, estudiem l'ús de grafs de coneixement per a les tasques de PLN de resposta de preguntes en comunitats, identificació del llenguatge natiu, i identificació de la varietat del llenguatge. Les contribucions d'aquesta tesi posen de manifest el potencial dels grafs de coneixement com a representació trans-domini i trans-llenguatge del text i el seu significat en tasques de PLN i RI. Aquestes contribucions han estat publicades en diverses revistes i conferències internacionals.
Franco Salvador, M. (2017). A Cross-domain and Cross-language Knowledge-based Representation of Text and its Meaning [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/84285
TESIS

APA, Harvard, Vancouver, ISO, and other styles

20

Bergstedt, Kenneth. "Lost in translation? En empirisk undersökning av användningen av tesaurer vid queryexpansion inom Cross Language Information Retrieval." Thesis, Högskolan i Borås, Institutionen Biblioteks- och informationsvetenskap / Bibliotekshögskolan, 2004. http://urn.kb.se/resolve?urn=urn:nbn:se:hb:diva-16903.

Full text

Abstract:

The purpose of this thesis is to examine the performance of queries that is expanded before translation in comparison with only translation of the queries using a bilingual dictionary, and also to see if the number of terms that was used to expand the queries was of any importance i. e. if many terms from a thesaurus helped or destroyed a query. To answer these questions i used two online thesauri, Rogets thesaurus and Merriam-Webster Online and one printed bilingual dictionary, Norstedts English-Swedish dictionary. Even though the number of examined queries is too small to draw any definite conclusions, the results suggest that expanding using a general thesaurus may have a negative effect on the queries. The reason is that the number of words from the expansion and the translation makes the queries more ambiguous and thereby increases the noise in the search, which leads to loss of relevant document.
Uppsatsnivå: D

APA, Harvard, Vancouver, ISO, and other styles

21

Geraldo, André Pinto. "Aplicando algoritmos de mineração de regras de associação para recuperação de informações multilíngues." reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2009. http://hdl.handle.net/10183/26506.

Full text

Abstract:

Este trabalho propõe a utilização de algoritmos de mineração de regras de associação para a Recuperação de Informações Multilíngues. Esses algoritmos têm sido amplamente utilizados para analisar transações de registro de vendas. A ideia é mapear o problema de encontrar associações entre itens vendidos para o problema de encontrar termos equivalentes entre idiomas diferentes em um corpus paralelo. A proposta foi validada por meio de experimentos com diferentes idiomas, conjuntos de consultas e corpora. Os resultados mostram que a eficácia da abordagem proposta é comparável ao estado da arte, ao resultado monolíngue e à tradução automática de consultas, embora este utilize técnicas mais complexas de processamento de linguagem natural. Foi criado um protótipo que faz consultas à Web utilizando o método proposto. O sistema recebe palavras-chave em português, as traduz para o inglês e submete a consulta a diversos sites de busca.
This work proposes the use of algorithms for mining association rules as an approach for Cross-Language Information Retrieval. These algorithms have been widely used to analyze market basket data. The idea is to map the problem of finding associations between sales items to the problem of finding term translations over a parallel corpus. The proposal was validated by means of experiments using different languages, queries and corpora. The results show that the performance of our proposed approach is comparable to the performance of the monolingual baseline and to query translation via machine translation, even though these systems employ more complex Natural Language Processing techniques. A prototype for cross-language web querying was implemented to test the proposed method. The system accepts keywords in Portuguese, translates them into English and submits the query to several web-sites that provide search functionalities.

APA, Harvard, Vancouver, ISO, and other styles

22

Asian, Jelita, and jelitayang@gmail com. "Effective Techniques for Indonesian Text Retrieval." RMIT University. Computer Science and Information Technology, 2007. http://adt.lib.rmit.edu.au/adt/public/adt-VIT20080110.084651.

Full text

Abstract:

The Web is a vast repository of data, and information on almost any subject can be found with the aid of search engines. Although the Web is international, the majority of research on finding of information has a focus on languages such as English and Chinese. In this thesis, we investigate information retrieval techniques for Indonesian. Although Indonesia is the fourth most populous country in the world, little attention has been given to search of Indonesian documents. Stemming is the process of reducing morphological variants of a word to a common stem form. Previous research has shown that stemming is language-dependent. Although several stemming algorithms have been proposed for Indonesian, there is no consensus on which gives better performance. We empirically explore these algorithms, showing that even the best algorithm still has scope for improvement. We propose novel extensions to this algorithm and develop a new Indonesian stemmer, and show that these can improve stemming correctness by up to three percentage points; our approach makes less than one error in thirty-eight words. We propose a range of techniques to enhance the performance of Indonesian information retrieval. These techniques include: stopping; sub-word tokenisation; and identification of proper nouns; and modifications to existing similarity functions. Our experiments show that many of these techniques can increase retrieval performance, with the highest increase achieved when we use grams of size five to tokenise words. We also present an effective method for identifying the language of a document; this allows various information retrieval techniques to be applied selectively depending on the language of target documents. We also address the problem of automatic creation of parallel corpora --- collections of documents that are the direct translations of each other --- which are essential for cross-lingual information retrieval tasks. Well-curated parallel corpora are rare, and for many languages, such as Indonesian, do not exist at all. We describe algorithms that we have developed to automatically identify parallel documents for Indonesian and English. Unlike most current approaches, which consider only the context and structure of the documents, our approach is based on the document content itself. Our algorithms do not make any prior assumptions about the documents, and are based on the Needleman-Wunsch algorithm for global alignment of protein sequences. Our approach works well in identifying Indonesian-English parallel documents, especially when no translation is performed. It can increase the separation value, a measure to discriminate good matches of parallel documents from bad matches, by approximately ten percentage points. We also investigate the applicability of our identification algorithms for other languages that use the Latin alphabet. Our experiments show that, with minor modifications, our alignment methods are effective for English-French, English-German, and French-German corpora, especially when the documents are not translated. Our technique can increase the separation value for the European corpus by up to twenty-eight percentage points. Together, these results provide a substantial advance in understanding techniques that can be applied for effective Indonesian text retrieval.

APA, Harvard, Vancouver, ISO, and other styles

23

Qureshi, Karl. "Att maskinöversätta sökfrågor : En studie av Google Translate och Bing Translators förmåga att översätta svenska sammansättningar i ett CLIR-perspektiv." Thesis, Umeå universitet, Sociologiska institutionen, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-131813.

Full text

Abstract:

Syftet med denna uppsats är att undersöka hur väl Google Translate respektive Bing Translator fungerar vid översättning av sökfrågor med avseende på svenska sammansättningar, samt försöka utröna huruvida det finns något samband mellan utfallet och sammansättningarnas komplexitet. Testmiljön utgörs av Europaparlamentets offentliga dokumentregister. Undersökningen är emellertid begränsad till Europeiska rådets handlingar, som till antalet är 1 334 på svenska respektive 1 368 på engelska. Analysen av data har dels skett utifrån precision och återvinningsgrad, dels utifrån en kontrastiv analys för att kunna ge en mer enhetlig bild på det undersökta fenomenet. Resultatet visar att medelvärdet varierar mellan 0,287 och 0,506 för precision samt 0,400 och 0,614 för återvinningsgrad beroende på ordtyp och översättningstjänst. Vidare visar resultatet att det inte tycks finnas något tydligt samband mellan effektivitet och sammansättningarnas komplexitet. I stället tycks de lägre värdena bero på synonymi, och då gärna inom själva sammansättningen, samt hyponymi. I det senare fallet beror det dels på översättningstjänsternas oförmåga att återge lämpliga översättningar, dels på det engelska språkets tendens att bilda sammansättningar med lösa substantivattribut.

APA, Harvard, Vancouver, ISO, and other styles

24

Wilhelm, Thomas. "Entwurf und Implementierung eines Frameworks zur Analyse und Evaluation von Verfahren im Information Retrieval." Master's thesis, [S.l. : s.n.], 2008. https://monarch.qucosa.de/id/qucosa%3A18962.

Full text

APA, Harvard, Vancouver, ISO, and other styles

25

Feldman, Anna. "Portable language technology a resource-light approach to morpho-syntactic tagging /." Columbus, Ohio : Ohio State University, 2006. http://rave.ohiolink.edu/etdc/view?acc%5Fnum=osu1153344391.

Full text

APA, Harvard, Vancouver, ISO, and other styles

26

Li, Bo. "Mesurer et améliorer la qualité des corpus comparables." Thesis, Grenoble, 2012. http://www.theses.fr/2012GRENM069.

Full text

Abstract:

Les corpus bilingues sont des ressources essentielles pour s'affranchir de la barrière de la langue en traitement automatique des langues (TAL) dans un contexte multilingue. La plupart des travaux actuels utilisent des corpus parallèles qui sont surtout disponibles pour des langues majeurs et pour des domaines spécifiques. Les corpus comparables, qui rassemblent des textes comportant des informations corrélées, sont cependant moins coûteux à obtenir en grande quantité. Plusieurs travaux antérieurs ont montré que l'utilisation des corpus comparables est bénéfique à différentes taches en TAL. En parallèle à ces travaux, nous proposons dans cette thèse d'améliorer la qualité des corpus comparables dans le but d'améliorer les performances des applications qui les exploitent. L'idée est avantageuse puisqu'elle peut être utilisée avec n'importe quelle méthode existante reposant sur des corpus comparables. Nous discuterons en premier la notion de comparabilité inspirée des expériences d'utilisation des corpus bilingues. Cette notion motive plusieurs implémentations de la mesure de comparabilité dans un cadre probabiliste, ainsi qu'une méthodologie pour évaluer la capacité des mesures de comparabilité à capturer un haut niveau de comparabilité. Les mesures de comparabilité sont aussi examinées en termes de robustesse aux changements des entrées du dictionnaire. Les expériences montrent qu'une mesure symétrique s'appuyant sur l'entrelacement du vocabulaire peut être corrélée avec un haut niveau de comparabilité et est robuste aux changements des entrées du dictionnaire. En s'appuyant sur cette mesure de comparabilité, deux méthodes nommées: greedy approach et clustering approach, sont alors développées afin d'améliorer la qualité d'un corpus comparable donnée. L'idée générale de ces deux méthodes est de choisir une sous partie du corpus original qui soit de haute qualité, et d'enrichir la sous-partie de qualité moindre avec des ressources externes. Les expériences montrent que l'on peut améliorer avec ces deux méthodes la qualité en termes de score de comparabilité d'un corpus comparable donnée, avec la méthode clustering approach qui est plus efficace que la method greedy approach. Le corpus comparable ainsi obtenu, permet d'augmenter la qualité des lexiques bilingues en utilisant l'algorithme d'extraction standard. Enfin, nous nous penchons sur la tâche d'extraction d'information interlingue (Cross-Language Information Retrieval, CLIR) et l'application des corpus comparables à cette tâche. Nous développons de nouveaux modèles CLIR en étendant les récents modèles proposés en recherche d'information monolingue. Le modèle CLIR montre de meilleurs performances globales. Les lexiques bilingues extraits à partir des corpus comparables sont alors combinés avec le dictionnaire bilingue existant, est utilisé dans les expériences CLIR, ce qui induit une amélioration significative des systèmes CLIR
Bilingual corpora are an essential resource used to cross the language barrier in multilingual Natural Language Processing (NLP) tasks. Most of the current work makes use of parallel corpora that are mainly available for major languages and constrained areas. Comparable corpora, text collections comprised of documents covering overlapping information, are however less expensive to obtain in high volume. Previous work has shown that using comparable corpora is beneficent for several NLP tasks. Apart from those studies, we will try in this thesis to improve the quality of comparable corpora so as to improve the performance of applications exploiting them. The idea is advantageous since it can work with any existing method making use of comparable corpora. We first discuss in the thesis the notion of comparability inspired from the usage experience of bilingual corpora. The notion motivates several implementations of the comparability measure under the probabilistic framework, as well as a methodology to evaluate the ability of comparability measures to capture gold-standard comparability levels. The comparability measures are also examined in terms of robustness to dictionary changes. The experiments show that a symmetric measure relying on vocabulary overlapping can correlate very well with gold-standard comparability levels and is robust to dictionary changes. Based on the comparability measure, two methods, namely the greedy approach and the clustering approach, are then developed to improve the quality of any given comparable corpus. The general idea of these two methods is to choose the highquality subpart from the original corpus and to enrich the low-quality subpart with external resources. The experiments show that one can improve the quality, in terms of comparability scores, of the given comparable corpus by these two methods, with the clustering approach being more efficient than the greedy approach. The enhanced comparable corpus further results in better bilingual lexicons extracted with the standard extraction algorithm. Lastly, we investigate the task of Cross-Language Information Retrieval (CLIR) and the application of comparable corpora in CLIR. We develop novel CLIR models extending the recently proposed information-based models in monolingual IR. The information-based CLIR model is shown to give the best performance overall. Bilingual lexicons extracted from comparable corpora are then combined with the existing bilingual dictionary and used in CLIR experiments, which results in significant improvement of the CLIR system

APA, Harvard, Vancouver, ISO, and other styles

27

Pollettini, Juliana Tarossi. "Auxílio na prevenção de doenças crônicas por meio de mapeamento e relacionamento conceitual de informações em biomedicina." Universidade de São Paulo, 2011. http://www.teses.usp.br/teses/disponiveis/95/95131/tde-24042012-223141/.

Full text

Abstract:

Pesquisas recentes em medicina genômica sugerem que fatores de risco que incidem desde a concepção de uma criança até o final de sua adolescência podem influenciar no desenvolvimento de doenças crônicas da idade adulta. Artigos científicos com descobertas e estudos inovadores sobre o tema indicam que a epigenética deve ser explorada para prevenir doenças de alta prevalência como doenças cardiovasculares, diabetes e obesidade. A grande quantidade de artigos disponibilizados diariamente dificulta a atualização de profissionais, uma vez que buscas por informação exata se tornam complexas e dispendiosas em relação ao tempo gasto na procura e análise dos resultados. Algumas tecnologias e técnicas computacionais podem apoiar a manipulação dos grandes repositórios de informações biomédicas, assim como a geração de conhecimento. O presente trabalho pesquisa a descoberta automática de artigos científicos que relacionem doenças crônicas e fatores de risco para as mesmas em registros clínicos de pacientes. Este trabalho também apresenta o desenvolvimento de um arcabouço de software para sistemas de vigilância que alertem profissionais de saúde sobre problemas no desenvolvimento humano. A efetiva transformação dos resultados de pesquisas biomédicas em conhecimento possível de ser utilizado para beneficiar a saúde pública tem sido considerada um domínio importante da informática. Este domínio é denominado Bioinformática Translacional (BUTTE,2008). Considerando-se que doenças crônicas são, mundialmente, um problema sério de saúde e lideram as causas de mortalidade com 60% de todas as mortes, o presente trabalho poderá possibilitar o uso direto dos resultados dessas pesquisas na saúde pública e pode ser considerado um trabalho de Bioinformática Translacional.
Genomic medicine has suggested that the exposure to risk factors since conception may influence gene expression and consequently induce the development of chronic diseases in adulthood. Scientific papers bringing up these discoveries indicate that epigenetics must be exploited to prevent diseases of high prevalence, such as cardiovascular diseases, diabetes and obesity. A large amount of scientific information burdens health care professionals interested in being updated, once searches for accurate information become complex and expensive. Some computational techniques might support management of large biomedical information repositories and discovery of knowledge. This study presents a framework to support surveillance systems to alert health professionals about human development problems, retrieving scientific papers that relate chronic diseases to risk factors detected on a patient\'s clinical record. As a contribution, healthcare professionals will be able to create a routine with the family, setting up the best growing conditions. According to Butte, the effective transformation of results from biomedical research into knowledge that actually improves public health has been considered an important domain of informatics and has been called Translational Bioinformatics. Since chronic diseases are a serious health problem worldwide and leads the causes of mortality with 60% of all deaths, this scientific investigation will probably enable results from bioinformatics researches to directly benefit public health.

APA, Harvard, Vancouver, ISO, and other styles

28

Luk, Wing-kong. "Concept space approach for cross-lingual information retrieval /." Hong Kong : University of Hong Kong, 2000. http://sunzi.lib.hku.hk/hkuto/record.jsp?B2275345X.

Full text

APA, Harvard, Vancouver, ISO, and other styles

29

陸穎剛 and Wing-kong Luk. "Concept space approach for cross-lingual information retrieval." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2000. http://hub.hku.hk/bib/B30147724.

Full text

APA, Harvard, Vancouver, ISO, and other styles

30

Magableh, Murad. "A generic architecture for semantic enhanced tagging systems." Thesis, De Montfort University, 2011. http://hdl.handle.net/2086/5172.

Full text

Abstract:

The Social Web, or Web 2.0, has recently gained popularity because of its low cost and ease of use. Social tagging sites (e.g. Flickr and YouTube) offer new principles for end-users to publish and classify their content (data). Tagging systems contain free-keywords (tags) generated by end-users to annotate and categorise data. Lack of semantics is the main drawback in social tagging due to the use of unstructured vocabulary. Therefore, tagging systems suffer from shortcomings such as low precision, lack of collocation, synonymy, multilinguality, and use of shorthands. Consequently, relevant contents are not visible, and thus not retrievable while searching in tag-based systems. On the other hand, the Semantic Web, so-called Web 3.0, provides a rich semantic infrastructure. Ontologies are the key enabling technology for the Semantic Web. Ontologies can be integrated with the Social Web to overcome the lack of semantics in tagging systems. In the work presented in this thesis, we build an architecture to address a number of tagging systems drawbacks. In particular, we make use of the controlled vocabularies presented by ontologies to improve the information retrieval in tag-based systems. Based on the tags provided by the end-users, we introduce the idea of adding “system tags” from semantic, as well as social, resources. The “system tags” are comprehensive and wide-ranging in comparison with the limited “user tags”. The system tags are used to fill the gap between the user tags and the search terms used for searching in the tag-based systems. We restricted the scope of our work to tackle the following tagging systems shortcomings: - The lack of semantic relations between user tags and search terms (e.g. synonymy, hypernymy), - The lack of translation mediums between user tags and search terms (multilinguality), - The lack of context to define the emergent shorthand writing user tags. To address the first shortcoming, we use the WordNet ontology as a semantic lingual resource from where system tags are extracted. For the second shortcoming, we use the MultiWordNet ontology to recognise the cross-languages linkages between different languages. Finally, to address the third shortcoming, we use tag clusters that are obtained from the Social Web to create a context for defining the meaning of shorthand writing tags. A prototype for our architecture was implemented. In the prototype system, we built our own database to host videos that we imported from real tag-based system (YouTube). The user tags associated with these videos were also imported and stored in the database. For each user tag, our algorithm adds a number of system tags that came from either semantic ontologies (WordNet or MultiWordNet), or from tag clusters that are imported from the Flickr website. Therefore, each system tag added to annotate the imported videos has a relationship with one of the user tags on that video. The relationship might be one of the following: synonymy, hypernymy, similar term, related term, translation, or clustering relation. To evaluate the suitability of our proposed system tags, we developed an online environment where participants submit search terms and retrieve two groups of videos to be evaluated. Each group is produced from one distinct type of tags; user tags or system tags. The videos in the two groups are produced from the same database and are evaluated by the same participants in order to have a consistent and reliable evaluation. Since the user tags are used nowadays for searching the real tag-based systems, we consider its efficiency as a criterion (reference) to which we compare the efficiency of the new system tags. In order to compare the relevancy between the search terms and each group of retrieved videos, we carried out a statistical approach. According to Wilcoxon Signed-Rank test, there was no significant difference between using either system tags or user tags. The findings revealed that the use of the system tags in the search is as efficient as the use of the user tags; both types of tags produce different results, but at the same level of relevance to the submitted search terms.

APA, Harvard, Vancouver, ISO, and other styles

31

Saad, Motaz. "Fouille de documents et d'opinions multilingue." Thesis, Université de Lorraine, 2015. http://www.theses.fr/2015LORR0003/document.

Full text

Abstract:

L’objectif de cette thèse est d’étudier les sentiments dans les documents comparables. Premièrement, nous avons recueillis des corpus comparables en anglais, français et arabe de Wikipédia et d’Euronews, et nous avons aligné ces corpus au niveau document. Nous avons en plus collecté des documents d’informations des agences de presse locales et étrangères dans les langues anglaise et arabe. Les documents en anglais ont été recueillis du site de la BBC, ceux en arabe du site d’Al-Jazzera. Deuxièmement, nous avons présenté une mesure de similarité cross-linguistique des documents dans le but de récupérer et aligner automatiquement les documents comparables. Ensuite, nous avons proposé une méthode d’annotation cross-linguistique en termes de sentiments, afin d’étiqueter les documents source et cible avec des sentiments. Enfin, nous avons utilisé des mesures statistiques pour comparer l’accord des sentiments entre les documents comparables source et cible. Les méthodes présentées dans cette thèse ne dépendent pas d’une paire de langue bien déterminée, elles peuvent être appliquées sur toute autre couple de langue
The aim of this thesis is to study sentiments in comparable documents. First, we collect English, French and Arabic comparable corpora from Wikipedia and Euronews, and we align each corpus at the document level. We further gather English-Arabic news documents from local and foreign news agencies. The English documents are collected from BBC website and the Arabic documents are collected from Al-jazeera website. Second, we present a cross-lingual document similarity measure to automatically retrieve and align comparable documents. Then, we propose a cross-lingual sentiment annotation method to label source and target documents with sentiments. Finally, we use statistical measures to compare the agreement of sentiments in the source and the target pair of the comparable documents. The methods presented in this thesis are language independent and they can be applied on any language pair

APA, Harvard, Vancouver, ISO, and other styles

32

Saad, Motaz. "Fouille de documents et d'opinions multilingue." Electronic Thesis or Diss., Université de Lorraine, 2015. http://www.theses.fr/2015LORR0003.

Full text

Abstract:

L’objectif de cette thèse est d’étudier les sentiments dans les documents comparables. Premièrement, nous avons recueillis des corpus comparables en anglais, français et arabe de Wikipédia et d’Euronews, et nous avons aligné ces corpus au niveau document. Nous avons en plus collecté des documents d’informations des agences de presse locales et étrangères dans les langues anglaise et arabe. Les documents en anglais ont été recueillis du site de la BBC, ceux en arabe du site d’Al-Jazzera. Deuxièmement, nous avons présenté une mesure de similarité cross-linguistique des documents dans le but de récupérer et aligner automatiquement les documents comparables. Ensuite, nous avons proposé une méthode d’annotation cross-linguistique en termes de sentiments, afin d’étiqueter les documents source et cible avec des sentiments. Enfin, nous avons utilisé des mesures statistiques pour comparer l’accord des sentiments entre les documents comparables source et cible. Les méthodes présentées dans cette thèse ne dépendent pas d’une paire de langue bien déterminée, elles peuvent être appliquées sur toute autre couple de langue
The aim of this thesis is to study sentiments in comparable documents. First, we collect English, French and Arabic comparable corpora from Wikipedia and Euronews, and we align each corpus at the document level. We further gather English-Arabic news documents from local and foreign news agencies. The English documents are collected from BBC website and the Arabic documents are collected from Al-jazeera website. Second, we present a cross-lingual document similarity measure to automatically retrieve and align comparable documents. Then, we propose a cross-lingual sentiment annotation method to label source and target documents with sentiments. Finally, we use statistical measures to compare the agreement of sentiments in the source and the target pair of the comparable documents. The methods presented in this thesis are language independent and they can be applied on any language pair

APA, Harvard, Vancouver, ISO, and other styles

33

Beltrame, Walber Antonio Ramos. "Um sistema de disseminação seletiva da informação baseado em Cross-Document Structure Theory." Universidade Federal do Espírito Santo, 2011. http://repositorio.ufes.br/handle/10/6414.

Full text

Abstract:

Made available in DSpace on 2016-12-23T14:33:46Z (GMT). No. of bitstreams: 1 Dissertacao Walber.pdf: 1673761 bytes, checksum: 5ada541492a23b9653e4a80bea3aaa40 (MD5) Previous issue date: 2011-08-30
A System for Selective Dissemination of Information is a type of information system that aims to harness new intellectual products, from any source, for environments where the probability of interest is high. The inherent challenge is to establish a computational model that maps specific information needs, to a large audience, in a personalized way. Therefore, it is necessary to mediate informational structure of unit, so that includes a plurality of attributes to be considered by process of content selection. In recent publications, systems are proposed based on text markup data (meta-data models), so that treatment of manifest information between computing semi-structured data and inference mechanisms on meta-models. Such approaches only use the data structure associated with the profile of interest. To improve this characteristic, this paper proposes construction of a system for selective dissemination of information based on analysis of multiple discourses through automatic generation of conceptual graphs from texts, introduced in solution also unstructured data (text). The proposed model is motivated by Cross-Document Structure Theory, introduced in area of Natural Language Processing, focusing on automatic generation of summaries. The model aims to establish correlations between semantic of discourse, for example, if there are identical information, additional or contradictory between multiple texts. Thus, an aspects discussed in this dissertation is that these correlations can be used in process of content selection, which had already been shown in other related work. Additionally, the algorithm of the original model is revised in order to make it easy to apply
Um Sistema de Disseminação Seletiva da Informação é um tipo de Sistema de Informação que visa canalizar novas produções intelectuais, provenientes de quaisquer fontes, para ambientes onde a probabilidade de interesse seja alta. O desafio computacional inerente é estabelecer um modelo que mapeie as necessidades específicas de informação, para um grande público, de modo personalizado. Para tanto, é necessário mediar à estruturação da unidade informacional, de maneira que contemple a pluralidade de atributos a serem considerados pelo processo de seleção de conteúdo. Em recentes publicações acadêmicas, são propostos sistemas baseados em marcação de dados sobre textos (modelos de meta-dados), de forma que o tratamento da informação manifesta-se entre computação de dados semi-estruturados e mecanismos de inferência sobre meta-modelos. Tais abordagens utilizam-se apenas da associação da estrutura de dados com o perfil de interesse. Para aperfeiçoar tal característica, este trabalho propõe a construção de um sistema de disseminação seletiva da informação baseado em análise de múltiplos discursos por meio da geração automática de grafos conceituais a partir de textos, concernindo à solução também os dados não estruturados (textos). A proposta é motivada pelo modelo Cross-Document Structure Theory, recentemente difundido na área de Processamento de Língua Natural, voltado para geração automática de resumos. O modelo visa estabelecer correlações de natureza semântica entre discursos, por exemplo, se existem informações idênticas, adicionais ou contraditórias entre múltiplos textos. Desse modo, um dos aspectos discutidos nesta dissertação é que essas correlações podem ser usadas no processo de seleção de conteúdo, o que já fora evidenciado em outros trabalhos correlatos. Adicionalmente, o algoritmo do modelo original é revisado, a fim de torná-lo de fácil aplicabilidade

APA, Harvard, Vancouver, ISO, and other styles

34

Kralisch, Anett. "The impact of culture and language on the use of the internet." Doctoral thesis, Humboldt-Universität zu Berlin, Wirtschaftswissenschaftliche Fakultät, 2006. http://dx.doi.org/10.18452/15501.

Full text

Abstract:

Diese Arbeit untersucht den Einfluss von Kultur und Sprache auf die Nutzung des Internets. Drei Hauptgebiete wurden bearbeitet: (1) Der Einfluss von Kultur und Sprache auf Nutzerpräferenzen bezüglich der Darstellung von Informationen und Nutzung von Suchoptionen; (2) Der Einfluss von Kultur auf Nutzerpräferenzen bezüglich des Inhaltes von Websiteinformationen; (3) Der Einfluss von Sprache auf die Nutzerzufriedenheit und Sprache als Informationszugangsbarriere Daten aus Logfile-Analysen, Onlinebefragungen und experimentellen Untersuchungen bildeten die Auswertungsgrundlage für die Überprüfung der 33 Hypothesen. Die Ergebnisse zeigen, dass kulturspezifische Denkmuster mit Navigationsmusters und Nutzung von Suchoptionen korrelieren. Der Einfluss von Kultur auf Nutzerpräferenzen bezüglich des Inhaltes von Websiteinformationen erwies sich als weniger eindeutig. Aus den Untersuchungen zum Einfluss von Sprache ging hervor, dass Sprache Web¬sitezugriff und –nutzung beeinflusst. Die Daten zeigen, dass signifikant weniger L1-Nutzer als L2-Nutzer auf eine Website zugreifen. Dies lässt sich zum einem mit dem sprachbedingten kognitiven Aufwand erklären als auch mit der Tatsache, dass Websites unterschiedlicher Sprachen weniger miteinander verlinkt sind als Websites gleicher Sprachen. Im Hinblick auf die Nutzung von Suchoptionen zeigte sich, dass L2 Nutzer mit geringem themenspezifischen Wissen sich signifikant von L1 Nutzern unterscheiden. Schließlich lassen die Ergebnisse auch darauf schließen, dass Zufriedenheit der Nutzer einer Website einerseits mit Sprachfähigkeiten der Nutzer und andererseits mit der wahrgenommenen Menge muttersprachlichen Angebots im Internet korreliert.
This thesis analyses the impact of culture and language on Internet use. Three main areas were investigated: (1) the impact of culture and language on preferences for information presentation and search options, (2) the impact of culture on the need for specific website content, and (3) language as a barrier to information access and as a determinant of website satisfaction. In order to test the 33 hypotheses, data was gathered by means of logfile analyses, online surveys, and laboratory studies. It was concluded that culture clearly correlated with patterns of navigation behaviour and the use of search options. In contrast, results concerning the impact of culture on the need for website content were less conclusive. Results concerning language, showed that significantly fewer L1 users than L2 users accessed a website. This can be explained with language related cognitive effort as well as with the fact the websites of different languages are less linked than websites of the same language. With regard to search option use, a strong mediation effect of domain knowledge was found. Furthermore, results revealed correlations between user satisfaction and language proficiency, as well as between satisfaction and the perceived amount of native language information online.

APA, Harvard, Vancouver, ISO, and other styles

35

Kubalík, Jakub. "Mining of Textual Data from the Web for Speech Recognition." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2010. http://www.nusl.cz/ntk/nusl-237170.

Full text

Abstract:

Prvotním cílem tohoto projektu bylo prostudovat problematiku jazykového modelování pro rozpoznávání řeči a techniky pro získávání textových dat z Webu. Text představuje základní techniky rozpoznávání řeči a detailněji popisuje jazykové modely založené na statistických metodách. Zvláště se práce zabývá kriterii pro vyhodnocení kvality jazykových modelů a systémů pro rozpoznávání řeči. Text dále popisuje modely a techniky dolování dat, zvláště vyhledávání informací. Dále jsou představeny problémy spojené se získávání dat z webu, a v kontrastu s tím je představen vyhledávač Google. Součástí projektu byl návrh a implementace systému pro získávání textu z webu, jehož detailnímu popisu je věnována náležitá pozornost. Nicméně, hlavním cílem práce bylo ověřit, zda data získaná z Webu mohou mít nějaký přínos pro rozpoznávání řeči. Popsané techniky se tak snaží najít optimální způsob, jak data získaná z Webu použít pro zlepšení ukázkových jazykových modelů, ale i modelů nasazených v reálných rozpoznávacích systémech.

APA, Harvard, Vancouver, ISO, and other styles

36

"Information fusion for monolingual and cross-language spoken document retrieval." 2002. http://library.cuhk.edu.hk/record=b6073504.

Full text

Abstract:

Lo Wai-kit.
"October 2002."
Thesis (Ph.D.)--Chinese University of Hong Kong, 2002.
Includes bibliographical references (p. 170-184).
Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web.
Electronic reproduction. Ann Arbor, MI : ProQuest Information and Learning Company, [200-] System requirements: Adobe Acrobat Reader. Available via World Wide Web.
Mode of access: World Wide Web.
Abstracts in English and Chinese.

APA, Harvard, Vancouver, ISO, and other styles

37

Yu-ChunTing and 丁鈺純. "The Establishment of English-Chinese Cross-language Information Retrieval System." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/07372899416113106412.

Full text

Abstract:

碩士
國立成功大學
資訊管理研究所
101
In recent years, due to the fast information flow and convenience of information sharing, information overload happens. How to obtain the information required by users from large amounts of data becomes important. Information retrieval systems perform well in monolingual information retrieval, but not in cross-language information retrieval. In current globalized environment, when the users intend to understand non-native words or files, they often need to search for cross-language documents to obtain native related information assisting users across dyslexia problems. Therefore, it is necessary to establish cross-language information retrieval(CLIR) systems to help users search for relevant cross-language document. The present study indicates tjar query translation and query expansion can be used to improve the retrieval accuracy of CLIR. However, the ambiguity of query terms as well as more and more out-of vocabulary(OOV) terms easily lead to translation errors. Cheng et al. (2004) apply network resources to translate query terms, and perform well in OOV terms, but not in general terms. Therefore, this study uses bilingual corpus, Google search results and Wikipedia to extract correct query translation terms in order to reduce word ambiguity problems and thus obtain translation of OOV terms. In addition, in order to improve the performance of CLIR this study uses Google search results and Wikipedia to obtain expansion terms related to query terms. This study ways to NTCIR-8 dataset to do the test, the results show that this method can effectively improve the retrieval accuracy.

APA, Harvard, Vancouver, ISO, and other styles

38

Ballesteros, Lisa Ann. "Resolving ambiguity for cross -language information retrieval: A dictionary approach." 2001. https://scholarworks.umass.edu/dissertations/AAI3027176.

Full text

Abstract:

The global exchange of information has been facilitated by the rapid expansion in the size and use of the Internet, which has led to a large increase in the availability of on-line texts. Expanded international collaboration, the increase in the availability of electronic foreign language texts, the growing number of non-English-speaking users, and the lack of a common language of discourse compels us to develop cross-language information retrieval (CLIR) tools capable of bridging the language barrier. Cross-language retrieval bridges this gap by enabling a person to search in one language and retrieve documents across languages. There are several goals for the research described herein. The first is to gain a clear understanding of the problems associated with the cross-language task and to develop techniques for addressing them. Empirical work shows that ambiguity and lack of lexical resources are the main hurdles. Second we show that cross-language effectiveness does not depend upon linguistic analysis. We demonstrate how statistical techniques can be used to significantly reduce the effects of ambiguity. We also show that combining these techniques is as effective as or more effective than a reasonable machine translation system. Third, we show that an approach based on multi-lingual dictionaries and statistical analysis can be used as the foundation for a cross-language retrieval architecture that circumvents the problem of limited resources.

APA, Harvard, Vancouver, ISO, and other styles

39

Nel, Johannes Gerhardus. "Zulu-English cross-language information retrieval : an analysis of errors." Diss., 2004. http://hdl.handle.net/2263/27720.

Full text

APA, Harvard, Vancouver, ISO, and other styles

40

Liang, Je-Wei, and 梁哲瑋. "Resolving Translation Ambiguity By Ontological Chain for Cross Language Information Retrieval." Thesis, 2004. http://ndltd.ncl.edu.tw/handle/73784787270503597835.

Full text

Abstract:

碩士
國立交通大學
資訊科學系所
92
Bilingual dictionaries have been commonly used for query translation in cross-language information retrieval(CLIR). However, the problem of translation ambiguity happens in query translation. Recent studies suggest traversing WordNet for selecting appropriate translations. This paper proposes an ontological chain approach to resolve translation ambiguity. First, we find the most smilar ontology nodes for each query. Second, we construct a semantic graph according to the semantic distances between these nodes. And finally we select the connected component with the highest score as our ontological chain. We show that our approach reaches 81% effect of monolingual information retrieval systems. When there are many candidate translations, our system performs better than monolingual information retrieval system.

APA, Harvard, Vancouver, ISO, and other styles

41

Lee, Chia-Jung, and 李佳蓉. "The Impact of Query Term Translation on Cross-language Information Retrieval." Thesis, 2010. http://ndltd.ncl.edu.tw/handle/69548766821333713446.

Full text

Abstract:

碩士
臺灣大學
資訊工程學研究所
98
Query translation is an important task in cross-language information retrieval (CLIR) aiming to translate queries into languages used in documents. The purpose of this paper is to investigate the necessity of translating query terms, which might differ from one term to another. Some untranslated terms cause irreparable performance drop while others do not. We propose an approach to estimate the translation probability of a query term, which helps decide if it should be translated or not. The approach learns regression and classification models based on a rich set of linguistic and statistical properties of the term. Experiments on NTCIR-4 and NTCIR-5 English-Chinese CLIR tasks demonstrate that the proposed approach can significantly improve CLIR performance. An in-depth analysis is provided for discussing the impact of untranslated out-of-vocabulary (OOV) query terms and translation quality of non-OOV query terms on CLIR performance. We also scrutinize how translation accuracy is related to translation quality, which eventually influences the translation necessity.

APA, Harvard, Vancouver, ISO, and other styles

42

"Using web resources for effective English-to-Chinese cross language information retrieval." Thesis, 2005. http://library.cuhk.edu.hk/record=b6074036.

Full text

Abstract:

A web-aided query translation expansion method in Cross-Language Information Retrieval (CLIR) is presented in this study. The method is applied to English/Chinese language pair, in which queries are expressed in English and the documents returned are in Chinese. Among the three main categories of CLIR methods of machine translation (MT), dictionary translation using a machine-readable dictionary (MRD), and parallel corpus, our method is based on the second one. MRD-based method is easy to implement. However, it faces the resource limitation problem, i.e., the dictionary is often incomplete leading to poor translation and hence undesirable results. By combining MRD and web-aided query translation expansion technique, good retrieval performance can be achieved. The performance gain is largely due to the successful translation extraction of relevant words of a query term from online texts. A new Chinese word discovery algorithm, which extracts words from continuous Chinese characters was designed and used for this purpose. The extracted relevant words do not only include the precise translation of a query term, but also those words that are relevant to that term in the source language.
Jin Honglan.
"October 2005."
Adviser: Kam Fai Wong.
Source: Dissertation Abstracts International, Volume: 67-07, Section: B, page: 3899.
Thesis (Ph.D.)--Chinese University of Hong Kong, 2005.
Includes bibliographical references (p. 115-121).
Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web.
Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [200-] System requirements: Adobe Acrobat Reader. Available via World Wide Web.
Abstract in English and Chinese.
School code: 1307.

APA, Harvard, Vancouver, ISO, and other styles

43

"A corpus-based approach for cross-lingual information retrieval." 2004. http://library.cuhk.edu.hk/record=b6073674.

Full text

Abstract:

Li Kar Wing.
"July 2004."
Thesis (Ph.D.)--Chinese University of Hong Kong, 2004.
Includes bibliographical references (p. 127-139).
Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web.
Electronic reproduction. Ann Arbor, MI : ProQuest Information and Learning Company, [200-] System requirements: Adobe Acrobat Reader. Available via World Wide Web.
Mode of access: World Wide Web.
Abstracts in English and Chinese.

APA, Harvard, Vancouver, ISO, and other styles

44

Ha, Yoo Jin. "Accessing and using multilanguage information by users searching in differenct information retrieval systems." 2008. http://hdl.rutgers.edu/1782.2/rucore10001600001.ETD.000051091.

Full text

APA, Harvard, Vancouver, ISO, and other styles

45

丁肇君. "A Chinese-English Cross-Language Information Retrieval System for On-line News Articles." Thesis, 2002. http://ndltd.ncl.edu.tw/handle/41429848292650750654.

Full text

Abstract:

碩士
國立臺北科技大學
電機工程系碩士班
90
Accelerated growth of the Internet and on-line news in English allow non-native English speakers to access on-line news in English more frequently. However, Chinese speaking Internet users cannot retrieve relevant topics from an enormous amount of news owing to the difficulty of making a precise query in English. Moreover, non-native English speakers cannot retrieve relevant news in English owing to limited vocabulary skills. This study proposes a novel information retrieval system for Chinese-English cross-language when retrieving on-line news articles. Thus, Chinese speaking Internet users can formulate queries in Chinese and then retrieve relevant news in English via the proposed system. The proposed system first collects on-line news from Chinese and English news web sites daily. Sentence segmentation is then performed using the Chinese query. With sentence segmentation, the original Chinese query can be expanded to more Chinese queries. Finally, these Chinese queries can be translated into English queries and then the relevant English news retrieved. Additionally, the relation between the announcement date between Chinese news and English news for the same event is considered to enhance the precision of the proposed system.

APA, Harvard, Vancouver, ISO, and other styles

46

"Multi-lingual text retrieval and mining." 2003. http://library.cuhk.edu.hk/record=b5891637.

Full text

Abstract:

Law Yin Yee.
Thesis (M.Phil.)--Chinese University of Hong Kong, 2003.
Includes bibliographical references (leaves 130-134).
Abstracts in English and Chinese.
Chapter 1 --- Introduction --- p.1
Chapter 1.1 --- Cross-Lingual Information Retrieval (CLIR) --- p.2
Chapter 1.2 --- Bilingual Term Association Mining --- p.5
Chapter 1.3 --- Our Contributions --- p.6
Chapter 1.3.1 --- CLIR --- p.6
Chapter 1.3.2 --- Bilingual Term Association Mining --- p.7
Chapter 1.4 --- Thesis Organization --- p.8
Chapter 2 --- Related Work --- p.9
Chapter 2.1 --- CLIR Techniques --- p.9
Chapter 2.1.1 --- Existing Approaches --- p.9
Chapter 2.1.2 --- Difference Between Our Model and Existing Approaches --- p.13
Chapter 2.2 --- Bilingual Term Association Mining Techniques --- p.13
Chapter 2.2.1 --- Existing Approaches --- p.13
Chapter 2.2.2 --- Difference Between Our Model and Existing Approaches --- p.17
Chapter 3 --- Cross-Lingual Information Retrieval (CLIR) --- p.18
Chapter 3.1 --- Cross-Lingual Query Processing and Translation --- p.18
Chapter 3.1.1 --- Query Context and Document Context Generation --- p.20
Chapter 3.1.2 --- Context-Based Query Translation --- p.23
Chapter 3.1.3 --- Query Term Weighting --- p.28
Chapter 3.1.4 --- Final Weight Calculation --- p.30
Chapter 3.2 --- Retrieval on Documents and Automated Summaries --- p.32
Chapter 4 --- Experiments on Cross-Lingual Information Retrieval --- p.38
Chapter 4.1 --- Experimental Setup --- p.38
Chapter 4.2 --- Results of English-to-Chinese Retrieval --- p.45
Chapter 4.2.1 --- Using Mono-Lingual Retrieval as the Gold Standard --- p.45
Chapter 4.2.2 --- Using Human Relevance Judgments as the Gold Stan- dard --- p.49
Chapter 4.3 --- Results of Chinese-to-English Retrieval --- p.53
Chapter 4.3.1 --- Using Mono-lingual Retrieval as the Gold Standard --- p.53
Chapter 4.3.2 --- Using Human Relevance Judgments as the Gold Stan- dard --- p.57
Chapter 5 --- Discovering Comparable Multi-lingual Online News for Text Mining --- p.61
Chapter 5.1 --- Story Representation --- p.62
Chapter 5.2 --- Gloss Translation --- p.64
Chapter 5.3 --- Comparable News Discovery --- p.67
Chapter 6 --- Mining Bilingual Term Association Based on Co-occurrence --- p.75
Chapter 6.1 --- Bilingual Term Cognate Generation --- p.75
Chapter 6.2 --- Term Mining Algorithm --- p.77
Chapter 7 --- Phonetic Matching --- p.87
Chapter 7.1 --- Algorithm Design --- p.87
Chapter 7.2 --- Discovering Associations of English Terms and Chinese Terms --- p.93
Chapter 7.2.1 --- Converting English Terms into Phonetic Representation --- p.93
Chapter 7.2.2 --- Discovering Associations of English Terms and Man- darin Chinese Terms --- p.100
Chapter 7.2.3 --- Discovering Associations of English Terms and Can- tonese Chinese Terms --- p.104
Chapter 8 --- Experiments on Bilingual Term Association Mining --- p.111
Chapter 8.1 --- Experimental Setup --- p.111
Chapter 8.2 --- Result and Discussion of Bilingual Term Association Mining Based on Co-occurrence --- p.114
Chapter 8.3 --- Result and Discussion of Phonetic Matching --- p.121
Chapter 9 --- Conclusions and Future Work --- p.126
Chapter 9.1 --- Conclusions --- p.126
Chapter 9.1.1 --- CLIR --- p.126
Chapter 9.1.2 --- Bilingual Term Association Mining --- p.127
Chapter 9.2 --- Future Work --- p.128
Bibliography --- p.134
Chapter A --- Original English Queries --- p.135
Chapter B --- Manual translated Chinese Queries --- p.137
Chapter C --- Pronunciation symbols used by the PRONLEX Lexicon --- p.139
Chapter D --- Initial Letter-to-Phoneme Tags --- p.141
Chapter E --- English Sounds with their Chinese Equivalents --- p.143

APA, Harvard, Vancouver, ISO, and other styles

47

"Named entity translation matching and learning with mining from multilingual news." 2004. http://library.cuhk.edu.hk/record=b5892099.

Full text

Abstract:

Cheung Pik Shan.
Thesis (M.Phil.)--Chinese University of Hong Kong, 2004.
Includes bibliographical references (leaves 79-82).
Abstracts in English and Chinese.
Chapter 1 --- Introduction --- p.1
Chapter 1.1 --- Named Entity Translation Matching --- p.2
Chapter 1.2 --- Mining New Translations from News --- p.3
Chapter 1.3 --- Thesis Organization --- p.4
Chapter 2 --- Related Work --- p.5
Chapter 3 --- Named Entity Matching Model --- p.9
Chapter 3.1 --- Problem Nature --- p.9
Chapter 3.2 --- Matching Model Investigation --- p.12
Chapter 3.3 --- Tokenization --- p.15
Chapter 3.4 --- Hybrid Semantic and Phonetic Matching Algorithm --- p.16
Chapter 4 --- Phonetic Matching Model --- p.22
Chapter 4.1 --- Generating Phonetic Representation for English --- p.22
Chapter 4.1.1 --- Phoneme Generation --- p.22
Chapter 4.1.2 --- Training the Tagging Lexicon and Transformation Rules --- p.25
Chapter 4.2 --- Generating Phonetic Representation for Chinese --- p.29
Chapter 4.3 --- Phonetic Matching Algorithm --- p.31
Chapter 5 --- Learning Phonetic Similarity --- p.37
Chapter 5.1 --- The Widrow-Hoff Algorithm --- p.39
Chapter 5.2 --- The Exponentiated-Gradient Algorithm --- p.41
Chapter 5.3 --- The Genetic Algorithm --- p.42
Chapter 6 --- Experiments on Named Entity Matching Model --- p.43
Chapter 6.1 --- Results for Learning Phonetic Similarity --- p.44
Chapter 6.2 --- Results for Named Entity Matching --- p.46
Chapter 7 --- Mining New Entity Translations from News --- p.48
Chapter 7.1 --- Metadata Generation --- p.52
Chapter 7.2 --- Discovering Comparable News Cluster --- p.54
Chapter 7.2.1 --- News Preprocessing --- p.54
Chapter 7.2.2 --- Gloss Translation --- p.55
Chapter 7.2.3 --- Comparable News Cluster Discovery --- p.62
Chapter 7.3 --- Named Entity Cognate Generation --- p.64
Chapter 7.4 --- Entity Matching --- p.66
Chapter 7.4.1 --- Matching Algorithm --- p.66
Chapter 7.4.2 --- Matching Result Production --- p.68
Chapter 8 --- Experiments on Mining New Translations --- p.69
Chapter 9 --- Experiments on Context-based Gloss Translation --- p.72
Chapter 9.1 --- Results on Chinese News Translation --- p.73
Chapter 9.2 --- Results on Arabic News Translation --- p.75
Chapter 10 --- Conclusions and Future Work --- p.77
Bibliography --- p.79
A --- p.83
B --- p.85
C --- p.87
D --- p.89
E --- p.91
F --- p.94
G --- p.95

APA, Harvard, Vancouver, ISO, and other styles

48

Bian, Guo-Wei, and 邊國維. "The Study of Query Translation and Document Translation in a Cross-Language Information Retrieval System." Thesis, 1999. http://ndltd.ncl.edu.tw/handle/42168106915261766587.

Full text

Abstract:

博士
國立臺灣大學
資訊工程學研究所
87
Internet and digital libraries make available heterogeneous collections in various languages. They provide many useful and powerful information dissemination services. However, about 80% of Web sites are in English and about 40% of Internet users do not speak English. Language barrier becomes a major problem for people to search, retrieve, and understand materials in different languages. How to incorporate the technologies of machine translation with text processing has shown to be very important in the information age. In this dissertation, we first present a general model of multilingual information access system to integrate the text processing systems and language translation systems. A distributed English-Chinese system on WWW is introduced to illustrate how to integrate query translation, search engines, and web translation system. This system can help users to access and retrieve documents on the WWW in their native language(s). This dissertation deals with translation ambiguity and target polysemy problems together. For translation disambiguation, we describe a new hybrid approach combining the dictionary-based and corpus-based approaches to Chinese-English Cross-Language Information Retrieval (CLIR). The bilingual dictionary provides the translation equivalents of each query term. And the word co-occurrence information trained from the retrieval document collection or a monolingual corpus can be used to disambiguate the translation. Further, we investigate the roles of phrase-level translation and short query by comparing the word-level translation and long query for different selection strategies. Several experiments for the query translations of CLIR have been simulated, and they have shown the applicability for short queries on the WWW. Further, we discuss the multiplication effects of translation ambiguity and target polysemy in cross-language information retrieval. And a new translation method is proposed to resolve these problems. Two monolingual balanced corpora are employed to learn word co-occurrence for translation ambiguity resolution, and augmented translation restrictions for target polysemy resolution. We also analyze the two factors: word sense ambiguity in source language (translation ambiguity), and word sense ambiguity in target language (target polysemy). The statistics of word sense ambiguities have shown that target polysemy resolution is critical in Chinese-English information retrieval. The capability of machine translation (MT) is incorporated into the World Wide Web. An on-line and real-time English-to-Chinese machine translation system has been developed and evaluated. It can be treated as a Chinese document generating system to produce the Chinese or the bilingual English-Chinese versions of documents from English web pages dynamically. A quantitative study of 100,000 web pages and the 30 top requested WWW sites have reflected the importance of the tradeoff between speed and translation quality for document translation. On the average, it takes 4.66 seconds for HTML analysis and machine translation. The correct rates of tagging and lexical selection are 97.36% and 85.37%, respectively. For the end-users, this system can be used as a multilingual information access system or a cross-language information retrieval system on the Internet. It can assist the users to retrieve and understand the web pages during their navigation on WWW. Since July 1997, more than 90,000 users have accessed our system and about 450,000 English web pages have been translated to pages in Chinese or bilingual English-Chinese versions. And the average satisfaction degree of users at document level is 67.47%.

APA, Harvard, Vancouver, ISO, and other styles

49

Wang, Yu-Chun, and 王昱鈞. "Web-based Named Entity Translation Method for Korean-Chinese and Japanese-Chinese Cross-language Information Retrieval." Thesis, 2008. http://ndltd.ncl.edu.tw/handle/32466877086955994100.

Full text

Abstract:

碩士
國立臺灣大學
電機工程學研究所
96
Named entity (NE) translation plays an important role in many applications, such as information retrieval and machine translation. In this paper, we focus on translating NEs from Korean/Japanese to Chinese in order to improve Korean-Chinese and Japanese-Chinese cross-language information retrieval. The ideographic nature of Chinese makes NE translation difficult because one syllable may map to several Chinese characters. We propose a hybrid NE translation system. First, we integrate two online databases to extend the coverage of our bilingual dictionaries. We use Wikipedia as a translation tool based on the inter-language links between the Korean/Japanese edition and the Chinese or English editions. We also use Naver.com’s people search engine to ﬁnd a query name’s Chinese or English translation. The second component of our system is able to learn Korean-Chinese (K-C), Korean-English (K-E), and English-Chinese (E-C) translation patterns from the web. These patterns can be used to extract K-C, K-E and E-C pairs from Google snippets. We also have the Japanese-Chinese (J-C), Japanese-English (J-E) translation patterns for translating Japanese NEs. We found CLIR performance using this hybrid conﬁguration over ﬁve times better than that a dictionary-based conﬁguration using only the bilingual dictionary. Mean average precision was as high as 0.3385 and recall reached 0.7578. Our method can handle Chinese, Japanese, Korean, and non-CJK NE translation and improve performance of CLIR substantially.

APA, Harvard, Vancouver, ISO, and other styles

50

Wang, Yu-Chun. "Web-based Named Entity Translation Method for Korean-Chinese and Japanese-Chinese Cross-language Information Retrieval." 2008. http://www.cetd.com.tw/ec/thesisdetail.aspx?etdun=U0001-2207200819095000.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Cross-language information retrieval'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles