Log in

Relevant bibliographies by topics / Arabic language – Data processing / Dissertations / Theses

Dissertations / Theses on the topic 'Arabic language – Data processing'

To see the other types of publications on this topic, follow the link: Arabic language – Data processing.

Author: Grafiati

Published: 10 December 2022

Last updated: 28 January 2023

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Arabic language – Data processing.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Hamrouni, Nadia. "Structure and Processing in Tunisian Arabic: Speech Error Data." Diss., The University of Arizona, 2010. http://hdl.handle.net/10150/195969.

Full text

Abstract:

This dissertation presents experimental research on speech errors in Tunisian Arabic (TA). The central empirical questions revolve around properties of `exchange errors'. These errors can mis-order lexical, morphological, or sound elements in a variety of patterns. TA's nonconcatenative morphology shows interesting interactions of phrasal and lexical constraints with morphological structure during language production and affords different and revealing error potentials linking the production system with linguistic knowledge.The dissertation studies expand and test generalizations based on Abd-El-Jawad and Abu-Salim's (1987) study of spontaneous speech errors in Jordanian Arabic by experimentally examining apparent regularities in data from real-time language processing perspective. The studies address alternative accounts of error phenomena that have figured prominently in accounts of production processing. Three experiments were designed and conducted based on an error elicitation paradigm used by Ferreira and Humphreys (2001). Experiment 1 tested within-phrase exchange errors focused on root versus non-root exchanges and lexical versus non-lexical outcomes for root and non-root errors. Experiments 2 and 3 addressed between-phrase exchange errors focused on violations of the Grammatical Category Constraint (GCC).The study of exchange potentials for the within-phrase items (experiment 1) contrasted lexical and non-lexical outcomes. The expectation was that these would include a significant number of root exchanges and that the lexical status of the resulting forms would not preclude error. Results show that root and vocalic pattern exchanges were very rare and that word forms rather than root forms were the dominant influence in the experimental performance. On the other hand, the study of exchange errors across phrasal boundaries of items that do or do not correspond in grammatical category (experiments 2 and 3) pursued two principal questions, one concerning the error rate and the second concerning the error elements. The expectation was that the errors predominantly come from grammatical category matches. That outcome would reinforce the interpretation that processing operations reflect the assignment of syntactically labeled elements to their location in phrasal structures. Results corroborated with the expectation. However, exchange errors involving words of different grammatical categories were also frequent. This has implications for speech monitoring models and the automaticity of the GCC.

APA, Harvard, Vancouver, ISO, and other styles

2

Bakheet, Mohammed. "Improving Speech Recognition for Arabic language Using Low Amounts of Labeled Data." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-176437.

Full text

Abstract:

The importance of Automatic Speech Recognition (ASR) Systems, whose job is to generate text from audio, is increasing as the number of applications of these systems is rapidly going up. However, when it comes to training ASR systems, the process is difficult and rather tedious, and that could be attributed to the lack of training data. ASRs require huge amounts of annotated training data containing the audio files and the corresponding accurately written transcript files. This annotated (labeled) training data is very difficult to find for most of the languages, it usually requires people to perform the annotation manually which, apart from the monetary price it costs, is error-prone. A supervised training task is impractical for this scenario. The Arabic language is one of the languages that do not have an abundance of labeled data, which makes its ASR system's accuracy very low compared to other resource-rich languages such as English, French, or Spanish. In this research, we take advantage of unlabeled voice data by learning general data representations from unlabeled training data (only audio files) in a self-supervised task or pre-training phase. This phase is done by using wav2vec 2.0 framework which masks out input in the latent space and solves a contrastive task. The model is then fine-tuned on a few amounts of labeled data. We also exploit models that have been pre-trained on different languages, by using wav2vec 2.0, for the purpose of fine-tuning them on Arabic language by using annotated Arabic data. We show that using wav2vec 2.0 framework for pre-training on Arabic is considerably time and resource-consuming. It took the model 21.5 days (about 3 weeks) to complete 662 epochs and get a validation accuracy of 58%. Arabic is a right-to-left (rtl) language with many diacritics that indicate how letters should be pronounced, these two features make it difficult for Arabic to fit into these models, as it requires heavy pre-processing for the transcript files. We demonstrate that we can fine-tune a cross-lingual model, that is trained on raw waveforms of speech in multiple languages, on Arabic data and get a low word error rate 36.53%. We also prove that by fine-tuning the model parameters we can increase the accuracy, thus, decrease the word error rate from 54.00% to 36.69%.

APA, Harvard, Vancouver, ISO, and other styles

3

Al-Nashashibi, May Y. A. "Arabic Language Processing for Text Classification. Contributions to Arabic Root Extraction Techniques, Building An Arabic Corpus, and to Arabic Text Classification Techniques." Thesis, University of Bradford, 2012. http://hdl.handle.net/10454/6326.

Full text

Abstract:

The impact and dynamics of Internet-based resources for Arabic-speaking users is increasing in significance, depth and breadth at highest pace than ever, and thus requires updated mechanisms for computational processing of Arabic texts. Arabic is a complex language and as such requires in depth investigation for analysis and improvement of available automatic processing techniques such as root extraction methods or text classification techniques, and for developing text collections that are already labeled, whether with single or multiple labels. This thesis proposes new ideas and methods to improve available automatic processing techniques for Arabic texts. Any automatic processing technique would require data in order to be used and critically reviewed and assessed, and here an attempt to develop a labeled Arabic corpus is also proposed. This thesis is composed of three parts: 1- Arabic corpus development, 2- proposing, improving and implementing root extraction techniques, and 3- proposing and investigating the effect of different pre-processing methods on single-labeled text classification methods for Arabic. This thesis first develops an Arabic corpus that is prepared to be used here for testing root extraction methods as well as single-label text classification techniques. It also enhances a rule-based root extraction method by handling irregular cases (that appear in about 34% of texts). It proposes and implements two expanded algorithms as well as an adjustment for a weight-based method. It also includes the algorithm that handles irregular cases to all and compares the performances of these proposed methods with original ones. This thesis thus develops a root extraction system that handles foreign Arabized words by constructing a list of about 7,000 foreign words. The outcome of the technique with best accuracy results in extracting the correct stem and root for respective words in texts, which is an enhanced rule-based method, is used in the third part of this thesis. This thesis finally proposes and implements a variant term frequency inverse document frequency weighting method, and investigates the effect of using different choices of features in document representation on single-label text classification performance (words, stems or roots as well as including to these choices their respective phrases). This thesis applies forty seven classifiers on all proposed representations and compares their performances. One challenge for researchers in Arabic text processing is that reported root extraction techniques in literature are either not accessible or require a long time to be reproduced while labeled benchmark Arabic text corpus is not fully available online. Also, by now few machine learning techniques were investigated on Arabic where usual preprocessing steps before classification were chosen. Such challenges are addressed in this thesis by developing a new labeled Arabic text corpus for extended applications of computational techniques. Results of investigated issues here show that proposing and implementing an algorithm that handles irregular words in Arabic did improve the performance of all implemented root extraction techniques. The performance of the algorithm that handles such irregular cases is evaluated in terms of accuracy improvement and execution time. Its efficiency is investigated with different document lengths and empirically is found to be linear in time for document lengths less than about 8,000. The rule-based technique is improved the highest among implemented root extraction methods when including the irregular cases handling algorithm. This thesis validates that choosing roots or stems instead of words in documents representations indeed improves single-label classification performance significantly for most used classifiers. However, the effect of extending such representations with their respective phrases on single-label text classification performance shows that it has no significant improvement. Many classifiers were not yet tested for Arabic such as the ripple-down rule classifier. The outcome of comparing the classifiers' performances concludes that the Bayesian network classifier performance is significantly the best in terms of accuracy, training time, and root mean square error values for all proposed and implemented representations.
Petra University, Amman (Jordan)

APA, Harvard, Vancouver, ISO, and other styles

4

Al-Nashashibi, May Yacoub Adib. "Arabic language processing for text classification : contributions to Arabic root extraction techniques, building an Arabic corpus, and to Arabic text classification techniques." Thesis, University of Bradford, 2012. http://hdl.handle.net/10454/6326.

Full text

Abstract:

The impact and dynamics of Internet-based resources for Arabic-speaking users is increasing in significance, depth and breadth at highest pace than ever, and thus requires updated mechanisms for computational processing of Arabic texts. Arabic is a complex language and as such requires in depth investigation for analysis and improvement of available automatic processing techniques such as root extraction methods or text classification techniques, and for developing text collections that are already labeled, whether with single or multiple labels. This thesis proposes new ideas and methods to improve available automatic processing techniques for Arabic texts. Any automatic processing technique would require data in order to be used and critically reviewed and assessed, and here an attempt to develop a labeled Arabic corpus is also proposed. This thesis is composed of three parts: 1- Arabic corpus development, 2- proposing, improving and implementing root extraction techniques, and 3- proposing and investigating the effect of different pre-processing methods on single-labeled text classification methods for Arabic. This thesis first develops an Arabic corpus that is prepared to be used here for testing root extraction methods as well as single-label text classification techniques. It also enhances a rule-based root extraction method by handling irregular cases (that appear in about 34% of texts). It proposes and implements two expanded algorithms as well as an adjustment for a weight-based method. It also includes the algorithm that handles irregular cases to all and compares the performances of these proposed methods with original ones. This thesis thus develops a root extraction system that handles foreign Arabized words by constructing a list of about 7,000 foreign words. The outcome of the technique with best accuracy results in extracting the correct stem and root for respective words in texts, which is an enhanced rule-based method, is used in the third part of this thesis. This thesis finally proposes and implements a variant term frequency inverse document frequency weighting method, and investigates the effect of using different choices of features in document representation on single-label text classification performance (words, stems or roots as well as including to these choices their respective phrases). This thesis applies forty seven classifiers on all proposed representations and compares their performances. One challenge for researchers in Arabic text processing is that reported root extraction techniques in literature are either not accessible or require a long time to be reproduced while labeled benchmark Arabic text corpus is not fully available online. Also, by now few machine learning techniques were investigated on Arabic where usual preprocessing steps before classification were chosen. Such challenges are addressed in this thesis by developing a new labeled Arabic text corpus for extended applications of computational techniques. Results of investigated issues here show that proposing and implementing an algorithm that handles irregular words in Arabic did improve the performance of all implemented root extraction techniques. The performance of the algorithm that handles such irregular cases is evaluated in terms of accuracy improvement and execution time. Its efficiency is investigated with different document lengths and empirically is found to be linear in time for document lengths less than about 8,000. The rule-based technique is improved the highest among implemented root extraction methods when including the irregular cases handling algorithm. This thesis validates that choosing roots or stems instead of words in documents representations indeed improves single-label classification performance significantly for most used classifiers. However, the effect of extending such representations with their respective phrases on single-label text classification performance shows that it has no significant improvement. Many classifiers were not yet tested for Arabic such as the ripple-down rule classifier. The outcome of comparing the classifiers' performances concludes that the Bayesian network classifier performance is significantly the best in terms of accuracy, training time, and root mean square error values for all proposed and implemented representations.

APA, Harvard, Vancouver, ISO, and other styles

5

Alabbas, Maytham Abualhail Shahed. "Textual entailment for modern standard Arabic." Thesis, University of Manchester, 2013. https://www.research.manchester.ac.uk/portal/en/theses/textual-entailment-for-modern-standard-arabic(9e053b1a-0570-4c30-9100-3d9c2ba86d8c).html.

Full text

Abstract:

This thesis explores a range of approaches to the task of recognising textual entailment (RTE), i.e. determining whether one text snippet entails another, for Arabic, where we are faced with an exceptional level of lexical and structural ambiguity. To the best of our knowledge, this is the first attempt to carry out this task for Arabic. Tree edit distance (TED) has been widely used as a component of natural language processing (NLP) systems that attempt to achieve the goal above, with the distance between pairs of dependency trees being taken as a measure of the likelihood that one entails the other. Such a technique relies on having accurate linguistic analyses. Obtaining such analyses for Arabic is notoriously difficult. To overcome these problems we have investigated strategies for improving tagging and parsing depending on system combination techniques. These strategies lead to substantially better performance than any of the contributing tools. We describe also a semi-automatic technique for creating a first dataset for RTE for Arabic using an extension of the ‘headline-lead paragraph’ technique because there are, again to the best of our knowledge, no such datasets available. We sketch the difficulties inherent in volunteer annotators-based judgment, and describe a regime to ameliorate some of these. The major contribution of this thesis is the introduction of two ways of improving the standard TED: (i) we present a novel approach, extended TED (ETED), for extending the standard TED algorithm for calculating the distance between two trees by allowing operations to apply to subtrees, rather than just to single nodes. This leads to useful improvements over the performance of the standard TED for determining entailment. The key here is that subtrees tend to correspond to single information units. By treating operations on subtrees as less costly than the corresponding set of individual node operations, ETED concentrates on entire information units, which are a more appropriate granularity than individual words for considering entailment relations; and (ii) we use the artificial bee colony (ABC) algorithm to automatically estimate the cost of edit operations for single nodes and subtrees and to determine thresholds, since assigning an appropriate cost to each edit operation manually can become a tricky task.The current findings are encouraging. These extensions can substantially affect the F-score and accuracy and achieve a better RTE model when compared with a number of string-based algorithms and the standard TED approaches. The relative performance of the standard techniques on our Arabic test set replicates the results reported for these techniques for English test sets. We have also applied ETED with ABC to the English RTE2 test set, where it again outperforms the standard TED.

APA, Harvard, Vancouver, ISO, and other styles

6

Khaliq, Bilal. "Unsupervised learning of Arabic non-concatenative morphology." Thesis, University of Sussex, 2015. http://sro.sussex.ac.uk/id/eprint/53865/.

Full text

Abstract:

Unsupervised approaches to learning the morphology of a language play an important role in computer processing of language from a practical and theoretical perspective, due their minimal reliance on manually produced linguistic resources and human annotation. Such approaches have been widely researched for the problem of concatenative affixation, but less attention has been paid to the intercalated (non-concatenative) morphology exhibited by Arabic and other Semitic languages. The aim of this research is to learn the root and pattern morphology of Arabic, with accuracy comparable to manually built morphological analysis systems. The approach is kept free from human supervision or manual parameter settings, assuming only that roots and patterns intertwine to form a word. Promising results were obtained by applying a technique adapted from previous work in concatenative morphology learning, which uses machine learning to determine relatedness between words. The output, with probabilistic relatedness values between words, was then used to rank all possible roots and patterns to form a lexicon. Analysis using trilateral roots resulted in correct root identification accuracy of approximately 86% for inflected words. Although the machine learning-based approach is effective, it is conceptually complex. So an alternative, simpler and computationally efficient approach was then devised to obtain morpheme scores based on comparative counts of roots and patterns. In this approach, root and pattern scores are defined in terms of each other in a mutually recursive relationship, converging to an optimized morpheme ranking. This technique gives slightly better accuracy while being conceptually simpler and more efficient. The approach, after further enhancements, was evaluated on a version of the Quranic Arabic Corpus, attaining a final accuracy of approximately 93%. A comparative evaluation shows this to be superior to two existing, well used manually built Arabic stemmers, thus demonstrating the practical feasibility of unsupervised learning of non-concatenative morphology.

APA, Harvard, Vancouver, ISO, and other styles

7

Grinman, Alex J. "Natural language processing on encrypted patient data." Thesis, Massachusetts Institute of Technology, 2016. http://hdl.handle.net/1721.1/113438.

Full text

Abstract:

Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2016.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 85-86).
While many industries can benefit from machine learning techniques for data analysis, they often do not have the technical expertise nor computational power to do so. Therefore, many organizations would benefit from outsourcing their data analysis. Yet, stringent data privacy policies prevent outsourcing sensitive data and may stop the delegation of data analysis in its tracks. In this thesis, we put forth a two-party system where one party capable of powerful computation can run certain machine learning algorithms from the natural language processing domain on the second party's data, where the first party is limited to learning only specific functions of the second party's data and nothing else. Our system provides simple cryptographic schemes for locating keywords, matching approximate regular expressions, and computing frequency analysis on encrypted data. We present a full implementation of this system in the form of a extendible software library and a command line interface. Finally, we discuss a medical case study where we used our system to run a suite of unmodified machine learning algorithms on encrypted free text patient notes.
by Alex J. Grinman.
M. Eng.

APA, Harvard, Vancouver, ISO, and other styles

8

Alamry, Ali. "Grammatical Gender Processing in Standard Arabic as a First and a Second Language." Thesis, Université d'Ottawa / University of Ottawa, 2019. http://hdl.handle.net/10393/39965.

Full text

Abstract:

The present dissertation investigates grammatical gender representation and processing in Modern Standard Arabic (MSA) as a first (L1) and a second (L2) language. It mainly examines whether L2 can process gender agreement in a native-like manner, and the extent to which L2 processing is influenced by the properties of the L2 speakers’ L1. Additionally, it examines whether L2 gender agreement processing is influenced by noun animacy (animate and inanimate) and word order (verb-subject and subject-verb). A series of experiments using both online and offline techniques were conducted to address these questions. In all of the experiments, gender agreement between verb and nouns was examined. The first series of experiments examined native speakers of MSA (n=49) using a self-paced reading task (SPR), an event-related potential (ERP) experiment, and a grammaticality judgment (GJ) task. Results of these experiments revealed that native speakers were sensitive to grammatical violations. Native speakers showed longer reaction times (RT) in the SPR task, and a P600 effect in the ERP, in responses to sentences with mismatched gender agreement as compared to sentences with matched gender agreement. They also performed at ceiling in the GJ task. The second series of experiments examined L2 speakers of MSA (n=74) using an SPR task, and a GJ task. Both experiments included adult L2 speakers whom were divided into two subgroups, -Gender and +Gender, based on whether or not their L1s has a grammatical gender system. The results of both experiments revealed that both groups were sensitive to gender agreement violations. The L2 speakers showed longer RTs, in the SPR task, in responses to sentences with mismatched gender agreement as compared to sentences with matched gender agreement. No difference was found between the L2 groups in this task. The L2 speakers also performed well in the GJ task, as they were able to correctly identify the grammatical and ungrammatical sentences. Interestingly in this task, the -Gender group outperformed +Gender group, which could be due to proficiency in the L2 as the former group obtained a better score on the proficiency task, or it could be that +Gender group showed negative transfer from their L1s. Based on the results of these two experiments, this dissertation argues that late L2 speakers are not restricted to their L1 grammar, and thus, they are able to acquire gender agreement system of their L2 even if this feature is not instantiated in their L1. The results provide converging evidence for the FTFA rather than FFFH model, as it appears that the -Gender group was able to reset their L1 gender parameter according to the L2 gender values. Although the L2 speakers were advanced, they showed slower RTs than the native speakers in the SPR task, and lower accuracy in the GJT. However, it is possible that they are still in the process of acquiring gender agreement of MSA and have not reached their final stage of acquisition. This is supported by the fact that some L2 speakers from both -Gender and +Gender groups performed as well as native speakers in both SPR and GJ tasks. Regarding the effect of animacy, the L2 speakers had slower RT and lower accuracy on sentences with inanimate nouns than on those with animate ones, which is in line with previous L2 studies (Anton-Medez, 1999; Alarcón, 2009; Gelin, & Bugaiska, 2014). The native speakers, on the other hand, showed no effect of animacy in both SPR task and GJT. Further, no N400 effect was observed as a result of semantic gender agreement violations in the ERP experiment. Finally, the results revealed a potential effect of word order. Both the native and L2 speakers showed longer RTs on VS word order than SV word order in the SPR task. Further the native speakers showed earlier and greater P600 effect on VS word order than SV word order in the ERP. This result suggests that processing gender agreement violation is more complex in the VS word order than in the SV word order due to the inherent asymmetry in the subject-verb agreement system in the two-word orders in MSA.

APA, Harvard, Vancouver, ISO, and other styles

9

余銘龍 and Ming-lung Yu. "Automatic processing of Chinese language bank cheques." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2002. http://hub.hku.hk/bib/B31225548.

Full text

APA, Harvard, Vancouver, ISO, and other styles

10

Hellmann, Sebastian. "Integrating Natural Language Processing (NLP) and Language Resources Using Linked Data." Doctoral thesis, Universitätsbibliothek Leipzig, 2015. http://nbn-resolving.de/urn:nbn:de:bsz:15-qucosa-157932.

Full text

Abstract:

This thesis is a compendium of scientific works and engineering specifications that have been contributed to a large community of stakeholders to be copied, adapted, mixed, built upon and exploited in any way possible to achieve a common goal: Integrating Natural Language Processing (NLP) and Language Resources Using Linked Data The explosion of information technology in the last two decades has led to a substantial growth in quantity, diversity and complexity of web-accessible linguistic data. These resources become even more useful when linked with each other and the last few years have seen the emergence of numerous approaches in various disciplines concerned with linguistic resources and NLP tools. It is the challenge of our time to store, interlink and exploit this wealth of data accumulated in more than half a century of computational linguistics, of empirical, corpus-based study of language, and of computational lexicography in all its heterogeneity. The vision of the Giant Global Graph (GGG) was conceived by Tim Berners-Lee aiming at connecting all data on the Web and allowing to discover new relations between this openly-accessible data. This vision has been pursued by the Linked Open Data (LOD) community, where the cloud of published datasets comprises 295 data repositories and more than 30 billion RDF triples (as of September 2011). RDF is based on globally unique and accessible URIs and it was specifically designed to establish links between such URIs (or resources). This is captured in the Linked Data paradigm that postulates four rules: (1) Referred entities should be designated by URIs, (2) these URIs should be resolvable over HTTP, (3) data should be represented by means of standards such as RDF, (4) and a resource should include links to other resources. Although it is difficult to precisely identify the reasons for the success of the LOD effort, advocates generally argue that open licenses as well as open access are key enablers for the growth of such a network as they provide a strong incentive for collaboration and contribution by third parties. In his keynote at BNCOD 2011, Chris Bizer argued that with RDF the overall data integration effort can be “split between data publishers, third parties, and the data consumer”, a claim that can be substantiated by observing the evolution of many large data sets constituting the LOD cloud. As written in the acknowledgement section, parts of this thesis has received numerous feedback from other scientists, practitioners and industry in many different ways. The main contributions of this thesis are summarized here: Part I – Introduction and Background. During his keynote at the Language Resource and Evaluation Conference in 2012, Sören Auer stressed the decentralized, collaborative, interlinked and interoperable nature of the Web of Data. The keynote provides strong evidence that Semantic Web technologies such as Linked Data are on its way to become main stream for the representation of language resources. The jointly written companion publication for the keynote was later extended as a book chapter in The People’s Web Meets NLP and serves as the basis for “Introduction” and “Background”, outlining some stages of the Linked Data publication and refinement chain. Both chapters stress the importance of open licenses and open access as an enabler for collaboration, the ability to interlink data on the Web as a key feature of RDF as well as provide a discussion about scalability issues and decentralization. Furthermore, we elaborate on how conceptual interoperability can be achieved by (1) re-using vocabularies, (2) agile ontology development, (3) meetings to refine and adapt ontologies and (4) tool support to enrich ontologies and match schemata. Part II - Language Resources as Linked Data. “Linked Data in Linguistics” and “NLP & DBpedia, an Upward Knowledge Acquisition Spiral” summarize the results of the Linked Data in Linguistics (LDL) Workshop in 2012 and the NLP & DBpedia Workshop in 2013 and give a preview of the MLOD special issue. In total, five proceedings – three published at CEUR (OKCon 2011, WoLE 2012, NLP & DBpedia 2013), one Springer book (Linked Data in Linguistics, LDL 2012) and one journal special issue (Multilingual Linked Open Data, MLOD to appear) – have been (co-)edited to create incentives for scientists to convert and publish Linked Data and thus to contribute open and/or linguistic data to the LOD cloud. Based on the disseminated call for papers, 152 authors contributed one or more accepted submissions to our venues and 120 reviewers were involved in peer-reviewing. “DBpedia as a Multilingual Language Resource” and “Leveraging the Crowdsourcing of Lexical Resources for Bootstrapping a Linguistic Linked Data Cloud” contain this thesis’ contribution to the DBpedia Project in order to further increase the size and inter-linkage of the LOD Cloud with lexical-semantic resources. Our contribution comprises extracted data from Wiktionary (an online, collaborative dictionary similar to Wikipedia) in more than four languages (now six) as well as language-specific versions of DBpedia, including a quality assessment of inter-language links between Wikipedia editions and internationalized content negotiation rules for Linked Data. In particular the work described in created the foundation for a DBpedia Internationalisation Committee with members from over 15 different languages with the common goal to push DBpedia as a free and open multilingual language resource. Part III - The NLP Interchange Format (NIF). “NIF 2.0 Core Specification”, “NIF 2.0 Resources and Architecture” and “Evaluation and Related Work” constitute one of the main contribution of this thesis. The NLP Interchange Format (NIF) is an RDF/OWL-based format that aims to achieve interoperability between Natural Language Processing (NLP) tools, language resources and annotations. The core specification is included in and describes which URI schemes and RDF vocabularies must be used for (parts of) natural language texts and annotations in order to create an RDF/OWL-based interoperability layer with NIF built upon Unicode Code Points in Normal Form C. In , classes and properties of the NIF Core Ontology are described to formally define the relations between text, substrings and their URI schemes. contains the evaluation of NIF. In a questionnaire, we asked questions to 13 developers using NIF. UIMA, GATE and Stanbol are extensible NLP frameworks and NIF was not yet able to provide off-the-shelf NLP domain ontologies for all possible domains, but only for the plugins used in this study. After inspecting the software, the developers agreed however that NIF is adequate enough to provide a generic RDF output based on NIF using literal objects for annotations. All developers were able to map the internal data structure to NIF URIs to serialize RDF output (Adequacy). The development effort in hours (ranging between 3 and 40 hours) as well as the number of code lines (ranging between 110 and 445) suggest, that the implementation of NIF wrappers is easy and fast for an average developer. Furthermore the evaluation contains a comparison to other formats and an evaluation of the available URI schemes for web annotation. In order to collect input from the wide group of stakeholders, a total of 16 presentations were given with extensive discussions and feedback, which has lead to a constant improvement of NIF from 2010 until 2013. After the release of NIF (Version 1.0) in November 2011, a total of 32 vocabulary employments and implementations for different NLP tools and converters were reported (8 by the (co-)authors, including Wiki-link corpus, 13 by people participating in our survey and 11 more, of which we have heard). Several roll-out meetings and tutorials were held (e.g. in Leipzig and Prague in 2013) and are planned (e.g. at LREC 2014). Part IV - The NLP Interchange Format in Use. “Use Cases and Applications for NIF” and “Publication of Corpora using NIF” describe 8 concrete instances where NIF has been successfully used. One major contribution in is the usage of NIF as the recommended RDF mapping in the Internationalization Tag Set (ITS) 2.0 W3C standard and the conversion algorithms from ITS to NIF and back. One outcome of the discussions in the standardization meetings and telephone conferences for ITS 2.0 resulted in the conclusion there was no alternative RDF format or vocabulary other than NIF with the required features to fulfill the working group charter. Five further uses of NIF are described for the Ontology of Linguistic Annotations (OLiA), the RDFaCE tool, the Tiger Corpus Navigator, the OntosFeeder and visualisations of NIF using the RelFinder tool. These 8 instances provide an implemented proof-of-concept of the features of NIF. starts with describing the conversion and hosting of the huge Google Wikilinks corpus with 40 million annotations for 3 million web sites. The resulting RDF dump contains 477 million triples in a 5.6 GB compressed dump file in turtle syntax. describes how NIF can be used to publish extracted facts from news feeds in the RDFLiveNews tool as Linked Data. Part V - Conclusions. provides lessons learned for NIF, conclusions and an outlook on future work. Most of the contributions are already summarized above. One particular aspect worth mentioning is the increasing number of NIF-formated corpora for Named Entity Recognition (NER) that have come into existence after the publication of the main NIF paper Integrating NLP using Linked Data at ISWC 2013. These include the corpora converted by Steinmetz, Knuth and Sack for the NLP & DBpedia workshop and an OpenNLP-based CoNLL converter by Brümmer. Furthermore, we are aware of three LREC 2014 submissions that leverage NIF: NIF4OGGD - NLP Interchange Format for Open German Governmental Data, N^3 – A Collection of Datasets for Named Entity Recognition and Disambiguation in the NLP Interchange Format and Global Intelligent Content: Active Curation of Language Resources using Linked Data as well as an early implementation of a GATE-based NER/NEL evaluation framework by Dojchinovski and Kliegr. Further funding for the maintenance, interlinking and publication of Linguistic Linked Data as well as support and improvements of NIF is available via the expiring LOD2 EU project, as well as the CSA EU project called LIDER, which started in November 2013. Based on the evidence of successful adoption presented in this thesis, we can expect a decent to high chance of reaching critical mass of Linked Data technology as well as the NIF standard in the field of Natural Language Processing and Language Resources.

APA, Harvard, Vancouver, ISO, and other styles

11

Kan'an, Tarek Ghaze. "Arabic News Text Classification and Summarization: A Case of the Electronic Library Institute SeerQ (ELISQ)." Diss., Virginia Tech, 2015. http://hdl.handle.net/10919/74272.

Full text

Abstract:

Arabic news articles in heterogeneous electronic collections are difficult for users to work with. Two problems are: that they are not categorized in a way that would aid browsing, and that there are no summaries or detailed metadata records that could be easier to work with than full articles. To address the first problem, schema mapping techniques were adapted to construct a simple taxonomy for Arabic news stories that is compatible with the subject codes of the International Press Telecommunications Council. So that each article would be labeled with the proper taxonomy category, automatic classification methods were researched, to identify the most appropriate. Experiments showed that the best features to use in classification resulted from a new tailored stemming approach (i.e., a new Arabic light stemmer called P-Stemmer). When coupled with binary classification using SVM, the newly developed approach proved to be superior to state-of-the-art techniques. To address the second problem, i.e., summarization, preliminary work was done with English corpora. This was in the context of a new Problem Based Learning (PBL) course wherein students produced template summaries of big text collections. The techniques used in the course were extended to work with Arabic news. Due to the lack of high quality tools for Named Entity Recognition (NER) and topic identification for Arabic, two new tools were constructed: RenA for Arabic NER, and ALDA for Arabic topic extraction tool (using the Latent Dirichlet Algorithm). Controlled experiments with each of RenA and ALDA, involving Arabic speakers and a randomly selected corpus of 1000 Qatari news articles, showed the tools produced very good results (i.e., names, organizations, locations, and topics). Then the categorization, NER, topic identification, and additional information extraction techniques were combined to produce approximately 120,000 summaries for Qatari news articles, which are searchable, along with the articles, using LucidWorks Fusion, which builds upon Solr software. Evaluation of the summaries showed high ratings based on the 1000-article test corpus. Contributions of this research with Arabic news articles thus include a new: test corpus, taxonomy, light stemmer, classification approach, NER tool, topic identification tool, and template-based summarizer – all shown through experimentation to be highly effective.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

12

Smith, Sydney. "Approaches to Natural Language Processing." Scholarship @ Claremont, 2018. http://scholarship.claremont.edu/cmc_theses/1817.

Full text

Abstract:

This paper explores topic modeling through the example text of Alice in Wonderland. It explores both singular value decomposition as well as non-‐‑negative matrix factorization as methods for feature extraction. The paper goes on to explore methods for partially supervised implementation of topic modeling through introducing themes. A large portion of the paper also focuses on implementation of these techniques in python as well as visualizations of the results which use a combination of python, html and java script along with the d3 framework. The paper concludes by presenting a mixture of SVD, NMF and partially-‐‑supervised NMF as a possible way to improve topic modeling.

APA, Harvard, Vancouver, ISO, and other styles

13

Benajiba, Yassine. "Arabic named entity recognition." Doctoral thesis, Universitat Politècnica de València, 2010. http://hdl.handle.net/10251/8318.

Full text

Abstract:

En esta tesis doctoral se describen las investigaciones realizadas con el objetivo de determinar las mejores tecnicas para construir un Reconocedor de Entidades Nombradas en Arabe. Tal sistema tendria la habilidad de identificar y clasificar las entidades nombradas que se encuentran en un texto arabe de dominio abierto. La tarea de Reconocimiento de Entidades Nombradas (REN) ayuda a otras tareas de Procesamiento del Lenguaje Natural (por ejemplo, la Recuperacion de Informacion, la Busqueda de Respuestas, la Traduccion Automatica, etc.) a lograr mejores resultados gracias al enriquecimiento que a~nade al texto. En la literatura existen diversos trabajos que investigan la tarea de REN para un idioma especifico o desde una perspectiva independiente del lenguaje. Sin embargo, hasta el momento, se han publicado muy pocos trabajos que estudien dicha tarea para el arabe. El arabe tiene una ortografia especial y una morfologia compleja, estos aspectos aportan nuevos desafios para la investigacion en la tarea de REN. Una investigacion completa del REN para elarabe no solo aportaria las tecnicas necesarias para conseguir un alto rendimiento, sino que tambien proporcionara un analisis de los errores y una discusion sobre los resultados que benefician a la comunidad de investigadores del REN. El objetivo principal de esta tesis es satisfacer esa necesidad. Para ello hemos: 1. Elaborado un estudio de los diferentes aspectos del arabe relacionados con dicha tarea; 2. Analizado el estado del arte del REN; 3. Llevado a cabo una comparativa de los resultados obtenidos por diferentes tecnicas de aprendizaje automatico; 4. Desarrollado un metodo basado en la combinacion de diferentes clasificadores, donde cada clasificador trata con una sola clase de entidades nombradas y emplea el conjunto de caracteristicas y la tecnica de aprendizaje automatico mas adecuados para la clase de entidades nombradas en cuestion. Nuestros experimentos han sido evaluados sobre nueve conjuntos de test.
Benajiba, Y. (2009). Arabic named entity recognition [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/8318
Palancia

APA, Harvard, Vancouver, ISO, and other styles

14

Farran, Lama K. "The Relationship between Language and Reading in Bilingual English-Arabic Children." Digital Archive @ GSU, 2010. http://digitalarchive.gsu.edu/ece_diss/13.

Full text

Abstract:

ABSTRACT THE RELATIONSHIP BETWEEN LANGUAGE AND READING IN BILINGUAL ENGLISH-ARABIC CHILDREN by Lama K. Farran This dissertation examined the relationship between language and reading in bilingual English-Arabic children. The dissertation followed a two chapter Review and Research Format. Chapter One presents a review of research that examined the relationship between oral language and reading development in bilingual English-Arabic children. Chapter Two describes the study that examined this same relationship. Participants were 83 third-, fourth-, and fifth-grade children who attended a charter school in a large school district in the Southeastern portion of the US. The school taught Arabic as a second language daily in the primary and elementary grades. This cross-sectional quantitative study used norm-referenced assessments and experimental measures. Data were analyzed using simultaneous and hierarchical regression to identify language predictors of reading. Analysis of covariance was used to examine whether the language groups differed in their Arabic reading comprehension scores, while controlling for age. Results indicated that phonological awareness in Arabic was related to phonological awareness in English. However, morphological awareness in Arabic was not related to morphological awareness in English. Results also revealed that phonological awareness predicted word reading, pseudoword decoding, and complex word reading fluency within Arabic and English; morphological awareness predicted complex word reading fluency in Arabic but not in English; and vocabulary predicted reading comprehension within Arabic and English. Further analyses indicated that children with high vocabulary differed from children with low vocabulary in their reading comprehension scores and that this difference was driven by children’s ability to read unvowelized words. Consistent with the extended version of the Triangle Model of Reading (Bishop & Snowling, 2004), the results suggest a division of labor among various language components in the process of word reading and reading comprehension. Implications for research, instruction, and early intervention with bilingual English-Arabic children are discussed.

APA, Harvard, Vancouver, ISO, and other styles

15

Mahfoudhi, Abdessatar. "Morphological and phonological units in the Arabic mental lexicon: Implications for theories of morphology and lexical processing." Thesis, University of Ottawa (Canada), 2005. http://hdl.handle.net/10393/29232.

Full text

Abstract:

This dissertation investigates the cognitive relevance of selected morphological and phonological units in the Arabic mental lexicon. The morphological units are sound and weak roots, etymons, phonetic matrices, and sound and weak patterns. The phonological units are vowels and consonants. The work is motivated by a controversy in Arabic morphology that is paralleled by a cross-linguistic debate in lexical processing. There are two views in Arabic morphology, the stem-based theory and the morpheme-based theory that is represented by two sub-theories. The first sub-theory argues that derivations are based on roots and patterns and the second proposes that the root should be replaced by the etymon and the phonetic matrix. The morpheme-based theory is congruent with lexical processing hypotheses that propose that complex words are accessed and represented as morphemes. The stem-based theory maintains that derivation is stem or word-based and is in line with the whole word hypothesis of lexical processing. These theoretical positions on Arabic morphology and lexical processing were tested in six priming experiments. One objective of these experiments was to test which of these morphemes prime word recognition. Another objective was to test the prediction of connectionism, another lexical processing hypothesis, that priming time correlates with prime-target overlap. A third objective was to examine how abstract the processing of these morphemes could be. The cognitive status of vowels and consonants was tested using a letter-circling task. The results of the online studies have shown that both roots and etymons facilitate word recognition significantly more than orthographic controls. However, non-ordered etymons, phonetic matrices, and patterns did not facilitate word recognition. Weak roots had priming effects only when primes and targets shared a vague semantic relationship. There was no correlation between priming time and meaning and/or form overlap. The lack of priming with non-ordered etymons suggests that there could be limits on abstractness in lexical processing. The results of the offline task suggest that root consonants are more salient than other letters. On the whole, the results support a morpheme-based theory of Arabic morphology and a localist view of lexical processing that assumes a morphemic stage in word recognition.

APA, Harvard, Vancouver, ISO, and other styles

16

Hu, Jin. "Explainable Deep Learning for Natural Language Processing." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-254886.

Full text

Abstract:

Deep learning methods get impressive performance in many Natural Neural Processing (NLP) tasks, but it is still difficult to know what happened inside a deep neural network. In this thesis, a general overview of Explainable AI and how explainable deep learning methods applied for NLP tasks is given. Then the Bi-directional LSTM and CRF (BiLSTM-CRF) model for Named Entity Recognition (NER) task is introduced, as well as the approach to make this model explainable. The approach to visualize the importance of neurons in Bi-LSTM layer of the model for NER by Layer-wise Relevance Propagation (LRP) is proposed, which can measure how neurons contribute to each predictionof a word in a sequence. Ideas about how to measure the influence of CRF layer of the Bi-LSTM-CRF model is also described.
Djupa inlärningsmetoder får imponerande prestanda i många naturliga Neural Processing (NLP) uppgifter, men det är fortfarande svårt att veta vad hände inne i ett djupt neuralt nätverk. I denna avhandling, en allmän översikt av förklarliga AI och hur förklarliga djupa inlärningsmetoder tillämpas för NLP-uppgifter ges. Då den bi-riktiga LSTM och CRF (BiLSTM-CRF) modell för Named Entity Recognition (NER) uppgift införs, liksom tillvägagångssättet för att göra denna modell förklarlig. De tillvägagångssätt för att visualisera vikten av neuroner i BiLSTM-skiktet av Modellen för NER genom Layer-Wise Relevance Propagation (LRP) föreslås, som kan mäta hur neuroner bidrar till varje förutsägelse av ett ord i en sekvens. Idéer om hur man mäter påverkan av CRF-skiktet i Bi-LSTM-CRF-modellen beskrivs också.

APA, Harvard, Vancouver, ISO, and other styles

17

Wang, Zongyan 1969. "Implementation of distributed data processing in a database programming language." Thesis, McGill University, 2002. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=79201.

Full text

Abstract:

This thesis discusses the design and implementation of integrating the Internet capability into a database programming language JRelix, so that it not only possesses data organization; storage and indexing capabilities of normal DBMS, but also possesses remote data processing capabilities across the Internet.
A URL-based name extension to database elements in a database programming language is adopted, which gives it collaborative and distributed capability over the Internet with no changes in syntax or semantics apart from the new structure in names. Relations, computations, statements (or queries) and relational expression are treated uniformly as database elements in our implementation. These database elements are enabled to be accessed or executed remotely. As a result, remote data accessing or processing, as well as Remote Procedure Call (RPC) are supported.
Sharing resource is a main achievement of the implementation. In addition, site autonomy and performance transparency are accomplished; distributed view management is provided; sites need not be geographically distant; security management is implemented.

APA, Harvard, Vancouver, ISO, and other styles

18

Ives, Zachary G. "Efficient query processing for data integration /." Thesis, Connect to this title online; UW restricted, 2002. http://hdl.handle.net/1773/6864.

Full text

APA, Harvard, Vancouver, ISO, and other styles

19

Shui, William Miao Computer Science &amp Engineering Faculty of Engineering UNSW. "On Efficient processing of XML data and their applications." Awarded by:University of New South Wales. Computer Science & Engineering, 2007. http://handle.unsw.edu.au/1959.4/40502.

Full text

Abstract:

The development of high-throughput genome sequencing and protein structure determination techniques have provided researchers with a wealth ofbiological data. However, providing an integrated analysis can be difficult due to the incompatibilities of data formats between providers and applications, the strict schema constraints imposed by data providers, and the lack ofinfrastructure for easily accommodating new semantic information. To address these issues, this thesis first proposes to use Extensible Markup Language (XML) [26] and its supporting query languages as the underlying technology to facilitate a seamless, integrated access to the sum of heterogeneous biological data and services. XML is used due to its semi-structured nature and its ability to easily encapsulate both contextual and semantic information. The tree representation of an XML document enables applications to easily traverse and access data within the document without prior knowledge of its schema. However, in the process ofconstructing the framework, we have identified a number of issues that are related to the performance ofXML technologies. More specifically, on the performance ofthe XML query processor, the data store and the transformation processor. Hence, this thesis also focuses on finding new solutions to address these issues. For the XML query processor, we proposes an efficient structural join algorithm that can be implemented on top of existing relational databases. Experiments show the proposed method outperforms previous work in both queries and updates. For complicated XML query patterns, a new twig join algorithm called CTwigStack is proposed in this thesis. In essence, the new approach only produces and merges partial solution nodes that satisfy the entire twig query pattern tree. Experiments show the proposed algorithm outperforms previous methods in most cases. For more general cases, a propose a mixed mode twig join is proposed, which combines CTwigStack with the existing twig join algorithms and the extensive experimental results have shown the superior effectiveness of both CTwigStack and the mixed mode twig join. By combining with existing system information, the mixed mode twig join can be served as a framework for plan selection during the process of XML query optimization. For the XML transfonnation component, a novel stand-alone, memory conscious XSLT processor is proposed in this thesis, such that the proposed XSLT processor only requires a single pass of the input XML dataset. Consequently, enabling fast transfonnation of streaming XML data and better handling of complicated XPath selection patterns, including aggregate predicate functions such as the XPath count function. Ultimately, based on the nature of the proposed framework, we believe that solving the perfonnance issues related to the underlying XML components can subsequently lead to a more robust framework for integrating heterogeneous biological data sources and services.

APA, Harvard, Vancouver, ISO, and other styles

20

羅憲璋 and Hin-cheung Hubert Law. "A language model for mandarin Chinese." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 1997. http://hub.hku.hk/bib/B29913391.

Full text

APA, Harvard, Vancouver, ISO, and other styles

21

張少能 and Siu-nang Bruce Cheung. "A theory of automatic language acquisition." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 1994. http://hub.hku.hk/bib/B31233521.

Full text

APA, Harvard, Vancouver, ISO, and other styles

22

Al-Muhtaseb, Husni Abdulghani. "Arabic text recognition of printed manuscripts : efficient recognition of off-line printed Arabic text using Hidden Markov Models, Bigram Statistical Language Model, and post-processing." Thesis, University of Bradford, 2010. http://hdl.handle.net/10454/4426.

Full text

Abstract:

Arabic text recognition was not researched as thoroughly as other natural languages. The need for automatic Arabic text recognition is clear. In addition to the traditional applications like postal address reading, check verification in banks, and office automation, there is a large interest in searching scanned documents that are available on the internet and for searching handwritten manuscripts. Other possible applications are building digital libraries, recognizing text on digitized maps, recognizing vehicle license plates, using it as first phase in text readers for visually impaired people and understanding filled forms. This research work aims to contribute to the current research in the field of optical character recognition (OCR) of printed Arabic text by developing novel techniques and schemes to advance the performance of the state of the art Arabic OCR systems. Statistical and analytical analysis for Arabic Text was carried out to estimate the probabilities of occurrences of Arabic character for use with Hidden Markov models (HMM) and other techniques. Since there is no publicly available dataset for printed Arabic text for recognition purposes it was decided to create one. In addition, a minimal Arabic script is proposed. The proposed script contains all basic shapes of Arabic letters. The script provides efficient representation for Arabic text in terms of effort and time. Based on the success of using HMM for speech and text recognition, the use of HMM for the automatic recognition of Arabic text was investigated. The HMM technique adapts to noise and font variations and does not require word or character segmentation of Arabic line images. In the feature extraction phase, experiments were conducted with a number of different features to investigate their suitability for HMM. Finally, a novel set of features, which resulted in high recognition rates for different fonts, was selected. The developed techniques do not need word or character segmentation before the classification phase as segmentation is a byproduct of recognition. This seems to be the most advantageous feature of using HMM for Arabic text as segmentation tends to produce errors which are usually propagated to the classification phase. Eight different Arabic fonts were used in the classification phase. The recognition rates were in the range from 98% to 99.9% depending on the used fonts. As far as we know, these are new results in their context. Moreover, the proposed technique could be used for other languages. A proof-of-concept experiment was conducted on English characters with a recognition rate of 98.9% using the same HMM setup. The same techniques where conducted on Bangla characters with a recognition rate above 95%. Moreover, the recognition of printed Arabic text with multi-fonts was also conducted using the same technique. Fonts were categorized into different groups. New high recognition results were achieved. To enhance the recognition rate further, a post-processing module was developed to correct the OCR output through character level post-processing and word level post-processing. The use of this module increased the accuracy of the recognition rate by more than 1%.

APA, Harvard, Vancouver, ISO, and other styles

23

Al-Muhtaseb, Husni A. "Arabic text recognition of printed manuscripts. Efficient recognition of off-line printed Arabic text using Hidden Markov Models, Bigram Statistical Language Model, and post-processing." Thesis, University of Bradford, 2010. http://hdl.handle.net/10454/4426.

Full text

Abstract:

Arabic text recognition was not researched as thoroughly as other natural languages. The need for automatic Arabic text recognition is clear. In addition to the traditional applications like postal address reading, check verification in banks, and office automation, there is a large interest in searching scanned documents that are available on the internet and for searching handwritten manuscripts. Other possible applications are building digital libraries, recognizing text on digitized maps, recognizing vehicle license plates, using it as first phase in text readers for visually impaired people and understanding filled forms. This research work aims to contribute to the current research in the field of optical character recognition (OCR) of printed Arabic text by developing novel techniques and schemes to advance the performance of the state of the art Arabic OCR systems. Statistical and analytical analysis for Arabic Text was carried out to estimate the probabilities of occurrences of Arabic character for use with Hidden Markov models (HMM) and other techniques. Since there is no publicly available dataset for printed Arabic text for recognition purposes it was decided to create one. In addition, a minimal Arabic script is proposed. The proposed script contains all basic shapes of Arabic letters. The script provides efficient representation for Arabic text in terms of effort and time. Based on the success of using HMM for speech and text recognition, the use of HMM for the automatic recognition of Arabic text was investigated. The HMM technique adapts to noise and font variations and does not require word or character segmentation of Arabic line images. In the feature extraction phase, experiments were conducted with a number of different features to investigate their suitability for HMM. Finally, a novel set of features, which resulted in high recognition rates for different fonts, was selected. The developed techniques do not need word or character segmentation before the classification phase as segmentation is a byproduct of recognition. This seems to be the most advantageous feature of using HMM for Arabic text as segmentation tends to produce errors which are usually propagated to the classification phase. Eight different Arabic fonts were used in the classification phase. The recognition rates were in the range from 98% to 99.9% depending on the used fonts. As far as we know, these are new results in their context. Moreover, the proposed technique could be used for other languages. A proof-of-concept experiment was conducted on English characters with a recognition rate of 98.9% using the same HMM setup. The same techniques where conducted on Bangla characters with a recognition rate above 95%. Moreover, the recognition of printed Arabic text with multi-fonts was also conducted using the same technique. Fonts were categorized into different groups. New high recognition results were achieved. To enhance the recognition rate further, a post-processing module was developed to correct the OCR output through character level post-processing and word level post-processing. The use of this module increased the accuracy of the recognition rate by more than 1%.
King Fahd University of Petroleum and Minerals (KFUPM)

APA, Harvard, Vancouver, ISO, and other styles

24

Guven, Ahmet. "Speeding up a path-based policy language compiler." Thesis, Monterey, Calif. : Springfield, Va. : Naval Postgraduate School ; Available from National Technical Information Service, 2003. http://library.nps.navy.mil/uhtbin/hyperion-image/03Mar%5FGuven.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

25

Trotter, William. "Translation Salience: A Model of Equivalence in Translation (Arabic/English)." University of Sydney. School of European, Asian and Middle Eastern Languages, 2000. http://hdl.handle.net/2123/497.

Full text

Abstract:

The term equivalence describes the relationship between a translation and the text from which it is translated. Translation is generally viewed as indeterminate insofar as there is no single acceptable translation - but many. Despite this, the rationalist metaphor of translation equivalence prevails. Rationalist approaches view translation as a process in which an original text is analysed to a level of abstraction, then transferred into a second representation from which a translation is generated. At the deepest level of abstraction, representations for analysis and generation are identical and transfer becomes redundant, while at the surface level it is said that surface textual features are transferred directly. Such approaches do not provide a principled explanation of how or why abstraction takes place in translation. They also fail to resolve the dilemma of specifying the depth of transfer appropriate for a given translation task. By focusing on the translator�s role as mediator of communication, equivalence can be understood as the coordination of information about situations and states of mind. A fundamental opposition is posited between the transfer of rule-like or codifiable aspects of equivalence and those non-codifiable aspects in which salient information is coordinated. The Translation Salience model proposes that Transfer and Salience constitute bipolar extremes of a continuum. The model offers a principled account of the translator�s interlingual attunement to multi-placed coordination, proposing that salient information can be accounted for with three primary notions: markedness, implicitness and localness. Chapter Two develops the Translation Salience model. The model is supported with empirical evidence from published translations of Arabic and English texts. Salience is illustrated in Chapter Three through contextualized interpretations associated with various Arabic communication resources (repetition, code switching, agreement, address in relative clauses, and the disambiguation of presentative structures). Measurability of the model is addressed in Chapter Four with reference to emerging computational techniques. Further research is suggested in connection with theme and focus, text type, cohesion and collocation relations.

APA, Harvard, Vancouver, ISO, and other styles

26

O'Sullivan, John J. D. "Teach2Learn : gamifying education to gather training data for natural language processing." Thesis, Massachusetts Institute of Technology, 2017. http://hdl.handle.net/1721.1/117320.

Full text

Abstract:

Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2017.
Cataloged from PDF version of thesis.
Includes bibliographical references (pages 65-66).
Teach2Learn is a website which crowd-sources the problem of labeling natural text samples using gamified education as an incentive. Students assign labels to text samples from an unlabeled data set, thereby teaching superised machine learning algorithms how to interpret new samples. In return, students can learn how that algorithm works by unlocking lessons written by researchers. This aligns the incentives of researchers and learners to help both achieve their goals. The application used current best practices in gamification to create a motivating structure around that labeling task. Testing showed that 27.7% of the user base (5/18 users) engaged with the content and labeled enough samples to unlock all of the lessons, suggesting that learning modules are sufficient motivation for the right users. Attempts to grow the platform through paid social media advertising were unsuccessful, likely because users aren't looking for a class when they browse those sites. Unpaid posts on subreddits discussing related topics, where users were more likely to be searching for learning opportunities, were more successful. Future research should seek users through comparable sites and explore how Teach2Learn can be used as an additional learning resource in classrooms.
by John J.D. O'Sullivan
M. Eng.

APA, Harvard, Vancouver, ISO, and other styles

27

Al-jasser, Faisal M. A. "Phonotactic probability and phonotactic constraints : processing and lexical segmentation by Arabic learners of English as a foreign language." Thesis, University of Newcastle Upon Tyne, 2008. http://hdl.handle.net/10443/537.

Full text

Abstract:

A fundamental skill in listening comprehension is the ability to recognize words. The ability to accurately locate word boundaries(i . e. to lexically segment) is an important contributor to this skill. Research has shown that English native speakers use various cues in the signal in lexical segmentation. One such cue is phonotactic constraints; more specifically, the presence of illegal English consonant sequences such as AV and MY signals word boundaries. It has also been shown that phonotactic probability (i. e. the frequency of segments and sequences of segments in words) affects native speakers' processing of English. However, the role that phonotactic probability and phonotactic constraints play in the EFL classroom has hardly been studied, while much attention has been devoted to teaching listening comprehension in EFL. This thesis reports on an intervention study which investigated the effect of teaching English phonotactics upon Arabic speakers' lexical segmentation of running speech in English. The study involved a native English group (N= 12), a non-native speaking control group (N= 20); and a non-native speaking experimental group (N=20). Each of the groups took three tests, namely Non-word Rating, Lexical Decision and Word Spotting. These tests probed how sensitive the subjects were to English phonotactic probability and to the presence of illegal sequences of phonemes in English and investigated whether they used these sequences in the lexical segmentation of English. The non-native groups were post-tested with the -same tasks after only the experimental group had been given a treatment which consisted of explicit teaching of relevant English phonotactic constraints and related activities for 8 weeks. The gains made by the experimental group are discussed, with implications for teaching both pronunciation and listening comprehension in an EFL setting.

APA, Harvard, Vancouver, ISO, and other styles

28

Pham, Son Bao Computer Science &amp Engineering Faculty of Engineering UNSW. "Incremental knowledge acquisition for natural language processing." Awarded by:University of New South Wales. School of Computer Science and Engineering, 2006. http://handle.unsw.edu.au/1959.4/26299.

Full text

Abstract:

Linguistic patterns have been used widely in shallow methods to develop numerous NLP applications. Approaches for acquiring linguistic patterns can be broadly categorised into three groups: supervised learning, unsupervised learning and manual methods. In supervised learning approaches, a large annotated training corpus is required for the learning algorithms to achieve decent results. However, annotated corpora are expensive to obtain and usually available only for established tasks. Unsupervised learning approaches usually start with a few seed examples and gather some statistics based on a large unannotated corpus to detect new examples that are similar to the seed ones. Most of these approaches either populate lexicons for predefined patterns or learn new patterns for extracting general factual information; hence they are applicable to only a limited number of tasks. Manually creating linguistic patterns has the advantage of utilising an expert's knowledge to overcome the scarcity of annotated data. In tasks with no annotated data available, the manual way seems to be the only choice. One typical problem that occurs with manual approaches is that the combination of multiple patterns, possibly being used at different stages of processing, often causes unintended side effects. Existing approaches, however, do not focus on the practical problem of acquiring those patterns but rather on how to use linguistic patterns for processing text. A systematic way to support the process of manually acquiring linguistic patterns in an efficient manner is long overdue. This thesis presents KAFTIE, an incremental knowledge acquisition framework that strongly supports experts in creating linguistic patterns manually for various NLP tasks. KAFTIE addresses difficulties in manually constructing knowledge bases of linguistic patterns, or rules in general, often faced in existing approaches by: (1) offering a systematic way to create new patterns while ensuring they are consistent; (2) alleviating the difficulty in choosing the right level of generality when creating a new pattern; (3) suggesting how existing patterns can be modified to improve the knowledge base's performance; (4) making the effort in creating a new pattern, or modifying an existing pattern, independent of the knowledge base's size. KAFTIE, therefore, makes it possible for experts to efficiently build large knowledge bases for complex tasks. This thesis also presents the KAFDIS framework for discourse processing using new representation formalisms: the level-of-detail tree and the discourse structure graph.

APA, Harvard, Vancouver, ISO, and other styles

29

Shutova, Ekaterina. "Computational approaches to figurative language." Thesis, University of Cambridge, 2011. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.609681.

Full text

APA, Harvard, Vancouver, ISO, and other styles

30

Kakavandy, Hanna, and John Landeholt. "How natural language processing can be used to improve digital language learning." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-281693.

Full text

Abstract:

The world is facing globalization and with that, companies are growing and need to hire according their needs. A great obstacle for this is the language barrier between job applicants and employers who want to hire competent candidates. One spark of light in this challenge is Lingio, who provides a product that teaches digital profession-specific Swedish. Lingio intends to make their existing product more interactive and this research paper aims to research aspects involved in that. This study evaluates system utterances that are planned to be used in Lingio’s product for language learners to use in their practice and studies the feasibility of using the natural language model cosine similarity in classifying the correctness of answers to these utterances. This report also looks at whether it best to use crowd sourced material or a golden standard as benchmark for a correct answer. The results indicate that there are a number of improvements and developments that need to be made to the model in order for it to accurately classify answers due to its formulation and the complexity of human language. It is also concluded that the utterances by Lingio might need to be further developed in order to be efficient in their use for learning language and that crowd sourced material works better than a golden standard. The study makes several interesting observations from the collected data and analysis, aiming to contribute to further research in natural language engineering when it comes to text classification and digital language learning.
Globaliseringen medför flertal konsekvenser för växande företag. En av utmaningarna som företag står inför är anställandet av tillräckligt med kompentent personal. För många företag står språkbarriären mellan de och att anställa kompetens, arbetsökande har ofta inte tillräckligt med språkkunskaper för att klara av jobbet. Lingio är företag som arbetar med just detta, deras produkt är en digital applikation som undervisar yrkesspecific svenska, en effektiv lösning för den som vill fokusera sin inlärning av språket inför ett jobb. Syftet är att hjälpa Lingio i utvecklingen av deras produkt, närmare bestämt i arbetet med att göra den mer interaktiv. Detta görs genom att undersöka effektiviteten hos applikationens yttranden som används för inlärningssyfte och att använda en språkteknologisk modell för att klassificera en användares svar till ett yttrande. Vidare analyseras huruvida det är bäst att använda en golden standard eller insamlat material från enkäter som referenspunkt för ett korrekt yttrande. Resultatet visar att modellen har flertal svagheter och behöver utvecklas för att kunna göra klassificeringen på ett korrekt sätt och att det finns utrymme för bättring när det kommer till yttrandena. Det visas även att insamlat material från enkäter fungerar bättre än en golden standard.

APA, Harvard, Vancouver, ISO, and other styles

31

Lee, Chi-yin. "A pure orthographic stage in processing Chinese characters evidence from data of sub-morphemic processing in preschool children /." Click to view the E-thesis via HKU Scholars Hub, 2003. http://lookup.lib.hku.hk/lookup/bib/B38888919.

Full text

Abstract:

Thesis (B.Sc.)--University of Hong Kong, 2003.
"A dissertation submitted in partial fulfilment of the requirements for the Bachelor of Science (Speech and Hearing Sciences), The University of Hong Kong, April 30, 2003." Includes bibliographical references (p. 28-30) Also available in print.

APA, Harvard, Vancouver, ISO, and other styles

32

Sabtan, Yasser Muhammad Naguib mahmoud. "Lexical selection for machine translation." Thesis, University of Manchester, 2011. https://www.research.manchester.ac.uk/portal/en/theses/lexical-selection-for-machine-translation(28ea687c-5eaf-4412-992a-16fc88b977c8).html.

Full text

Abstract:

Current research in Natural Language Processing (NLP) tends to exploit corpus resources as a way of overcoming the problem of knowledge acquisition. Statistical analysis of corpora can reveal trends and probabilities of occurrence, which have proved to be helpful in various ways. Machine Translation (MT) is no exception to this trend. Many MT researchers have attempted to extract knowledge from parallel bilingual corpora. The MT problem is generally decomposed into two sub-problems: lexical selection and reordering of the selected words. This research addresses the problem of lexical selection of open-class lexical items in the framework of MT. The work reported in this thesis investigates different methodologies to handle this problem, using a corpus-based approach. The current framework can be applied to any language pair, but we focus on Arabic and English. This is because Arabic words are hugely ambiguous and thus pose a challenge for the current task of lexical selection. We use a challenging Arabic-English parallel corpus, containing many long passages with no punctuation marks to denote sentence boundaries. This points to the robustness of the adopted approach. In our attempt to extract lexical equivalents from the parallel corpus we focus on the co-occurrence relations between words. The current framework adopts a lexicon-free approach towards the selection of lexical equivalents. This has the double advantage of investigating the effectiveness of different techniques without being distracted by the properties of the lexicon and at the same time saving much time and effort, since constructing a lexicon is time-consuming and labour-intensive. Thus, we use as little, if any, hand-coded information as possible. The accuracy score could be improved by adding hand-coded information. The point of the work reported here is to see how well one can do without any such manual intervention. With this goal in mind, we carry out a number of preprocessing steps in our framework. First, we build a lexicon-free Part-of-Speech (POS) tagger for Arabic. This POS tagger uses a combination of rule-based, transformation-based learning (TBL) and probabilistic techniques. Similarly, we use a lexicon-free POS tagger for English. We use the two POS taggers to tag the bi-texts. Second, we develop lexicon-free shallow parsers for Arabic and English. The two parsers are then used to label the parallel corpus with dependency relations (DRs) for some critical constructions. Third, we develop stemmers for Arabic and English, adopting the same knowledge -free approach. These preprocessing steps pave the way for the main system (or proposer) whose task is to extract translational equivalents from the parallel corpus. The framework starts with automatically extracting a bilingual lexicon using unsupervised statistical techniques which exploit the notion of co-occurrence patterns in the parallel corpus. We then choose the target word that has the highest frequency of occurrence from among a number of translational candidates in the extracted lexicon in order to aid the selection of the contextually correct translational equivalent. These experiments are carried out on either raw or POS-tagged texts. Having labelled the bi-texts with DRs, we use them to extract a number of translation seeds to start a number of bootstrapping techniques to improve the proposer. These seeds are used as anchor points to resegment the parallel corpus and start the selection process once again. The final F-score for the selection process is 0.701. We have also written an algorithm for detecting ambiguous words in a translation lexicon and obtained a precision score of 0.89.

APA, Harvard, Vancouver, ISO, and other styles

33

Lameris, Harm. "Homograph Disambiguation and Diacritization for Arabic Text-to-Speech Using Neural Networks." Thesis, Uppsala universitet, Institutionen för lingvistik och filologi, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-446509.

Full text

Abstract:

Pre-processing Arabic text for Text-to-Speech (TTS) systems poses major challenges, as Arabic omits short vowels in writing. This omission leads to a large number of homographs, and means that Arabic text needs to be diacritized to disambiguate these homographs, in order to be matched up with the intended pronunciation. Diacritizing Arabic has generally been achieved by using rule-based, statistical, or hybrid methods that combine rule-based and statistical methods. Recently, diacritization methods involving deep learning have shown promise in reducing error rates. These deep-learning methods are not yet commonly used in TTS engines, however. To examine neural diacritization methods for use in TTS engines, we normalized and pre-processed a version of the Tashkeela corpus, a large diacritized corpus containing largely Classical Arabic texts, for TTS purposes. We then trained and tested three state-of-the-art Recurrent-Neural-Network-based models on this data set. Additionally we tested these models on the Wiki News corpus, a test set that contains Modern Standard Arabic (MSA) news articles and thus more closely resembles most TTS queries. The models were evaluated by comparing the Diacritic Error Rate (DER) and Word Error Rate (WER) achieved for each data set to one another and to the DER and WER reported in the original papers. Moreover, the per-diacritic accuracy was examined, and a manual evaluation was performed. For the Tashkeela corpus, all models achieved a lower DER and WER than reported in the original papers. This was largely the result of using more training data in addition to the TTS pre-processing steps that were performed on the data. For the Wiki News corpus, the error rates were higher, largely due to the domain gap between the data sets. We found that for both data sets the models overfit on common patterns and the most common diacritic. For the Wiki News corpus the models struggled with Named Entities and loanwords. Purely neural models generally outperformed the model that combined deep learning with rule-based and statistical corrections. These findings highlight the usability of deep learning methods for Arabic diacritization in TTS engines as well as the need for diacritized corpora that are more representative of Modern Standard Arabic.

APA, Harvard, Vancouver, ISO, and other styles

34

González, Alejandro. "A Swedish Natural Language Processing Pipeline For Building Knowledge Graphs." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-254363.

Full text

Abstract:

The concept of knowledge is proper only to the human being thanks to the faculty of understanding. The immaterial concepts, independent of the material causes of the experience constitute an evident proof of the existence of the rational soul that makes the human being a spiritual being "in a way independent of the material. Nowadays research efforts in the field of Artificial Intelligence are trying to mimic this human capacity using computers by means of tteachingthem how to read and understand human language using Machine Learning techniques related to the processing of human language. However, there are still a significant number of challenges such as how to represent this knowledge so can be used by a machine to infer conclusions or provide answers. This thesis presents a Natural Language Processing pipeline that is capable of building a knowledge representation of the information contained in Swedish human-generated text. The result is a system that, given Swedish text in its raw format, builds a representation in the form of a Knowledge Graph of the knowledge or information contained in that text.
Vetskapen om kunskap är den del av det som definierar den nutida människan (som vet, att hon vet). De immateriella begreppen oberoende av materiella attribut är en del av beviset på att människan en själslig varelse som till viss del är oberoende av materialet. För närvarande försöker forskningsinsatser inom artificiell intelligens efterlikna det mänskliga betandet med hjälp av datorer genom att "lära" dem hur man läser och förstår mänskligt språk genom att använda maskininlärningstekniker relaterade till behandling av mänskligt språk. Det finns emellertid fortfarande ett betydande antal utmaningar, till exempel hur man representerar denna kunskap så att den kan användas av en maskin för att dra slutsatser eller ge svar utifrån detta. Denna avhandling presenterar en studie i användningen av ”Natural Language Processing” i en pipeline som kan generera en kunskapsrepresentation av informationen utifrån det svenska språket som bas. Resultatet är ett system som, med svensk text i råformat, bygger en representation i form av en kunskapsgraf av kunskapen eller informationen i den texten.

APA, Harvard, Vancouver, ISO, and other styles

35

洪進德 and Chun-tak Hung. "Chinese workbench: an integrated environment for Chinese writers." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 1992. http://hub.hku.hk/bib/B31210314.

Full text

APA, Harvard, Vancouver, ISO, and other styles

36

Mok, Yuen-kwan Sally, and 莫婉君. "Multilingual information retrieval on the world wide web: the development of a Cantonese-Dagaare-English trilingual electroniclexicon." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2006. http://hub.hku.hk/bib/B36399085.

Full text

APA, Harvard, Vancouver, ISO, and other styles

37

Alkathiri, Abdul Aziz. "Decentralized Large-Scale Natural Language Processing Using Gossip Learning." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-281277.

Full text

Abstract:

The field of Natural Language Processing in machine learning has seen rising popularity and use in recent years. The nature of Natural Language Processing, which deals with natural human language and computers, has led to the research and development of many algorithms that produce word embeddings. One of the most widely-used of these algorithms is Word2Vec. With the abundance of data generated by users and organizations and the complexity of machine learning and deep learning models, performing training using a single machine becomes unfeasible. The advancement in distributed machine learning offers a solution to this problem. Unfortunately, due to reasons concerning data privacy and regulations, in some real-life scenarios, the data must not leave its local machine. This limitation has lead to the development of techniques and protocols that are massively-parallel and data-private. The most popular of these protocols is federated learning. However, due to its centralized nature, it still poses some security and robustness risks. Consequently, this led to the development of massively-parallel, data private, decentralized approaches, such as gossip learning. In the gossip learning protocol, every once in a while each node in the network randomly chooses a peer for information exchange, which eliminates the need for a central node. This research intends to test the viability of gossip learning for large- scale, real-world applications. In particular, it focuses on implementation and evaluation for a Natural Language Processing application using gossip learning. The results show that application of Word2Vec in a gossip learning framework is viable and yields comparable results to its non-distributed, centralized counterpart for various scenarios, with an average loss on quality of 6.904%.
Fältet Naturlig Språkbehandling (Natural Language Processing eller NLP) i maskininlärning har sett en ökande popularitet och användning under de senaste åren. Naturen av Naturlig Språkbehandling, som bearbetar naturliga mänskliga språk och datorer, har lett till forskningen och utvecklingen av många algoritmer som producerar inbäddningar av ord. En av de mest använda av dessa algoritmer är Word2Vec. Med överflödet av data som genereras av användare och organisationer, komplexiteten av maskininlärning och djupa inlärningsmodeller, blir det omöjligt att utföra utbildning med hjälp av en enda maskin. Avancemangen inom distribuerad maskininlärning erbjuder en lösning på detta problem, men tyvärr får data av sekretesskäl och datareglering i vissa verkliga scenarier inte lämna sin lokala maskin. Denna begränsning har lett till utvecklingen av tekniker och protokoll som är massivt parallella och dataprivata. Det mest populära av dessa protokoll är federerad inlärning (federated learning), men på grund av sin centraliserade natur utgör det ändock vissa säkerhets- och robusthetsrisker. Följaktligen ledde detta till utvecklingen av massivt parallella, dataprivata och decentraliserade tillvägagångssätt, såsom skvallerinlärning (gossip learning). I skvallerinlärningsprotokollet väljer varje nod i nätverket slumpmässigt en like för informationsutbyte, vilket eliminerarbehovet av en central nod. Syftet med denna forskning är att testa livskraftighetenav skvallerinlärning i större omfattningens verkliga applikationer. I synnerhet fokuserar forskningen på implementering och utvärdering av en NLP-applikation genom användning av skvallerinlärning. Resultaten visar att tillämpningen av Word2Vec i en skvallerinlärnings ramverk är livskraftig och ger jämförbara resultat med dess icke-distribuerade, centraliserade motsvarighet för olika scenarier, med en genomsnittlig kvalitetsförlust av 6,904%.

APA, Harvard, Vancouver, ISO, and other styles

38

Al-Hadlaq, Mohammed S. "Retention of words learned incidentally by Saudi EFL learners through working on vocabulary learning tasks constructed to activate varying depths of processing." Virtual Press, 2003. http://liblink.bsu.edu/uhtbin/catkey/1263891.

Full text

Abstract:

This study investigated the effectiveness of four vocabulary learning tasks on 104 Saudi EFL learners' retention of ten previously unencountered lexical items. These four tasks were: 1) writing original sentences (WS), 2) writing an original text (i.e. composition) (WT), 3) filling-in-the-blank of single sentences (FS), and 4) filling-in-the-lank of a text (FT). Different results were obtained depending on whether the amount of time required by these tasks was considered in the analysis or not. When time was not considered in the analysis, the WT group outperformed the other groups while the FS group obtained the lowest score. No significant differences were found between WS and FT. The picture, however, changed dramatically when time was considered in the analysis. The analysis of ratio of score to time taken revealed no significant differences between the four groups except between FT and FS, and it was in favor of FT. The differences in vocabulary gains between the four groups were ascribed to the level (or depth) of processing these tasks required the subjects to do and to the richness of the context available in two of the four exercises, namely WT and FT. The researcher concluded that composition writing was the most helpful task for vocabulary retention and also for general language learning, followed by FT. Sentence fill-in was considered the least useful activity in this regard.
Department of English

APA, Harvard, Vancouver, ISO, and other styles

39

Chen, Yong. "Constructing a language model based on data mining techniques for a Chinese character recognition system /." View the Table of Contents & Abstract, 2004. http://sunzi.lib.hku.hk/hkuto/record/B30708527.

Full text

APA, Harvard, Vancouver, ISO, and other styles

40

Chen, Yong, and 陳勇. "Constructing a language model based on data mining techniques for a Chinese character recognition system." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2004. http://hub.hku.hk/bib/B44570193.

Full text

APA, Harvard, Vancouver, ISO, and other styles

41

Ilberg, Peter. "Floyd : a functional programming language with distributed scope." Thesis, Georgia Institute of Technology, 1998. http://hdl.handle.net/1853/8187.

Full text

APA, Harvard, Vancouver, ISO, and other styles

42

Wong, Kun-wing Peter, and 黃冠榮. "Breaking the learning barrier of Chinese Changjei input method." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 1998. http://hub.hku.hk/bib/B31961198.

Full text

APA, Harvard, Vancouver, ISO, and other styles

43

Lee, Hiu-wing Doris, and 李曉穎. "A study of automatic expansion of Chinese abbreviations." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2005. http://hub.hku.hk/bib/B31609338.

Full text

APA, Harvard, Vancouver, ISO, and other styles

44

Yiu, Lai Kuen Candy. "Chinese character synthesis : towards universal Chinese information exchange." HKBU Institutional Repository, 2003. http://repository.hkbu.edu.hk/etd_ra/477.

Full text

APA, Harvard, Vancouver, ISO, and other styles

45

Mountaki, Youness. "The Relative Effects of Processing Instruction and Traditional Output Instruction on the Acquisition of the Arabic Subjunctive." Scholar Commons, 2016. http://scholarcommons.usf.edu/etd/6330.

Full text

Abstract:

The role of input and output in the acquisition of language has been a source of controversy in Second Language Acquisition (SLA) research. This present study aimed to investigate the relative effects of processing instruction (PI) as a type of input-based instruction and traditional instruction (TI) as a type of output-based instruction. Specifically, this experiment examined whether PI and TI bring about any improvement in comprehension and production of the Arabic subjunctive by beginner-level learners of Arabic. The PI instructional technique was based on the principles of input processing suggested by VanPatten (1993, 2002, 2004). It has three main elements: (a) an explicit explanation of grammar, (b) information on processing strategies, and (c) structured input activities. The study involved second semester students of Arabic and it aimed at assessing the impact of PI and traditional output instruction on the interpretation and production of the Arabic subjunctive on immediate and delayed posttests. One instructional package was developed for the PI group and another package was developed for the TI group. To assess the effects of instruction, a pretest/posttest/delayed posttest procedure with three tests was used. Each test included: 1) interpretation task with sixteen multiple choice items and 2) production task with sixteen sentence-completion items. The results from this study showed that participants who received PI outperformed participants from the TI as measured by Interpretation tasks of the subjunctive. However, the performance of both groups were statistically similar as was measured by the production tasks of the subjunctive. These results supported those of previous research that had compared PI with TI (Benati, 2001, 2005; Cadierno, 1995; VanPatten & Cadierno, 1993a, 1993b; VanPatten & Wong, 2004).

APA, Harvard, Vancouver, ISO, and other styles

46

Caines, Andrew Paul. "You talking to me? : zero auxiliary constructions in British English." Thesis, University of Cambridge, 2011. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.609153.

Full text

APA, Harvard, Vancouver, ISO, and other styles

47

Cheng, James Sheung-Chak. "The development of a structural index tree for processing XML data /." View abstract or full-text, 2004. http://library.ust.hk/cgi/db/thesis.pl?COMP%202004%20CHENG.

Full text

Abstract:

Thesis (M. Phil.)--Hong Kong University of Science and Technology, 2004.
Includes bibliographical references (leaves 80-86). Also available in electronic version. Access restricted to campus users.

APA, Harvard, Vancouver, ISO, and other styles

48

Li, Jianxin. "Adaptive query relaxation and processing over heterogeneous xml data sources." Swinburne Research Bank, 2009. http://hdl.handle.net/1959.3/66874.

Full text

Abstract:

Thesis (Ph.D) - Swinburne University of Technology, Faculty of Information & Communication Technologies, 2009.
A dissertation submitted to the Faculty of Information and Communication Technologies, Swinburne University of Technology in partial fulfillment of the requirements for the degree of Doctor of Philosophy, 2009. Typescript. "August 2009". Bibliography p. 161-171.

APA, Harvard, Vancouver, ISO, and other styles

49

Lauretig, Adam M. "Natural Language Processing, Statistical Inference, and American Foreign Policy." The Ohio State University, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=osu1562147711514566.

Full text

APA, Harvard, Vancouver, ISO, and other styles

50

Tempfli, Peter. "Preprocessing method comparison and model tuning for natural language data." Thesis, Högskolan Dalarna, Mikrodataanalys, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:du-34438.

Full text

Abstract:

Twitter and other microblogging services are a valuable source for almost real-time marketing, public opinion and brand-related consumer information mining. As such, collection and analysis of user-generated natural language content is in the focus of research regarding automated sentiment analysis. The most successful approach in the field is supervised machine learning, where the three key problems are data cleaning and transformation, feature generation and model choice and training parameter selection. Papers in recent years thoroughly examined the field and there is a agreement that relatively simple techniques as bag-of-words transformation of text and a naive bayes models can generate acceptable results (between 75% and 85% percent F1-scores for an average dataset) and fine tuning can be really difficult and yields relatively small results. However, a few percent in performance even on a middle-size dataset can mean thousands of better classified documents, which can mean thousands of missed sales or angry customers in any business domain. Thus this work presents and demonstrates a framework for better tailored, fine-tuned models for analysing twitter data. The experiments show that Naive Bayes classifiers with domain specific stopword selection work the best (up to 88% F1-score), however the performance dramatically decreases if the data is unbalanced or the classes are not binary. Filtering stopwords is crucial to increase prediction performance; and the experiment shows that a stopword set should be domain-specific. The conclusion is that there is no one best way for model training and stopword selection in sentiment analysis. Thus the work suggests that there is space for using a comparison framework to fine-tune prediction models to a given problem: such a comparison framework should compare different training settings on the same dataset, so the best trained models can be found for a given real-life problem.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!