Academic literature on the topic 'Arabic language – Data processing'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Arabic language – Data processing.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Arabic language – Data processing"

1

Bouziane, Abdelghani, Djelloul Bouchiha, Redha Rebhi, Giulio Lorenzini, Noureddine Doumi, Younes Menni, and Hijaz Ahmad. "ARALD: Arabic Annotation Using Linked Data." Ingénierie des systèmes d information 26, no. 2 (April 30, 2021): 143–49. http://dx.doi.org/10.18280/isi.260201.

Full text
Abstract:
The evolution of the traditional Web into the semantic Web makes the machine a first-class citizen on the Web and increases the discovery and accessibility of unstructured Web-based data. This development makes it possible to use Linked Data technology as the background knowledge base for unstructured data, especially texts, now available in massive quantities on the Web. Given any text, the main challenge is determining DBpedia's most relevant information with minimal effort and time. Although, DBpedia annotation tools, such as DBpedia spotlight, mainly targeted English and Latin DBpedia versions. The current situation of the Arabic language is less bright; the Web content of the Arabic language does not reflect the importance of this language. Thus, we have developed an approach to annotate Arabic texts with Linked Open Data, particularly DBpedia. This approach uses natural language processing and machine learning techniques for interlinking Arabic text with Linked Open Data. Despite the high complexity of the independent domain knowledge base and the reduced resources in Arabic natural language processing, the evaluation results of our approach were encouraging.
APA, Harvard, Vancouver, ISO, and other styles
2

Tachicart, Ridouane, and Karim Bouzoubaa. "Moroccan Data-Driven Spelling Normalization Using Character Neural Embedding." Vietnam Journal of Computer Science 08, no. 01 (October 5, 2020): 113–31. http://dx.doi.org/10.1142/s2196888821500044.

Full text
Abstract:
With the increase of Web use in Morocco today, Internet has become an important source of information. Specifically, across social media, the Moroccan people use several languages in their communication leaving behind unstructured user-generated text (UGT) that presents several opportunities for Natural Language Processing. Among the languages found in this data, Moroccan Arabic (MA) stands with an important content and several features. In this paper, we investigate online written text generated by Moroccan users in social media with an emphasis on Moroccan Arabic. For this purpose, we follow several steps, using some tools such as a language identification system, in order to conduct a deep study of this data. The most interesting findings that have emerged are the use of code-switching, multi-script and low amount of words in the Moroccan UGT. Moreover, we used the investigated data in order to build a new Moroccan language resource. The latter consists in building a Moroccan words orthographic variants lexicon following an unsupervised approach and using character neural embedding. This lexicon can be useful for several NLP tasks such as spelling normalization.
APA, Harvard, Vancouver, ISO, and other styles
3

Essam, Nader, Abdullah M. Moussa, Khaled M. Elsayed, Sherif Abdou, Mohsen Rashwan, Shaheen Khatoon, Md Maruf Hasan, Amna Asif, and Majed A. Alshamari. "Location Analysis for Arabic COVID-19 Twitter Data Using Enhanced Dialect Identification Models." Applied Sciences 11, no. 23 (November 30, 2021): 11328. http://dx.doi.org/10.3390/app112311328.

Full text
Abstract:
The recent surge of social media networks has provided a channel to gather and publish vital medical and health information. The focal role of these networks has become more prominent in periods of crisis, such as the recent pandemic of COVID-19. These social networks have been the leading platform for broadcasting health news updates, precaution instructions, and governmental procedures. They also provide an effective means for gathering public opinion and tracking breaking events and stories. To achieve location-based analysis for social media input, the location information of the users must be captured. Most of the time, this information is either missing or hidden. For some languages, such as Arabic, the users’ location can be predicted from their dialects. The Arabic language has many local dialects for most Arab countries. Natural Language Processing (NLP) techniques have provided several approaches for dialect identification. The recent advanced language models using contextual-based word representations in the continuous domain, such as BERT models, have provided significant improvement for many NLP applications. In this work, we present our efforts to use BERT-based models to improve the dialect identification of Arabic text. We show the results of the developed models to recognize the source of the Arabic country, or the Arabic region, from Twitter data. Our results show 3.4% absolute enhancement in dialect identification accuracy on the regional level over the state-of-the-art result. When we excluded the Modern Standard Arabic (MSA) set, which is formal Arabic language, we achieved 3% absolute gain in accuracy between the three major Arabic dialects over the state-of-the-art level. Finally, we applied the developed models on a recently collected resource for COVID-19 Arabic tweets to recognize the source country from the users’ tweets. We achieved a weighted average accuracy of 97.36%, which proposes a tool to be used by policymakers to support country-level disaster-related activities.
APA, Harvard, Vancouver, ISO, and other styles
4

Mahmoudi, Omayma, Mouncef Filali Bouami, and Mustapha Badri. "Arabic Language Modeling Based on Supervised Machine Learning." Revue d'Intelligence Artificielle 36, no. 3 (June 30, 2022): 467–73. http://dx.doi.org/10.18280/ria.360315.

Full text
Abstract:
Misinformation and misleading actions have appeared as soon as COVID-19 vaccinations campaigns were launched, no matter what the country’s alphabetization level or growing index is. In such a situation, supervised machine learning techniques for classification appears as a suitable solution to model the value & veracity of data, especially in the Arabic language as a language used by millions of people around the world. To achieve this task, we had to collect data manually from SM platforms such as Facebook, Twitter and Arabic news websites. This paper aims to classify Arabic language news into fake news and real news, by creating a Machine Learning (ML) model that will detect Arabic fake news (DAFN) about COVID-19 vaccination. To achieve our goal, we will use Natural Language Processing (NLP) techniques, which is especially challenging since NLP libraries support for Arabic is not common. We will use NLTK package of python to preprocess the data, and then we will use a ML model for the classification.
APA, Harvard, Vancouver, ISO, and other styles
5

Aflisia, Noza, Mohamad Erihadiana, and Nur Balqis. "Teacher’s Perception toward the Readiness to Face Multiculturalism in Arabic Teaching and Learning." Izdihar : Journal of Arabic Language Teaching, Linguistics, and Literature 3, no. 3 (December 31, 2020): 197–210. http://dx.doi.org/10.22219/jiz.v3i3.14117.

Full text
Abstract:
Multicultural presence required an appropriate response from Arabic teachers, so that Arabic is easily accepted and loved by various groups. This research aimed to analyze the efforts of Arabic teachers in dealing with multiculturalism and analyze the obstacles encountered in applying multicultural education in Arabic language learning. This qualitative descriptive research was conducted with interview and documentation. While the data analysis and processing techniques used in this study were processing and preparing the data for analysis, reading the entire data, starting coding all the data, coding to explain the settings, people, categories, themes analyzed, and describing the themes that will be presented again in the narrative/qualitative report. The results revealed that the efforts of Arabic teachers to confront multicultural were by reaffirming the unifying Arabic for Muslims, confirming Arabic as one of the International language, learning the essence of multicultural, improving didactic and methodical competencies, attending training, and modeling. The constraints of the application of multicultural education in Arabic language learning were lack of understanding of the essence of multicultural, lack of knowledge of learning methods and strategies, lack of literature, lack of syllabus and teaching materials contained multicultural education, lack of support from institutions, and lack of training and guidance.
APA, Harvard, Vancouver, ISO, and other styles
6

Hizbullah, Nur, Zakiyah Arifa, Yoke Suryadarma, Ferry Hidayat, Luthfi Muhyiddin, and Eka Kurnia Firmansyah. "SOURCE-BASED ARABIC LANGUAGE LEARNING: A CORPUS LINGUISTIC APPROACH." Humanities & Social Sciences Reviews 8, no. 3 (June 17, 2020): 940–54. http://dx.doi.org/10.18510/hssr.2020.8398.

Full text
Abstract:
Purpose: The study explores the process of using Arabic websites for Arabic language learning, utilising the Arabic Corpus Linguistic approach. This approach enables data-mining out of websites, systematically compiling the mined data, as well as processing the data for the express purpose of Arabic language teaching including its clusters, such as Arabic pragmatics, Arabic linguistics, and Arabic translation teaching as well. MethodologyThe research is written descriptively and utilises qualitative methods used for analysing the process and step-by-step procedures to be executed to make good use of the data. Main Findings: This study is conducted based on the theory of source-based teaching, while the process of utilising the websites is systematically elaborated through the corpus linguistic mechanism. The research concludes that almost all Arabic websites can be employed to be authentic, reliable teaching sources. The sources can be made good use of for teaching the four language competencies, for being the object of linguistic studies and for translation through the particular use of websites whose contents are bilingual or multilingual. Implications/ Applications: The utilisation of the Corpus for teaching and learning has still been needing wide-spreading and promoting either among practitioners or among researchers of the Arabic language in Indonesia. Novelty/Originality of this study: This study highlights that almost Arabic-language websites are one of the richest sources of learning. These learning resources can be used for language learning and various other dimensions of scientific Arabic. Corpus linguistics has many benefits for learners and teachers in Arabic language learning. This study gives the new approach of Arabic teaching-learning using website resources, and the dynamic of Arabic learning using technology.
APA, Harvard, Vancouver, ISO, and other styles
7

LANGLOIS, D., M. SAAD, and K. SMAILI. "Alignment of comparable documents: Comparison of similarity measures on French–English–Arabic data." Natural Language Engineering 24, no. 5 (June 19, 2018): 677–94. http://dx.doi.org/10.1017/s1351324918000232.

Full text
Abstract:
AbstractThe objective, in this article, is to address the issue of the comparability of documents, which are extracted from different sources and written in different languages. These documents are not necessarily translations of each other. This material is referred as multilingual comparable corpora. These language resources are useful for multilingual natural language processing applications, especially for low-resourced language pairs. In this paper, we collect different data in Arabic, English, and French. Two corpora are built by using available hyperlinks for Wikipedia and Euronews. Euronews is an aligned multilingual (Arabic, English, and French) corpus of 34k documents collected from Euronews website. A more challenging issue is to build comparable corpus from two different and independent media having two distinct editorial lines, such as British Broadcasting Corporation (BBC) and Al Jazeera (JSC). To build such corpus, we propose to use the Cross-Lingual Latent Semantic approach. For this purpose, documents have been harvested from BBC and JSC websites for each month of the years 2012 and 2013. The comparability is calculated for each Arabic–English couple of documents of each month. This automatic task is then validated by hand. This led to a multilingual (Arabic–English) aligned corpus of 305 pairs of documents (233k English words and 137k Arabic words). In addition, a study is presented in this paper to analyze the performance of three methods of the literature allowing to measure the comparability of documents on the multilingual reference corpora. A recall at rank 1 of 50.16 per cent is achieved with the Cross-lingual LSI approach for BBC–JSC test corpus, while the dictionary-based method reaches a recall of only 35.41 per cent.
APA, Harvard, Vancouver, ISO, and other styles
8

Chaimae, Azroumahli, Yacine El Younoussi, Otman Moussaoui, and Youssra Zahidi. "An Arabic Dialects Dictionary Using Word Embeddings." International Journal of Rough Sets and Data Analysis 6, no. 3 (July 2019): 18–31. http://dx.doi.org/10.4018/ijrsda.2019070102.

Full text
Abstract:
The dialectical Arabic and the Modern Standard Arabic lacks sufficient standardized language resources to enable the tasks of Arabic language processing, despite it being an active research area. This work addresses this issue by firstly highlighting the steps and the issues related to building a multi Arabic dialect corpus using web data from blogs and social media platforms (i.e. Facebook, Twitter, etc.). This is to create a vectorized dictionary for the crawled data using the word Embeddings. In other terms, the goal of this article is to build an updated multi-dialect data set, and then, to extract an annotated corpus from it.
APA, Harvard, Vancouver, ISO, and other styles
9

Alothman, Manal Othman, Muhammad Badruddin Khan, and Mozaherul Hoque Abul Hasanat. "Review of Researches on Arabic Social Media Text Mining." Journal of Intelligent Systems and Computing 2, no. 1 (March 31, 2021): 20–33. http://dx.doi.org/10.51682/jiscom.00201005.2021.

Full text
Abstract:
Social media sites and applications have allowed people to share their comments, opinions, and point of views in different languages on mass scale. Arabic language is one of the languages that has seen huge surge in production of its digital textual content. The Arabic content and its metadata are a goldmine of useful information for a wide variety of applications. A large number of researchers are working on Arabic data in various domains of research such as natural language processing, sentiment analysis, event detection, named entity recognition, etc. This article presents a review of number of such studies conducted between 2014 and 2019 using their data sources from social media websites. We found that Twitter was the most used source to contribute data for dataset construction for Arabic text mining researchers. Our study also found that the Sport Vector Machine (SVM) and Naïve Bayesian (NB) classifiers were the most used classifiers in the previous researches. Moreover, the results of the previous studies indicate that SVM classifier provided the best performance compared to other classifiers.
APA, Harvard, Vancouver, ISO, and other styles
10

Bessou, Sadik, and Racha Sari. "Efficient Discrimination between Arabic Dialects." Recent Advances in Computer Science and Communications 13, no. 4 (October 19, 2020): 725–30. http://dx.doi.org/10.2174/2213275912666190716115604.

Full text
Abstract:
Background: With the explosion of communication technologies and the accompanying pervasive use of social media, we notice an outstanding proliferation of posts, reviews, comments, and other forms of expressions in different languages. This content attracted researchers from different fields; economics, political sciences, social sciences, psychology and particularly language processing. One of the prominent subjects is the discrimination between similar languages and dialects using natural language processing and machine learning techniques. The problem is usually addressed by formulating the identification as a classification task. Methods: The approach is based on machine learning classification methods to discriminate between Modern Standard Arabic (MSA) and four regional Arabic dialects: Egyptian, Levantine, Gulf and North-African. Several models were trained to discriminate between the studied dialects in large corpora mined from online Arabic newspapers and manually annotated. Results: Experimental results showed that n-gram features could substantially improve performance. Logistic regression based on character and word n-gram model using Count Vectors identified the handled dialects with an overall accuracy of 95%. Best results were achieved with Linear Support vector classifier using TF-IDF Vectors trained by character-based uni-gram, bi-gram, trigram, and word-based uni-gram, bi-gram with an overall accuracy of 95.1%. Conclusion: The results showed that n-gram features could substantially improve performance. Additionally, we noticed that the kind of data representation could provide a significant performance boost compared to simple representation.
APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic "Arabic language – Data processing"

1

Hamrouni, Nadia. "Structure and Processing in Tunisian Arabic: Speech Error Data." Diss., The University of Arizona, 2010. http://hdl.handle.net/10150/195969.

Full text
Abstract:
This dissertation presents experimental research on speech errors in Tunisian Arabic (TA). The central empirical questions revolve around properties of `exchange errors'. These errors can mis-order lexical, morphological, or sound elements in a variety of patterns. TA's nonconcatenative morphology shows interesting interactions of phrasal and lexical constraints with morphological structure during language production and affords different and revealing error potentials linking the production system with linguistic knowledge.The dissertation studies expand and test generalizations based on Abd-El-Jawad and Abu-Salim's (1987) study of spontaneous speech errors in Jordanian Arabic by experimentally examining apparent regularities in data from real-time language processing perspective. The studies address alternative accounts of error phenomena that have figured prominently in accounts of production processing. Three experiments were designed and conducted based on an error elicitation paradigm used by Ferreira and Humphreys (2001). Experiment 1 tested within-phrase exchange errors focused on root versus non-root exchanges and lexical versus non-lexical outcomes for root and non-root errors. Experiments 2 and 3 addressed between-phrase exchange errors focused on violations of the Grammatical Category Constraint (GCC).The study of exchange potentials for the within-phrase items (experiment 1) contrasted lexical and non-lexical outcomes. The expectation was that these would include a significant number of root exchanges and that the lexical status of the resulting forms would not preclude error. Results show that root and vocalic pattern exchanges were very rare and that word forms rather than root forms were the dominant influence in the experimental performance. On the other hand, the study of exchange errors across phrasal boundaries of items that do or do not correspond in grammatical category (experiments 2 and 3) pursued two principal questions, one concerning the error rate and the second concerning the error elements. The expectation was that the errors predominantly come from grammatical category matches. That outcome would reinforce the interpretation that processing operations reflect the assignment of syntactically labeled elements to their location in phrasal structures. Results corroborated with the expectation. However, exchange errors involving words of different grammatical categories were also frequent. This has implications for speech monitoring models and the automaticity of the GCC.
APA, Harvard, Vancouver, ISO, and other styles
2

Bakheet, Mohammed. "Improving Speech Recognition for Arabic language Using Low Amounts of Labeled Data." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-176437.

Full text
Abstract:
The importance of Automatic Speech Recognition (ASR) Systems, whose job is to generate text from audio, is increasing as the number of applications of these systems is rapidly going up. However, when it comes to training ASR systems, the process is difficult and rather tedious, and that could be attributed to the lack of training data. ASRs require huge amounts of annotated training data containing the audio files and the corresponding accurately written transcript files. This annotated (labeled) training data is very difficult to find for most of the languages, it usually requires people to perform the annotation manually which, apart from the monetary price it costs, is error-prone. A supervised training task is impractical for this scenario.  The Arabic language is one of the languages that do not have an abundance of labeled data, which makes its ASR system's accuracy very low compared to other resource-rich languages such as English, French, or Spanish. In this research, we take advantage of unlabeled voice data by learning general data representations from unlabeled training data (only audio files) in a self-supervised task or pre-training phase. This phase is done by using wav2vec 2.0 framework which masks out input in the latent space and solves a contrastive task. The model is then fine-tuned on a few amounts of labeled data. We also exploit models that have been pre-trained on different languages, by using wav2vec 2.0, for the purpose of fine-tuning them on Arabic language by using annotated Arabic data.   We show that using wav2vec 2.0 framework for pre-training on Arabic is considerably time and resource-consuming. It took the model 21.5 days (about 3 weeks) to complete 662 epochs and get a validation accuracy of 58%.  Arabic is a right-to-left (rtl) language with many diacritics that indicate how letters should be pronounced, these two features make it difficult for Arabic to fit into these models, as it requires heavy pre-processing for the transcript files. We demonstrate that we can fine-tune a cross-lingual model, that is trained on raw waveforms of speech in multiple languages, on Arabic data and get a low word error rate 36.53%. We also prove that by fine-tuning the model parameters we can increase the accuracy, thus, decrease the word error rate from 54.00% to 36.69%.
APA, Harvard, Vancouver, ISO, and other styles
3

Al-Nashashibi, May Y. A. "Arabic Language Processing for Text Classification. Contributions to Arabic Root Extraction Techniques, Building An Arabic Corpus, and to Arabic Text Classification Techniques." Thesis, University of Bradford, 2012. http://hdl.handle.net/10454/6326.

Full text
Abstract:
The impact and dynamics of Internet-based resources for Arabic-speaking users is increasing in significance, depth and breadth at highest pace than ever, and thus requires updated mechanisms for computational processing of Arabic texts. Arabic is a complex language and as such requires in depth investigation for analysis and improvement of available automatic processing techniques such as root extraction methods or text classification techniques, and for developing text collections that are already labeled, whether with single or multiple labels. This thesis proposes new ideas and methods to improve available automatic processing techniques for Arabic texts. Any automatic processing technique would require data in order to be used and critically reviewed and assessed, and here an attempt to develop a labeled Arabic corpus is also proposed. This thesis is composed of three parts: 1- Arabic corpus development, 2- proposing, improving and implementing root extraction techniques, and 3- proposing and investigating the effect of different pre-processing methods on single-labeled text classification methods for Arabic. This thesis first develops an Arabic corpus that is prepared to be used here for testing root extraction methods as well as single-label text classification techniques. It also enhances a rule-based root extraction method by handling irregular cases (that appear in about 34% of texts). It proposes and implements two expanded algorithms as well as an adjustment for a weight-based method. It also includes the algorithm that handles irregular cases to all and compares the performances of these proposed methods with original ones. This thesis thus develops a root extraction system that handles foreign Arabized words by constructing a list of about 7,000 foreign words. The outcome of the technique with best accuracy results in extracting the correct stem and root for respective words in texts, which is an enhanced rule-based method, is used in the third part of this thesis. This thesis finally proposes and implements a variant term frequency inverse document frequency weighting method, and investigates the effect of using different choices of features in document representation on single-label text classification performance (words, stems or roots as well as including to these choices their respective phrases). This thesis applies forty seven classifiers on all proposed representations and compares their performances. One challenge for researchers in Arabic text processing is that reported root extraction techniques in literature are either not accessible or require a long time to be reproduced while labeled benchmark Arabic text corpus is not fully available online. Also, by now few machine learning techniques were investigated on Arabic where usual preprocessing steps before classification were chosen. Such challenges are addressed in this thesis by developing a new labeled Arabic text corpus for extended applications of computational techniques. Results of investigated issues here show that proposing and implementing an algorithm that handles irregular words in Arabic did improve the performance of all implemented root extraction techniques. The performance of the algorithm that handles such irregular cases is evaluated in terms of accuracy improvement and execution time. Its efficiency is investigated with different document lengths and empirically is found to be linear in time for document lengths less than about 8,000. The rule-based technique is improved the highest among implemented root extraction methods when including the irregular cases handling algorithm. This thesis validates that choosing roots or stems instead of words in documents representations indeed improves single-label classification performance significantly for most used classifiers. However, the effect of extending such representations with their respective phrases on single-label text classification performance shows that it has no significant improvement. Many classifiers were not yet tested for Arabic such as the ripple-down rule classifier. The outcome of comparing the classifiers' performances concludes that the Bayesian network classifier performance is significantly the best in terms of accuracy, training time, and root mean square error values for all proposed and implemented representations.
Petra University, Amman (Jordan)
APA, Harvard, Vancouver, ISO, and other styles
4

Al-Nashashibi, May Yacoub Adib. "Arabic language processing for text classification : contributions to Arabic root extraction techniques, building an Arabic corpus, and to Arabic text classification techniques." Thesis, University of Bradford, 2012. http://hdl.handle.net/10454/6326.

Full text
Abstract:
The impact and dynamics of Internet-based resources for Arabic-speaking users is increasing in significance, depth and breadth at highest pace than ever, and thus requires updated mechanisms for computational processing of Arabic texts. Arabic is a complex language and as such requires in depth investigation for analysis and improvement of available automatic processing techniques such as root extraction methods or text classification techniques, and for developing text collections that are already labeled, whether with single or multiple labels. This thesis proposes new ideas and methods to improve available automatic processing techniques for Arabic texts. Any automatic processing technique would require data in order to be used and critically reviewed and assessed, and here an attempt to develop a labeled Arabic corpus is also proposed. This thesis is composed of three parts: 1- Arabic corpus development, 2- proposing, improving and implementing root extraction techniques, and 3- proposing and investigating the effect of different pre-processing methods on single-labeled text classification methods for Arabic. This thesis first develops an Arabic corpus that is prepared to be used here for testing root extraction methods as well as single-label text classification techniques. It also enhances a rule-based root extraction method by handling irregular cases (that appear in about 34% of texts). It proposes and implements two expanded algorithms as well as an adjustment for a weight-based method. It also includes the algorithm that handles irregular cases to all and compares the performances of these proposed methods with original ones. This thesis thus develops a root extraction system that handles foreign Arabized words by constructing a list of about 7,000 foreign words. The outcome of the technique with best accuracy results in extracting the correct stem and root for respective words in texts, which is an enhanced rule-based method, is used in the third part of this thesis. This thesis finally proposes and implements a variant term frequency inverse document frequency weighting method, and investigates the effect of using different choices of features in document representation on single-label text classification performance (words, stems or roots as well as including to these choices their respective phrases). This thesis applies forty seven classifiers on all proposed representations and compares their performances. One challenge for researchers in Arabic text processing is that reported root extraction techniques in literature are either not accessible or require a long time to be reproduced while labeled benchmark Arabic text corpus is not fully available online. Also, by now few machine learning techniques were investigated on Arabic where usual preprocessing steps before classification were chosen. Such challenges are addressed in this thesis by developing a new labeled Arabic text corpus for extended applications of computational techniques. Results of investigated issues here show that proposing and implementing an algorithm that handles irregular words in Arabic did improve the performance of all implemented root extraction techniques. The performance of the algorithm that handles such irregular cases is evaluated in terms of accuracy improvement and execution time. Its efficiency is investigated with different document lengths and empirically is found to be linear in time for document lengths less than about 8,000. The rule-based technique is improved the highest among implemented root extraction methods when including the irregular cases handling algorithm. This thesis validates that choosing roots or stems instead of words in documents representations indeed improves single-label classification performance significantly for most used classifiers. However, the effect of extending such representations with their respective phrases on single-label text classification performance shows that it has no significant improvement. Many classifiers were not yet tested for Arabic such as the ripple-down rule classifier. The outcome of comparing the classifiers' performances concludes that the Bayesian network classifier performance is significantly the best in terms of accuracy, training time, and root mean square error values for all proposed and implemented representations.
APA, Harvard, Vancouver, ISO, and other styles
5

Alabbas, Maytham Abualhail Shahed. "Textual entailment for modern standard Arabic." Thesis, University of Manchester, 2013. https://www.research.manchester.ac.uk/portal/en/theses/textual-entailment-for-modern-standard-arabic(9e053b1a-0570-4c30-9100-3d9c2ba86d8c).html.

Full text
Abstract:
This thesis explores a range of approaches to the task of recognising textual entailment (RTE), i.e. determining whether one text snippet entails another, for Arabic, where we are faced with an exceptional level of lexical and structural ambiguity. To the best of our knowledge, this is the first attempt to carry out this task for Arabic. Tree edit distance (TED) has been widely used as a component of natural language processing (NLP) systems that attempt to achieve the goal above, with the distance between pairs of dependency trees being taken as a measure of the likelihood that one entails the other. Such a technique relies on having accurate linguistic analyses. Obtaining such analyses for Arabic is notoriously difficult. To overcome these problems we have investigated strategies for improving tagging and parsing depending on system combination techniques. These strategies lead to substantially better performance than any of the contributing tools. We describe also a semi-automatic technique for creating a first dataset for RTE for Arabic using an extension of the ‘headline-lead paragraph’ technique because there are, again to the best of our knowledge, no such datasets available. We sketch the difficulties inherent in volunteer annotators-based judgment, and describe a regime to ameliorate some of these. The major contribution of this thesis is the introduction of two ways of improving the standard TED: (i) we present a novel approach, extended TED (ETED), for extending the standard TED algorithm for calculating the distance between two trees by allowing operations to apply to subtrees, rather than just to single nodes. This leads to useful improvements over the performance of the standard TED for determining entailment. The key here is that subtrees tend to correspond to single information units. By treating operations on subtrees as less costly than the corresponding set of individual node operations, ETED concentrates on entire information units, which are a more appropriate granularity than individual words for considering entailment relations; and (ii) we use the artificial bee colony (ABC) algorithm to automatically estimate the cost of edit operations for single nodes and subtrees and to determine thresholds, since assigning an appropriate cost to each edit operation manually can become a tricky task.The current findings are encouraging. These extensions can substantially affect the F-score and accuracy and achieve a better RTE model when compared with a number of string-based algorithms and the standard TED approaches. The relative performance of the standard techniques on our Arabic test set replicates the results reported for these techniques for English test sets. We have also applied ETED with ABC to the English RTE2 test set, where it again outperforms the standard TED.
APA, Harvard, Vancouver, ISO, and other styles
6

Khaliq, Bilal. "Unsupervised learning of Arabic non-concatenative morphology." Thesis, University of Sussex, 2015. http://sro.sussex.ac.uk/id/eprint/53865/.

Full text
Abstract:
Unsupervised approaches to learning the morphology of a language play an important role in computer processing of language from a practical and theoretical perspective, due their minimal reliance on manually produced linguistic resources and human annotation. Such approaches have been widely researched for the problem of concatenative affixation, but less attention has been paid to the intercalated (non-concatenative) morphology exhibited by Arabic and other Semitic languages. The aim of this research is to learn the root and pattern morphology of Arabic, with accuracy comparable to manually built morphological analysis systems. The approach is kept free from human supervision or manual parameter settings, assuming only that roots and patterns intertwine to form a word. Promising results were obtained by applying a technique adapted from previous work in concatenative morphology learning, which uses machine learning to determine relatedness between words. The output, with probabilistic relatedness values between words, was then used to rank all possible roots and patterns to form a lexicon. Analysis using trilateral roots resulted in correct root identification accuracy of approximately 86% for inflected words. Although the machine learning-based approach is effective, it is conceptually complex. So an alternative, simpler and computationally efficient approach was then devised to obtain morpheme scores based on comparative counts of roots and patterns. In this approach, root and pattern scores are defined in terms of each other in a mutually recursive relationship, converging to an optimized morpheme ranking. This technique gives slightly better accuracy while being conceptually simpler and more efficient. The approach, after further enhancements, was evaluated on a version of the Quranic Arabic Corpus, attaining a final accuracy of approximately 93%. A comparative evaluation shows this to be superior to two existing, well used manually built Arabic stemmers, thus demonstrating the practical feasibility of unsupervised learning of non-concatenative morphology.
APA, Harvard, Vancouver, ISO, and other styles
7

Grinman, Alex J. "Natural language processing on encrypted patient data." Thesis, Massachusetts Institute of Technology, 2016. http://hdl.handle.net/1721.1/113438.

Full text
Abstract:
Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2016.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 85-86).
While many industries can benefit from machine learning techniques for data analysis, they often do not have the technical expertise nor computational power to do so. Therefore, many organizations would benefit from outsourcing their data analysis. Yet, stringent data privacy policies prevent outsourcing sensitive data and may stop the delegation of data analysis in its tracks. In this thesis, we put forth a two-party system where one party capable of powerful computation can run certain machine learning algorithms from the natural language processing domain on the second party's data, where the first party is limited to learning only specific functions of the second party's data and nothing else. Our system provides simple cryptographic schemes for locating keywords, matching approximate regular expressions, and computing frequency analysis on encrypted data. We present a full implementation of this system in the form of a extendible software library and a command line interface. Finally, we discuss a medical case study where we used our system to run a suite of unmodified machine learning algorithms on encrypted free text patient notes.
by Alex J. Grinman.
M. Eng.
APA, Harvard, Vancouver, ISO, and other styles
8

Alamry, Ali. "Grammatical Gender Processing in Standard Arabic as a First and a Second Language." Thesis, Université d'Ottawa / University of Ottawa, 2019. http://hdl.handle.net/10393/39965.

Full text
Abstract:
The present dissertation investigates grammatical gender representation and processing in Modern Standard Arabic (MSA) as a first (L1) and a second (L2) language. It mainly examines whether L2 can process gender agreement in a native-like manner, and the extent to which L2 processing is influenced by the properties of the L2 speakers’ L1. Additionally, it examines whether L2 gender agreement processing is influenced by noun animacy (animate and inanimate) and word order (verb-subject and subject-verb). A series of experiments using both online and offline techniques were conducted to address these questions. In all of the experiments, gender agreement between verb and nouns was examined. The first series of experiments examined native speakers of MSA (n=49) using a self-paced reading task (SPR), an event-related potential (ERP) experiment, and a grammaticality judgment (GJ) task. Results of these experiments revealed that native speakers were sensitive to grammatical violations. Native speakers showed longer reaction times (RT) in the SPR task, and a P600 effect in the ERP, in responses to sentences with mismatched gender agreement as compared to sentences with matched gender agreement. They also performed at ceiling in the GJ task. The second series of experiments examined L2 speakers of MSA (n=74) using an SPR task, and a GJ task. Both experiments included adult L2 speakers whom were divided into two subgroups, -Gender and +Gender, based on whether or not their L1s has a grammatical gender system. The results of both experiments revealed that both groups were sensitive to gender agreement violations. The L2 speakers showed longer RTs, in the SPR task, in responses to sentences with mismatched gender agreement as compared to sentences with matched gender agreement. No difference was found between the L2 groups in this task. The L2 speakers also performed well in the GJ task, as they were able to correctly identify the grammatical and ungrammatical sentences. Interestingly in this task, the -Gender group outperformed +Gender group, which could be due to proficiency in the L2 as the former group obtained a better score on the proficiency task, or it could be that +Gender group showed negative transfer from their L1s. Based on the results of these two experiments, this dissertation argues that late L2 speakers are not restricted to their L1 grammar, and thus, they are able to acquire gender agreement system of their L2 even if this feature is not instantiated in their L1. The results provide converging evidence for the FTFA rather than FFFH model, as it appears that the -Gender group was able to reset their L1 gender parameter according to the L2 gender values. Although the L2 speakers were advanced, they showed slower RTs than the native speakers in the SPR task, and lower accuracy in the GJT. However, it is possible that they are still in the process of acquiring gender agreement of MSA and have not reached their final stage of acquisition. This is supported by the fact that some L2 speakers from both -Gender and +Gender groups performed as well as native speakers in both SPR and GJ tasks. Regarding the effect of animacy, the L2 speakers had slower RT and lower accuracy on sentences with inanimate nouns than on those with animate ones, which is in line with previous L2 studies (Anton-Medez, 1999; Alarcón, 2009; Gelin, & Bugaiska, 2014). The native speakers, on the other hand, showed no effect of animacy in both SPR task and GJT. Further, no N400 effect was observed as a result of semantic gender agreement violations in the ERP experiment. Finally, the results revealed a potential effect of word order. Both the native and L2 speakers showed longer RTs on VS word order than SV word order in the SPR task. Further the native speakers showed earlier and greater P600 effect on VS word order than SV word order in the ERP. This result suggests that processing gender agreement violation is more complex in the VS word order than in the SV word order due to the inherent asymmetry in the subject-verb agreement system in the two-word orders in MSA.
APA, Harvard, Vancouver, ISO, and other styles
9

余銘龍 and Ming-lung Yu. "Automatic processing of Chinese language bank cheques." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2002. http://hub.hku.hk/bib/B31225548.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Hellmann, Sebastian. "Integrating Natural Language Processing (NLP) and Language Resources Using Linked Data." Doctoral thesis, Universitätsbibliothek Leipzig, 2015. http://nbn-resolving.de/urn:nbn:de:bsz:15-qucosa-157932.

Full text
Abstract:
This thesis is a compendium of scientific works and engineering specifications that have been contributed to a large community of stakeholders to be copied, adapted, mixed, built upon and exploited in any way possible to achieve a common goal: Integrating Natural Language Processing (NLP) and Language Resources Using Linked Data The explosion of information technology in the last two decades has led to a substantial growth in quantity, diversity and complexity of web-accessible linguistic data. These resources become even more useful when linked with each other and the last few years have seen the emergence of numerous approaches in various disciplines concerned with linguistic resources and NLP tools. It is the challenge of our time to store, interlink and exploit this wealth of data accumulated in more than half a century of computational linguistics, of empirical, corpus-based study of language, and of computational lexicography in all its heterogeneity. The vision of the Giant Global Graph (GGG) was conceived by Tim Berners-Lee aiming at connecting all data on the Web and allowing to discover new relations between this openly-accessible data. This vision has been pursued by the Linked Open Data (LOD) community, where the cloud of published datasets comprises 295 data repositories and more than 30 billion RDF triples (as of September 2011). RDF is based on globally unique and accessible URIs and it was specifically designed to establish links between such URIs (or resources). This is captured in the Linked Data paradigm that postulates four rules: (1) Referred entities should be designated by URIs, (2) these URIs should be resolvable over HTTP, (3) data should be represented by means of standards such as RDF, (4) and a resource should include links to other resources. Although it is difficult to precisely identify the reasons for the success of the LOD effort, advocates generally argue that open licenses as well as open access are key enablers for the growth of such a network as they provide a strong incentive for collaboration and contribution by third parties. In his keynote at BNCOD 2011, Chris Bizer argued that with RDF the overall data integration effort can be “split between data publishers, third parties, and the data consumer”, a claim that can be substantiated by observing the evolution of many large data sets constituting the LOD cloud. As written in the acknowledgement section, parts of this thesis has received numerous feedback from other scientists, practitioners and industry in many different ways. The main contributions of this thesis are summarized here: Part I – Introduction and Background. During his keynote at the Language Resource and Evaluation Conference in 2012, Sören Auer stressed the decentralized, collaborative, interlinked and interoperable nature of the Web of Data. The keynote provides strong evidence that Semantic Web technologies such as Linked Data are on its way to become main stream for the representation of language resources. The jointly written companion publication for the keynote was later extended as a book chapter in The People’s Web Meets NLP and serves as the basis for “Introduction” and “Background”, outlining some stages of the Linked Data publication and refinement chain. Both chapters stress the importance of open licenses and open access as an enabler for collaboration, the ability to interlink data on the Web as a key feature of RDF as well as provide a discussion about scalability issues and decentralization. Furthermore, we elaborate on how conceptual interoperability can be achieved by (1) re-using vocabularies, (2) agile ontology development, (3) meetings to refine and adapt ontologies and (4) tool support to enrich ontologies and match schemata. Part II - Language Resources as Linked Data. “Linked Data in Linguistics” and “NLP & DBpedia, an Upward Knowledge Acquisition Spiral” summarize the results of the Linked Data in Linguistics (LDL) Workshop in 2012 and the NLP & DBpedia Workshop in 2013 and give a preview of the MLOD special issue. In total, five proceedings – three published at CEUR (OKCon 2011, WoLE 2012, NLP & DBpedia 2013), one Springer book (Linked Data in Linguistics, LDL 2012) and one journal special issue (Multilingual Linked Open Data, MLOD to appear) – have been (co-)edited to create incentives for scientists to convert and publish Linked Data and thus to contribute open and/or linguistic data to the LOD cloud. Based on the disseminated call for papers, 152 authors contributed one or more accepted submissions to our venues and 120 reviewers were involved in peer-reviewing. “DBpedia as a Multilingual Language Resource” and “Leveraging the Crowdsourcing of Lexical Resources for Bootstrapping a Linguistic Linked Data Cloud” contain this thesis’ contribution to the DBpedia Project in order to further increase the size and inter-linkage of the LOD Cloud with lexical-semantic resources. Our contribution comprises extracted data from Wiktionary (an online, collaborative dictionary similar to Wikipedia) in more than four languages (now six) as well as language-specific versions of DBpedia, including a quality assessment of inter-language links between Wikipedia editions and internationalized content negotiation rules for Linked Data. In particular the work described in created the foundation for a DBpedia Internationalisation Committee with members from over 15 different languages with the common goal to push DBpedia as a free and open multilingual language resource. Part III - The NLP Interchange Format (NIF). “NIF 2.0 Core Specification”, “NIF 2.0 Resources and Architecture” and “Evaluation and Related Work” constitute one of the main contribution of this thesis. The NLP Interchange Format (NIF) is an RDF/OWL-based format that aims to achieve interoperability between Natural Language Processing (NLP) tools, language resources and annotations. The core specification is included in and describes which URI schemes and RDF vocabularies must be used for (parts of) natural language texts and annotations in order to create an RDF/OWL-based interoperability layer with NIF built upon Unicode Code Points in Normal Form C. In , classes and properties of the NIF Core Ontology are described to formally define the relations between text, substrings and their URI schemes. contains the evaluation of NIF. In a questionnaire, we asked questions to 13 developers using NIF. UIMA, GATE and Stanbol are extensible NLP frameworks and NIF was not yet able to provide off-the-shelf NLP domain ontologies for all possible domains, but only for the plugins used in this study. After inspecting the software, the developers agreed however that NIF is adequate enough to provide a generic RDF output based on NIF using literal objects for annotations. All developers were able to map the internal data structure to NIF URIs to serialize RDF output (Adequacy). The development effort in hours (ranging between 3 and 40 hours) as well as the number of code lines (ranging between 110 and 445) suggest, that the implementation of NIF wrappers is easy and fast for an average developer. Furthermore the evaluation contains a comparison to other formats and an evaluation of the available URI schemes for web annotation. In order to collect input from the wide group of stakeholders, a total of 16 presentations were given with extensive discussions and feedback, which has lead to a constant improvement of NIF from 2010 until 2013. After the release of NIF (Version 1.0) in November 2011, a total of 32 vocabulary employments and implementations for different NLP tools and converters were reported (8 by the (co-)authors, including Wiki-link corpus, 13 by people participating in our survey and 11 more, of which we have heard). Several roll-out meetings and tutorials were held (e.g. in Leipzig and Prague in 2013) and are planned (e.g. at LREC 2014). Part IV - The NLP Interchange Format in Use. “Use Cases and Applications for NIF” and “Publication of Corpora using NIF” describe 8 concrete instances where NIF has been successfully used. One major contribution in is the usage of NIF as the recommended RDF mapping in the Internationalization Tag Set (ITS) 2.0 W3C standard and the conversion algorithms from ITS to NIF and back. One outcome of the discussions in the standardization meetings and telephone conferences for ITS 2.0 resulted in the conclusion there was no alternative RDF format or vocabulary other than NIF with the required features to fulfill the working group charter. Five further uses of NIF are described for the Ontology of Linguistic Annotations (OLiA), the RDFaCE tool, the Tiger Corpus Navigator, the OntosFeeder and visualisations of NIF using the RelFinder tool. These 8 instances provide an implemented proof-of-concept of the features of NIF. starts with describing the conversion and hosting of the huge Google Wikilinks corpus with 40 million annotations for 3 million web sites. The resulting RDF dump contains 477 million triples in a 5.6 GB compressed dump file in turtle syntax. describes how NIF can be used to publish extracted facts from news feeds in the RDFLiveNews tool as Linked Data. Part V - Conclusions. provides lessons learned for NIF, conclusions and an outlook on future work. Most of the contributions are already summarized above. One particular aspect worth mentioning is the increasing number of NIF-formated corpora for Named Entity Recognition (NER) that have come into existence after the publication of the main NIF paper Integrating NLP using Linked Data at ISWC 2013. These include the corpora converted by Steinmetz, Knuth and Sack for the NLP & DBpedia workshop and an OpenNLP-based CoNLL converter by Brümmer. Furthermore, we are aware of three LREC 2014 submissions that leverage NIF: NIF4OGGD - NLP Interchange Format for Open German Governmental Data, N^3 – A Collection of Datasets for Named Entity Recognition and Disambiguation in the NLP Interchange Format and Global Intelligent Content: Active Curation of Language Resources using Linked Data as well as an early implementation of a GATE-based NER/NEL evaluation framework by Dojchinovski and Kliegr. Further funding for the maintenance, interlinking and publication of Linguistic Linked Data as well as support and improvements of NIF is available via the expiring LOD2 EU project, as well as the CSA EU project called LIDER, which started in November 2013. Based on the evidence of successful adoption presented in this thesis, we can expect a decent to high chance of reaching critical mass of Linked Data technology as well as the NIF standard in the field of Natural Language Processing and Language Resources.
APA, Harvard, Vancouver, ISO, and other styles

Books on the topic "Arabic language – Data processing"

1

al-Fattāḥ, Abrāham ʻAbd, and Kullīyat al-Ādāb bi-Manūbah, eds. Processings of the international symposium on processing Arabic, April 2002, 18th-20th. [Manouba, Tunisia]: University of Manouba, Faculty of Arts, 2005.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
2

Raymond, Descout, ed. Applied Arabic linguistics and signal & information processing. Washington: Hemisphere Pub. Corp., 1987.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
3

1933-, MacKay Pierre A., ed. Computers and the Arabic language. New York: Hemisphere Pub. Corp., 1990.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
4

Mūsá, Nuhād. al-ʻ Arabīyah: Naḥwa tawṣīf jadīd fī ḍawʾ al-lisānīyāt al-ḥāsūbīyah. Bayrūt: al-Muʾassasah al-ʻArabīyah lil-Dirāsāt wa-al-Nashr, 2000.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
5

al-ʻ Arabīyah: Naḥwa tawṣīf jadīd fī ḍawʼ al-lisānīyāt al-ḥāsūbīyah. Bayrūt: al-Muʼassasah al-ʻArabīyah lil-Dirāsāt wa-al-Nashr, 2000.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
6

Haddad, E. W. A dictionary of data processing and computer terms: English-French-Arabic : with French-English vocabulary and Arabic index. [Beirut]: Libr. du Liban, 1987.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
7

Haddad, E. W. A new dictionary of computer and data processing terms: English-Arabic with English abbreviations and Arabic glossary. Beirut: Librairie du Liban, 1988.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
8

Farghaly, Ali Ahmed Sabry. Arabic computational linguistics. Stanford, Calif: CSLI Publications, Center for the Study of Language and Information, 2010.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
9

al-Rasūl, Nāẓim Ibrāhīm ʻAbd. Al masdar: Glossary of computer terms. English-Arabic. [Great Britain]: Sahara Publications, 1985.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
10

Haddad, E. W. Dictionnaire de l'informatique français-arabe: Avec abréviations françaises et glossaire arabe. Beyrouth: Librairie du Liban, 1989.

Find full text
APA, Harvard, Vancouver, ISO, and other styles

Book chapters on the topic "Arabic language – Data processing"

1

Habash, Nizar Y. "Arabic Script." In Introduction to Arabic Natural Language Processing, 5–26. Cham: Springer International Publishing, 2010. http://dx.doi.org/10.1007/978-3-031-02139-8_2.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Habash, Nizar Y. "Arabic Syntax." In Introduction to Arabic Natural Language Processing, 93–112. Cham: Springer International Publishing, 2010. http://dx.doi.org/10.1007/978-3-031-02139-8_6.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Habash, Nizar Y. "Arabic Morphology." In Introduction to Arabic Natural Language Processing, 39–63. Cham: Springer International Publishing, 2010. http://dx.doi.org/10.1007/978-3-031-02139-8_4.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Habash, Nizar Y. "What is “Arabic”?" In Introduction to Arabic Natural Language Processing, 1–4. Cham: Springer International Publishing, 2010. http://dx.doi.org/10.1007/978-3-031-02139-8_1.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Siddiqui, Sanjeera, Azza Abdel Monem, and Khaled Shaalan. "Sentiment Analysis in Arabic." In Natural Language Processing and Information Systems, 409–14. Cham: Springer International Publishing, 2016. http://dx.doi.org/10.1007/978-3-319-41754-7_41.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Habash, Nizar Y. "Arabic Phonology and Orthography." In Introduction to Arabic Natural Language Processing, 27–37. Cham: Springer International Publishing, 2010. http://dx.doi.org/10.1007/978-3-031-02139-8_3.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

el Aissati, Abderrahman. "Language shift and sentence processing in Moroccan Arabic." In Language Choices, 77. Amsterdam: John Benjamins Publishing Company, 1997. http://dx.doi.org/10.1075/impact.1.08ais.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Nehar, Attia, Djelloul Ziadi, and Hadda Cherroun. "Rational Kernels for Arabic Text Classification." In Statistical Language and Speech Processing, 176–87. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013. http://dx.doi.org/10.1007/978-3-642-39593-2_16.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Ramsay, Allan, and Hanady Mansour. "Local Constraints on Arabic Word Order." In Advances in Natural Language Processing, 447–57. Berlin, Heidelberg: Springer Berlin Heidelberg, 2006. http://dx.doi.org/10.1007/11816508_45.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Boudabous, Mohamed Mahdi, Mohamed Hédi Maaloul, and Lamia Hadrich Belguith. "Digital Learning for Summarizing Arabic Documents." In Advances in Natural Language Processing, 79–84. Berlin, Heidelberg: Springer Berlin Heidelberg, 2010. http://dx.doi.org/10.1007/978-3-642-14770-8_10.

Full text
APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Arabic language – Data processing"

1

El Kah, Anoual, Imad Zeroual, and Abdelhak Lakhouaja. "Application of Arabic language processing in language learning." In BDCA'17: 2nd international Conference on Big Data, Cloud and Applications. New York, NY, USA: ACM, 2017. http://dx.doi.org/10.1145/3090354.3090390.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Gad-elrab, Mohamed H., Mohamed Amir Yosef, and Gerhard Weikum. "EDRAK: Entity-Centric Data Resource for Arabic Knowledge." In Proceedings of the Second Workshop on Arabic Natural Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics, 2015. http://dx.doi.org/10.18653/v1/w15-3224.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Huang, Fei. "Improved Arabic Dialect Classification with Social Media Data." In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics, 2015. http://dx.doi.org/10.18653/v1/d15-1254.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Baly, Ramy, Gilbert Badaro, Georges El-Khoury, Rawan Moukalled, Rita Aoun, Hazem Hajj, Wassim El-Hajj, Nizar Habash, and Khaled Shaban. "A Characterization Study of Arabic Twitter Data with a Benchmarking for State-of-the-Art Opinion Mining Models." In Proceedings of the Third Arabic Natural Language Processing Workshop. Stroudsburg, PA, USA: Association for Computational Linguistics, 2017. http://dx.doi.org/10.18653/v1/w17-1314.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

ASBAYOU, Omar. "Arabic Location Name Annotations and Applications." In 9th International Conference on Natural Language Processing (NLP 2020). AIRCC Publishing Corporation, 2020. http://dx.doi.org/10.5121/csit.2020.101405.

Full text
Abstract:
This paper show how location named entity (LNE) extraction and annotation, which makes part of our named entity recognition (NER) systems, is an important task in managing the great amount of data. In this paper, we try to explain our linguistic approach in our rule-based LNE recognition and classification system based on syntactico-semantic patterns. To reach good results, we have taken into account morpho-syntactic information provided by morpho-syntactic analysis based on DIINAR database, and syntactico-semantic classification of both location name trigger words (TW) and extensions. Formally, different trigger word sense implies different syntactic entity structures. We also show the semantic data that our LNE recognition and classification system can provide to both information extraction (IE) and information retrieval(IR).The XML database output of the LNE system constituted an important resource for IE and IR. Future project will improve this processing output in order to exploit it in computerassisted Translation (CAT).
APA, Harvard, Vancouver, ISO, and other styles
6

Abdulrahim, Dana. "Annotating corpus data for a quantitative, constructional analysis of motion verbs in Modern Standard Arabic." In Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP). Stroudsburg, PA, USA: Association for Computational Linguistics, 2014. http://dx.doi.org/10.3115/v1/w14-3604.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Elmahdy, Mohamed, Rainer Gruhn, Wolfgang Minker, and Slim Abdennadher. "Effect of gaussian densities and amount of training data on grapheme-based acoustic modeling for Arabic." In 2009 International Conference on Natural Language Processing and Knowledge Engineering (NLP-KE). IEEE, 2009. http://dx.doi.org/10.1109/nlpke.2009.5313727.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Castiñeira, David, Robert Toronyi, and Nansen Saleri. "Machine Learning and Natural Language Processing for Automated Analysis of Drilling and Completion Data." In SPE Kingdom of Saudi Arabia Annual Technical Symposium and Exhibition. Society of Petroleum Engineers, 2018. http://dx.doi.org/10.2118/192280-ms.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Khader, Mariam, Arafat Awajan, and Ghazi Al-Naymat. "The Effects of Natural Language Processing on Big Data Analysis: Sentiment Analysis Case Study." In 2018 International Arab Conference on Information Technology (ACIT). IEEE, 2018. http://dx.doi.org/10.1109/acit.2018.8672697.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Malmasi, Shervin, and Mark Dras. "Arabic Native Language Identification." In Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP). Stroudsburg, PA, USA: Association for Computational Linguistics, 2014. http://dx.doi.org/10.3115/v1/w14-3625.

Full text
APA, Harvard, Vancouver, ISO, and other styles

Reports on the topic "Arabic language – Data processing"

1

Tratz, Stephen C. Arabic Natural Language Processing System Code Library. Fort Belvoir, VA: Defense Technical Information Center, June 2014. http://dx.doi.org/10.21236/ada603814.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Davidson, Robert B., and Richard L. Hopely. Foreign Language Optical Character Recognition, Phase II: Arabic and Persian Training and Test Data Sets. Fort Belvoir, VA: Defense Technical Information Center, May 1997. http://dx.doi.org/10.21236/ada325444.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Onanian, Janice S. A Signal Processing Language for Coarse Grain Data flow Multiprocessors. Fort Belvoir, VA: Defense Technical Information Center, June 1989. http://dx.doi.org/10.21236/ada213863.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Salter, R., Quyen Dong, Cody Coleman, Maria Seale, Alicia Ruvinsky, LaKenya Walker, and W. Bond. Data Lake Ecosystem Workflow. Engineer Research and Development Center (U.S.), April 2021. http://dx.doi.org/10.21079/11681/40203.

Full text
Abstract:
The Engineer Research and Development Center, Information Technology Laboratory’s (ERDC-ITL’s) Big Data Analytics team specializes in the analysis of large-scale datasets with capabilities across four research areas that require vast amounts of data to inform and drive analysis: large-scale data governance, deep learning and machine learning, natural language processing, and automated data labeling. Unfortunately, data transfer between government organizations is a complex and time-consuming process requiring coordination of multiple parties across multiple offices and organizations. Past successes in large-scale data analytics have placed a significant demand on ERDC-ITL researchers, highlighting that few individuals fully understand how to successfully transfer data between government organizations; future project success therefore depends on a small group of individuals to efficiently execute a complicated process. The Big Data Analytics team set out to develop a standardized workflow for the transfer of large-scale datasets to ERDC-ITL, in part to educate peers and future collaborators on the process required to transfer datasets between government organizations. Researchers also aim to increase workflow efficiency while protecting data integrity. This report provides an overview of the created Data Lake Ecosystem Workflow by focusing on the six phases required to efficiently transfer large datasets to supercomputing resources located at ERDC-ITL.
APA, Harvard, Vancouver, ISO, and other styles
5

Fuentes, Anthony, Michelle Michaels, and Sally Shoop. Methodology for the analysis of geospatial and vehicle datasets in the R language. Cold Regions Research and Engineering Laboratory (U.S.), November 2021. http://dx.doi.org/10.21079/11681/42422.

Full text
Abstract:
The challenge of autonomous off-road operations necessitates a robust understanding of the relationships between remotely sensed terrain data and vehicle performance. The implementation of statistical analyses on large geospatial datasets often requires the transition between multiple software packages that may not be open-source. The lack of a single, modular, and open-source analysis environment can reduce the speed and reliability of an analysis due to an increased number of processing steps. Here we present the capabilities of a workflow, developed in R, to perform a series of spatial and statistical analyses on vehicle and terrain datasets to quantify the relationship between sensor data and vehicle performance in winter conditions. We implemented the R-based workflow on datasets from a large, coordinated field campaign aimed at quantifying the response of military vehicles on snow-covered terrains. This script greatly reduces processing times of these datasets by combining the GIS, data-assimilation and statistical analyses steps into one efficient and modular interface.
APA, Harvard, Vancouver, ISO, and other styles
6

Volkova, Nataliia P., Nina O. Rizun, and Maryna V. Nehrey. Data science: opportunities to transform education. [б. в.], September 2019. http://dx.doi.org/10.31812/123456789/3241.

Full text
Abstract:
The article concerns the issue of data science tools implementation, including the text mining and natural language processing algorithms for increasing the value of high education for development modern and technologically flexible society. Data science is the field of study that involves tools, algorithms, and knowledge of math and statistics to discover knowledge from the raw data. Data science is developing fast and penetrating all spheres of life. More people understand the importance of the science of data and the need for implementation in everyday life. Data science is used in business for business analytics and production, in sales for offerings and, for sales forecasting, in marketing for customizing customers, and recommendations on purchasing, digital marketing, in banking and insurance for risk assessment, fraud detection, scoring, and in medicine for disease forecasting, process automation and patient health monitoring, in tourism in the field of price analysis, flight safety, opinion mining etc. However, data science applications in education have been relatively limited, and many opportunities for advancing the fields still unexplored.
APA, Harvard, Vancouver, ISO, and other styles
7

Leavy, Michelle B., Danielle Cooke, Sarah Hajjar, Erik Bikelman, Bailey Egan, Diana Clarke, Debbie Gibson, Barbara Casanova, and Richard Gliklich. Outcome Measure Harmonization and Data Infrastructure for Patient-Centered Outcomes Research in Depression: Report on Registry Configuration. Agency for Healthcare Research and Quality (AHRQ), November 2020. http://dx.doi.org/10.23970/ahrqepcregistryoutcome.

Full text
Abstract:
Background: Major depressive disorder is a common mental disorder. Many pressing questions regarding depression treatment and outcomes exist, and new, efficient research approaches are necessary to address them. The primary objective of this project is to demonstrate the feasibility and value of capturing the harmonized depression outcome measures in the clinical workflow and submitting these data to different registries. Secondary objectives include demonstrating the feasibility of using these data for patient-centered outcomes research and developing a toolkit to support registries interested in sharing data with external researchers. Methods: The harmonized outcome measures for depression were developed through a multi-stakeholder, consensus-based process supported by AHRQ. For this implementation effort, the PRIME Registry, sponsored by the American Board of Family Medicine, and PsychPRO, sponsored by the American Psychiatric Association, each recruited 10 pilot sites from existing registry sites, added the harmonized measures to the registry platform, and submitted the project for institutional review board review Results: The process of preparing each registry to calculate the harmonized measures produced three major findings. First, some clarifications were necessary to make the harmonized definitions operational. Second, some data necessary for the measures are not routinely captured in structured form (e.g., PHQ-9 item 9, adverse events, suicide ideation and behavior, and mortality data). Finally, capture of the PHQ-9 requires operational and technical modifications. The next phase of this project will focus collection of the baseline and follow-up PHQ-9s, as well as other supporting clinical documentation. In parallel to the data collection process, the project team will examine the feasibility of using natural language processing to extract information on PHQ-9 scores, adverse events, and suicidal behaviors from unstructured data. Conclusion: This pilot project represents the first practical implementation of the harmonized outcome measures for depression. Initial results indicate that it is feasible to calculate the measures within the two patient registries, although some challenges were encountered related to the harmonized definition specifications, the availability of the necessary data, and the clinical workflow for collecting the PHQ-9. The ongoing data collection period, combined with an evaluation of the utility of natural language processing for these measures, will produce more information about the practical challenges, value, and burden of using the harmonized measures in the primary care and mental health setting. These findings will be useful to inform future implementations of the harmonized depression outcome measures.
APA, Harvard, Vancouver, ISO, and other styles
8

Zelenskyi, Arkadii A. Relevance of research of programs for semantic analysis of texts and review of methods of their realization. [б. в.], December 2018. http://dx.doi.org/10.31812/123456789/2884.

Full text
Abstract:
One of the main tasks of applied linguistics is the solution of the problem of high-quality automated processing of natural language. The most popular methods for processing natural-language text responses for the purpose of extraction and representation of semantics should be systems that are based on the efficient combination of linguistic analysis technologies and analysis methods. Among the existing methods for analyzing text data, a valid method is used by the method using a vector model. Another effective and relevant means of extracting semantics from the text and its representation is the method of latent semantic analysis (LSA). The LSA method was tested and confirmed its effectiveness in such areas of processing the native language as modeling the conceptual knowledge of the person; information search, the implementation of which LSA shows much better results than conventional vector methods.
APA, Harvard, Vancouver, ISO, and other styles
9

Murdick, Dewey, Daniel Chou, Ryan Fedasiuk, and Emily Weinstein. The Public AI Research Portfolio of China’s Security Forces. Center for Security and Emerging Technology, March 2021. http://dx.doi.org/10.51593/20200057.

Full text
Abstract:
New analytic tools are used in this data brief to explore the public artificial intelligence (AI) research portfolio of China’s security forces. The methods contextualize Chinese-language scholarly papers that claim a direct working affiliation with components of the Ministry of Public Security, People's Armed Police Force, and People’s Liberation Army. The authors review potential uses of computer vision, robotics, natural language processing and general AI research.
APA, Harvard, Vancouver, ISO, and other styles
10

Price, Roz. Climate Change Risks and Opportunities in Yemen. Institute of Development Studies, May 2022. http://dx.doi.org/10.19088/k4d.2022.096.

Full text
Abstract:
This rapid review provides insight into the effects of climate change in the Republic of Yemen (Yemen), with particular attention on key sectors of concern, including food security, water, energy and health. Many contextual and background factors are relevant when discussing climate-related impacts and potential priorities in Yemen. Limited studies and tools that provide climate data for Yemen exist, and there is a clear lack of recent and reliable climate data and statistics for past and future climates in Yemen, both at the national and more local levels (downscaled). Country-level information in this report is drawn mostly from information reported in Yemen’s UNFCCC reporting (Republic of Yemen, 2013, 2015) and other sources, which tend to be donor climate change country profiles, such as a USAID (2017) climate change risk profile for Yemen and a Climate Service Center Germany (GERICS) (2015) climate fact sheet on Yemen. Many of these are based on projections from older sources. Studies more commonly tend to look at water scarcity or food insecurity issues in relation to Yemen, with climate change mentioned as a factor (one of many) but not the main focus. Regional information is taken from the latest Intergovernmental Panel on Climate Change (IPCC) Sixth Assessment Report (AR6) report in relation to the Arabian Peninsula (and hence Yemen). Academic sources as well as donor, research institutes and intergovernmental organisations sources are also included. It was outside the scope of this report to review literature in the Arabic language.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography