Щоб переглянути інші типи публікацій з цієї теми, перейдіть за посиланням: Low-Resourced language.

Статті в журналах з теми "Low-Resourced language"

Оформте джерело за APA, MLA, Chicago, Harvard та іншими стилями

Оберіть тип джерела:

Ознайомтеся з топ-50 статей у журналах для дослідження на тему "Low-Resourced language".

Біля кожної праці в переліку літератури доступна кнопка «Додати до бібліографії». Скористайтеся нею – і ми автоматично оформимо бібліографічне посилання на обрану працю в потрібному вам стилі цитування: APA, MLA, «Гарвард», «Чикаго», «Ванкувер» тощо.

Також ви можете завантажити повний текст наукової публікації у форматі «.pdf» та прочитати онлайн анотацію до роботи, якщо відповідні параметри наявні в метаданих.

Переглядайте статті в журналах для різних дисциплін та оформлюйте правильно вашу бібліографію.

1

Allah, Fadoua Ataa, and Siham Boulaknadel. "NEW TRENDS IN LESS-RESOURCED LANGUAGE PROCESSING: CASE OF AMAZIGH LANGUAGE." International Journal on Natural Language Computing 12, no. 2 (April 29, 2023): 75–89. http://dx.doi.org/10.5121/ijnlc.2023.12207.

Повний текст джерела
Анотація:
The coronavirus (COVID-19) pandemic has dramatically changed lifestyles in much of the world. It forced people to profoundly review their relationships and interactions with digital technologies. Nevertheless, people prefer using these technologies in their favorite languages. Unfortunately, most languages are considered even as low or less-resourced, and they do not have the potential to keep up with the new needs. Therefore, this study explores how this kind of languages, mainly the Amazigh, will behave in the wholly digital environment, and what to expect for new trends. Contrary to last decades, the research gap of low and less-resourced languages is continually reducing. Nonetheless, the literature review exploration unveils the need for innovative research to review their informatization roadmap, while rethinking, in a valuable way, people’s behaviors in this increasingly changing environment. Through this work, we will try first to introduce the technology access challenges, and explain how natural language processing contributes to their overcoming. Then, we will give an overview of existing studies and research related to under and less-resourced languages’ informatization, with an emphasis on the Amazigh language. After, based on these studies and the agile revolution, a new roadmap will be presented.
Стилі APA, Harvard, Vancouver, ISO та ін.
2

Kipyatkova, Irina, and Ildar Kagirov. "Deep Models for Low-Resourced Speech Recognition: Livvi-Karelian Case." Mathematics 11, no. 18 (September 5, 2023): 3814. http://dx.doi.org/10.3390/math11183814.

Повний текст джерела
Анотація:
Recently, there has been a growth in the number of studies addressing the automatic processing of low-resource languages. The lack of speech and text data significantly hinders the development of speech technologies for such languages. This paper introduces an automatic speech recognition system for Livvi-Karelian. Acoustic models based on artificial neural networks with time delays and hidden Markov models were trained using a limited speech dataset of 3.5 h. To augment the data, pitch and speech rate perturbation, SpecAugment, and their combinations were employed. Language models based on 3-grams and neural networks were trained using written texts and transcripts. The achieved word error rate metric of 22.80% is comparable to other low-resource languages. To the best of our knowledge, this is the first speech recognition system for Livvi-Karelian. The results obtained can be of a certain significance for development of automatic speech recognition systems not only for Livvi-Karelian, but also for other low-resource languages, including the fields of speech recognition and machine translation systems. Future work includes experiments with Karelian data using techniques such as transfer learning and DNN language models.
Стилі APA, Harvard, Vancouver, ISO та ін.
3

Singh, Pranaydeep, Orphée De Clercq, and Els Lefever. "Distilling Monolingual Models from Large Multilingual Transformers." Electronics 12, no. 4 (February 18, 2023): 1022. http://dx.doi.org/10.3390/electronics12041022.

Повний текст джерела
Анотація:
Although language modeling has been trending upwards steadily, models available for low-resourced languages are limited to large multilingual models such as mBERT and XLM-RoBERTa, which come with significant overheads for deployment vis-à-vis their model size, inference speeds, etc. We attempt to tackle this problem by proposing a novel methodology to apply knowledge distillation techniques to filter language-specific information from a large multilingual model into a small, fast monolingual model that can often outperform the teacher model. We demonstrate the viability of this methodology on two downstream tasks each for six languages. We further dive into the possible modifications to the basic setup for low-resourced languages by exploring ideas to tune the final vocabulary of the distilled models. Lastly, we perform a detailed ablation study to understand the different components of the setup better and find out what works best for the two under-resourced languages, Swahili and Slovene.
Стилі APA, Harvard, Vancouver, ISO та ін.
4

Mabokela, Koena Ronny, Mpho Primus, and Turgay Celik. "Explainable Pre-Trained Language Models for Sentiment Analysis in Low-Resourced Languages." Big Data and Cognitive Computing 8, no. 11 (November 15, 2024): 160. http://dx.doi.org/10.3390/bdcc8110160.

Повний текст джерела
Анотація:
Sentiment analysis is a crucial tool for measuring public opinion and understanding human communication across digital social media platforms. However, due to linguistic complexities and limited data or computational resources, it is under-represented in many African languages. While state-of-the-art Afrocentric pre-trained language models (PLMs) have been developed for various natural language processing (NLP) tasks, their applications in eXplainable Artificial Intelligence (XAI) remain largely unexplored. In this study, we propose a novel approach that combines Afrocentric PLMs with XAI techniques for sentiment analysis. We demonstrate the effectiveness of incorporating attention mechanisms and visualization techniques in improving the transparency, trustworthiness, and decision-making capabilities of transformer-based models when making sentiment predictions. To validate our approach, we employ the SAfriSenti corpus, a multilingual sentiment dataset for South African under-resourced languages, and perform a series of sentiment analysis experiments. These experiments enable comprehensive evaluations, comparing the performance of Afrocentric models against mainstream PLMs. Our results show that the Afro-XLMR model outperforms all other models, achieving an average F1-score of 71.04% across five tested languages, and the lowest error rate among the evaluated models. Additionally, we enhance the interpretability and explainability of the Afro-XLMR model using Local Interpretable Model-Agnostic Explanations (LIME) and Shapley Additive Explanations (SHAP). These XAI techniques ensure that sentiment predictions are not only accurate and interpretable but also understandable, fostering trust and reliability in AI-driven NLP technologies, particularly in the context of African languages.
Стилі APA, Harvard, Vancouver, ISO та ін.
5

Shafiq, Nida, Isma Hamid, Muhammad Asif, Qamar Nawaz, Hanan Aljuaid, and Hamid Ali. "Abstractive text summarization of low-resourced languages using deep learning." PeerJ Computer Science 9 (January 13, 2023): e1176. http://dx.doi.org/10.7717/peerj-cs.1176.

Повний текст джерела
Анотація:
Background Humans must be able to cope with the huge amounts of information produced by the information technology revolution. As a result, automatic text summarization is being employed in a range of industries to assist individuals in identifying the most important information. For text summarization, two approaches are mainly considered: text summarization by the extractive and abstractive methods. The extractive summarisation approach selects chunks of sentences like source documents, while the abstractive approach can generate a summary based on mined keywords. For low-resourced languages, e.g., Urdu, extractive summarization uses various models and algorithms. However, the study of abstractive summarization in Urdu is still a challenging task. Because there are so many literary works in Urdu, producing abstractive summaries demands extensive research. Methodology This article proposed a deep learning model for the Urdu language by using the Urdu 1 Million news dataset and compared its performance with the two widely used methods based on machine learning, such as support vector machine (SVM) and logistic regression (LR). The results show that the suggested deep learning model performs better than the other two approaches. The summaries produced by extractive summaries are processed using the encoder-decoder paradigm to create an abstractive summary. Results With the help of Urdu language specialists, the system-generated summaries were validated, showing the proposed model’s improvement and accuracy.
Стилі APA, Harvard, Vancouver, ISO та ін.
6

Pandit, Rajat, Saptarshi Sengupta, Sudip Kumar Naskar, Niladri Sekhar Dash, and Mohini Mohan Sardar. "Improving Semantic Similarity with Cross-Lingual Resources: A Study in Bangla—A Low Resourced Language." Informatics 6, no. 2 (May 5, 2019): 19. http://dx.doi.org/10.3390/informatics6020019.

Повний текст джерела
Анотація:
Semantic similarity is a long-standing problem in natural language processing (NLP). It is a topic of great interest as its understanding can provide a look into how human beings comprehend meaning and make associations between words. However, when this problem is looked at from the viewpoint of machine understanding, particularly for under resourced languages, it poses a different problem altogether. In this paper, semantic similarity is explored in Bangla, a less resourced language. For ameliorating the situation in such languages, the most rudimentary method (path-based) and the latest state-of-the-art method (Word2Vec) for semantic similarity calculation were augmented using cross-lingual resources in English and the results obtained are truly astonishing. In the presented paper, two semantic similarity approaches have been explored in Bangla, namely the path-based and distributional model and their cross-lingual counterparts were synthesized in light of the English WordNet and Corpora. The proposed methods were evaluated on a dataset comprising of 162 Bangla word pairs, which were annotated by five expert raters. The correlation scores obtained between the four metrics and human evaluation scores demonstrate a marked enhancement that the cross-lingual approach brings into the process of semantic similarity calculation for Bangla.
Стилі APA, Harvard, Vancouver, ISO та ін.
7

Badawi, Soran. "Transformer-Based Neural Network Machine Translation Model for the Kurdish Sorani Dialect." UHD Journal of Science and Technology 7, no. 1 (January 15, 2023): 15–21. http://dx.doi.org/10.21928/uhdjst.v7n1y2023.pp15-21.

Повний текст джерела
Анотація:
The transformer model is one of the most recently developed models for translating texts into another language. The model uses the principle of attention mechanism, surpassing previous models, such as sequence-to-sequence, in terms of performance. It performed well with highly resourced English, French, and German languages. Using the model architecture, we investigate training the modified version of the model in a low-resourced language such as the Kurdish language. This paper presents the first-ever transformer-based neural machine translation model for the Kurdish language by utilizing vocabulary dictionary units that share vocabulary across the dataset. For this purpose, we combine all the existing parallel corpora of Kurdish – English by building a large corpus and training it on the proposed transformer model. The outcome indicated that the suggested transformer model works well with Kurdish texts by scoring (0.45) on bilingual evaluation understudy (BLEU). According to the BLEU standard, the score indicates a high-quality translation.
Стилі APA, Harvard, Vancouver, ISO та ін.
8

Kapočiūtė-Dzikienė, Jurgita, and Senait Gebremichael Tesfagergish. "Part-of-Speech Tagging via Deep Neural Networks for Northern-Ethiopic Languages." Information Technology And Control 49, no. 4 (December 19, 2020): 482–94. http://dx.doi.org/10.5755/j01.itc.49.4.26808.

Повний текст джерела
Анотація:
Deep Neural Networks (DNNs) have proven to be especially successful in the area of Natural Language Processing (NLP) and Part-Of-Speech (POS) tagging—which is the process of mapping words to their corresponding POS labels depending on the context. Despite recent development of language technologies, low-resourced languages (such as an East African Tigrinya language), have received too little attention. We investigate the effectiveness of Deep Learning (DL) solutions for the low-resourced Tigrinya language of the Northern-Ethiopic branch. We have selected Tigrinya as the testbed example and have tested state-of-the-art DL approaches seeking to build the most accurate POS tagger. We have evaluated DNN classifiers (Feed Forward Neural Network – FFNN, Long Short-Term Memory method – LSTM, Bidirectional LSTM, and Convolutional Neural Network – CNN) on a top of neural word2vec word embeddings with a small training corpus known as Nagaoka Tigrinya Corpus. To determine the best DNN classifier type, its architecture and hyper-parameter set both manual and automatic hyper-parameter tuning has been performed. BiLSTM method was proved to be the most suitable for our solving task: it achieved the highest accuracy equal to 92% that is 65% above the random baseline.
Стилі APA, Harvard, Vancouver, ISO та ін.
9

Nitu, Melania, and Mihai Dascalu. "Natural Language Processing Tools for Romanian – Going Beyond a Low-Resource Language." Interaction Design and Architecture(s), no. 60 (March 15, 2024): 7–26. http://dx.doi.org/10.55612/s-5002-060-001sp.

Повний текст джерела
Анотація:
Advances in Natural Language Processing bring innovative instruments to the educational field to improve the quality of the didactic process by addressing challenges like language barriers and creating personalized learning experiences. Most research in the domain is dedicated to high-resource languages, such as English, while languages with limited coverage, like Romanian, are still underrepresented in the field. Operating on low-resource languages is essential to ensure equitable access to educational opportunities and to preserve linguistic diversity. Through continuous investments in developing Romanian educational instruments, we are rapidly going beyond a low-resource language. This paper presents recent educational instruments and frameworks dedicated to Romanian, leveraging state-of-the-art NLP techniques, such as building advanced Romanian language models and benchmarks encompassing tools for language learning, text comprehension, question answering, automatic essay scoring, and information retrieval. The methods and insights gained are transferable to other low-resource languages, emphasizing methodological adaptability, collaborative frameworks, and technology transfer to address similar challenges in diverse linguistic contexts. Two use cases are presented, focusing on assessing student performance in Moodle courses and extracting main ideas from students’ feedback. These practical applications in Romanian academic settings serve as examples for enhancing educational practices in other less-resourced languages.
Стилі APA, Harvard, Vancouver, ISO та ін.
10

Ngué Um, Emmanuel, Émilie Eliette, Caroline Ngo Tjomb Assembe, and Francis Morton Tyers. "Developing a Rule-Based Machine-Translation System, Ewondo–French–Ewondo." International Journal of Humanities and Arts Computing 16, no. 2 (October 2022): 166–81. http://dx.doi.org/10.3366/ijhac.2022.0289.

Повний текст джерела
Анотація:
Machine translation (MT) significantly contributes to democratizing access to textual information across multiple languages and is established as a dynamic language service in the global multilingual society. Not surprisingly, the attractiveness of the MT market has stirred up spectacular innovations, driven by artificial intelligence in the digital technology industry. The commercial stakes of the industry has led to massive investments in the development of automatic translation systems for languages of wider communication and an increased marginalization of minority languages in this avenue. This article reports on the on-going development of a low-tech, rule-based MT system for Ewondo, a Bantu low-resourced language spoken in Cameroon. The project aims to fill the gap in access to MT services in the target minority language community and to generate parallel corpora from and into the Ewondo language.
Стилі APA, Harvard, Vancouver, ISO та ін.
11

Agbesi, Victor Kwaku, Wenyu Chen, Sophyani Banaamwini Yussif, Md Altab Hossin, Chiagoziem C. Ukwuoma, Noble A. Kuadey, Colin Collinson Agbesi, Nagwan Abdel Samee, Mona M. Jamjoom, and Mugahed A. Al-antari. "Pre-Trained Transformer-Based Models for Text Classification Using Low-Resourced Ewe Language." Systems 12, no. 1 (December 19, 2023): 1. http://dx.doi.org/10.3390/systems12010001.

Повний текст джерела
Анотація:
Despite a few attempts to automatically crawl Ewe text from online news portals and magazines, the African Ewe language remains underdeveloped despite its rich morphology and complex "unique" structure. This is due to the poor quality, unbalanced, and religious-based nature of the crawled Ewe texts, thus making it challenging to preprocess and perform any NLP task with current transformer-based language models. In this study, we present a well-preprocessed Ewe dataset for low-resource text classification to the research community. Additionally, we have developed an Ewe-based word embedding to leverage the low-resource semantic representation. Finally, we have fine-tuned seven transformer-based models, namely BERT-based (cased and uncased), DistilBERT-based (cased and uncased), RoBERTa, DistilRoBERTa, and DeBERTa, using the preprocessed Ewe dataset that we have proposed. Extensive experiments indicate that the fine-tuned BERT-base-cased model outperforms all baseline models with an accuracy of 0.972, precision of 0.969, recall of 0.970, loss score of 0.021, and an F1-score of 0.970. This performance demonstrates the model’s ability to comprehend the low-resourced Ewe semantic representation compared to all other models, thus setting the fine-tuned BERT-based model as the benchmark for the proposed Ewe dataset.
Стилі APA, Harvard, Vancouver, ISO та ін.
12

Abigail Rai. "Part-of-Speech (POS) Tagging of Low-Resource Language (Limbu) with Deep learning." Panamerican Mathematical Journal 35, no. 1s (November 13, 2024): 149–57. http://dx.doi.org/10.52783/pmj.v35.i1s.2297.

Повний текст джерела
Анотація:
POS tagging is a basic Natural Language Processing (NLP) task that tags the words in an input text according to its grammatical values. Although POS Tagging is a fundamental application for very resourced languages, such as Limbu, is still unknown due to only few tagged datasets and linguistic resources. This research project uses deep learning techniques, transfer learning, and the BiLSTM-CRF model to develop an accurate POS-tagging system for the Limbu language. Using annotated and unannotated language data, we progress in achieving a small yet informative dataset of Limbu text. Skilled multilingual tutoring was modified to enhance success on low-resource language tests. The model as propose attains 90% accuracy, which is very much better than traditional rule-based and machine learning methods for Limbu POS tagging. The results indicate that deep learning methods can address linguistic issues facing low-resource languages even with limited data. In turn, this study provides a cornerstone for follow up NLP-based applications of Limbu and similar low-resource languages, demonstrating how deep learning can fill the gap where data is scarce.
Стилі APA, Harvard, Vancouver, ISO та ін.
13

Masethe, Hlaudi Daniel, Mosima Anna Masethe, Sunday Olusegun Ojo, Fausto Giunchiglia, and Pius Adewale Owolawi. "Word Sense Disambiguation for Morphologically Rich Low-Resourced Languages: A Systematic Literature Review and Meta-Analysis." Information 15, no. 9 (September 4, 2024): 540. http://dx.doi.org/10.3390/info15090540.

Повний текст джерела
Анотація:
In natural language processing, word sense disambiguation (WSD) continues to be a major difficulty, especially for low-resource languages where linguistic variation and a lack of data make model training and evaluation more difficult. The goal of this comprehensive review and meta-analysis of the literature is to summarize the body of knowledge regarding WSD techniques for low-resource languages, emphasizing the advantages and disadvantages of different strategies. A thorough search of several databases for relevant literature produced articles assessing WSD methods in low-resource languages. Effect sizes and performance measures were extracted from a subset of trials through analysis. Heterogeneity was evaluated using pooled effect and estimates were computed by meta-analysis. The preferred reporting elements for systematic reviews and meta-analyses (PRISMA) were used to develop the process for choosing the relevant papers for extraction. The meta-analysis included 32 studies, encompassing a range of WSD methods and low-resourced languages. The overall pooled effect size indicated moderate effectiveness of WSD techniques. Heterogeneity among studies was high, with an I2 value of 82.29%, suggesting substantial variability in WSD performance across different studies. The (τ2) tau value of 5.819 further reflects the extent of between-study variance. This variability underscores the challenges in generalizing findings and highlights the influence of diverse factors such as language-specific characteristics, dataset quality, and methodological differences. The p-values from the meta-regression (0.454) and the meta-analysis (0.440) suggest that the variability in WSD performance is not statistically significantly associated with the investigated moderators, indicating that the performance differences may be influenced by factors not fully captured in the current analysis. The absence of significant p-values raises the possibility that the problems presented by low-resource situations are not yet well addressed by the models and techniques in use.
Стилі APA, Harvard, Vancouver, ISO та ін.
14

Shaukat, Saima, Muhammad Asad, and Asmara Akram. "Developing an Urdu Lemmatizer Using a Dictionary-Based Lookup Approach." Applied Sciences 13, no. 8 (April 19, 2023): 5103. http://dx.doi.org/10.3390/app13085103.

Повний текст джерела
Анотація:
Lemmatization aims at returning the root form of a word. The lemmatizer is envisioned as a vital instrument that can assist in many Natural Language Processing (NLP) tasks. These tasks include Information Retrieval, Word Sense Disambiguation, Machine Translation, Text Reuse, and Plagiarism Detection. Previous studies in the literature have focused on developing lemmatizers using rule-based approaches for English and other highly-resourced languages. However, there have been no thorough efforts for the development of a lemmatizer for most South Asian languages, specifically Urdu. Urdu is a morphologically rich language with many inflectional and derivational forms. This makes the development of an efficient Urdu lemmatizer a challenging task. A standardized lemmatizer would contribute towards establishing much-needed methodological resources for this low-resourced language, which are required to boost the performance of many Urdu NLP applications. This paper presents a lemmatization system for the Urdu language, based on a novel dictionary lookup approach. The contributions made through this research are the following: (1) the development of a large benchmark corpus for the Urdu language, (2) the exploration of the relationship between parts of speech tags and the lemmatizer, and (3) the development of standard approaches for an Urdu lemmatizer. Furthermore, we experimented with the impact of Part of Speech (PoS) on our proposed dictionary lookup approach. The empirical results showed that we achieved the best accuracy score of 76.44% through the proposed dictionary lookup approach.
Стилі APA, Harvard, Vancouver, ISO та ін.
15

Nazir, Shahzad, Muhammad Asif, Mariam Rehman, and Shahbaz Ahmad. "Machine learning based framework for fine-grained word segmentation and enhanced text normalization for low resourced language." PeerJ Computer Science 10 (January 31, 2024): e1704. http://dx.doi.org/10.7717/peerj-cs.1704.

Повний текст джерела
Анотація:
In text applications, pre-processing is deemed as a significant parameter to enhance the outcomes of natural language processing (NLP) chores. Text normalization and tokenization are two pivotal procedures of text pre-processing that cannot be overstated. Text normalization refers to transforming raw text into scriptural standardized text, while word tokenization splits the text into tokens or words. Well defined normalization and tokenization approaches exist for most spoken languages in world. However, the world’s 10th most widely spoken language has been overlooked by the research community. This research presents improved text normalization and tokenization techniques for the Urdu language. For Urdu text normalization, multiple regular expressions and rules are proposed, including removing diuretics, normalizing single characters, separating digits, etc. While for word tokenization, core features are defined and extracted against each character of text. Machine learning model is considered with specified handcrafted rules to predict the space and to tokenize the text. This experiment is performed, while creating the largest human-annotated dataset composed in Urdu script covering five different domains. The results have been evaluated using precision, recall, F-measure, and accuracy. Further, the results are compared with state-of-the-art. The normalization approach produced 20% and tokenization approach achieved 6% improvement.
Стилі APA, Harvard, Vancouver, ISO та ін.
16

Păiș, Vasile, Verginica Barbu Mititelu, Elena Irimia, Radu Ion, and Dan Tufiș. "Under-Represented Speech Dataset from Open Data: Case Study on the Romanian Language." Applied Sciences 14, no. 19 (October 7, 2024): 9043. http://dx.doi.org/10.3390/app14199043.

Повний текст джерела
Анотація:
This paper introduces the USPDATRO dataset. This is a speech dataset, in the Romanian language, constructed from open data, focusing on under-represented voice types (children, young and old people, and female voices). The paper covers the methodology behind the dataset construction, specific details regarding the dataset, and evaluation of existing Romanian Automatic Speech Recognition (ASR) systems, with different architectures. Results indicate that more under-represented speech content is needed in the training of ASR systems. Our approach can be extended to other low-resourced languages, as long as open data are available.
Стилі APA, Harvard, Vancouver, ISO та ін.
17

JP, Sanjanasri, Vijay Krishna Menon, Soman KP, Rajendran S, and Agnieszka Wolk. "Generation of Cross-Lingual Word Vectors for Low-Resourced Languages Using Deep Learning and Topological Metrics in a Data-Efficient Way." Electronics 10, no. 12 (June 8, 2021): 1372. http://dx.doi.org/10.3390/electronics10121372.

Повний текст джерела
Анотація:
Linguists have been focused on a qualitative comparison of the semantics from different languages. Evaluation of the semantic interpretation among disparate language pairs like English and Tamil is an even more formidable task than for Slavic languages. The concept of word embedding in Natural Language Processing (NLP) has enabled a felicitous opportunity to quantify linguistic semantics. Multi-lingual tasks can be performed by projecting the word embeddings of one language onto the semantic space of the other. This research presents a suite of data-efficient deep learning approaches to deduce the transfer function from the embedding space of English to that of Tamil, deploying three popular embedding algorithms: Word2Vec, GloVe and FastText. A novel evaluation paradigm was devised for the generation of embeddings to assess their effectiveness, using the original embeddings as ground truths. Transferability across other target languages of the proposed model was assessed via pre-trained Word2Vec embeddings from Hindi and Chinese languages. We empirically prove that with a bilingual dictionary of a thousand words and a corresponding small monolingual target (Tamil) corpus, useful embeddings can be generated by transfer learning from a well-trained source (English) embedding. Furthermore, we demonstrate the usability of generated target embeddings in a few NLP use-case tasks, such as text summarization, part-of-speech (POS) tagging, and bilingual dictionary induction (BDI), bearing in mind that those are not the only possible applications.
Стилі APA, Harvard, Vancouver, ISO та ін.
18

Ramesh, Akshai, Venkatesh Balavadhani Parthasarathy, Rejwanul Haque, and Andy Way. "Comparing Statistical and Neural Machine Translation Performance on Hindi-To-Tamil and English-To-Tamil." Digital 1, no. 2 (April 2, 2021): 86–102. http://dx.doi.org/10.3390/digital1020007.

Повний текст джерела
Анотація:
Phrase-based statistical machine translation (PB-SMT) has been the dominant paradigm in machine translation (MT) research for more than two decades. Deep neural MT models have been producing state-of-the-art performance across many translation tasks for four to five years. To put it another way, neural MT (NMT) took the place of PB-SMT a few years back and currently represents the state-of-the-art in MT research. Translation to or from under-resourced languages has been historically seen as a challenging task. Despite producing state-of-the-art results in many translation tasks, NMT still poses many problems such as performing poorly for many low-resource language pairs mainly because of its learning task’s data-demanding nature. MT researchers have been trying to address this problem via various techniques, e.g., exploiting source- and/or target-side monolingual data for training, augmenting bilingual training data, and transfer learning. Despite some success, none of the present-day benchmarks have entirely overcome the problem of translation in low-resource scenarios for many languages. In this work, we investigate the performance of PB-SMT and NMT on two rarely tested under-resourced language pairs, English-To-Tamil and Hindi-To-Tamil, taking a specialised data domain into consideration. This paper demonstrates our findings and presents results showing the rankings of our MT systems produced via a social media-based human evaluation scheme.
Стилі APA, Harvard, Vancouver, ISO та ін.
19

Zia, Haris Bin, Ignacio Castro, Arkaitz Zubiaga, and Gareth Tyson. "Improving Zero-Shot Cross-Lingual Hate Speech Detection with Pseudo-Label Fine-Tuning of Transformer Language Models." Proceedings of the International AAAI Conference on Web and Social Media 16 (May 31, 2022): 1435–39. http://dx.doi.org/10.1609/icwsm.v16i1.19402.

Повний текст джерела
Анотація:
Hate speech has proliferated on social media platforms in recent years. While this has been the focus of many studies, most works have exclusively focused on a single language, generally English. Low-resourced languages have been neglected due to the dearth of labeled resources. These languages, however, represent an important portion of the data due to the multilingual nature of social media. This work presents a novel zero-shot, cross-lingual transfer learning pipeline based on pseudo-label fine-tuning of Transformer Language Models for automatic hate speech detection. We employ our pipeline on benchmark datasets covering English (source) and 6 different non-English (target) languages written in 3 different scripts. Our pipeline achieves an average improvement of 7.6% (in terms of macro-F1) over previous zero-shot, cross-lingual models. This demonstrates the feasibility of high accuracy automatic hate speech detection for low-resource languages. We release our code and models at https://github.com/harisbinzia/ZeroshotCrosslingualHateSpeech.
Стилі APA, Harvard, Vancouver, ISO та ін.
20

Poudel, Guru Prasad. "Speaking in English Language Classroom: Teachers’ Strategies and Confronting Problems." NELTA Bagmati Journal 3, no. 1 (December 31, 2022): 1–18. http://dx.doi.org/10.3126/nbj.v3i1.53412.

Повний текст джерела
Анотація:
Speaking components included in the English language curriculum or textbooks aim at developing communication abilities of the students. The effective practise of dealing with those components can enhance students’ efficiency in communication. However, the existing practise of teaching speaking in public schools of Nepal, especially located in rural and low-resourced areas does not seem supportive for developing students’ efficiency in communication abilities. In this connection, the present article aims to figure out the practise of teaching speaking in rural and low-resourced schools, use of the strategies in dealing with speaking components and the problems faced by the teachers. It has been developed out of the insights collected through case study research. The required information was accumulated through classroom observation. During the observation, it has been found that speaking component in the class was dealt with the practise of whole class discussion, pair work, group work, individual work, picture description, loud reading and repetition drills, telling stories, oral games, problem-solving, and sharing experiences. The teacher faced the problems of student-student chatting in their mother language while engaging in group discussion, low participation, hesitation and unwillingness to in taking part in the interaction, lack of enough and effective materials, difficulty to comprehend ideas expressed in English, and lack of the exposure in speaking in English. The findings make a call for innovative and technologically enhanced activities for the practise of speaking. The study concludes that participate reaction research is required for exploring the ways to address the problems in developing speaking efficiency.
Стилі APA, Harvard, Vancouver, ISO та ін.
21

Kalluri, Kartheek. "ADAPTING LLMs FOR LOW RESOURCE LANGUAGES-TECHNIQUES AND ETHICAL CONSIDERATIONS." INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT 08, no. 12 (December 30, 2024): 1–6. https://doi.org/10.55041/isjem00140.

Повний текст джерела
Анотація:
Adaptive large language models (LLMs) to resource-scarce languages and also analyze the ethical considerations involved. Already incorporated the elements of mixed methods. It consists of a literature review, corpus collection, expert interviews, and shareholders meeting. Some adaptation techniques examined in this study are data augmentation, multilingual pre-training, change of architecture, and parameter-efficient fine-tuning. The quantitative analysis indicated model performance improvements for under-resourced languages, particularly through cross-lingual knowledge transfer and data augmentation. However, results were varied in terms of languages and tasks. There were ethical issues in qualitative analysis: This articulated an ethical framework around the aspects of inclusive and transparency with the involvement of constituencies along all the lines of bias, cultural sensitivity, privacy of data, and impacts on linguistic diversity. Finally, although transfer learning and data augmentation speak nicely to adapting LLMs toward low-resource languages, very careful consideration must still be given to implications to ensure their fair and contextually appropriate use. Keywords- Adaptive large language models (LLMs), Resource-scarce languages, Data augmentation, Multilingual pre training, Cross lingual knowledge transfer, Ethical consideration, Cultural sensitivity
Стилі APA, Harvard, Vancouver, ISO та ін.
22

Adjeisah, Michael, Guohua Liu, Douglas Omwenga Nyabuga, Richard Nuetey Nortey, and Jinling Song. "Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine Translation." Computational Intelligence and Neuroscience 2021 (April 11, 2021): 1–10. http://dx.doi.org/10.1155/2021/6682385.

Повний текст джерела
Анотація:
Scaling natural language processing (NLP) to low-resourced languages to improve machine translation (MT) performance remains enigmatic. This research contributes to the domain on a low-resource English-Twi translation based on filtered synthetic-parallel corpora. It is often perplexing to learn and understand what a good-quality corpus looks like in low-resource conditions, mainly where the target corpus is the only sample text of the parallel language. To improve the MT performance in such low-resource language pairs, we propose to expand the training data by injecting synthetic-parallel corpus obtained by translating a monolingual corpus from the target language based on bootstrapping with different parameter settings. Furthermore, we performed unsupervised measurements on each sentence pair engaging squared Mahalanobis distances, a filtering technique that predicts sentence parallelism. Additionally, we extensively use three different sentence-level similarity metrics after round-trip translation. Experimental results on a diverse amount of available parallel corpus demonstrate that injecting pseudoparallel corpus and extensive filtering with sentence-level similarity metrics significantly improves the original out-of-the-box MT systems for low-resource language pairs. Compared with existing improvements on the same original framework under the same structure, our approach exhibits tremendous developments in BLEU and TER scores.
Стилі APA, Harvard, Vancouver, ISO та ін.
23

Mon, Aye Nyein, Win Pa Pa, and Ye Kyaw Thu. "UCSY-SC1: A Myanmar speech corpus for automatic speech recognition." International Journal of Electrical and Computer Engineering (IJECE) 9, no. 4 (August 1, 2019): 3194. http://dx.doi.org/10.11591/ijece.v9i4.pp3194-3202.

Повний текст джерела
Анотація:
This paper introduces a speech corpus which is developed for Myanmar Automatic Speech Recognition (ASR) research. Automatic Speech Recognition (ASR) research has been conducted by the researchers around the world to improve their language technologies. Speech corpora are important in developing the ASR and the creation of the corpora is necessary especially for low-resourced languages. Myanmar language can be regarded as a low-resourced language because of lack of pre-created resources for speech processing research. In this work, a speech corpus named UCSY-SC1 (University of Computer Studies Yangon - Speech Corpus1) is created for Myanmar ASR research. The corpus consists of two types of domain: news and daily conversations. The total size of the speech corpus is over 42 hrs. There are 25 hrs of web news and 17 hrs of conversational recorded data.<br />The corpus was collected from 177 females and 84 males for the news data and 42 females and 4 males for conversational domain. This corpus was used as training data for developing Myanmar ASR. Three different types of acoustic models such as Gaussian Mixture Model (GMM) - Hidden Markov Model (HMM), Deep Neural Network (DNN), and Convolutional Neural Network (CNN) models were built and compared their results. Experiments were conducted on different data sizes and evaluation is done by two test sets: TestSet1, web news and TestSet2, recorded conversational data. It showed that the performance of Myanmar ASRs using this corpus gave satisfiable results on both test sets. The Myanmar ASR using this corpus leading to word error rates of 15.61% on TestSet1 and 24.43% on TestSet2.<br /><br />
Стилі APA, Harvard, Vancouver, ISO та ін.
24

Sirora, Leslie Wellington, and Mainford Mutandavari. "A Deep Learning Automatic Speech Recognition Model for Shona Language." International Journal of Innovative Research in Computer and Communication Engineering 12, no. 09 (September 25, 2024): 1–14. http://dx.doi.org/10.15680/ijircce.2024.1209019.

Повний текст джерела
Анотація:
This study presented the development of a deep learning-based Automatic Speech Recognition (ASR) system for Shona, a low-resource language characterized by unique tonal and grammatical complexities. The research aimed to address the challenges posed by limited training data, a lack of labelled data, and the intricate tonal nuances present in Shona speech, with the objective of achieving significant improvements in recognition accuracy compared to traditional statistical models. Motivated by the limitations of existing approaches, the research addressed three key questions. Firstly, it explored the feasibility of using deep learning to develop an accurate ASR system for Shona. Secondly, it investigated the specific challenges involved in designing and implementing deep learning architectures for Shona speech recognition and proposed strategies to mitigate these challenges. Lastly, it compared the performance of the deep learning-based model with existing statistical models in terms of accuracy. The developed ASR system utilized a hybrid architecture consisting of a Convolutional Neural Network (CNN) for acoustic modelling and a Long Short-Term Memory (LSTM) network for language modelling. To overcome the scarcity of data, the research employed data augmentation techniques and transfer learning. Attention mechanisms were also incorporated to accommodate the tonal nature of Shona speech. The resulting ASR system achieved impressive results, with a Word Error Rate (WER) of 29%, Phoneme Error Rate (PER) of 12%, and an overall accuracy of 74%. These metrics indicated a significant improvement over existing statistical models, highlighting the potential of deep learning to enhance ASR accuracy for under-resourced languages like Shona. This research contributed to the advancement of ASR technology for under-resourced languages like Shona, ultimately fostering improved accessibility and communication for Shona speakers worldwide.
Стилі APA, Harvard, Vancouver, ISO та ін.
25

Mohamed Basheer K. P., Rizwana Kallooravi Thandil, Muneer V. K. ,. "Utilizing BiLSTM For Fine-Grained Aspect-Based Travel Recommendations Using Travel Reviews In Low Resourced Language." Journal of Electrical Systems 20, no. 2s (April 4, 2024): 233–40. http://dx.doi.org/10.52783/jes.1133.

Повний текст джерела
Анотація:
Recommender systems have become an essential tool for enhancing user experiences by providing personalized recommendations. In this study, we present a novel approach to constructing a recommender system specifically tailored for Malayalam travel reviews. Our objective was to extract relevant features from these reviews and employ a bidirectional Long Short-Term Memory (BiLSTM) architecture to construct a robust and accurate recommendation model. We focused on four key features extracted from the travel reviews: travel mode, travel type, location climate, and location type. The travel mode feature encompassed the mode of transport opted for the travel such as bus, car, train, etc., while the travel type captured the nature of the trip, including family, friends, or solo travel. Additionally, we considered the climate of the location, including rainy, snowy, hot, and dry, among others, and the location type, such as beach, hilly, or forest destinations. To construct our recommender system, we implemented a BiLSTM architecture, a powerful deep-learning model known for effectively capturing temporal dependencies in sequential data. This architecture allowed us to process the extracted features and learn the underlying patterns within the Malayalam travel reviews. Our experiments were conducted on a comprehensive dataset of Malayalam travel reviews, carefully curated for this study. The dataset encompassed a diverse range of travel experiences, enabling our model to learn from a wide variety of user preferences and recommendations. The performance evaluation of our recommender system yielded promising results. With an accuracy of 83.65 percent, our model showcased its ability to accurately predict and recommend travel options based on the extracted features from the reviews. The high accuracy achieved by our model underscores the effectiveness of the BiLSTM architecture in capturing the nuances of the Malayalam language and understanding the subtle preferences expressed in travel reviews. The practical implications of our work are significant, as it offers a valuable tool for travelers seeking personalized recommendations based on their travel preferences. The use of the Malayalam language in this context expands the reach of recommender systems to a wider audience, catering specifically to individuals who prefer to consume content and make decisions in their native language.
Стилі APA, Harvard, Vancouver, ISO та ін.
26

Waite, S. "Low-resourced self-access with EAP in the developing world: the great enabler?" ELT Journal 48, no. 3 (July 1, 1994): 233–42. http://dx.doi.org/10.1093/elt/48.3.233.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
27

Liapis, Charalampos M., Konstantinos Kyritsis, Isidoros Perikos, Nikolaos Spatiotis, and Michael Paraskevas. "A Hybrid Ensemble Approach for Greek Text Classification Based on Multilingual Models." Big Data and Cognitive Computing 8, no. 10 (October 14, 2024): 137. http://dx.doi.org/10.3390/bdcc8100137.

Повний текст джерела
Анотація:
The present study explores the field of text classification in the Greek language. A novel ensemble classification scheme based on generated embeddings from Greek text made by the multilingual capabilities of the E5 model is presented. Our approach incorporates partial transfer learning by using pre-trained models to extract embeddings, enabling the evaluation of classical classifiers on Greek data. Additionally, we enhance the predictive capability while maintaining the costs low by employing a soft voting combination scheme that exploits the strengths of XGBoost, K-nearest neighbors, and logistic regression. This method significantly improves all classification metrics, demonstrating the superiority of ensemble techniques in handling the complexity of Greek textual data. Our study contributes to the field of natural language processing by proposing an effective ensemble framework for the categorization of Greek texts, leveraging the advantages of both traditional and modern machine learning techniques. This framework has the potential to be applied to other less-resourced languages, thereby broadening the impact of our research beyond Greek language processing.
Стилі APA, Harvard, Vancouver, ISO та ін.
28

Bhagath, Parabattina, Malempati Shanmukha, and Pradip K. Das. "Hindi spoken digit analysis for native and non-native speakers." IAES International Journal of Artificial Intelligence (IJ-AI) 14, no. 2 (April 1, 2025): 1561. https://doi.org/10.11591/ijai.v14.i2.pp1561-1567.

Повний текст джерела
Анотація:
<p>Automated speech recognition (ASR) is the process of using an algorithm or<br />automated system to recognize and translate spoken words of a specific language. ASR has various applications in fields such as mobile speech recognition, the internet of things and human-machine interaction. Researchers have been working on issues related to ASR for more than 60 years. One of the many use cases of ASR is designing applications such as digit recognition that aid differently-abled individuals, children and elderly people. However, there is a lack of spoken language data in under-developed and low-resourced languages, which presents difficulties. Although this is not a pivotal issue for highly established languages like English, it has a significant impact on less commonly spoken languages. In this paper, we discuss the development of a Hindi-spoken dataset and benchmark spoken digit models using convolutional neural networks (CNNs). The dataset includes both native and non-native Hindi speakers. The models built using CNN exhibit 88.44%, 95.15%, and 89.41% for non-native, native, and combined speakers respectively.</p>
Стилі APA, Harvard, Vancouver, ISO та ін.
29

Obosu, Gideon Kwesi, Irene Vanderpuye, Nana Afia Opoku-Asare, and Timothy Olufemi Adigun. "A Qualitative Inquiry into the Factors that Influence Deaf Children's Early Sign Language Acquisition among Deaf Children in Ghana." Sign Language Studies 23, no. 4 (June 2023): 527–54. http://dx.doi.org/10.1353/sls.2023.a905538.

Повний текст джерела
Анотація:
Abstract: The linguistic and cognitive importance of early language exposure for deaf children is well reported in the literature. However, most of such studies have been conducted in industrialized countries with less of such studies conducted in developing and nonindustrialized countries such as Ghana. Therefore, hinged on the social interactionist theory of language development, this study explored the factors that influence early acquisition of sign language among deaf children from a low-resource setting in Ghana. Ten mothers of deaf children from these communities were purposively selected for the study. Data was gathered through observation, focus group discussion, and a face-to-face interview using a semistructured interview guide. The data were subsequently analyzed thematically. Parents' knowledge about their children's deafness, sociocultural beliefs, and the parents' interactions with their deaf children at home were found as core potential factors influencing early acquisition of sign language among deaf children in these low-resourced communities. Based on these findings, appropriate recommendations are made for policy and practice.
Стилі APA, Harvard, Vancouver, ISO та ін.
30

Pasini, Tommaso, Alessandro Raganato, and Roberto Navigli. "XL-WSD: An Extra-Large and Cross-Lingual Evaluation Framework for Word Sense Disambiguation." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 15 (May 18, 2021): 13648–56. http://dx.doi.org/10.1609/aaai.v35i15.17609.

Повний текст джерела
Анотація:
Transformer-based architectures brought a breeze of change to Word Sense Disambiguation (WSD), improving models' performances by a large margin. The fast development of new approaches has been further encouraged by a well-framed evaluation suite for English, which has allowed their performances to be kept track of and compared fairly. However, other languages have remained largely unexplored, as testing data are available for a few languages only and the evaluation setting is rather matted. In this paper, we untangle this situation by proposing XL-WSD, a cross-lingual evaluation benchmark for the WSD task featuring sense-annotated development and test sets in 18 languages from six different linguistic families, together with language-specific silver training data. We leverage XL-WSD datasets to conduct an extensive evaluation of neural and knowledge-based approaches, including the most recent multilingual language models. Results show that the zero-shot knowledge transfer across languages is a promising research direction within the WSD field, especially when considering low-resourced languages where large pre-trained multilingual models still perform poorly. We make the evaluation suite and the code for performing the experiments available at https://sapienzanlp.github.io/xl-wsd/.
Стилі APA, Harvard, Vancouver, ISO та ін.
31

Dunđer, Ivan. "Machine Translation System for the Industry Domain and Croatian Language." Journal of information and organizational sciences 44, no. 1 (June 25, 2020): 33–50. http://dx.doi.org/10.31341/jios.44.1.2.

Повний текст джерела
Анотація:
Machine translation is increasingly becoming a hot research topic in information and communication sciences, computer science and computational linguistics, due to the fact that it enables communication and transferring of meaning across different languages. As the Croatian language can be considered low-resourced in terms of available services and technology, development of new domain-specific machine translation systems is important, especially due to raised interest and needs of industry, academia and everyday users. Machine translation is not perfect, but it is crucial to assure acceptable quality, which is purpose-dependent. In this research, different statistical machine translation systems were built – but one system utilized domain adaptation in particular, with the intention of boosting the output of machine translation. Afterwards, extensive evaluation has been performed – in form of applying several automatic quality metrics and human evaluation with focus on various aspects. Evaluation is done in order to assess the quality of specific machine-translated text.
Стилі APA, Harvard, Vancouver, ISO та ін.
32

Bani, Rkia, Samir Amri, Lahbib Zenkouar, and Zouhair Guennoun. "Toward accurate Amazigh part-of-speech tagging." IAES International Journal of Artificial Intelligence (IJ-AI) 13, no. 1 (March 1, 2024): 572. http://dx.doi.org/10.11591/ijai.v13.i1.pp572-580.

Повний текст джерела
Анотація:
<span lang="EN-US">Part-of-speech (POS) tagging is the process of assigning to each word in a text its corresponding grammatical information POS. It is an important pre-processing step in other natural language processing (NLP) tasks, so the objective of finding the most accurate one. The previous approaches were based on traditional machine learning algorithms, later with the development of deep learning, more POS taggers were adopted. If the accuracy of POS tagging reaches 97%, even with the traditional machine learning, for high resourced language like English, French, it’s far the case in low resource language like Amazigh. The most used approaches are traditional machine learning, and the results are far from those for rich language. In this paper, we present a new POS tagger based on bidirectional long short-term memory for Amazigh language and the experiments that have been done on real dataset shows that it outperforms the existing machine learning methods.</span>
Стилі APA, Harvard, Vancouver, ISO та ін.
33

Gamage, Buddhi, Randil Pushpananda, Thilini Nadungodage, and Ruvan Weerasinghe. "Applicability of End-to-End Deep Neural Architecture to Sinhala Speech Recognition." International Journal on Advances in ICT for Emerging Regions (ICTer) 17, no. 1 (May 31, 2024): 17–21. http://dx.doi.org/10.4038/icter.v17i1.7273.

Повний текст джерела
Анотація:
This research presents a study on the application of end-to-end deep learning models for Automatic Speech Recognition in the Sinhala language, which is characterized by its high inflection and limited resources.We explore two e2e architectures, namely the e2e Lattice-Free Maximum Mutual Information model and the Recurrent Neural Network model, using a restricted dataset. Statistical models with 40 hours of training data are established as baselines for evaluation. Our pretrained endto-end Automatic Speech Recognition models achieved a Word Error Rate of 23.38% by far the best word-error-rate achieved for low resourced Sinhala Language. Our models demonstrate greater contextual independence and faster processing, making them more suitable for general-purpose speech-to-text translation in Sinhala.
Стилі APA, Harvard, Vancouver, ISO та ін.
34

Khan, Muzammil, Kifayat Ullah, Yasser Alharbi, Ali Alferaidi, Talal Saad Alharbi, Kusum Yadav, Naif Alsharabi, and Aakash Ahmad. "Understanding the Research Challenges in Low-Resource Language and Linking Bilingual News Articles in Multilingual News Archive." Applied Sciences 13, no. 15 (July 25, 2023): 8566. http://dx.doi.org/10.3390/app13158566.

Повний текст джерела
Анотація:
The developed world has focused on Web preservation compared to the developing world, especially news preservation for future generations. However, the news published online is volatile because of constant changes in the technologies used to disseminate information and the formats used for publication. News preservation became more complicated and challenging when the archive began to contain articles from low-resourced and morphologically complex languages like Urdu and Arabic, along with English news articles. The digital news story preservation framework is enriched with eighteen sources for Urdu, Arabic, and English news sources. This study presents challenges in low-resource languages (LRLs), research challenges, and details of how the framework is enhanced. In this paper, we introduce a multilingual news archive and discuss the digital news story extractor, which addresses major issues in implementing low-resource languages and facilitates normalized format migration. The extraction results are presented in detail for high-resource languages, i.e., English, and low-resource languages, i.e., Urdu and Arabic. LRLs encountered a high error rate during preservation compared to high-resource languages (HRLs), corresponding to 10% and 03%, respectively. The extraction results show that few news sources are not regularly updated and release few new news stories online. LRLs require more detailed study for accurate news content extraction and archiving for future access. LRLs and HRLs enrich the digital news story preservation (DNSP) framework. The Digital News Stories Archive (DNSA) preserves a huge number of news articles from multiple news sources in LRLs and HRLs. This paper presents research challenges encountered during the preservation of Urdu and Arabic-language news articles to create a multilingual news archive. The second part of the paper compares two bilingual linking mechanisms for Urdu-to-English-language news articles in the DNSA: the common ratio measure for dual language (CRMDL) and the similarity measure based on transliteration words (SMTW) with the cosine similarity measure (CSM) baseline technique. The experimental results show that the SMTW is more effective than the CRMDL and CSM for linking Urdu-to-English news articles. The precision improved from 46% and 50% to 60%, and the recall improved from 64% and 67% to 82% for CSM, CRMDL, and SMTW, respectively, with improved impact of common terms as well.
Стилі APA, Harvard, Vancouver, ISO та ін.
35

Silber Varod, Vered, Ingo Siegert, Oliver Jokisch, Yamini Sinha, and Nitza Geri. "A cross-language study of speech recognition systems for English, German, and Hebrew." Online Journal of Applied Knowledge Management 9, no. 1 (July 26, 2021): 1–15. http://dx.doi.org/10.36965/ojakm.2021.9(1)1-15.

Повний текст джерела
Анотація:
Despite the growing importance of Automatic Speech Recognition (ASR), its application is still challenging, limited, language-dependent, and requires considerable resources. The resources required for ASR are not only technical, they also need to reflect technological trends and cultural diversity. The purpose of this research is to explore ASR performance gaps by a comparative study of American English, German, and Hebrew. Apart from different languages, we also investigate different speaking styles – utterances from spontaneous dialogues and utterances from frontal lectures (TED-like genre). The analysis includes a comparison of the performance of four ASR engines (Google Cloud, Google Search, IBM Watson, and WIT.ai) using four commonly used metrics: Word Error Rate (WER); Character Error Rate (CER); Word Information Lost (WIL); and Match Error Rate (MER). As expected, findings suggest that English ASR systems provide the best results. Contrary to our hypothesis regarding ASR’s low performance for under-resourced languages, we found that the Hebrew and German ASR systems have similar performance. Overall, our findings suggest that ASR performance is language-dependent and system-dependent. Furthermore, ASR may be genre-sensitive, as our results showed for German. This research contributes a valuable insight for improving ubiquitous global consumption and management of knowledge and calls for corporate social responsibility of commercial companies, to develop ASR under Fair, Reasonable, and Non-Discriminatory (FRAND) terms
Стилі APA, Harvard, Vancouver, ISO та ін.
36

Khan, Muzammil, Sarwar Shah Khan, Yasser Alharbi, Ali Alferaidi, Talal Saad Alharbi, and Kusum Yadav. "The Role of Transliterated Words in Linking Bilingual News Articles in an Archive." Applied Sciences 13, no. 7 (March 31, 2023): 4435. http://dx.doi.org/10.3390/app13074435.

Повний текст джерела
Анотація:
Retrieving a specific digital information object from a multi-lingual huge and evolving news archives is challenging and complicated against a user query. The processing becomes more difficult to understand and analyze when low-resourced and morphologically complex languages like Urdu and Arabic scripts are included in the archive. Computing similarity against a query and among news articles in huge and evolving collections may be inaccurate and time-consuming at run time. This paper introduces a Similarity Measure based on Transliteration Words (SMTW) from the English language in the Urdu scripts for linking news articles extracted from multiple online sources during the preservation process. The SMTW link Urdu-to-English news articles using an upgraded Urdu-to-English lexicon, including transliteration words. The SMTW was exhaustively evaluated to assess the effectiveness using different size datasets and the results were compared with the Common Ratio Measure for Dual Language (CRMDL). The experimental results show that the SMTW was more effective than the CRMDL for linking Urdu-to-English news articles. The precision improved from 50% to 60%, recall improved from 67% to 82%, and the impact of common terms also improved.
Стилі APA, Harvard, Vancouver, ISO та ін.
37

Cadwell, Patrick, Sharon O’Brien, and Eric DeLuca. "More than tweets." Translation Spaces 8, no. 2 (November 5, 2019): 300–333. http://dx.doi.org/10.1075/ts.19018.cad.

Повний текст джерела
Анотація:
Abstract The application of machine translation (MT) in crisis settings is of increasing interest to humanitarian practitioners. We collaborated with industry and non-profit partners: (1) to develop and test the utility of an MT system trained specifically on crisis-related content in an under-resourced language combination (French-to-Swahili); and (2) to evaluate the extent to which speakers of both French and Swahili without post-editing experience could be mobilized to post-edit the output of this system effectively. Our small study carried out in Kenya found that our system performed well, provided useful output, and was positively evaluated by inexperienced post-editors. We use the study to discuss the feasibility of MT use in crisis settings for low-resource language combinations and make recommendations on data selection and domain consideration for future crisis-related MT development.
Стилі APA, Harvard, Vancouver, ISO та ін.
38

Iyengar, Radhika. "Using Cognitive Neuroscience Principles to Design Efficient Reading Programs: Case Studies from India and Malawi Cognitive Neuroscience to Design Literacy Programs." International Journal of Contemporary Education 2, no. 2 (July 21, 2019): 38. http://dx.doi.org/10.11114/ijce.v2i2.4394.

Повний текст джерела
Анотація:
The hidden crisis in education has come to light since the past decade. Millions of school-going children remain illiterate, even after spending 2-3 years in school. This paper explores a cognitive neuroscience driven method to improve children’s reading in two local languages--Chichewa (Malawi) and Telugu (Telangana, India). The paper first presents the science behind how children learn using this science-driven model. It then presents the process of contextualization of this literacy method for Malawi and Telangana, India. The contextualization and adaptation processes lead to some generalized principles that could be applied to other local language literacy programs. The study looks at sequencing of letters, font size and type, teacher training modalities as well as classroom delivery processes, which are all key components for any early literacy intervention. The study also focuses on cost-cutting measures to aid in full implementation and scale-up for a low resourced educational setting.
Стилі APA, Harvard, Vancouver, ISO та ін.
39

Jayawardena, Asitha D. L., Zelda J. Ghersin, Marcos Mirambeaux, Jose A. Bonilla, Ernesto Quiñones, Evelyn Zablah, Kevin Callans, et al. "A Sustainable and Scalable Multidisciplinary Airway Teaching Mission: The Operation Airway 10-Year Experience." Otolaryngology–Head and Neck Surgery 163, no. 5 (June 30, 2020): 971–78. http://dx.doi.org/10.1177/0194599820935042.

Повний текст джерела
Анотація:
Objective To address whether a multidisciplinary team of pediatric otolaryngologists, anesthesiologists, pediatric intensivists, speech-language pathologists, and nurses can achieve safe and sustainable surgical outcomes in low-resourced settings when conducting a pediatric airway surgical teaching mission that features a program of progressive autonomy. Study Design Consecutive case series with chart review. Setting This study reviews 14 consecutive missions from 2010 to 2019 in Ecuador, El Salvador, and the Dominican Republic. Methods Demographic data, diagnostic and operative details, and operative outcomes were collected. A country’s program met graduation criteria if its multidisciplinary team developed the ability to autonomously manage the preoperative huddle, operating room discussion and setup, operative procedure, and postoperative multidisciplinary pediatric intensive care unit and floor care decision making. This was assessed by direct observation and assessment of surgical outcomes. Results A total of 135 procedures were performed on 90 patients in Ecuador (n = 24), the Dominican Republic (n = 51), and El Salvador (n = 39). Five patients required transport to the United States to receive quaternary-level care. Thirty-six laryngotracheal reconstructions were completed: 6 single-stage, 12 one-and-a-half-stage, and 18 double-stage cases. We achieved a decannulation rate of 82%. Two programs (Ecuador and the Dominican Republic) met graduation criteria and have become self-sufficient. No mortalities were recorded. Conclusion This is the largest longitudinal description of an airway reconstruction teaching mission in low- and middle-income countries. Airway reconstruction can be safe and effective in low-resourced settings with a thoughtful multidisciplinary team led by local champions.
Стилі APA, Harvard, Vancouver, ISO та ін.
40

Kumar Nayak, Subrat, Ajit Kumar Nayak, Smitaprava Mishra, Prithviraj Mohanty, Nrusingha Tripathy, and Kumar Surjeet Chaudhury. "Exploring Speech Emotion Recognition in Tribal Language with Deep Learning Techniques." International journal of electrical and computer engineering systems 16, no. 1 (January 2, 2025): 53–64. https://doi.org/10.32985/ijeces.16.1.6.

Повний текст джерела
Анотація:
Emotion is fundamental to interpersonal interactions since it assists mutual understanding. Developing human-computer interactions and a related digital product depends heavily on emotion recognition. Due to the need for human-computer interaction applications, deep learning models for the voice recognition of emotions are an essential area of research. Most speech emotion recognition algorithms are only deployed in European and a few Asian languages. However, for a low-resource tribal language like KUI, the dataset is not available. So, we created the dataset and applied some augmentation techniques to increase the dataset size. Therefore, this study is based on speech emotion recognition using a low-resourced KUI speech dataset, and the results with and without augmentation of the dataset are compared. The dataset is created using a studio platform for better-quality speech data. They are labeled using six perceived emotions: ସଡାଙ୍ଗି (angry), େରହା (happy), ଆଜି (fear), ବିକାଲି (sad), ବିଜାରି (disgust), and େଡ଼କ୍‌(surprise). Mel-frequency cepstral coefficient (MFCC) is used for feature extraction. The deep learning technique is an alternative to the traditional methods to recognize speech emotion. This study uses a hybrid architecture of Long Short-Term Memory (LSTM) and Convolutional Neural Networks (CNNs) as classification techniques for recognition. The results have been compared with existing benchmark models, with the experiments demonstrating that the proposed hybrid model achieved an accuracy of 96% without augmentation and 97% with augmentation.
Стилі APA, Harvard, Vancouver, ISO та ін.
41

Kiros, Atakilti Brhanu, and Petros Ukbagergis Aray. "Tigrigna language spellchecker and correction system for mobile phone devices." International Journal of Electrical and Computer Engineering (IJECE) 11, no. 3 (June 1, 2021): 2307. http://dx.doi.org/10.11591/ijece.v11i3.pp2307-2314.

Повний текст джерела
Анотація:
This paper presents on the implementation of spellchecker and corrector system in mobile phone devices, such as a smartphone for the low-resourced Tigrigna language. Designing and developing a spell checking for Tigrigna language is a challenging task. Tigrigna script has more than 32 base letters with seven vowels each. Every first letter has six suffixes. Word formation in Tigrigna depends mainly on root-and-pattern morphology and exhibits prefixes, suffixes, and infixes. A few project have been done on Tigrigna spellchecker on desktop application and the nature of Ethiopic characters. However, in this work we have proposed a systems modeling for Tigrigna language spellchecker, detecting and correction: a corpus of 430,379 Tigrigna words has been used. To indication the validity of the spellchecker and corrector model and algorithm designed, a prototype is developed. The experiment is tested and accuracy of the prototype for Tigrigna spellchecker and correction system for mobile phone devices achieved 92%. This experiment result shows clearly that the system model is efficient in spellchecking and correcting relevant suggested correct words and reduces the misspelled input words for writing Tigrigna words on mobile phone devices.
Стилі APA, Harvard, Vancouver, ISO та ін.
42

Mulla, Rahesha, and B. Suresh Kumar. "Text-Independent Automatic Dialect Recognition of Marathi Language using Spectro-Temporal Characteristics of Voice." International Journal on Recent and Innovation Trends in Computing and Communication 10, no. 2s (December 31, 2022): 313–21. http://dx.doi.org/10.17762/ijritcc.v10i2s.5949.

Повний текст джерела
Анотація:
Text-independent dialect recognition system is proposed in this paper for Marathi language. India is rich in language varieties. Each language in turn has its unique dialect variations. Maharashtra has Marathi as official language and for Goa it is a co-official language . In literature there are very few studies available for Indian language recognition and then their respective dialect recognition. So research work available for regional languages such as Marathi is extremely limited. As a part of research work, an attempt is made to generate a case study of a low resourced Marathi language dialect recognition system. The study was carried out using Marathi speech data corpus provided by Linguistic Data Consortium for Indian Language (LDC- IL). This corpus includes four major dialects of Marathi speakers. The efficiency and performance evaluation of the explored spectral (rhythmic) and temporal features are carried out to perform classification tasks. We investigated the performance of six different classifiers; K-nearest neighbor (KNN), Naïve Bayes (NB), Support Vector Machine (SVM), Decision Tree (DT) classifier , Stochastic Gradient Descent (SGD) classifier and Ridge Classifier (RC). Experimental results have demonstrated that the RC classifier worked well with 84.24% of accuracy for fifteen spectral and temporal features. With twelve MFCCs it has been observed that SGD has outperformed among all classifiers with accuracy of 80.63%. For further study, a prominent feature subset as a part of dimensionality reduction has been identified using chi square, mutual information and ANOVA-f test. In this chi-square based feature extraction method has proven to be the best over over mutual information and ANOVA f-test.
Стилі APA, Harvard, Vancouver, ISO та ін.
43

Phaladi, Amanda, and Thipe Modipa. "The Evaluation of a Code-Switched Sepedi-English Automatic Speech Recognition System." International Journal on Cybernetics & Informatics 13, no. 2 (March 10, 2024): 33–44. http://dx.doi.org/10.5121/ijci.2024.130203.

Повний текст джерела
Анотація:
Speech technology is a field that encompasses various techniques and tools used to enable machines to interact with speech, such as automatic speech recognition (ASR), spoken dialog systems, and others, allowing a device to capture spoken words through a microphone from a human speaker. End-to-end approaches such as Connectionist Temporal Classification (CTC) and attention-based methods are the most used for the development of ASR systems. However, these techniques were commonly used for research and development for many high-resourced languages with large amounts of speech data for training and evaluation, leaving low-resource languages relatively underdeveloped. While the CTC method has been successfully used for other languages, its effectiveness for the Sepedi language remains uncertain. In this study, we present the evaluation of the SepediEnglish code-switched automatic speech recognition system. This end-to-end system was developed using the Sepedi Prompted Code Switching corpus and the CTC approach. The performance of the system was evaluated using both the NCHLT Sepedi test corpus and the Sepedi Prompted Code Switching corpus. The model produced the lowest WER of 41.9%, however, the model faced challenges in recognizing the Sepedi only text
Стилі APA, Harvard, Vancouver, ISO та ін.
44

Aravinthan, Archchitha, and Charles Eugene. "Exploring Recent NLP Advances for Tamil: Word Vectors and Hybrid Deep Learning Architectures." International Journal on Advances in ICT for Emerging Regions (ICTer) 17, no. 2 (October 9, 2024): 85–101. http://dx.doi.org/10.4038/icter.v17i2.7279.

Повний текст джерела
Анотація:
The advancements of deep learning methods and the availability of large corpora and data sets have led to an exponential increase in the performance of Natural Language Processing (NLP) methods resulting in successful NLP applications for various day-to-day tasks such as Language translation, Voice to text, Grammar checking, Sentiment analysis, etc. These advancements enabled the well-resourced languages to adapt themselves to the digital era while the gap for the low-resource languages widened. This research work explores the suitability of the recent advancements in NLP for Tamil, a low-resource language spoken mainly in South India, Sri Lanka, and Malaysia. From the literature survey, it has been found that there is a lack of comprehensive study on the effect of the recent advancements of NLP for the Tamil text. To fill this gap, this research work analysed the performance of deep learning based text representation and classification approaches namely word embedding, Convolutional Neural Network (CNN), and Recurrent Neural Network (RNN) for Tamil text classification tasks. Different dimensional pretrained word2Vec and FastText word vectors were built for Tamil and their effectiveness on Text classification was evaluated. The study found that the pre-trained 300- dimensional FastText word vector showed better performance than other pre-trained word vectors for Tamil text classification. Further, in this study, four simple hybrid CNN and Bi-GRU models were proposed for Tamil text classification and their performances were evaluated. The study found that hybrid CNN and Bi-GRU models perform better compared to the classical machine learning models, individual CNN and RNN models, and the Multilingual BERT model. These results confirm that the jointly learned embeddings with different deep learning architectures like CNN and RNN can achieve remarkable results for Tamil text classification, thus ensuring that the deep learning approaches can be successful for NLP on Tamil text.
Стилі APA, Harvard, Vancouver, ISO та ін.
45

Zgank, Andrej. "Influence of Highly Inflected Word Forms and Acoustic Background on the Robustness of Automatic Speech Recognition for Human–Computer Interaction." Mathematics 10, no. 5 (February 24, 2022): 711. http://dx.doi.org/10.3390/math10050711.

Повний текст джерела
Анотація:
Automatic speech recognition is essential for establishing natural communication with a human–computer interface. Speech recognition accuracy strongly depends on the complexity of language. Highly inflected word forms are a type of unit present in some languages. The acoustic background presents an additional important degradation factor influencing speech recognition accuracy. While the acoustic background has been studied extensively, the highly inflected word forms and their combined influence still present a major research challenge. Thus, a novel type of analysis is proposed, where a dedicated speech database comprised solely of highly inflected word forms is constructed and used for tests. Dedicated test sets with various acoustic backgrounds were generated and evaluated with the Slovenian UMB BN speech recognition system. The baseline word accuracy of 93.88% and 98.53% was reduced to as low as 23.58% and 15.14% for the various acoustic backgrounds. The analysis shows that the word accuracy degradation depends on and changes with the acoustic background type and level. The highly inflected word forms’ test sets without background decreased word accuracy from 93.3% to only 63.3% in the worst case. The impact of highly inflected word forms on speech recognition accuracy was reduced with the increased levels of acoustic background and was, in these cases, similar to the non-highly inflected test sets. The results indicate that alternative methods in constructing speech databases, particularly for low-resourced Slovenian language, could be beneficial.
Стилі APA, Harvard, Vancouver, ISO та ін.
46

Hoque, Md Nesarul, and Umme Salma. "Detecting Level of Depression from Social Media Posts for the Low-resource Bengali Language." Journal of Engineering Advancements 4, no. 02 (June 28, 2023): 49–56. http://dx.doi.org/10.38032/jea.2023.02.003.

Повний текст джерела
Анотація:
Depression is a mental illness that suffers people in their thoughts and daily activities. In extreme cases, sometimes it leads to self-destruction or commit to suicide. Besides an individual, depression harms the victim's family, society, and working environment. Therefore, before physiological treatment, it is essential to identify depressed people first. As various social media platforms like Facebook overwhelm our everyday life, depressed people share their personal feelings and opinions through these platforms by sending posts or comments. We have detected many research work that experiment on those text messages in English and other highly-resourced languages. Limited works we have identified in low-resource languages like Bengali. In addition, most of these works deal with a binary classification problem. We classify the Bengali depression text into four classes: non-depressive, mild, moderate, and severe in this investigation. At first, we developed a depression dataset of 2,598 entries. Then, we apply pre-processing tasks, feature selection techniques, and three types of machine learning (ML) models: classical ML, deep-learning (DL), and transformer-based pre-trained models. The XLM-RoBERTa-based pre-trained model outperforms with 61.11% F1-score and 60.89% accuracy the existing works for the four levels of the depression-class classification problem. Our proposed machine learning-based automatic detection system can recognize the various stages of depression, from low to high. It may assist the psychologist or others in providing level-wise counseling to depressed people to return to their ordinary life.
Стилі APA, Harvard, Vancouver, ISO та ін.
47

Shashi Shekhar, Rashmi Gupta, Jeetendra Kumar,. "Hindi Abstractive Text Summarization using Transliteration with Pre-trained Model." Journal of Electrical Systems 20, no. 3s (March 31, 2024): 2089–110. http://dx.doi.org/10.52783/jes.1810.

Повний текст джерела
Анотація:
Automatic text summarization is a subarea of natural language processing that generates a summary of the text by keeping its key points. The research work done on summarizing low-resourced language text is very limited. In India, the Hindi language is being spoken by central and north Indian people and only a few research works have been done on abstractive summarization of Hindi language. Having matras in Hindi makes it difficult to tokenize so it is difficult to summarize Hindi text using abstractive text summarization. In the proposed method, abstractive Hindi text summarization is done using transliteration and fine-tuning. In this work, the model is trained to generate both summaries and headlines. ROUGE-score and BERT-score have been utilized to check summary quality. A new semantic similarity score-based performance measure is also proposed to measure semantic similarity between reference summaries and predicted summaries. Using the proposed method, we have achieved the highest 55.16 ROUGE score, 0.80 BERT score, and 0.98 similarity score. Along with these performance measures, human evaluation of predicted summaries is also done and it is found that summaries and headlines were generated at a human-acceptable level.
Стилі APA, Harvard, Vancouver, ISO та ін.
48

Rananga, Seani, Bassey Isong, Abiodun Modupe, and Vukosi Marivate. "Misinformation Detection: A Review for High and Low-Resource Languages." Journal of Information Systems and Informatics 6, no. 4 (December 31, 2024): 2892–922. https://doi.org/10.51519/journalisi.v6i4.931.

Повний текст джерела
Анотація:
The rapid spread of misinformation on platforms like Twitter, and Facebook, and in news headlines highlights the urgent need for effective ways to detect it. Currently, researchers are increasingly using machine learning (ML) and deep learning (DL) techniques to tackle misinformation detection (MID) because of their proven success. However, this task is still challenging due to the complexity of deceptive language, digital editing tools, and the lack of reliable linguistic resources for non-English languages. This paper provides a comprehensive analysis of relevant research, providing insights into advanced techniques for MID. It covers dataset assessments, the importance of using multiple forms of data (multimodality), and different language representations. By applying the Preferred Reporting Items for Systematic Review and Meta-Analysis (PRISMA) methodology, the study identified and analyzed literature from 2019 to 2024 across five databases: Google Scholar, Springer, Elsevier, ACM, and IEEE Xplore. The study selected thirty-one papers and examined the effectiveness of various ML and DL approaches with a focal point on performance metrics, datasets, and false or misleading information detection challenges. The findings indicate that most current MID models are heavily dependent on DL techniques, with approximately 81% of studies preferring these over traditional ML methods. In addition, most studies are text-based, with much less attention given to audio, speech, images, and videos. The most effective models are mainly designed for high-resource languages, with English datasets being the most used (67%), followed by Arabic (14%), Chinese (11%), and others. Less than 10% of the studies focus on low-resource languages (LRLs). Therefore, the study highlighted the need for robust datasets and interpretable, scalable MID models for LRLs. It emphasizes the critical need to prioritize and advance MID research for LRLs across all data types, including text, audio, speech, images, videos, and multimodal approaches. This study aims to support ongoing efforts to combat misinformation and promote a more informed understanding of under-resourced African languages.
Стилі APA, Harvard, Vancouver, ISO та ін.
49

Choi, Carolyn Areum. "Transperipheral Educational Mobility: Less Privileged South Korean Young Adults Pursuing English Language Study in a Peripheral City in the Philippines." positions: asia critique 30, no. 2 (May 1, 2022): 377–407. http://dx.doi.org/10.1215/10679847-9573396.

Повний текст джерела
Анотація:
Abstract The pursuit of overseas English language education by South Korean youth has resulted in a hierarchy of educational destinations, with migrants studying English in the Global North attaining higher cultural capital compared to those learning English in the Global South. This article examines the experiences of South Korean youth who pursue education in English language schools in the provincial Philippines. Using in-depth interviews and participant observation with South Korean educational migrants in the Philippines and South Korea, it outlines class and regional dynamics in a pattern of youth mobility the author calls “transperipheral educational mobility.” This type of mobility refers to the transnational movement of less-privileged, that is low-resourced, South Korean youth from peripheral regions in South Korea to peripheral cities in the Philippines for the purpose of pursuing English language education in a budget program. Despite being considered “less legitimate” than the credentials earned by their counterparts in destinations in the Global North, the pursuit of English language education in the Global South, as this article shows, provides forms of precultural capital, compensatory middle-class consumption, and entrepreneurial inspiration that strategically and creatively seeks to challenge working-class migrants’ marginal positions within South Korea's highly stratified and increasingly neoliberal society.
Стилі APA, Harvard, Vancouver, ISO та ін.
50

Bogdanović, Miloš, Milena Frtunić Gligorijević, Jelena Kocić, and Leonid Stoimenov. "Improving Text Recognition Accuracy for Serbian Legal Documents Using BERT." Applied Sciences 15, no. 2 (January 10, 2025): 615. https://doi.org/10.3390/app15020615.

Повний текст джерела
Анотація:
Producing a new high-quality text corpus is a big challenge due to the required complexity and labor expenses. High-quality datasets, considered a prerequisite for many supervised machine learning algorithms, are often only available in very limited quantities. This in turn limits the capabilities of many advanced technologies when used in a specific field of research and development. This is also the case for the Serbian language, which is considered low-resourced in digitized language resources. In this paper, we address this issue for the Serbian language through a novel approach for generating high-quality text corpora by improving text recognition accuracy for scanned documents belonging to Serbian legal heritage. Our approach integrates three different components to provide high-quality results: a BERT-based large language model built specifically for Serbian legal texts, a high-quality open-source optical character recognition (OCR) model, and a word-level similarity measure for Serbian Cyrillic developed for this research and used for generating necessary correction suggestions. This approach was evaluated manually using scanned legal documents sampled from three different epochs between the years 1970 and 2002 with more than 14,500 test cases. We demonstrate that our approach can correct up to 88% of terms inaccurately extracted by the OCR model in the case of Serbian legal texts.
Стилі APA, Harvard, Vancouver, ISO та ін.
Ми пропонуємо знижки на всі преміум-плани для авторів, чиї праці увійшли до тематичних добірок літератури. Зв'яжіться з нами, щоб отримати унікальний промокод!

До бібліографії