Academic literature on the topic 'Low-Resourced language'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Low-Resourced language.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Low-Resourced language"

1

Allah, Fadoua Ataa, and Siham Boulaknadel. "NEW TRENDS IN LESS-RESOURCED LANGUAGE PROCESSING: CASE OF AMAZIGH LANGUAGE." International Journal on Natural Language Computing 12, no. 2 (April 29, 2023): 75–89. http://dx.doi.org/10.5121/ijnlc.2023.12207.

Full text
Abstract:
The coronavirus (COVID-19) pandemic has dramatically changed lifestyles in much of the world. It forced people to profoundly review their relationships and interactions with digital technologies. Nevertheless, people prefer using these technologies in their favorite languages. Unfortunately, most languages are considered even as low or less-resourced, and they do not have the potential to keep up with the new needs. Therefore, this study explores how this kind of languages, mainly the Amazigh, will behave in the wholly digital environment, and what to expect for new trends. Contrary to last decades, the research gap of low and less-resourced languages is continually reducing. Nonetheless, the literature review exploration unveils the need for innovative research to review their informatization roadmap, while rethinking, in a valuable way, people’s behaviors in this increasingly changing environment. Through this work, we will try first to introduce the technology access challenges, and explain how natural language processing contributes to their overcoming. Then, we will give an overview of existing studies and research related to under and less-resourced languages’ informatization, with an emphasis on the Amazigh language. After, based on these studies and the agile revolution, a new roadmap will be presented.
APA, Harvard, Vancouver, ISO, and other styles
2

Kipyatkova, Irina, and Ildar Kagirov. "Deep Models for Low-Resourced Speech Recognition: Livvi-Karelian Case." Mathematics 11, no. 18 (September 5, 2023): 3814. http://dx.doi.org/10.3390/math11183814.

Full text
Abstract:
Recently, there has been a growth in the number of studies addressing the automatic processing of low-resource languages. The lack of speech and text data significantly hinders the development of speech technologies for such languages. This paper introduces an automatic speech recognition system for Livvi-Karelian. Acoustic models based on artificial neural networks with time delays and hidden Markov models were trained using a limited speech dataset of 3.5 h. To augment the data, pitch and speech rate perturbation, SpecAugment, and their combinations were employed. Language models based on 3-grams and neural networks were trained using written texts and transcripts. The achieved word error rate metric of 22.80% is comparable to other low-resource languages. To the best of our knowledge, this is the first speech recognition system for Livvi-Karelian. The results obtained can be of a certain significance for development of automatic speech recognition systems not only for Livvi-Karelian, but also for other low-resource languages, including the fields of speech recognition and machine translation systems. Future work includes experiments with Karelian data using techniques such as transfer learning and DNN language models.
APA, Harvard, Vancouver, ISO, and other styles
3

Singh, Pranaydeep, Orphée De Clercq, and Els Lefever. "Distilling Monolingual Models from Large Multilingual Transformers." Electronics 12, no. 4 (February 18, 2023): 1022. http://dx.doi.org/10.3390/electronics12041022.

Full text
Abstract:
Although language modeling has been trending upwards steadily, models available for low-resourced languages are limited to large multilingual models such as mBERT and XLM-RoBERTa, which come with significant overheads for deployment vis-à-vis their model size, inference speeds, etc. We attempt to tackle this problem by proposing a novel methodology to apply knowledge distillation techniques to filter language-specific information from a large multilingual model into a small, fast monolingual model that can often outperform the teacher model. We demonstrate the viability of this methodology on two downstream tasks each for six languages. We further dive into the possible modifications to the basic setup for low-resourced languages by exploring ideas to tune the final vocabulary of the distilled models. Lastly, we perform a detailed ablation study to understand the different components of the setup better and find out what works best for the two under-resourced languages, Swahili and Slovene.
APA, Harvard, Vancouver, ISO, and other styles
4

Mabokela, Koena Ronny, Mpho Primus, and Turgay Celik. "Explainable Pre-Trained Language Models for Sentiment Analysis in Low-Resourced Languages." Big Data and Cognitive Computing 8, no. 11 (November 15, 2024): 160. http://dx.doi.org/10.3390/bdcc8110160.

Full text
Abstract:
Sentiment analysis is a crucial tool for measuring public opinion and understanding human communication across digital social media platforms. However, due to linguistic complexities and limited data or computational resources, it is under-represented in many African languages. While state-of-the-art Afrocentric pre-trained language models (PLMs) have been developed for various natural language processing (NLP) tasks, their applications in eXplainable Artificial Intelligence (XAI) remain largely unexplored. In this study, we propose a novel approach that combines Afrocentric PLMs with XAI techniques for sentiment analysis. We demonstrate the effectiveness of incorporating attention mechanisms and visualization techniques in improving the transparency, trustworthiness, and decision-making capabilities of transformer-based models when making sentiment predictions. To validate our approach, we employ the SAfriSenti corpus, a multilingual sentiment dataset for South African under-resourced languages, and perform a series of sentiment analysis experiments. These experiments enable comprehensive evaluations, comparing the performance of Afrocentric models against mainstream PLMs. Our results show that the Afro-XLMR model outperforms all other models, achieving an average F1-score of 71.04% across five tested languages, and the lowest error rate among the evaluated models. Additionally, we enhance the interpretability and explainability of the Afro-XLMR model using Local Interpretable Model-Agnostic Explanations (LIME) and Shapley Additive Explanations (SHAP). These XAI techniques ensure that sentiment predictions are not only accurate and interpretable but also understandable, fostering trust and reliability in AI-driven NLP technologies, particularly in the context of African languages.
APA, Harvard, Vancouver, ISO, and other styles
5

Shafiq, Nida, Isma Hamid, Muhammad Asif, Qamar Nawaz, Hanan Aljuaid, and Hamid Ali. "Abstractive text summarization of low-resourced languages using deep learning." PeerJ Computer Science 9 (January 13, 2023): e1176. http://dx.doi.org/10.7717/peerj-cs.1176.

Full text
Abstract:
Background Humans must be able to cope with the huge amounts of information produced by the information technology revolution. As a result, automatic text summarization is being employed in a range of industries to assist individuals in identifying the most important information. For text summarization, two approaches are mainly considered: text summarization by the extractive and abstractive methods. The extractive summarisation approach selects chunks of sentences like source documents, while the abstractive approach can generate a summary based on mined keywords. For low-resourced languages, e.g., Urdu, extractive summarization uses various models and algorithms. However, the study of abstractive summarization in Urdu is still a challenging task. Because there are so many literary works in Urdu, producing abstractive summaries demands extensive research. Methodology This article proposed a deep learning model for the Urdu language by using the Urdu 1 Million news dataset and compared its performance with the two widely used methods based on machine learning, such as support vector machine (SVM) and logistic regression (LR). The results show that the suggested deep learning model performs better than the other two approaches. The summaries produced by extractive summaries are processed using the encoder-decoder paradigm to create an abstractive summary. Results With the help of Urdu language specialists, the system-generated summaries were validated, showing the proposed model’s improvement and accuracy.
APA, Harvard, Vancouver, ISO, and other styles
6

Pandit, Rajat, Saptarshi Sengupta, Sudip Kumar Naskar, Niladri Sekhar Dash, and Mohini Mohan Sardar. "Improving Semantic Similarity with Cross-Lingual Resources: A Study in Bangla—A Low Resourced Language." Informatics 6, no. 2 (May 5, 2019): 19. http://dx.doi.org/10.3390/informatics6020019.

Full text
Abstract:
Semantic similarity is a long-standing problem in natural language processing (NLP). It is a topic of great interest as its understanding can provide a look into how human beings comprehend meaning and make associations between words. However, when this problem is looked at from the viewpoint of machine understanding, particularly for under resourced languages, it poses a different problem altogether. In this paper, semantic similarity is explored in Bangla, a less resourced language. For ameliorating the situation in such languages, the most rudimentary method (path-based) and the latest state-of-the-art method (Word2Vec) for semantic similarity calculation were augmented using cross-lingual resources in English and the results obtained are truly astonishing. In the presented paper, two semantic similarity approaches have been explored in Bangla, namely the path-based and distributional model and their cross-lingual counterparts were synthesized in light of the English WordNet and Corpora. The proposed methods were evaluated on a dataset comprising of 162 Bangla word pairs, which were annotated by five expert raters. The correlation scores obtained between the four metrics and human evaluation scores demonstrate a marked enhancement that the cross-lingual approach brings into the process of semantic similarity calculation for Bangla.
APA, Harvard, Vancouver, ISO, and other styles
7

Badawi, Soran. "Transformer-Based Neural Network Machine Translation Model for the Kurdish Sorani Dialect." UHD Journal of Science and Technology 7, no. 1 (January 15, 2023): 15–21. http://dx.doi.org/10.21928/uhdjst.v7n1y2023.pp15-21.

Full text
Abstract:
The transformer model is one of the most recently developed models for translating texts into another language. The model uses the principle of attention mechanism, surpassing previous models, such as sequence-to-sequence, in terms of performance. It performed well with highly resourced English, French, and German languages. Using the model architecture, we investigate training the modified version of the model in a low-resourced language such as the Kurdish language. This paper presents the first-ever transformer-based neural machine translation model for the Kurdish language by utilizing vocabulary dictionary units that share vocabulary across the dataset. For this purpose, we combine all the existing parallel corpora of Kurdish – English by building a large corpus and training it on the proposed transformer model. The outcome indicated that the suggested transformer model works well with Kurdish texts by scoring (0.45) on bilingual evaluation understudy (BLEU). According to the BLEU standard, the score indicates a high-quality translation.
APA, Harvard, Vancouver, ISO, and other styles
8

Kapočiūtė-Dzikienė, Jurgita, and Senait Gebremichael Tesfagergish. "Part-of-Speech Tagging via Deep Neural Networks for Northern-Ethiopic Languages." Information Technology And Control 49, no. 4 (December 19, 2020): 482–94. http://dx.doi.org/10.5755/j01.itc.49.4.26808.

Full text
Abstract:
Deep Neural Networks (DNNs) have proven to be especially successful in the area of Natural Language Processing (NLP) and Part-Of-Speech (POS) tagging—which is the process of mapping words to their corresponding POS labels depending on the context. Despite recent development of language technologies, low-resourced languages (such as an East African Tigrinya language), have received too little attention. We investigate the effectiveness of Deep Learning (DL) solutions for the low-resourced Tigrinya language of the Northern-Ethiopic branch. We have selected Tigrinya as the testbed example and have tested state-of-the-art DL approaches seeking to build the most accurate POS tagger. We have evaluated DNN classifiers (Feed Forward Neural Network – FFNN, Long Short-Term Memory method – LSTM, Bidirectional LSTM, and Convolutional Neural Network – CNN) on a top of neural word2vec word embeddings with a small training corpus known as Nagaoka Tigrinya Corpus. To determine the best DNN classifier type, its architecture and hyper-parameter set both manual and automatic hyper-parameter tuning has been performed. BiLSTM method was proved to be the most suitable for our solving task: it achieved the highest accuracy equal to 92% that is 65% above the random baseline.
APA, Harvard, Vancouver, ISO, and other styles
9

Nitu, Melania, and Mihai Dascalu. "Natural Language Processing Tools for Romanian – Going Beyond a Low-Resource Language." Interaction Design and Architecture(s), no. 60 (March 15, 2024): 7–26. http://dx.doi.org/10.55612/s-5002-060-001sp.

Full text
Abstract:
Advances in Natural Language Processing bring innovative instruments to the educational field to improve the quality of the didactic process by addressing challenges like language barriers and creating personalized learning experiences. Most research in the domain is dedicated to high-resource languages, such as English, while languages with limited coverage, like Romanian, are still underrepresented in the field. Operating on low-resource languages is essential to ensure equitable access to educational opportunities and to preserve linguistic diversity. Through continuous investments in developing Romanian educational instruments, we are rapidly going beyond a low-resource language. This paper presents recent educational instruments and frameworks dedicated to Romanian, leveraging state-of-the-art NLP techniques, such as building advanced Romanian language models and benchmarks encompassing tools for language learning, text comprehension, question answering, automatic essay scoring, and information retrieval. The methods and insights gained are transferable to other low-resource languages, emphasizing methodological adaptability, collaborative frameworks, and technology transfer to address similar challenges in diverse linguistic contexts. Two use cases are presented, focusing on assessing student performance in Moodle courses and extracting main ideas from students’ feedback. These practical applications in Romanian academic settings serve as examples for enhancing educational practices in other less-resourced languages.
APA, Harvard, Vancouver, ISO, and other styles
10

Ngué Um, Emmanuel, Émilie Eliette, Caroline Ngo Tjomb Assembe, and Francis Morton Tyers. "Developing a Rule-Based Machine-Translation System, Ewondo–French–Ewondo." International Journal of Humanities and Arts Computing 16, no. 2 (October 2022): 166–81. http://dx.doi.org/10.3366/ijhac.2022.0289.

Full text
Abstract:
Machine translation (MT) significantly contributes to democratizing access to textual information across multiple languages and is established as a dynamic language service in the global multilingual society. Not surprisingly, the attractiveness of the MT market has stirred up spectacular innovations, driven by artificial intelligence in the digital technology industry. The commercial stakes of the industry has led to massive investments in the development of automatic translation systems for languages of wider communication and an increased marginalization of minority languages in this avenue. This article reports on the on-going development of a low-tech, rule-based MT system for Ewondo, a Bantu low-resourced language spoken in Cameroon. The project aims to fill the gap in access to MT services in the target minority language community and to generate parallel corpora from and into the Ewondo language.
APA, Harvard, Vancouver, ISO, and other styles
More sources

Dissertations / Theses on the topic "Low-Resourced language"

1

Aufrant, Lauriane. "Training parsers for low-resourced languages : improving cross-lingual transfer with monolingual knowledge." Thesis, Université Paris-Saclay (ComUE), 2018. http://www.theses.fr/2018SACLS089/document.

Full text
Abstract:
Le récent essor des algorithmes d'apprentissage automatique a rendu les méthodes de Traitement Automatique des Langues d'autant plus sensibles à leur facteur le plus limitant : la qualité des systèmes repose entièrement sur la disponibilité de grandes quantités de données, ce qui n'est pourtant le cas que d'une minorité parmi les 7.000 langues existant au monde. La stratégie dite du transfert cross-lingue permet de contourner cette limitation : une langue peu dotée en ressources (la cible) peut être traitée en exploitant les ressources disponibles dans une autre langue (la source). Les progrès accomplis sur ce plan se limitent néanmoins à des scénarios idéalisés, avec des ressources cross-lingues prédéfinies et de bonne qualité, de sorte que le transfert reste inapplicable aux cas réels de langues peu dotées, qui n'ont pas ces garanties. Cette thèse vise donc à tirer parti d'une multitude de sources et ressources cross-lingues, en opérant une combinaison sélective : il s'agit d'évaluer, pour chaque aspect du traitement cible, la pertinence de chaque ressource. L'étude est menée en utilisant l'analyse en dépendance par transition comme cadre applicatif. Le cœur de ce travail est l'élaboration d'un nouveau méta-algorithme de transfert, dont l'architecture en cascade permet la combinaison fine des diverses ressources, en ciblant leur exploitation à l'échelle du mot. L'approche cross-lingue pure n'étant en l'état pas compétitive avec la simple annotation de quelques phrases cibles, c'est avant tout la complémentarité de ces méthodes que souligne l'analyse empirique. Une série de nouvelles métriques permet une caractérisation fine des similarités cross-lingues et des spécificités syntaxiques de chaque langue, de même que de la valeur ajoutée de l'information cross-lingue par rapport au cadre monolingue. L'exploitation d'informations typologiques s'avère également particulièrement fructueuse. Ces contributions reposent largement sur des innovations techniques en analyse syntaxique, concrétisées par la publication en open source du logiciel PanParser, qui exploite et généralise la méthode dite des oracles dynamiques. Cette thèse contribue sur le plan monolingue à plusieurs autres égards, comme le concept de cascades monolingues, pouvant traiter par exemple d'abord toutes les dépendances faciles, puis seulement les difficiles
As a result of the recent blossoming of Machine Learning techniques, the Natural Language Processing field faces an increasingly thorny bottleneck: the most efficient algorithms entirely rely on the availability of large training data. These technological advances remain consequently unavailable for the 7,000 languages in the world, out of which most are low-resourced. One way to bypass this limitation is the approach of cross-lingual transfer, whereby resources available in another (source) language are leveraged to help building accurate systems in the desired (target) language. However, despite promising results in research settings, the standard transfer techniques lack the flexibility regarding cross-lingual resources needed to be fully usable in real-world scenarios: exploiting very sparse resources, or assorted arrays of resources. This limitation strongly diminishes the applicability of that approach. This thesis consequently proposes to combine multiple sources and resources for transfer, with an emphasis on selectivity: can we estimate which resource of which language is useful for which input? This strategy is put into practice in the frame of transition-based dependency parsing. To this end, a new transfer framework is designed, with a cascading architecture: it enables the desired combination, while ensuring better targeted exploitation of each resource, down to the level of the word. Empirical evaluation dampens indeed the enthusiasm for the purely cross-lingual approach -- it remains in general preferable to annotate just a few target sentences -- but also highlights its complementarity with other approaches. Several metrics are developed to characterize precisely cross-lingual similarities, syntactic idiosyncrasies, and the added value of cross-lingual information compared to monolingual training. The substantial benefits of typological knowledge are also explored. The whole study relies on a series of technical improvements regarding the parsing framework: this work includes the release of a new open source software, PanParser, which revisits the so-called dynamic oracles to extend their use cases. Several purely monolingual contributions complete this work, including an exploration of monolingual cascading, which offers promising perspectives with easy-then-hard strategies
APA, Harvard, Vancouver, ISO, and other styles
2

Susman, Derya. "Turkish Large Vocabulary Continuous Speech Recognition By Using Limited Audio Corpus." Master's thesis, METU, 2012. http://etd.lib.metu.edu.tr/upload/12614207/index.pdf.

Full text
Abstract:
Speech recognition in Turkish Language is a challenging problem in several perspectives. Most of the challenges are related to the morphological structure of the language. Since Turkish is an agglutinative language, it is possible to generate many words from a single stem by using suffixes. This characteristic of the language increases the out-of-vocabulary (OOV) words, which degrade the performance of a speech recognizer dramatically. Also, Turkish language allows words to be ordered in a free manner, which makes it difficult to generate robust language models. In this thesis, the existing models and approaches which address the problem of Turkish LVCSR (Large Vocabulary Continuous Speech Recognition) are explored. Different recognition units (words, morphs, stem and endings) are used in generating the n-gram language models. 3-gram and 4-gram language models are generated with respect to the recognition unit. Since the solution domain of speech recognition is involved with machine learning, the performance of the recognizer depends on the sufficiency of the audio data used in acoustic model training. However, it is difficult to obtain rich audio corpora for the Turkish language. In this thesis, existing approaches are used to solve the problem of Turkish LVCSR by using a limited audio corpus. We also proposed several data selection approaches in order to improve the robustness of the acoustic model.
APA, Harvard, Vancouver, ISO, and other styles
3

Karim, Hiva. "Best way for collecting data for low-resourced languages." Thesis, Högskolan Dalarna, Mikrodataanalys, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:du-35945.

Full text
Abstract:
Low resource languages possess a limited number of digitized texts, making it challenging togenerate a satisfactory language audio corpus and information retrieval services. Low resourcelanguages, especially those spoken exclusively in African countries, lack a well-defined andannotated language corpus, making it a big obstacle for experts to provide a comprehensive textprocessing system. In this study, I Found out the best practices for producing and collectingdata for such zero/low resource languages by means of crowd-sourcing. For the purpose of thisstudy, a number of research articles (n=260) were extracted from Google Scholar, MicrosoftAcademic, and science direct. From these articles, only 60 of them, which met the inclusioncriteria' demands, were considered to review for eligibility. A full-text version of these researcharticles was downloaded and then were carefully screened to ensure eligibility. On the result ofthe eligibility assessment from potentially eligible 60 full-text articles for inclusion, only 25were selected and qualified to include in the final review. The final pool of the selected articles,concerning data generation practices and collection of low resource languages, can beconcluded that speech-based audio data is one of the most common and accessible data types.It can be contended that the collection of audio data from speech-based resources such as nativespeakers of the intended language and available audio recording by taking the advantages ofnew technologies is the most practical, cost-effective, and common method for collecting datafor low resource languages.
APA, Harvard, Vancouver, ISO, and other styles
4

Cordova, Johanna. "Le quechua dans les outils numériques, un défi pour le TAL ? Développement de ressources linguistiques et numériques pour le quechua ancashino." Electronic Thesis or Diss., Paris, INALCO, 2024. http://www.theses.fr/2024INAL0031.

Full text
Abstract:
Les langues quechuas constituent l'une des familles linguistiques amérindiennes comptant le plus grand nombre de locuteurs natifs. Au Pérou, selon le recensement de 2017, 13,9% de la population a le quechua pour première langue et environ 20% le parle. Pourtant, elle est presque totalement absente des usages numériques. En traitement automatique des langues (TAL), c'est une langue peu dotée, avec une forte disparité de ressources selon la variété de quechua considérée. L'objectif de cette thèse est de développer un ensemble d'outils fondamentaux pour le traitement automatique d'une variété du quechua central, le quechua ancashino, parlé par environ 400 000 personnes, et en danger d'extinction d'après la classification de l'UNESCO. Ce processus comporte trois étapes : la numérisation des ressources disponibles dans cette variété (dictionnaires, corpus écrits), l'implémentation d'un analyseur morphologique, et l'élaboration d'un corpus arboré pour l'analyse en morpho-syntaxe. Les ressources développées seront valorisées à travers des applications telles qu'un moteur de recherche permettant d'interroger l'ensemble des dictionnaires. Dans un contexte global de valorisation des langues originaires et alors que d'ambitieuses politiques liées aux droits linguistiques sont en cours de déploiement dans les pays de l'aire andine, la présence du quechua dans les technologies constitue un important levier pour renforcer sa pratique et faciliter son enseignement
Quechua languages are one of the Amerindian language families with the largest number of native speakers. In Peru, according to the 2017 census, 13.9% of the population have Quechua as their first language, and around 20% speak it. However, the language is almost totally absent from digital tools. In natural language processing (NLP), it is an under-resourced language, with a strong disparity in the amount of resources depending on the variety of Quechua considered. The aim of this thesis is to develop a set of fundamental tools for the automatic processing of a variety of central Quechua, Ancash Quechua, spoken by around 400,000 people and in danger of extinction according to the UNESCO classification. This process involves three stages: digitisation of the resources available in this variety (dictionaries, written corpora), implementation of a morphological analyser, and development of a treebank for morpho-syntactic analysis. These resources will be made available on the web via applications, in particular a search engine that can be used to query the dictionaries available for this language. In a global context of preservation movement of native languages, and while ambitious policies related to linguistic rights are being deployed in the countries of the Andean region, the presence of Quechua in technologies would be an important lever to strengthen its practice and facilitate its teaching
APA, Harvard, Vancouver, ISO, and other styles
5

Samson, Juan Sarah Flora. "Exploiting resources from closely-related languages for automatic speech recognition in low-resource languages from Malaysia." Thesis, Université Grenoble Alpes (ComUE), 2015. http://www.theses.fr/2015GREAM061/document.

Full text
Abstract:
Les langues en Malaisie meurent à un rythme alarmant. A l'heure actuelle, 15 langues sont en danger alors que deux langues se sont éteintes récemment. Une des méthodes pour sauvegarder les langues est de les documenter, mais c'est une tâche fastidieuse lorsque celle-ci est effectuée manuellement.Un système de reconnaissance automatique de la parole (RAP) serait utile pour accélérer le processus de documentation de ressources orales. Cependant, la construction des systèmes de RAP pour une langue cible nécessite une grande quantité de données d'apprentissage comme le suggèrent les techniques actuelles de l'état de l'art, fondées sur des approches empiriques. Par conséquent, il existe de nombreux défis à relever pour construire des systèmes de transcription pour les langues qui possèdent des quantités de données limitées.L'objectif principal de cette thèse est d'étudier les effets de l'utilisation de données de langues étroitement liées, pour construire un système de RAP pour les langues à faibles ressources en Malaisie. Des études antérieures ont montré que les méthodes inter-lingues et multilingues pourraient améliorer les performances des systèmes de RAP à faibles ressources. Dans cette thèse, nous essayons de répondre à plusieurs questions concernant ces approches: comment savons-nous si une langue est utile ou non dans un processus d'apprentissage trans-lingue ? Comment la relation entre la langue source et la langue cible influence les performances de la reconnaissance de la parole ? La simple mise en commun (pooling) des données d'une langue est-elle une approche optimale ?Notre cas d'étude est l'iban, une langue peu dotée de l'île de Bornéo. Nous étudions les effets de l'utilisation des données du malais, une langue locale dominante qui est proche de l'iban, pour développer un système de RAP pour l'iban, sous différentes contraintes de ressources. Nous proposons plusieurs approches pour adapter les données du malais afin obtenir des modèles de prononciation et des modèles acoustiques pour l'iban.Comme la contruction d'un dictionnaire de prononciation à partir de zéro nécessite des ressources humaines importantes, nous avons développé une approche semi-supervisée pour construire rapidement un dictionnaire de prononciation pour l'iban. Celui-ci est fondé sur des techniques d'amorçage, pour améliorer la correspondance entre les données du malais et de l'iban.Pour augmenter la performance des modèles acoustiques à faibles ressources, nous avons exploré deux techniques de modélisation : les modèles de mélanges gaussiens à sous-espaces (SGMM) et les réseaux de neurones profonds (DNN). Nous avons proposé, dans ce cadre, des méthodes de transfert translingue pour la modélisation acoustique permettant de tirer profit d'une grande quantité de langues “proches” de la langue cible d'intérêt. Les résultats montrent que l'utilisation de données du malais est bénéfique pour augmenter les performances des systèmes de RAP de l'iban. Par ailleurs, nous avons également adapté les modèles SGMM et DNN au cas spécifique de la transcription automatique de la parole non native (très présente en Malaisie). Nous avons proposé une approche fine de fusion pour obtenir un SGMM multi-accent optimal. En outre, nous avons développé un modèle DNN spécifique pour la parole accentuée. Les deux approches permettent des améliorations significatives de la précision du système de RAP. De notre étude, nous observons que les modèles SGMM et, de façon plus surprenante, les modèles DNN sont très performants sur des jeux de données d'apprentissage en quantité limités
Languages in Malaysia are dying in an alarming rate. As of today, 15 languages are in danger while two languages are extinct. One of the methods to save languages is by documenting languages, but it is a tedious task when performed manually.Automatic Speech Recognition (ASR) system could be a tool to help speed up the process of documenting speeches from the native speakers. However, building ASR systems for a target language requires a large amount of training data as current state-of-the-art techniques are based on empirical approach. Hence, there are many challenges in building ASR for languages that have limited data available.The main aim of this thesis is to investigate the effects of using data from closely-related languages to build ASR for low-resource languages in Malaysia. Past studies have shown that cross-lingual and multilingual methods could improve performance of low-resource ASR. In this thesis, we try to answer several questions concerning these approaches: How do we know which language is beneficial for our low-resource language? How does the relationship between source and target languages influence speech recognition performance? Is pooling language data an optimal approach for multilingual strategy?Our case study is Iban, an under-resourced language spoken in Borneo island. We study the effects of using data from Malay, a local dominant language which is close to Iban, for developing Iban ASR under different resource constraints. We have proposed several approaches to adapt Malay data to obtain pronunciation and acoustic models for Iban speech.Building a pronunciation dictionary from scratch is time consuming, as one needs to properly define the sound units of each word in a vocabulary. We developed a semi-supervised approach to quickly build a pronunciation dictionary for Iban. It was based on bootstrapping techniques for improving Malay data to match Iban pronunciations.To increase the performance of low-resource acoustic models we explored two acoustic modelling techniques, the Subspace Gaussian Mixture Models (SGMM) and Deep Neural Networks (DNN). We performed cross-lingual strategies using both frameworks for adapting out-of-language data to Iban speech. Results show that using Malay data is beneficial for increasing the performance of Iban ASR. We also tested SGMM and DNN to improve low-resource non-native ASR. We proposed a fine merging strategy for obtaining an optimal multi-accent SGMM. In addition, we developed an accent-specific DNN using native speech data. After applying both methods, we obtained significant improvements in ASR accuracy. From our study, we observe that using SGMM and DNN for cross-lingual strategy is effective when training data is very limited
APA, Harvard, Vancouver, ISO, and other styles

Books on the topic "Low-Resourced language"

1

Multilingual processing in eastern and southern EU languages: Low-resourced technologies and translation. Newcastle upon Tyne, UK: Cambridge Scholars Publishing, 2012.

Find full text
APA, Harvard, Vancouver, ISO, and other styles

Book chapters on the topic "Low-Resourced language"

1

Pattnaik, Sagarika, and Ajit Kumar Nayak. "An Automatic Summarizer for a Low-Resourced Language." In Advances in Intelligent Systems and Computing, 285–95. Singapore: Springer Singapore, 2020. http://dx.doi.org/10.1007/978-981-15-1081-6_24.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Mbaye, Derguene, Moussa Diallo, and Thierno Ibrahima Diop. "Low-Resourced Machine Translation for Senegalese Wolof Language." In Proceedings of Eighth International Congress on Information and Communication Technology, 243–55. Singapore: Springer Nature Singapore, 2023. http://dx.doi.org/10.1007/978-981-99-3236-8_19.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Rögnvaldsson, Eiríkur. "Language Report Icelandic." In European Language Equality, 159–62. Cham: Springer International Publishing, 2023. http://dx.doi.org/10.1007/978-3-031-28819-7_21.

Full text
Abstract:
AbstractIn 2019, the Icelandic Government launched a three-year Language Technology Programme for Icelandic (LTPI). Within this programme, a number of language resources and tools have been built from scratch and several pre-existing resources and tools have been enhanced and improved. This programme is now finished and the situation for Icelandic with respect to language technology has improved considerably. In spite of this, Icelandic still remains a low-resourced language compared to most official European languages.
APA, Harvard, Vancouver, ISO, and other styles
4

Adda-Decker, Martine, Lori Lamel, Gilles Adda, and Thomas Lavergne. "A First LVCSR System for Luxembourgish, a Low-Resourced European Language." In Human Language Technology Challenges for Computer Science and Linguistics, 479–90. Cham: Springer International Publishing, 2014. http://dx.doi.org/10.1007/978-3-319-14120-6_39.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Datta, Goutam, Nisheeth Joshi, and Kusum Gupta. "Analysis of Automatic Evaluation Metric on Low-Resourced Language: BERTScore vs BLEU Score." In Speech and Computer, 155–62. Cham: Springer International Publishing, 2022. http://dx.doi.org/10.1007/978-3-031-20980-2_14.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Thandil, Rizwana Kallooravi, K. P. Mohamed Basheer, and V. K. Muneer. "A Multi-feature Analysis of Accented Multisyllabic Malayalam Words—a Low-Resourced Language." In Lecture Notes in Networks and Systems, 243–51. Singapore: Springer Nature Singapore, 2023. http://dx.doi.org/10.1007/978-981-99-1203-2_21.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Thandil, Rizwana Kallooravi, K. P. Mohamed Basheer, and V. K. Muneer. "End-to-End Unified Accented Acoustic Model for Malayalam-A Low Resourced Language." In Communications in Computer and Information Science, 346–54. Cham: Springer International Publishing, 2023. http://dx.doi.org/10.1007/978-3-031-33231-9_25.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Bani, Rkia, Samir Amri, Lahbib Zenkouar, and Zouhair Guennoun. "Part of Speech Tagging of Amazigh Language as a Very Low-Resourced Language: Particularities and Challenges." In Artificial Intelligence and Industrial Applications, 172–82. Cham: Springer Nature Switzerland, 2023. http://dx.doi.org/10.1007/978-3-031-43520-1_15.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Anagha, H. M., Karthik Sairam, Janya Mahesh, and H. R. Mamatha. "Paraphrase Generation and Deep Learning Models for Paraphrase Detection in a Low-Resourced Language: Kannada." In Advances in Data-Driven Computing and Intelligent Systems, 283–93. Singapore: Springer Nature Singapore, 2024. http://dx.doi.org/10.1007/978-981-99-9531-8_23.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Thandil, Rizwana Kallooravi, K. P. Mohamed Basheer, and V. K. Muneer. "Deep Spectral Feature Representations Via Attention-Based Neural Network Architectures for Accented Malayalam Speech—A Low-Resourced Language." In Proceedings of Data Analytics and Management, 1–13. Singapore: Springer Nature Singapore, 2023. http://dx.doi.org/10.1007/978-981-99-6553-3_1.

Full text
APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Low-Resourced language"

1

Jayakody, Ravindu, and Gihan Dias. "Performance of Recent Large Language Models for a Low-Resourced Language." In 2024 International Conference on Asian Language Processing (IALP), 162–67. IEEE, 2024. http://dx.doi.org/10.1109/ialp63756.2024.10661169.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Pal, Vaishali, Evangelos Kanoulas, Andrew Yates, and Maarten de Rijke. "Table Question Answering for Low-resourced Indic Languages." In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 75–92. Stroudsburg, PA, USA: Association for Computational Linguistics, 2024. http://dx.doi.org/10.18653/v1/2024.emnlp-main.5.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Masethe, Mosima Anna, Hlaudi Daniel Masethe, and Sunday O. Ojo. "Context-Based Question Answering Using Large Language BERT Variant Models for Low Resourced Sesotho sa Leboa Language." In 2024 4th International Multidisciplinary Information Technology and Engineering Conference (IMITEC), 507–13. IEEE, 2024. https://doi.org/10.1109/imitec60221.2024.10850997.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Dong, Lukuan, Donghong Qin, Fengbo Bai, Fanhua Song, Yan Liu, Chen Xu, and Zhijian Ou. "Low-Resourced Speech Recognition for Iu Mien Language via Weakly-Supervised Phoneme-Based Multilingual Pretraining." In 2024 IEEE 14th International Symposium on Chinese Spoken Language Processing (ISCSLP), 264–68. IEEE, 2024. https://doi.org/10.1109/iscslp63861.2024.10800186.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Zhang, Jiajie, Shulin Cao, Linmei Hu, Ling Feng, Lei Hou, and Juanzi Li. "KB-Plugin: A Plug-and-play Framework for Large Language Models to Induce Programs over Low-resourced Knowledge Bases." In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2868–82. Stroudsburg, PA, USA: Association for Computational Linguistics, 2024. http://dx.doi.org/10.18653/v1/2024.emnlp-main.168.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Masethe, Hlaudi Daniel, Lawrence M. Mothapo, Sunday O. Ojo, Pius A. Owolawi, Mosima Anna Masethe, and Fausto Giunchigilia. "Machine Translation for Morphologically Rich Low-Resourced South African Languages." In 2024 4th International Multidisciplinary Information Technology and Engineering Conference (IMITEC), 71–78. IEEE, 2024. https://doi.org/10.1109/imitec60221.2024.10850972.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Abisado, Mideth B., Maria Luisa G. Bautista, Marilen F. Pacis, Joseph Marvin R. Imperial, Ramon L. Rodriguez, Bernie S. Fabito, Jean V. Malolos, Mico C. Magtira, and Mariecar G. Alfon. "RespiratoryPH: Empowering Low-Resourced Languages Through Multilingual and Multi-Labeled Social Media Dataset Towards Intelligent Public Health Disease Surveillance." In 2024 IEEE International Conference on Progress in Informatics and Computing (PIC), 1–6. IEEE, 2024. https://doi.org/10.1109/pic62406.2024.10892651.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Gupta, Akshat. "On Building Spoken Language Understanding Systems for Low Resourced Languages." In Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology. Stroudsburg, PA, USA: Association for Computational Linguistics, 2022. http://dx.doi.org/10.18653/v1/2022.sigmorphon-1.1.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

H M, Anagha, Karthik Sairam, Janya Mahesh, and Mamatha H R. "Paraphrase Detection in a Low Resourced Language: Kannada." In 2023 IEEE 8th International Conference for Convergence in Technology (I2CT). IEEE, 2023. http://dx.doi.org/10.1109/i2ct57861.2023.10126391.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

István, Varga, and Yokoyama Shoichi. "Bilingual dictionary generation for low-resourced language pairs." In the 2009 Conference. Morristown, NJ, USA: Association for Computational Linguistics, 2009. http://dx.doi.org/10.3115/1699571.1699625.

Full text
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography