Journal articles on the topic 'Multilingual Modeling'

To see the other types of publications on this topic, follow the link: Multilingual Modeling.

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Multilingual Modeling.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Haas, Alison, Scott E. Grapin, Lorena Llosa, and Okhee Lee. "Computational Modeling With Multilingual Learners." Science and Children 60, no. 7 (September 2023): 64–70. http://dx.doi.org/10.1080/00368148.2023.12315941.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Santhosh Kumar, C., and V. P. Mohandas. "Robust features for multilingual acoustic modeling." International Journal of Speech Technology 14, no. 3 (May 11, 2011): 147–55. http://dx.doi.org/10.1007/s10772-011-9092-6.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Grutman, Rainier. "The Missing Link: Modeling Readers of Multilingual Writing." Journal of Literary Multilingualism 1, no. 1 (May 2023): 15–36. http://dx.doi.org/10.1163/2667324x-20230103.

Full text
Abstract:
Abstract This contribution tries to fill the gap concerning the place and role of readers in multilingual studies by focusing on the ways in which multilingual texts both do and do not create multilingual readers. Three scenarios are illustrated with two examples each. So-called ‘shared multilingualism’ implies bilingual competence (and excludes monolingual readers) by juxtaposing languages with little overlap. Other texts exhibit more than one language yet construct a monolingual reader, while others still reward bilingual competence and at the same time accommodate monolingual incompetence.
APA, Harvard, Vancouver, ISO, and other styles
4

Park, Hyunji Hayley, Katherine J. Zhang, Coleman Haley, Kenneth Steimel, Han Liu, and Lane Schwartz. "Morphology Matters: A Multilingual Language Modeling Analysis." Transactions of the Association for Computational Linguistics 9 (March 17, 2021): 261–76. http://dx.doi.org/10.1162/tacl_a_00365.

Full text
Abstract:
Abstract Prior studies in multilingual language modeling (e.g., Cotterell et al., 2018; Mielke et al., 2019) disagree on whether or not inflectional morphology makes languages harder to model. We attempt to resolve the disagreement and extend those studies. We compile a larger corpus of 145 Bible translations in 92 languages and a larger number of typological features.1 We fill in missing typological data for several languages and consider corpus-based measures of morphological complexity in addition to expert-produced typological features. We find that several morphological measures are significantly associated with higher surprisal when LSTM models are trained with BPE-segmented data. We also investigate linguistically motivated subword segmentation strategies like Morfessor and Finite-State Transducers (FSTs) and find that these segmentation strategies yield better performance and reduce the impact of a language’s morphology on language modeling.
APA, Harvard, Vancouver, ISO, and other styles
5

Lindén, Krister. "Multilingual modeling of cross-lingual spelling variants." Information Retrieval 9, no. 3 (June 2006): 295–310. http://dx.doi.org/10.1007/s10791-006-1541-5.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Han, Yao Jun, and Xue Mei Luo. "Modeling and Analysis of Multilingual Information Parallel Downloads in Data Grid." Applied Mechanics and Materials 263-266 (December 2012): 1424–28. http://dx.doi.org/10.4028/www.scientific.net/amm.263-266.1424.

Full text
Abstract:
The need arises in parallel downloads of multilingual information for powerful graphical and analytical tools, as information with a variety of different languages distributed in different Web pages and the databases are heterogeneous and uneven in data grid. Petri net is a powerful graphical and mathematics tool for describing the concurrent, asynchronous and dynamic events. The parallel downloading of multilingual information was modeled and analyzed using extended timed colored Petri net (ETSdCPN). In ETSdCPN model, the color represents different languages information, and the time duration associated with place instead of transition is a function of tokens instead of constant. The reachable parallel download graph (RPDG) of ETSdCPN is defined. Finally, some important results such as rate of satisfaction and makespan of multilingual information parallel downloads are gotten by analyzing reachability of Petri net.
APA, Harvard, Vancouver, ISO, and other styles
7

Song, Guizhe, Degen Huang, and Zhifeng Xiao. "A Study of Multilingual Toxic Text Detection Approaches under Imbalanced Sample Distribution." Information 12, no. 5 (May 12, 2021): 205. http://dx.doi.org/10.3390/info12050205.

Full text
Abstract:
Multilingual characteristics, lack of annotated data, and imbalanced sample distribution are the three main challenges for toxic comment analysis in a multilingual setting. This paper proposes a multilingual toxic text classifier which adopts a novel fusion strategy that combines different loss functions and multiple pre-training models. Specifically, the proposed learning pipeline starts with a series of pre-processing steps, including translation, word segmentation, purification, text digitization, and vectorization, to convert word tokens to a vectorized form suitable for the downstream tasks. Two models, multilingual bidirectional encoder representation from transformers (MBERT) and XLM-RoBERTa (XLM-R), are employed for pre-training through Masking Language Modeling (MLM) and Translation Language Modeling (TLM), which incorporate semantic and contextual information into the models. We train six base models and fuse them to obtain three fusion models using the F1 scores as the weights. The models are evaluated on the Jigsaw Multilingual Toxic Comment dataset. Experimental results show that the best fusion model outperforms the two state-of-the-art models, MBERT and XLM-R, in F1 score by 5.05% and 0.76%, respectively, verifying the effectiveness and robustness of the proposed fusion strategy.
APA, Harvard, Vancouver, ISO, and other styles
8

Hao, Shudong, and Michael J. Paul. "An Empirical Study on Crosslingual Transfer in Probabilistic Topic Models." Computational Linguistics 46, no. 1 (March 2020): 95–134. http://dx.doi.org/10.1162/coli_a_00369.

Full text
Abstract:
Probabilistic topic modeling is a common first step in crosslingual tasks to enable knowledge transfer and extract multilingual features. Although many multilingual topic models have been developed, their assumptions about the training corpus are quite varied, and it is not clear how well the different models can be utilized under various training conditions. In this article, the knowledge transfer mechanisms behind different multilingual topic models are systematically studied, and through a broad set of experiments with four models on ten languages, we provide empirical insights that can inform the selection and future development of multilingual topic models.
APA, Harvard, Vancouver, ISO, and other styles
9

Rahimi, Razieh, Azadeh Shakery, and Irwin King. "Multilingual information retrieval in the language modeling framework." Information Retrieval Journal 18, no. 3 (May 6, 2015): 246–81. http://dx.doi.org/10.1007/s10791-015-9255-1.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Mitchell, Joan S., Marcia Lei Zeng, and Maja Žumer. "Modeling Classification Systems in Multicultural and Multilingual Contexts." Cataloging & Classification Quarterly 52, no. 1 (December 18, 2013): 90–101. http://dx.doi.org/10.1080/01639374.2013.845620.

Full text
APA, Harvard, Vancouver, ISO, and other styles
11

Teferra, Solomon, Martha Yifiru, and Tanja Schultz. "DNN-based Multilingual Acoustic Modeling for Four Ethiopian Languages." SINET: Ethiopian Journal of Science 46, no. 3 (March 27, 2024): 237–49. http://dx.doi.org/10.4314/sinet.v46i3.2.

Full text
Abstract:
In this paper, we present the results of experiments conducted on multilingual acoustic modeling in the development of an Automatic Speech Recognition (ASR) system using speech data of phonetically much related Ethiopian languages (Amharic, Tigrigna, Oromo and Wolaytta) with multilingual (ML) mix and multitask approaches. The use of speech data from only phonetically much related languages brought improvement over results reported in a previous work that used 26 languages (including the four languages). A maximum Word Error Rate (WER) reduction from 25.03% (in the previous work) to 21.52% has been achieved for Wolaytta, which is a relative WER reduction of 14.02%. As a result of using multilingual acoustic modeling for the development of an automatic speech recognition (ASR) system, a relative WER reduction of up to 7.36% (a WER reduction from 23.23% to 21.52%) has been achieved over a monolingual ASR. Compared to the ML mix, the multitask approach brought a better performance improvement (a relative WERs reduction of up to 5.9%). Experiments have also been conducted using Amharic and Tigrigna in a pair and Oromo and Wolaytta in another pair. The results of the experiments showed that languages with a relatively better language resources for lexical and language modeling (Amharic and Tigrigna) benefited from the use of speech data from only two languages. Generally, the findings show that the use of speech corpora of phonetically related languages with the multitask multilingual modeling approach for the development of ASR systems for less-resourced languages is a promising solution.
APA, Harvard, Vancouver, ISO, and other styles
12

Pian, Weiguo, Hanyu Peng, Xunzhu Tang, Tiezhu Sun, Haoye Tian, Andrew Habib, Jacques Klein, and Tegawendé F. Bissyandé. "MetaTPTrans: A Meta Learning Approach for Multilingual Code Representation Learning." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 4 (June 26, 2023): 5239–47. http://dx.doi.org/10.1609/aaai.v37i4.25654.

Full text
Abstract:
Representation learning of source code is essential for applying machine learning to software engineering tasks. Learning code representation from a multilingual source code dataset has been shown to be more effective than learning from single-language datasets separately, since more training data from multilingual dataset improves the model's ability to extract language-agnostic information from source code. However, existing multilingual training overlooks the language-specific information which is crucial for modeling source code across different programming languages, while only focusing on learning a unified model with shared parameters among different languages for language-agnostic information modeling. To address this problem, we propose MetaTPTrans, a meta learning approach for multilingual code representation learning. MetaTPTrans generates different parameters for the feature extractor according to the specific programming language type of the input code snippet, enabling the model to learn both language-agnostic and language-specific information with dynamic parameters in the feature extractor. We conduct experiments on the code summarization and code completion tasks to verify the effectiveness of our approach. The results demonstrate the superiority of our approach with significant improvements on state-of-the-art baselines.
APA, Harvard, Vancouver, ISO, and other styles
13

Lewoniewski, Włodzimierz, Krzysztof Węcel, and Witold Abramowicz. "Modeling Popularity and Reliability of Sources in Multilingual Wikipedia." Information 11, no. 5 (May 13, 2020): 263. http://dx.doi.org/10.3390/info11050263.

Full text
Abstract:
One of the most important factors impacting quality of content in Wikipedia is presence of reliable sources. By following references, readers can verify facts or find more details about described topic. A Wikipedia article can be edited independently in any of over 300 languages, even by anonymous users, therefore information about the same topic may be inconsistent. This also applies to use of references in different language versions of a particular article, so the same statement can have different sources. In this paper we analyzed over 40 million articles from the 55 most developed language versions of Wikipedia to extract information about over 200 million references and find the most popular and reliable sources. We presented 10 models for the assessment of the popularity and reliability of the sources based on analysis of meta information about the references in Wikipedia articles, page views and authors of the articles. Using DBpedia and Wikidata we automatically identified the alignment of the sources to a specific domain. Additionally, we analyzed the changes of popularity and reliability in time and identified growth leaders in each of the considered months. The results can be used for quality improvements of the content in different languages versions of Wikipedia.
APA, Harvard, Vancouver, ISO, and other styles
14

Hermann, Enno, Herman Kamper, and Sharon Goldwater. "Multilingual and unsupervised subword modeling for zero-resource languages." Computer Speech & Language 65 (January 2021): 101098. http://dx.doi.org/10.1016/j.csl.2020.101098.

Full text
APA, Harvard, Vancouver, ISO, and other styles
15

Natvig, David. "Modeling Heritage Language Phonetics and Phonology: Toward an Integrated Multilingual Sound System." Languages 6, no. 4 (December 14, 2021): 209. http://dx.doi.org/10.3390/languages6040209.

Full text
Abstract:
Although heritage language phonology is often argued to be fairly stable, heritage language speakers often sound noticeably different from both monolinguals and second-language learners. In order to model these types of asymmetries, I propose a theoretical framework—an integrated multilingual sound system—based on modular representations of an integrated set of phonological contrasts. An examination of general findings in laryngeal (voicing, aspiration, etc.) phonetics and phonology for heritage languages shows that procedures for pronouncing phonemes are variable and plastic, even if abstract may representations remain stable. Furthermore, an integrated multilingual sound system predicts that use of one language may require a subset of the available representations, which illuminates the mechanisms that underlie phonological transfer, attrition, and acquisition.
APA, Harvard, Vancouver, ISO, and other styles
16

Shliazhko, Oleh, Alena Fenogenova, Maria Tikhonova, Anastasia Kozlova, Vladislav Mikhailov, and Tatiana Shavrina. "mGPT: Few-Shot Learners Go Multilingual." Transactions of the Association for Computational Linguistics 12 (2024): 58–79. http://dx.doi.org/10.1162/tacl_a_00633.

Full text
Abstract:
Abstract This paper introduces mGPT, a multilingual variant of GPT-3, pretrained on 61 languages from 25 linguistically diverse language families using Wikipedia and the C4 Corpus. We detail the design and pretraining procedure. The models undergo an intrinsic and extrinsic evaluation: language modeling in all languages, downstream evaluation on cross-lingual NLU datasets and benchmarks in 33 languages, and world knowledge probing in 23 languages. The in-context learning abilities are on par with the contemporaneous language models while covering a larger number of languages, including underrepresented and low-resource languages of the Commonwealth of Independent States and the indigenous peoples in Russia. The source code and the language models are publicly available under the MIT license.
APA, Harvard, Vancouver, ISO, and other styles
17

Li, Rui, Liyang He, Qi Liu, Yuze Zhao, Zheng Zhang, Zhenya Huang, Yu Su, and Shijin Wang. "CONSIDER: Commonalities and Specialties Driven Multilingual Code Retrieval Framework." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 8 (March 24, 2024): 8679–87. http://dx.doi.org/10.1609/aaai.v38i8.28713.

Full text
Abstract:
Multilingual code retrieval aims to find code snippets relevant to a user's query from a multilingual codebase, which plays a crucial role in software development and expands their application scenarios compared to classical monolingual code retrieval. Despite the performance improvements achieved by previous studies, two crucial problems are overlooked in the multilingual scenario. First, certain programming languages face data scarcity in specific domains, resulting in limited representation capabilities within those domains. Second, different programming languages can be used interchangeably within the same domain, making it challenging for multilingual models to accurately identify the intended programming language of a user's query. To address these issues, we propose the CommONalities and SpecIalties Driven Multilingual CodE Retrieval Framework (CONSIDER), which includes two modules. The first module enhances the representation of various programming languages by modeling pairwise and global commonalities among them. The second module introduces a novel contrastive learning negative sampling algorithm that leverages language confusion to automatically extract specific language features. Through our experiments, we confirm the significant benefits of our model in real-world multilingual code retrieval scenarios in various aspects. Furthermore, an evaluation demonstrates the effectiveness of our proposed CONSIDER framework in monolingual scenarios as well. Our source code is available at https://github.com/smsquirrel/consider.
APA, Harvard, Vancouver, ISO, and other styles
18

Choi, Sung-Kwon, and Younggil Kim. "Linguistic Modeling for Multilingual Machine Translation based on Common Transfer." Language and Information 18, no. 1 (June 30, 2014): 77–97. http://dx.doi.org/10.29403/li.18.1.4.

Full text
APA, Harvard, Vancouver, ISO, and other styles
19

Nejad, Gholamali, and Mohammadreza Rostamzadeh. "Towards an Evaluation Framework for Multilingual Supported Data Modeling Patterns." International Journal of Computer Applications 143, no. 10 (June 17, 2016): 9–13. http://dx.doi.org/10.5120/ijca2016910364.

Full text
APA, Harvard, Vancouver, ISO, and other styles
20

Stepykin, N. I. "Experience in Modeling Associative Fields (Project “Multilingual Associative Thesaurus of Politeness”)." Nauchnyi dialog, no. 3 (March 27, 2021): 106–20. http://dx.doi.org/10.24224/2227-1295-2021-3-106-120.

Full text
Abstract:
The article is devoted to modeling associative fields vezhlivaya (f) and vezhlivyy (m) (polite) based on the materials of the project “Multilingual associative thesaurus of politeness”. The relevance of the study is due to the need to identify the structure and content of the associative-verbal network of a native speaker, which is possible when referring to the data of a free associative experiment. The author considers the combination of stimulus-response as a speech action. The novelty of the research lies in the fact that the analysis of associative data is carried out based on the operational model of speech production of distributive activation, which makes it possible to explain the presence of various reactions in the structure of the associative field. When analyzing each speech action and operation, the principle of approaching the word as a unity of the acoustic image and concept is considered. This indissoluble unity is manifested in the simultaneous mechanism of speech actions of conceptualization and internal articulation. A comparative analysis of the associations of the respondents in the masculine and female groups based on the operational model of speech production of distributive activation made it possible to identify universal and gender-specific features in the structure and content of the analyzed associative fields. It is concluded that it is possible to use the speech production model developed by the author in modeling associative fields.
APA, Harvard, Vancouver, ISO, and other styles
21

Jayanna, H. S., and B. G. Nagaraja. "An Experimental Comparison of Modeling Techniques and Combination of Speaker – Specific Information from Different Languages for Multilingual Speaker Identification." Journal of Intelligent Systems 25, no. 4 (October 1, 2016): 529–38. http://dx.doi.org/10.1515/jisys-2014-0128.

Full text
Abstract:
AbstractMost of the state-of-the-art speaker identification systems work on a monolingual (preferably English) scenario. Therefore, English-language autocratic countries can use the system efficiently for speaker recognition. However, there are many countries, including India, that are multilingual in nature. People in such countries have habituated to speak multiple languages. The existing speaker identification system may yield poor performance if a speaker’s train and test data are in different languages. Thus, developing a robust multilingual speaker identification system is an issue in many countries. In this work, an experimental evaluation of the modeling techniques, including self-organizing map (SOM), learning vector quantization (LVQ), and Gaussian mixture model-universal background model (GMM-UBM) classifiers for multilingual speaker identification, is presented. The monolingual and crosslingual speaker identification studies are conducted using 50 speakers of our own database. It is observed from the experimental results that the GMM-UBM classifier gives better identification performance than the SOM and LVQ classifiers. Furthermore, we propose a combination of speaker-specific information from different languages for crosslingual speaker identification, and it is observed that the combination feature gives better performance in all the crosslingual speaker identification experiments.
APA, Harvard, Vancouver, ISO, and other styles
22

Kim, Hyunah, Christine Barron, Jeanne Sinclair, and Eunice Eunhee Jang. "Change in home language environment and English literacy achievement over time: A multi-group latent growth curve modeling investigation." Language Testing 37, no. 4 (June 30, 2020): 573–99. http://dx.doi.org/10.1177/0265532220930348.

Full text
Abstract:
In most studies investigating the educational outcomes of linguistically diverse students, variables that identify this population have been considered as static. In reality, owing to the dynamic nature of students and their families, students’ home language environments change over time. This study aims to understand how elementary school students’ home language environments change over time, and how longitudinal patterns of English literacy achievement across grades 3, 6, and 10 differ among students with various home language shift patterns in Ontario, Canada. The longitudinal cohort data of 89,609 students between grades 3 and 10 from the provincial assessments were analyzed for changes in their home language environment. A subsample of 18,000 students was used to examine different patterns of relative literacy performance over time and their associations with immigration background and early intervention programming using multi-group latent growth curve modeling. Our findings suggest a strong movement toward an English-dominant home language environment among multilingual students; yet, students whose homes remained as multilingual demonstrated the highest literacy achievement in the early grade as well as the highest improvement in relative performance over time. The paper draws implications for promoting students’ home language, instilling a positive view of multilingual competence.
APA, Harvard, Vancouver, ISO, and other styles
23

Singh, Pranaydeep, Orphée De Clercq, and Els Lefever. "Distilling Monolingual Models from Large Multilingual Transformers." Electronics 12, no. 4 (February 18, 2023): 1022. http://dx.doi.org/10.3390/electronics12041022.

Full text
Abstract:
Although language modeling has been trending upwards steadily, models available for low-resourced languages are limited to large multilingual models such as mBERT and XLM-RoBERTa, which come with significant overheads for deployment vis-à-vis their model size, inference speeds, etc. We attempt to tackle this problem by proposing a novel methodology to apply knowledge distillation techniques to filter language-specific information from a large multilingual model into a small, fast monolingual model that can often outperform the teacher model. We demonstrate the viability of this methodology on two downstream tasks each for six languages. We further dive into the possible modifications to the basic setup for low-resourced languages by exploring ideas to tune the final vocabulary of the distilled models. Lastly, we perform a detailed ablation study to understand the different components of the setup better and find out what works best for the two under-resourced languages, Swahili and Slovene.
APA, Harvard, Vancouver, ISO, and other styles
24

Mercha, El Mahdi, Houda Benbrahim, and Mohammed Erradi. "Heterogeneous text graph for comprehensive multilingual sentiment analysis: capturing short- and long-distance semantics." PeerJ Computer Science 10 (February 23, 2024): e1876. http://dx.doi.org/10.7717/peerj-cs.1876.

Full text
Abstract:
Multilingual sentiment analysis (MSA) involves the task of comprehending people’s opinions, sentiments, and emotions in multilingual written texts. This task has garnered considerable attention due to its importance in extracting insights for decision-making across diverse fields such as marketing, finance, and politics. Several studies have explored MSA using deep learning methods. Nonetheless, a majority of these studies depend on sequential-based approaches, which focus on capturing short-distance semantics within adjacent word sequences, but they overlook long-distance semantics, which can provide more profound insights for analysis. In this work, we propose an approach for multilingual sentiment analysis, namely MSA-GCN, leveraging a graph convolutional network to effectively capture both short- and long-distance semantics. MSA-GCN involves the comprehensive modeling of the multilingual sentiment analysis corpus through a unified heterogeneous text graph. Subsequently, a slightly deep graph convolutional network is employed to acquire predictive representations for all nodes by encouraging the transfer learning across languages. Extensive experiments are carried out on various language combinations using different benchmark datasets to assess the efficiency of the proposed approach. These datasets include Multilingual Amazon Reviews Corpus (MARC), Internet Movie Database (IMDB), Allociné, and Muchocine. The achieved results reveal that MSA-GCN significantly outperformed all baseline models in almost all datasets with a p-value < 0.05 based on student t-test. In addition, such approach shows prominent results in a variety of language combinations, revealing the robustness of the approach against language variation.
APA, Harvard, Vancouver, ISO, and other styles
25

Fu, Hui. "Gaussian Mixture Modeling of Neighbor Characters for Multilingual Text Extraction in Images." Journal of Computer Research and Development 44, no. 11 (2007): 1920. http://dx.doi.org/10.1360/crad20071115.

Full text
APA, Harvard, Vancouver, ISO, and other styles
26

Perrier, Pascal, and Susanne Fuchs. "Speed‐curvature relations in speech production: a multilingual experimental and modeling study." Journal of the Acoustical Society of America 123, no. 5 (May 2008): 3330. http://dx.doi.org/10.1121/1.2933840.

Full text
APA, Harvard, Vancouver, ISO, and other styles
27

Chauhan, Uttam, and Apurva Shah. "Topic Modeling Using Latent Dirichlet allocation." ACM Computing Surveys 54, no. 7 (September 30, 2022): 1–35. http://dx.doi.org/10.1145/3462478.

Full text
Abstract:
We are not able to deal with a mammoth text corpus without summarizing them into a relatively small subset. A computational tool is extremely needed to understand such a gigantic pool of text. Probabilistic Topic Modeling discovers and explains the enormous collection of documents by reducing them in a topical subspace. In this work, we study the background and advancement of topic modeling techniques. We first introduce the preliminaries of the topic modeling techniques and review its extensions and variations, such as topic modeling over various domains, hierarchical topic modeling, word embedded topic models, and topic models in multilingual perspectives. Besides, the research work for topic modeling in a distributed environment, topic visualization approaches also have been explored. We also covered the implementation and evaluation techniques for topic models in brief. Comparison matrices have been shown over the experimental results of the various categories of topic modeling. Diverse technical challenges and future directions have been discussed.
APA, Harvard, Vancouver, ISO, and other styles
28

Kreutzer, Julia, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, et al. "Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets." Transactions of the Association for Computational Linguistics 10 (2022): 50–72. http://dx.doi.org/10.1162/tacl_a_00447.

Full text
Abstract:
Abstract With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, Web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases.
APA, Harvard, Vancouver, ISO, and other styles
29

Majewska, Olga, Evgeniia Razumovskaia, Edoardo M. Ponti, Ivan Vulić, and Anna Korhonen. "Cross-Lingual Dialogue Dataset Creation via Outline-Based Generation." Transactions of the Association for Computational Linguistics 11 (2023): 139–56. http://dx.doi.org/10.1162/tacl_a_00539.

Full text
Abstract:
Abstract Multilingual task-oriented dialogue (ToD) facilitates access to services and information for many (communities of) speakers. Nevertheless, its potential is not fully realized, as current multilingual ToD datasets—both for modular and end-to-end modeling—suffer from severe limitations. 1) When created from scratch, they are usually small in scale and fail to cover many possible dialogue flows. 2) Translation-based ToD datasets might lack naturalness and cultural specificity in the target language. In this work, to tackle these limitations we propose a novel outline-based annotation process for multilingual ToD datasets, where domain-specific abstract schemata of dialogue are mapped into natural language outlines. These in turn guide the target language annotators in writing dialogues by providing instructions about each turn’s intents and slots. Through this process we annotate a new large-scale dataset for evaluation of multilingual and cross-lingual ToD systems. Our Cross-lingual Outline-based Dialogue dataset (cod) enables natural language understanding, dialogue state tracking, and end-to-end dialogue evaluation in 4 diverse languages: Arabic, Indonesian, Russian, and Kiswahili. Qualitative and quantitative analyses of cod versus an equivalent translation-based dataset demonstrate improvements in data quality, unlocked by the outline-based approach. Finally, we benchmark a series of state-of-the-art systems for cross-lingual ToD, setting reference scores for future work and demonstrating that cod prevents over-inflated performance, typically met with prior translation-based ToD datasets.
APA, Harvard, Vancouver, ISO, and other styles
30

Blake, John, Natalia Bogach, Akemi Kusakari, Iurii Lezhenin, Veronica Khaustova, Son Luu Xuan, Van Nhi Nguyen, et al. "An Open CAPT System for Prosody Practice: Practical Steps towards Multilingual Setup." Languages 9, no. 1 (January 12, 2024): 27. http://dx.doi.org/10.3390/languages9010027.

Full text
Abstract:
This paper discusses the challenges posed in creating a Computer-Assisted Pronunciation Training (CAPT) environment for multiple languages. By selecting one language from each of three different language families, we show that a single environment may be tailored to cater for different target languages. We detail the challenges faced during the development of a multimodal CAPT environment comprising a toolkit that manages mobile applications using speech signal processing, visualization, and estimation algorithms. Since the applied underlying mathematical and phonological models, as well as the feedback production algorithms, are based on sound signal processing and modeling rather than on particular languages, the system is language-agnostic and serves as an open toolkit for developing phrasal intonation training exercises for an open selection of languages. However, it was necessary to tailor the CAPT environment to the language-specific particularities in the multilingual setups, especially the additional requirements for adequate and consistent speech evaluation and feedback production. In our work, we describe our response to the challenges in visualizing and segmenting recorded pitch signals and modeling the language melody and rhythm necessary for such a multilingual adaptation, particularly for tonal syllable-timed and mora-timed languages.
APA, Harvard, Vancouver, ISO, and other styles
31

Amara, Amina, Mohamed Ali Hadj Taieb, and Mohamed Ben Aouicha. "Multilingual topic modeling for tracking COVID-19 trends based on Facebook data analysis." Applied Intelligence 51, no. 5 (February 13, 2021): 3052–73. http://dx.doi.org/10.1007/s10489-020-02033-3.

Full text
APA, Harvard, Vancouver, ISO, and other styles
32

KITA, KENJI. "Reconstructing the Language Family Tree from Multilingual Corpus Based on Probabilistic Language Modeling." Journal of Natural Language Processing 4, no. 3 (1997): 71–82. http://dx.doi.org/10.5715/jnlp.4.3_71.

Full text
APA, Harvard, Vancouver, ISO, and other styles
33

Bouselmi, G., D. Fohr, and I. Illina. "Multilingual recognition of non-native speech using acoustic model transformation and pronunciation modeling." International Journal of Speech Technology 15, no. 2 (March 8, 2012): 203–13. http://dx.doi.org/10.1007/s10772-012-9134-8.

Full text
APA, Harvard, Vancouver, ISO, and other styles
34

Vulić, Ivan, Wim De Smet, Jie Tang, and Marie-Francine Moens. "Probabilistic topic modeling in multilingual settings: An overview of its methodology and applications." Information Processing & Management 51, no. 1 (January 2015): 111–47. http://dx.doi.org/10.1016/j.ipm.2014.08.003.

Full text
APA, Harvard, Vancouver, ISO, and other styles
35

Miller, R. A., R. H. Baud, J. R. Scherrer, and A. M. Rassinoux. "Modeling Concepts in Medicine for Medical Language Understanding." Methods of Information in Medicine 37, no. 04/05 (October 1998): 361–72. http://dx.doi.org/10.1055/s-0038-1634561.

Full text
Abstract:
AbstractOver the past two decades, the construction of models for medical concept representation and for understanding of the deep meaning of medical narrative texts have been challenging areas of medical informatics research. This review highlights how these two inter-related domains have evolved, emphasizing aspects of medical modeling as a tool for medical language understanding. A representation schema, which balances partially but accurately with complete but complex representations of domainspecific knowledge, must be developed to facilitate language understanding. Representative examples are drawn from two major independent efforts undertaken by the authors: the elaboration and the subsequent adjustment of the RECIT multilingual analyzer to include a robust medical concept model, and the recasting of a frame-based interlingua system, originally developed to map equivalent concepts between controlled clinical vocabularies, to invoke a similar concept model.
APA, Harvard, Vancouver, ISO, and other styles
36

Nagaraja, B. G., and H. S. Jayanna. "Multilingual Speaker Identification by Combining Evidence from LPR and Multitaper MFCC." Journal of Intelligent Systems 22, no. 3 (September 1, 2013): 241–51. http://dx.doi.org/10.1515/jisys-2013-0038.

Full text
Abstract:
AbstractIn this work, the significance of combining the evidence from multitaper mel-frequency cepstral coefficients (MFCC), linear prediction residual (LPR), and linear prediction residual phase (LPRP) features for multilingual speaker identification with the constraint of limited data condition is demonstrated. The LPR is derived from linear prediction analysis, and LPRP is obtained by dividing the LPR using its Hilbert envelope. The sine-weighted cepstrum estimators (SWCE) with six tapers are considered for multitaper MFCC feature extraction. The Gaussian mixture model–universal background model is used for modeling each speaker for different evidence. The evidence is then combined at scoring level to improve the performance. The monolingual, crosslingual, and multilingual speaker identification studies were conducted using 30 randomly selected speakers from the IITG multivariability speaker recognition database. The experimental results show that the combined evidence improves the performance by nearly 8–10% compared with individual evidence.
APA, Harvard, Vancouver, ISO, and other styles
37

Alexandrowicz, Viviana, and Bobbi Hansen. "Addressing Multilingual Learners’ Language Needs Through Engaging Inquiry-Based Science." English Language Teaching 16, no. 10 (September 30, 2023): 73. http://dx.doi.org/10.5539/elt.v16n10p73.

Full text
Abstract:
This article presents an overview of a 3-year series of workshops on teaching and providing multilingual learners (MLs) with access to science by utilizing effective language development strategies. The workshops were delivered to primary teachers in California and included three different modules: (a) Think and Question Like a Scientist, (b) Observe and Record Like a Scientist through Science Notebooking, and (c) Argue Like a Scientist. The activities showcase effective, research-based second language acquisition (SLA) strategies, including providing comprehensible input via paraphrasing, using visual and media resources, gestures, and the student&rsquo;s native language, and modeling tasks. Additionally, scaffolding academic language through personal dictionaries, sentence frames, and native language support constitutes some of the ideas shared. Detailed descriptions highlight the &ldquo;how&rdquo; of addressing the needs of MLs at a variety of proficiency levels.
APA, Harvard, Vancouver, ISO, and other styles
38

Mykhalchuk, Nataliia, Pavlo Levchuk, Ernest Ivashkevych, and Alexander Nabochuk. "Dynamic Models of Multilingualism on the Territory of Western Ukraine." PSYCHOLINGUISTICS 33, no. 2 (February 21, 2023): 114–44. http://dx.doi.org/10.31470/2309-1797-2023-33-2-114-144.

Full text
Abstract:
The purpose of the article is to study lexical units, with the help of which it becomes possible to build up the models of multilingualism, which are dominant among the population on the territory of Western Ukraine. Methods. Theoretical methods – categorical and structurally-functional analysis of the texts, the methods of systematization, modeling, generalization; empirical ones – the analysis of lexical units, the experiment. For the purpose of studying the models of multilingualism we used “The Methodology of studying the models of multilingualism on the territory of Western Ukraine (by the influence of Russian, English and German” (Mykhalchuk & Ivashkevych, 2022). Results. Dynamic models of multilingualism on the territory of Western Ukraine are: the Model of Balanced Ambilingualism and the Model of Unbalanced or Asymmetric Bilingualism. There are two types of Balanced Ambilingualism: (1) the Model of Ambilingual Balanced Bilingualism. It emphasizes that both language systems are developed to the highest level of perfect mastery of the language as mastering a native one; (2) the Model of Non-Ambilingual Balanced Bilingualism implies that both language systems aren’t at the same level of their development. Unbalanced or Asymmetric Bilingualism is presented by two sub-models: (1) Transitional Bilingualism; (2) Stable Dominant Multilingualism. Conclusions. Any multilingual system is not reduced to the summation of different monolingual systems. Multilingual psycholinguistic systems of the person are open ones. The bilingual’s metalinguistic abilities show a strengthening effect when the person is studying not only the second, but also the third or more languages. Accumulating such advantages as cognitive variability (mobility), metalinguistic abilities, metapragmatic and sociocultural “awareness”, multilinguals also accumulate some disadvantages: a deficit in the level of language proficiency due to interlanguage interactions; limitations in language acquisition and language efforts.
APA, Harvard, Vancouver, ISO, and other styles
39

Tachbelie, Martha Yifiru, Solomon Teferra Abate, and Tanja Schultz. "Multilingual speech recognition for GlobalPhone languages." Speech Communication 140 (May 2022): 71–86. http://dx.doi.org/10.1016/j.specom.2022.03.006.

Full text
APA, Harvard, Vancouver, ISO, and other styles
40

Khademi Zahedi, Reza, Naif Alajlan, Hooman Khademi Zahedi, and Timon Rabczuk. "Multilingual Sentiment Mining System to Prognosticate Governance." Computers, Materials & Continua 71, no. 1 (2022): 389–406. http://dx.doi.org/10.32604/cmc.2022.021384.

Full text
APA, Harvard, Vancouver, ISO, and other styles
41

K. Alnahdi, Amany. "A Framework for Building a Multilingual Industrial Ontology: Methodology and a Case Study for Building Smartphone English-Arabic Ontology." International journal of Web & Semantic Technology 12, no. 03 (July 31, 2021): 15–21. http://dx.doi.org/10.5121/ijwest.2021.12302.

Full text
Abstract:
As Web 3.0 is blooming, ontologies augment semantic Web with semi–structured knowledge. Industrial ontologies can help in improving online commercial communication and marketing. In addition, conceptualizing the enterprise knowledge can improve information retrieval for industrial applications. Having ontologies combine multiple languages can help in delivering the knowledge to a broad sector of Internet users. In addition, multi-lingual ontologies can also help in commercial transactions. This research paper provides a framework model for building industrial multilingual ontologies which include Corpus Determination, Filtering, Analysis, Ontology Building, and Ontology Evaluation. It also addresses factors to be considered when modeling multilingual ontologies. A case study for building a bilingual English-Arabic ontology for smart phones is presented. The ontology was illustrated using an ontology editor and visualization tool. The built ontology consists of 67 classes and 18 instances presented in both Arabic and English. In addition, applications for using the ontology are presented. Future research directions for the built industrial ontology are presented.
APA, Harvard, Vancouver, ISO, and other styles
42

Lee, Jaeseong, Dohyeon Lee, and Seung-won Hwang. "Script, Language, and Labels: Overcoming Three Discrepancies for Low-Resource Language Specialization." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 11 (June 26, 2023): 13004–13. http://dx.doi.org/10.1609/aaai.v37i11.26528.

Full text
Abstract:
Although multilingual pretrained models (mPLMs) enabled support of various natural language processing in diverse languages, its limited coverage of 100+ languages lets 6500+ languages remain ‘unseen’. One common approach for an unseen language is specializing the model for it as target, by performing additional masked language modeling (MLM) with the target language corpus. However, we argue that, due to the discrepancy from multilingual MLM pretraining, a naive specialization as such can be suboptimal. Specifically, we pose three discrepancies to overcome. Script and linguistic discrepancy of the target language from the related seen languages, hinder a positive transfer, for which we propose to maximize representation similarity, unlike existing approaches maximizing overlaps. In addition, label space for MLM prediction can vary across languages, for which we propose to reinitialize top layers for a more effective adaptation. Experiments over four different language families and three tasks shows that our method improves the task performance of unseen languages with statistical significance, while previous approach fails to.
APA, Harvard, Vancouver, ISO, and other styles
43

Gutiérrez-Fandiño, Asier, David Pérez-Fernández, Jordi Armengol-Estapé, David Griol, Ksenia Kharitonova, and Zoraida Callejas. "esCorpius-m: A Massive Multilingual Crawling Corpus with a Focus on Spanish." Applied Sciences 13, no. 22 (November 8, 2023): 12155. http://dx.doi.org/10.3390/app132212155.

Full text
Abstract:
In recent years, transformer-based models have played a significant role in advancing language modeling for natural language processing. However, they require substantial amounts of data and there is a shortage of high-quality non-English corpora. Some recent initiatives have introduced multilingual datasets obtained through web crawling. However, there are notable limitations in the results for some languages, including Spanish. These datasets are either smaller compared to other languages or suffer from lower quality due to insufficient cleaning and deduplication. In this paper, we present esCorpius-m, a multilingual corpus extracted from around 1 petabyte of Common Crawl data. It is the most extensive corpus for some languages with such a level of high-quality content extraction, cleanliness, and deduplication. Our data curation process involves an efficient cleaning pipeline and various deduplication methods that maintain the integrity of document and paragraph boundaries. We also ensure compliance with EU regulations by retaining both the source web page URL and the WARC shared origin URL.
APA, Harvard, Vancouver, ISO, and other styles
44

Liu, Xiabi, Hui Fu, and Yunde Jia. "Gaussian mixture modeling and learning of neighboring characters for multilingual text extraction in images." Pattern Recognition 41, no. 2 (February 2008): 484–93. http://dx.doi.org/10.1016/j.patcog.2007.06.004.

Full text
APA, Harvard, Vancouver, ISO, and other styles
45

Grapin, Scott E., Sharon Dudek, and Okhee Lee. "Justice-Centered STEM Education With Multilingual Learners: Computational Modeling to Address COVID-19 Disparities." Science Scope 46, no. 5 (May 2023): 36–44. http://dx.doi.org/10.1080/19434901.2023.12290258.

Full text
APA, Harvard, Vancouver, ISO, and other styles
46

Longpre, Shayne, Yi Lu, and Joachim Daiber. "MKQA: A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answering." Transactions of the Association for Computational Linguistics 9 (2021): 1389–406. http://dx.doi.org/10.1162/tacl_a_00433.

Full text
Abstract:
Abstract Progress in cross-lingual modeling depends on challenging, realistic, and diverse evaluation sets. We introduce Multilingual Knowledge Questions and Answers (MKQA), an open- domain question answering evaluation set comprising 10k question-answer pairs aligned across 26 typologically diverse languages (260k question-answer pairs in total). Answers are based on heavily curated, language- independent data representation, making results comparable across languages and independent of language-specific passages. With 26 languages, this dataset supplies the widest range of languages to-date for evaluating question answering. We benchmark a variety of state- of-the-art methods and baselines for generative and extractive question answering, trained on Natural Questions, in zero shot and translation settings. Results indicate this dataset is challenging even in English, but especially in low-resource languages.1
APA, Harvard, Vancouver, ISO, and other styles
47

Lupancu, Viorica-Camelia, and Adrian Iftene. "Multilingual Fine-Grained Named Entity Recognition." Computer Science Journal of Moldova 31, no. 3(93) (December 2023): 321–39. http://dx.doi.org/10.56415/csjm.v31.16.

Full text
Abstract:
The “MultiCoNER II Multilingual Complex Named Entity Recognition” task1 within SemEval 2023 competition focuses on identifying complex named entities (NEs), such as the titles of creative works (e.g., songs, books, movies), people with different titles (e.g., politicians, scientists, artists, athletes), different categories of products (e.g., food, drinks, clothing), and so on, in several languages. In the context of SemEval, our team, FII_Better, presented an exploration of a base transformer model’s capabilities regarding the task, focused more specifically on five languages (English, Spanish, Swedish, German, and Italian). We took DistilBERT (a distilled version of BERT) and BERT (Bidirectional Encoder Representations from Transformers) as two examples of basic transformer models, using DistilBERT as a baseline and BERT as the platform to create an improved model. In this process, we managed to get fair results in the chosen languages. We have managed to get moderate results in the English track (we ranked 17th out of 34), while our results in the other tracks could be further improved in the future (overall third to last).
APA, Harvard, Vancouver, ISO, and other styles
48

Del Río, Miguel, Corey Miller, Ján Profant, Jennifer Drexler-Fox, Quinn Mcnamara, Nishchal Bhandari, Natalie Delworth, et al. "Accents in Speech Recognition through the Lens of a World Englishes Evaluation Set." Research in Language 21, no. 3 (December 28, 2023): 225–44. http://dx.doi.org/10.18778/1731-7533.21.3.02.

Full text
Abstract:
Automatic Speech Recognition (ASR) systems generalize poorly on accented speech, creating bias issues for users and providers. The phonetic and linguistic variability of accents present challenges for ASR systems in both data collection and modeling strategies. We present two promising approaches to accented speech recognition— custom vocabulary and multilingual modeling— and highlight key challenges in the space. Among these, lack of a standard benchmark makes research and comparison difficult. We address this with a novel corpus of accented speech: Earnings-22, A 125 file, 119 hour corpus of English-language earnings calls gathered from global companies. We compare commercial models showing variation in performance when taking country of origin into consideration and demonstrate targeted improvements using the methods we introduce.
APA, Harvard, Vancouver, ISO, and other styles
49

Radke, Sarah C., Sara E. Vogel, Jasmine Y. Ma, Christopher Hoadley, and Laura Ascenzi-Moreno. "Emergent Bilingual Middle Schoolers’ Syncretic Reasoning in Statistical Modeling." Teachers College Record: The Voice of Scholarship in Education 124, no. 5 (May 2022): 206–28. http://dx.doi.org/10.1177/01614681221104141.

Full text
Abstract:
Background/Context: Bi/multilingual students’ STEM learning is better supported when educators leverage their language and cultural practices as resources, but STEM subject divisions have been historically constructed based on oppressive, dominant values and exclude the ways of knowing of nondominant groups. Truly promoting equity requires expanding and transforming STEM disciplines. Purpose/Objective/Research Question/Focus of Study: This article contributes to efforts to illuminate emergent bi/multilingual students’ ways of knowing, languaging, and doing in STEM. We follow the development of syncretic literacies in relation to translanguaging practices, asking, How do knowledges and practices from different communities get combined and reorganized by students and teachers in service of new modeling practices? Setting and Participants: We focus on a seventh-grade science classroom, deliberately designed to support syncretic literacies and translanguaging practices, where computer science concepts were infused into the curriculum through modeling activities. The majority of the students in the bilingual program had arrived in the United States at most three years before enrolling, from the Caribbean and Central and South America. Research Design: We analyze one lesson that was part of a larger research–practice partnership focused on teaching computer science through leveraging translanguaging practices and syncretic literacies. The lesson was a modeling and computing activity codesigned by the teacher and two researchers about post–Hurricane María outmigration from Puerto Rico. Analysis used microethnographic methods to trace how students assembled translanguaging, social, and schooled practices to make sense of and construct models. Findings/Results: Findings show how students assembled representational forms from a variety of practices as part of accomplishing and negotiating both designed and emergent goals. These included sensemaking, constructing, explaining, justifying, and interpreting both the physical and computational models of migration. Conclusions/Recommendations: Implications support the development of theory and pedagogy that intentionally make space for students to engage in meaning-making through translanguaging and syncretic practices in order to provide new possibilities for lifting up STEM learning that may include, but is not constrained by, disciplinary learning. Additional implications for teacher education and student assessment practices call for reconceptualizing schooling beyond day-to-day curriculum as part of making an ontological shift away from prioritizing math, science, and CS disciplinary and language objectives as defined by and for schooling, and toward celebrating, supporting, and centering students’ diverse, syncretic knowledges and knowledge use.
APA, Harvard, Vancouver, ISO, and other styles
50

Bovi, Claudio Delli, and Roberto Navigli. "Multilingual semantic dictionaries for natural language processing: The case of BabelNet." Encyclopedia with Semantic Computing and Robotic Intelligence 01, no. 01 (March 2017): 1630015. http://dx.doi.org/10.1142/s2425038416300159.

Full text
Abstract:
Accurate semantic modeling lies at the very core of today’s Natural Language Processing (NLP). Getting a handle on the various phenomena that regulate the meaning of linguistic utterances can pave the way for solving many compelling and ambitious tasks in the field, from Machine Translation to Question Answering and Information Retrieval. A complete semantic model of language, however, needs first of all reliable building blocks. In the last two decades, research in lexical semantics (which focuses on the meaning of individual linguistic elements, i.e., words and expressions), has produced increasingly comprehensive and effective machine-readable dictionaries in multiple languages: like humans, NLP systems can now leverage these sources of lexical knowledge to discriminate among various senses of a given lexeme, thereby improving their performances on downstream tasks and applications. In this paper, we focus on the case study of BabelNet, a large multilingual encyclopedic dictionary and semantic network, to describe in detail how such knowledge resources are built, improved and exploited for crucial NLP tasks such as Word Sense Disambiguation, Entity Linking and Semantic Similarity.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography