Log in

Relevant bibliographies by topics / Pre-training corpora / Journal articles

To see the other types of publications on this topic, follow the link: Pre-training corpora.

Journal articles on the topic 'Pre-training corpora'

Author: Grafiati

Published: 25 May 2024

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Pre-training corpora.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Sun, Yu, Shuohuan Wang, Yukun Li, Shikun Feng, Hao Tian, Hua Wu, and Haifeng Wang. "ERNIE 2.0: A Continual Pre-Training Framework for Language Understanding." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 05 (April 3, 2020): 8968–75. http://dx.doi.org/10.1609/aaai.v34i05.6428.

Full text

Abstract:

Recently pre-trained models have achieved state-of-the-art results in various language understanding tasks. Current pre-training procedures usually focus on training the model with several simple tasks to grasp the co-occurrence of words or sentences. However, besides co-occurring information, there exists other valuable lexical, syntactic and semantic information in training corpora, such as named entities, semantic closeness and discourse relations. In order to extract the lexical, syntactic and semantic information from training corpora, we propose a continual pre-training framework named ERNIE 2.0 which incrementally builds pre-training tasks and then learn pre-trained models on these constructed tasks via continual multi-task learning. Based on this framework, we construct several tasks and train the ERNIE 2.0 model to capture lexical, syntactic and semantic aspects of information in the training data. Experimental results demonstrate that ERNIE 2.0 model outperforms BERT and XLNet on 16 tasks including English tasks on GLUE benchmarks and several similar tasks in Chinese. The source codes and pre-trained models have been released at https://github.com/PaddlePaddle/ERNIE.

APA, Harvard, Vancouver, ISO, and other styles

2

Moodaley, Wayne, and Arnesh Telukdarie. "A Conceptual Framework for Subdomain Specific Pre-Training of Large Language Models for Green Claim Detection." European Journal of Sustainable Development 12, no. 4 (October 1, 2023): 319. http://dx.doi.org/10.14207/ejsd.2023.v12n4p319.

Full text

Abstract:

Detection of false or misleading green claims (referred to as “greenwashing”) within company sustainability disclosures is challenging for a number of reasons, which include the textual and qualitative nature, volume, and complexity of such disclosures. In recent years, notable progress made in the fields of artificial intelligence and specifically, large language models (LLMs), has showcased the capacity of these tools to effectively analyse extensive and intricate textual data, including the contents of sustainability disclosures. Transformer-based LLMs, such as Google’s BERT architecture, were trained on general domain text corpora. Subsequent research has shown that further pre-training of such LLMs on specific domains, such as the climate or sustainability domains, may improve performance. However, previous research often uses text corpora that exhibit significant variation across topics and language and which often consist of heterogeneous subdomains. We therefore propose a conceptual framework for further pre-training of transformer based LLMs using text corpora relating to specific sustainability subdomains i.e. subdomain specific pre-training. We do so as a basis for the improved performance of such models in analysing sustainability disclosures. The main contribution is a conceptual framework to advance the use of LLMs for the reliable identification of green claims and ultimately, greenwashing. Keywords: greenwashing, artificial intelligence, sustainability, sustainability reporting, sustainability disclosures.

APA, Harvard, Vancouver, ISO, and other styles

3

Liu, Yinhan, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. "Multilingual Denoising Pre-training for Neural Machine Translation." Transactions of the Association for Computational Linguistics 8 (November 2020): 726–42. http://dx.doi.org/10.1162/tacl_a_00343.

Full text

Abstract:

This paper demonstrates that multilingual denoising pre-training produces significant performance gains across a wide variety of machine translation (MT) tasks. We present mBART—a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora in many languages using the BART objective (Lewis et al., 2019 ). mBART is the first method for pre-training a complete sequence-to-sequence model by denoising full texts in multiple languages, whereas previous approaches have focused only on the encoder, decoder, or reconstructing parts of the text. Pre-training a complete model allows it to be directly fine-tuned for supervised (both sentence-level and document-level) and unsupervised machine translation, with no task- specific modifications. We demonstrate that adding mBART initialization produces performance gains in all but the highest-resource settings, including up to 12 BLEU points for low resource MT and over 5 BLEU points for many document-level and unsupervised models. We also show that it enables transfer to language pairs with no bi-text or that were not in the pre-training corpus, and present extensive analysis of which factors contribute the most to effective pre-training. 1

APA, Harvard, Vancouver, ISO, and other styles

4

Dean, Roger Thornton, and Marcus Thomas Pearce. "Algorithmically-generated Corpora that use Serial Compositional Principles Can Contribute to the Modeling of Sequential Pitch Structure in Non-tonal Music." Empirical Musicology Review 11, no. 1 (July 8, 2016): 27. http://dx.doi.org/10.18061/emr.v11i1.4900.

Full text

Abstract:

We investigate whether pitch sequences in non-tonal music can be modeled by an information-theoretic approach using algorithmically-generated melodic sequences, made according to 12-tone serial principles, as the training corpus. This is potentially useful, because symbolic corpora of non-tonal music are not readily available. A non-tonal corpus of serially-composed melodies was constructed algorithmically using classic principles of 12-tone music, including prime, inversion, retrograde and retrograde inversion transforms. A similar algorithm generated a tonal melodic corpus of tonal transformations, in each case based on a novel tonal melody and expressed in alternating major keys. A cognitive model of auditory expectation (IDyOM) was used first to analyze the sequential pitch structure of the corpora, in some cases with pre-training on established tonal folk-song corpora (Essen, Schaffrath, 1995). The two algorithmic corpora can be distinguished in terms of their information content, and they were quite different from random corpora and from the folk-song corpus. We then demonstrate that the algorithmic serial corpora can assist modeling of canonical non-tonal compositions by Webern and Schoenberg, and also non-tonal segments of improvisations by skilled musicians. Separately, we developed the process of algorithmic melody composition into a software system (the Serial Collaborator) capable of generating multi-stranded serial keyboard music. Corpora of such keyboard compositions based either on the non-tonal or the tonal melodic corpora were generated and assessed for their information-theoretic modeling properties.

APA, Harvard, Vancouver, ISO, and other styles

5

Yuan, Sha, Hanyu Zhao, Zhengxiao Du, Ming Ding, Xiao Liu, Yukuo Cen, Xu Zou, Zhilin Yang, and Jie Tang. "WuDaoCorpora: A super large-scale Chinese corpora for pre-training language models." AI Open 2 (2021): 65–68. http://dx.doi.org/10.1016/j.aiopen.2021.06.001.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Kreutzer, Julia, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, et al. "Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets." Transactions of the Association for Computational Linguistics 10 (2022): 50–72. http://dx.doi.org/10.1162/tacl_a_00447.

Full text

Abstract:

Abstract With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, Web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases.

APA, Harvard, Vancouver, ISO, and other styles

7

Qian, Jing, Yong Yue, Katie Atkinson, and Gangmin Li. "Understanding Chinese Moral Stories with Further Pre-Training." International Journal on Natural Language Computing 12, no. 2 (April 29, 2023): 01–12. http://dx.doi.org/10.5121/ijnlc.2023.12201.

Full text

Abstract:

The goal of moral understanding is to grasp the theoretical concepts embedded in a narrative by delving beyond the concrete occurrences and dynamic personas. Specifically, the narrative is compacted into a single statement without involving any characters within the original text, necessitating a more astute language model that can comprehend connotative morality and exhibit commonsense reasoning. The “pretraining + fine-tuning” paradigm is widely embraced in neural language models. In this paper, we propose an intermediary phase to establish an improved paradigm of “pre-training + further pre-training + finetuning”. Further pre-training generally refers to continual learning on task-specific or domain-relevant corpora before being applied to target tasks, which aims at bridging the gap in data distribution between the phases of pre-training and fine-tuning. Our work is based on a Chinese dataset named STORAL-ZH that composes of 4k human-written story-moral pairs. Furthermore, we design a two-step process of domain-adaptive pre-training in the intermediary phase. The first step depends on a newly-collected Chinese dataset of Confucian moral culture. And the second step bases on the Chinese version of a frequently-used commonsense knowledge graph (i.e. ATOMIC) to enrich the backbone model with inferential knowledge besides morality. By comparison with several advanced models including BERTbase, RoBERTa-base and T5-base, experimental results on two understanding tasks demonstrate the effectiveness of our proposed three-phase paradigm.

APA, Harvard, Vancouver, ISO, and other styles

8

Jiang, Xiaoze, Yaobo Liang, Weizhu Chen, and Nan Duan. "XLM-K: Improving Cross-Lingual Language Model Pre-training with Multilingual Knowledge." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 10 (June 28, 2022): 10840–48. http://dx.doi.org/10.1609/aaai.v36i10.21330.

Full text

Abstract:

Cross-lingual pre-training has achieved great successes using monolingual and bilingual plain text corpora. However, most pre-trained models neglect multilingual knowledge, which is language agnostic but comprises abundant cross-lingual structure alignment. In this paper, we propose XLM-K, a cross-lingual language model incorporating multilingual knowledge in pre-training. XLM-K augments existing multilingual pre-training with two knowledge tasks, namely Masked Entity Prediction Task and Object Entailment Task. We evaluate XLM-K on MLQA, NER and XNLI. Experimental results clearly demonstrate significant improvements over existing multilingual language models. The results on MLQA and NER exhibit the superiority of XLM-K in knowledge related tasks. The success in XNLI shows a better cross-lingual transferability obtained in XLM-K. What is more, we provide a detailed probing analysis to confirm the desired knowledge captured in our pre-training regimen. The code is available at https://github.com/microsoft/Unicoder/tree/master/pretraining/xlmk.

APA, Harvard, Vancouver, ISO, and other styles

9

Kajiwara, Tomoyuki, Biwa Miura, and Yuki Arase. "Monolingual Transfer Learning via Bilingual Translators for Style-Sensitive Paraphrase Generation." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 05 (April 3, 2020): 8042–49. http://dx.doi.org/10.1609/aaai.v34i05.6314.

Full text

Abstract:

We tackle the low-resource problem in style transfer by employing transfer learning that utilizes abundantly available raw corpora. Our method consists of two steps: pre-training learns to generate a semantically equivalent sentence with an input assured grammaticality, and fine-tuning learns to add a desired style. Pre-training has two options, auto-encoding and machine translation based methods. Pre-training based on AutoEncoder is a simple way to learn these from a raw corpus. If machine translators are available, the model can learn more diverse paraphrasing via roundtrip translation. After these, fine-tuning achieves high-quality paraphrase generation even in situations where only 1k sentence pairs of the parallel corpus for style transfer is available. Experimental results of formality style transfer indicated the effectiveness of both pre-training methods and the method based on roundtrip translation achieves state-of-the-art performance.

APA, Harvard, Vancouver, ISO, and other styles

10

Kryeziu, Labehat, and Visar Shehu. "Pre-Training MLM Using Bert for the Albanian Language." SEEU Review 18, no. 1 (June 1, 2023): 52–62. http://dx.doi.org/10.2478/seeur-2023-0035.

Full text

Abstract:

Abstract Knowing that language is often used as a classifier of human intelligence and the development of systems that understand human language remains a challenge all the time (Kryeziu & Shehu, 2022). Natural Language Processing is a very active field of study, where transformers have a key role. Transformers function based on neural networks and they are increasingly showing promising results. One of the first major contributions to transfer learning in Natural Language Processing was the use of pre-trained word embeddings in 2010 (Joseph, Lev, & Yoshua, 2010). Pre-trained models like ELMo (Matthew, et al., 2018) and BERT (Delvin, et al., 2019) are trained on large corpora of unlabeled text and as a result learning from text representations has achieved good performance on many of the underlying tasks on datasets from different domains. Pre-training in the language model has proven that there has been an improvement in some aspects of natural language processing, based on the paper (Dai & Le, 2015). In present paper, we will pre-train BERT on the task of Masked Language Modeling (MLM) with the Albanian language dataset (alb_dataset) that we have created for this purpose (Kryeziu et al., 2022). We will compare two approaches: training of BERT using the available OSCAR dataset and using our alb_dataset that we have collected. The paper shows some discrepancies during training, especially while evaluating the performance of the model.

APA, Harvard, Vancouver, ISO, and other styles

11

Shi, Peng, Patrick Ng, Zhiguo Wang, Henghui Zhu, Alexander Hanbo Li, Jun Wang, Cicero Nogueira dos Santos, and Bing Xiang. "Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 15 (May 18, 2021): 13806–14. http://dx.doi.org/10.1609/aaai.v35i15.17627.

Full text

Abstract:

Most recently, there has been significant interest in learning contextual representations for various NLP tasks, by leveraging large scale text corpora to train powerful language models with self-supervised learning objectives, such as Masked Language Model (MLM). Based on a pilot study, we observe three issues of existing general-purpose language models when they are applied in the text-to-SQL semantic parsers: fail to detect the column mentions in the utterances, to infer the column mentions from the cell values, and to compose target SQL queries when they are complex. To mitigate these issues, we present a model pretraining framework, Generation-Augmented Pre-training (GAP), that jointly learns representations of natural language utterance and table schemas, by leveraging generation models to generate high-quality pre-train data. GAP Model is trained on 2 million utterance-schema pairs and 30K utterance-schema-SQL triples, whose utterances are generated by generation models. Based on experimental results, neural semantic parsers that leverage GAP Model as a representation encoder obtain new state-of-the-art results on both Spider and Criteria-to-SQL benchmarks.

APA, Harvard, Vancouver, ISO, and other styles

12

Alruwaili, Awatif. "An online training course on the use of corpora for teachers in public schools." JALT CALL Journal 19, no. 1 (April 2023): 53–70. http://dx.doi.org/10.29140/jaltcall.v19n1.675.

Full text

Abstract:

This paper describes the outcomes of a teacher-training course offered to inservice teachers in public education on the use of corpora in language education. The paper reports on a mixed method study that explores in-service teachers’ evaluation of an online seven-week course in corpus linguistics (CL). Data were gathered through surveys and open-ended questions. Seventy-one in-service teachers took part in this study and completed both pre- and post-course questionnaires. The main aim of the course was to introduce the main concepts of CL, including an exploration of CL tools and resources, and the use of CL in learning and teaching. The course presents a wide range of corpora, corpus-based materials, and tools. The participants had a positive reaction to the course and considered it useful and comprehensive, although they did not prefer the online delivery method for the practical sessions of the course. The participants acknowledged the usefulness of corpora in their classrooms as well as the possible difficulties they might face, which shows that they genuinely thought about applying corpora in their teaching contexts. Moreover, the participants showed a willingness for the future use of corpora. Offering such a course to in-service teachers using an online delivery method is not preferable, but a hybrid course may be more suitable and effective, given the variation in teachers’ computer and corpus literacy.

APA, Harvard, Vancouver, ISO, and other styles

13

Luo, Da, Yanglei Gan, Rui Hou, Run Lin, Qiao Liu, Yuxiang Cai, and Wannian Gao. "Synergistic Anchored Contrastive Pre-training for Few-Shot Relation Extraction." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 17 (March 24, 2024): 18742–50. http://dx.doi.org/10.1609/aaai.v38i17.29838.

Full text

Abstract:

Few-shot Relation Extraction (FSRE) aims to extract relational facts from a sparse set of labeled corpora. Recent studies have shown promising results in FSRE by employing Pre-trained Language Models (PLMs) within the framework of supervised contrastive learning, which considers both instances and label facts. However, how to effectively harness massive instance-label pairs to encompass the learned representation with semantic richness in this learning paradigm is not fully explored. To address this gap, we introduce a novel synergistic anchored contrastive pre-training framework. This framework is motivated by the insight that the diverse viewpoints conveyed through instance-label pairs capture incomplete yet complementary intrinsic textual semantics. Specifically, our framework involves a symmetrical contrastive objective that encompasses both sentence-anchored and label-anchored contrastive losses. By combining these two losses, the model establishes a robust and uniform representation space. This space effectively captures the reciprocal alignment of feature distributions among instances and relational facts, simultaneously enhancing the maximization of mutual information across diverse perspectives within the same relation. Experimental results demonstrate that our framework achieves significant performance enhancements compared to baseline models in downstream FSRE tasks. Furthermore, our approach exhibits superior adaptability to handle the challenges of domain shift and zero-shot relation extraction. Our code is available online at https://github.com/AONE-NLP/FSRE-SaCon.

APA, Harvard, Vancouver, ISO, and other styles

14

Li, Zhen, Dan Qu, Chaojie Xie, Wenlin Zhang, and Yanxia Li. "Language Model Pre-training Method in Machine Translation Based on Named Entity Recognition." International Journal on Artificial Intelligence Tools 29, no. 07n08 (November 30, 2020): 2040021. http://dx.doi.org/10.1142/s0218213020400217.

Full text

Abstract:

Neural Machine Translation (NMT) model has become the mainstream technology in machine translation. The supervised neural machine translation model trains with abundant of sentence-level parallel corpora. But for low-resources language or dialect with no such corpus available, it is difficult to achieve good performance. Researchers began to focus on unsupervised neural machine translation (UNMT) that monolingual corpus as training data. UNMT need to construct the language model (LM) which learns semantic information from the monolingual corpus. This paper focuses on the pre-training of LM in unsupervised machine translation and proposes a pre-training method, NER-MLM (named entity recognition masked language model). Through performing NER, the proposed method can obtain better semantic information and language model parameters with better training results. In the unsupervised machine translation task, the BLEU scores on the WMT’16 English–French, English–German, data sets are 35.30, 27.30 respectively. To the best of our knowledge, this is the highest results in the field of UNMT reported so far.

APA, Harvard, Vancouver, ISO, and other styles

15

Liu, Peng, Lemei Zhang, and Jon Atle Gulla. "Pre-train, Prompt, and Recommendation: A Comprehensive Survey of Language Modeling Paradigm Adaptations in Recommender Systems." Transactions of the Association for Computational Linguistics 11 (2023): 1553–71. http://dx.doi.org/10.1162/tacl_a_00619.

Full text

Abstract:

Abstract The emergence of Pre-trained Language Models (PLMs) has achieved tremendous success in the field of Natural Language Processing (NLP) by learning universal representations on large corpora in a self-supervised manner. The pre-trained models and the learned representations can be beneficial to a series of downstream NLP tasks. This training paradigm has recently been adapted to the recommendation domain and is considered a promising approach by both academia and industry. In this paper, we systematically investigate how to extract and transfer knowledge from pre-trained models learned by different PLM-related training paradigms to improve recommendation performance from various perspectives, such as generality, sparsity, efficiency and effectiveness. Specifically, we propose a comprehensive taxonomy to divide existing PLM-based recommender systems w.r.t. their training strategies and objectives. Then, we analyze and summarize the connection between PLM-based training paradigms and different input data types for recommender systems. Finally, we elaborate on open issues and future research directions in this vibrant field.

APA, Harvard, Vancouver, ISO, and other styles

16

Maruyama, Takumi, and Kazuhide Yamamoto. "Extremely Low-Resource Text Simplification with Pre-trained Transformer Language Model." International Journal of Asian Language Processing 30, no. 01 (March 2020): 2050001. http://dx.doi.org/10.1142/s2717554520500010.

Full text

Abstract:

Inspired by machine translation task, recent text simplification approaches regard a task as a monolingual text-to-text generation, and neural machine translation models have significantly improved the performance of simplification tasks. Although such models require a large-scale parallel corpus, such corpora for text simplification are very few in number and smaller in size compared to machine translation task. Therefore, we have attempted to facilitate the training of simplification rewritings using pre-training from a large-scale monolingual corpus such as Wikipedia articles. In addition, we propose a translation language model to seamlessly conduct a fine-tuning of text simplification from the pre-training of the language model. The experimental results show that the translation language model substantially outperforms a state-of-the-art model under a low-resource setting. In addition, a pre-trained translation language model with only 3000 supervised examples can achieve a performance comparable to that of the state-of-the-art model using 30,000 supervised examples.

APA, Harvard, Vancouver, ISO, and other styles

17

Zheng, Yinhe, Rongsheng Zhang, Minlie Huang, and Xiaoxi Mao. "A Pre-Training Based Personalized Dialogue Generation Model with Persona-Sparse Data." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 05 (April 3, 2020): 9693–700. http://dx.doi.org/10.1609/aaai.v34i05.6518.

Full text

Abstract:

Endowing dialogue systems with personas is essential to deliver more human-like conversations. However, this problem is still far from well explored due to the difficulties of both embodying personalities in natural languages and the persona sparsity issue observed in most dialogue corpora. This paper proposes a pre-training based personalized dialogue model that can generate coherent responses using persona-sparse dialogue data. In this method, a pre-trained language model is used to initialize an encoder and decoder, and personal attribute embeddings are devised to model richer dialogue contexts by encoding speakers' personas together with dialogue histories. Further, to incorporate the target persona in the decoding process and to balance its contribution, an attention routing structure is devised in the decoder to merge features extracted from the target persona and dialogue contexts using dynamically predicted weights. Our model can utilize persona-sparse dialogues in a unified manner during the training process, and can also control the amount of persona-related features to exhibit during the inference process. Both automatic and manual evaluation demonstrates that the proposed model outperforms state-of-the-art methods for generating more coherent and persona consistent responses with persona-sparse data.

APA, Harvard, Vancouver, ISO, and other styles

18

Mao, Zhuoyuan, Chenhui Chu, and Sadao Kurohashi. "Linguistically Driven Multi-Task Pre-Training for Low-Resource Neural Machine Translation." ACM Transactions on Asian and Low-Resource Language Information Processing 21, no. 4 (July 31, 2022): 1–29. http://dx.doi.org/10.1145/3491065.

Full text

Abstract:

In the present study, we propose novel sequence-to-sequence pre-training objectives for low-resource machine translation (NMT): Japanese-specific sequence to sequence (JASS) for language pairs involving Japanese as the source or target language, and English-specific sequence to sequence (ENSS) for language pairs involving English. JASS focuses on masking and reordering Japanese linguistic units known as bunsetsu, whereas ENSS is proposed based on phrase structure masking and reordering tasks. Experiments on ASPEC Japanese–English & Japanese–Chinese, Wikipedia Japanese–Chinese, News English–Korean corpora demonstrate that JASS and ENSS outperform MASS and other existing language-agnostic pre-training methods by up to +2.9 BLEU points for the Japanese–English tasks, up to +7.0 BLEU points for the Japanese–Chinese tasks and up to +1.3 BLEU points for English–Korean tasks. Empirical analysis, which focuses on the relationship between individual parts in JASS and ENSS, reveals the complementary nature of the subtasks of JASS and ENSS. Adequacy evaluation using LASER, human evaluation, and case studies reveals that our proposed methods significantly outperform pre-training methods without injected linguistic knowledge and they have a larger positive impact on the adequacy as compared to the fluency.

APA, Harvard, Vancouver, ISO, and other styles

19

Ai, Xi, and Bin Fang. "Empirical Regularization for Synthetic Sentence Pairs in Unsupervised Neural Machine Translation." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 14 (May 18, 2021): 12471–79. http://dx.doi.org/10.1609/aaai.v35i14.17479.

Full text

Abstract:

UNMT tackles translation on monolingual corpora in two required languages. Since there is no explicitly cross-lingual signal, pre-training and synthetic sentence pairs are significant to the success of UNMT. In this work, we empirically study the core training procedure of UNMT to analyze the synthetic sentence pairs obtained from back-translation. We introduce new losses to UNMT to regularize the synthetic sentence pairs by jointly training the UNMT objective and the regularization objective. Our comprehensive experiments support that our method can generally improve the performance of currently successful models on three similar pairs {French, German, Romanian} English and one dissimilar pair Russian English with acceptably additional cost.

APA, Harvard, Vancouver, ISO, and other styles

20

Fromont, Robert, and Kevin Watson. "Factors influencing automatic segmental alignment of sociophonetic corpora." Corpora 11, no. 3 (November 2016): 401–31. http://dx.doi.org/10.3366/cor.2016.0101.

Full text

Abstract:

Automatically time-aligning utterances at the segmental level is increasingly common practice in phonetic and sociophonetic work because of the obvious benefits it brings in allowing the efficient scaling up of the amount of speech data that can be analysed. The field is arriving at a set of recommended practices for improving alignment accuracy, but methodological differences across studies (e.g., the use of different languages and different measures of accuracy) often mean that direct comparison of the factors which facilitate or hinder alignment can be difficult. In this paper, following a review of the state of the art in automatic segmental alignment, we test the effects of a number of factors on its accuracy. Namely, we test the effects of: (1) the presence or absence of pause markers in the training data, (2) the presence of overlapping speech or other noise, (3) using training data from single or multiple speakers, (4) using different sampling rates, (5) using pre-trained acoustic models versus models trained ‘from scratch’, and (6) using different amounts of training data. For each test, we examine three different varieties of English, from New Zealand, the USA and the UK. The paper concludes with some recommendations for automatic segmental alignment in general.

APA, Harvard, Vancouver, ISO, and other styles

21

Zhu, Quan, Xiaoyin Wang, Xuan Liu, Wanru Du, and Xingxing Ding. "Multi-task learning for aspect level semantic classification combining complex aspect target semantic enhancement and adaptive local focus." Mathematical Biosciences and Engineering 20, no. 10 (2023): 18566–91. http://dx.doi.org/10.3934/mbe.2023824.

Full text

Abstract:

<abstract> <p>Aspect-based sentiment analysis (ABSA) is a fine-grained and diverse task in natural language processing. Existing deep learning models for ABSA face the challenge of balancing the demand for finer granularity in sentiment analysis with the scarcity of training corpora for such granularity. To address this issue, we propose an enhanced BERT-based model for multi-dimensional aspect target semantic learning. Our model leverages BERT's pre-training and fine-tuning mechanisms, enabling it to capture rich semantic feature parameters. In addition, we propose a complex semantic enhancement mechanism for aspect targets to enrich and optimize fine-grained training corpora. Third, we combine the aspect recognition enhancement mechanism with a CRF model to achieve more robust and accurate entity recognition for aspect targets. Furthermore, we propose an adaptive local attention mechanism learning model to focus on sentiment elements around rich aspect target semantics. Finally, to address the varying contributions of each task in the joint training mechanism, we carefully optimize this training approach, allowing for a mutually beneficial training of multiple tasks. Experimental results on four Chinese and five English datasets demonstrate that our proposed mechanisms and methods effectively improve ABSA models, surpassing some of the latest models in multi-task and single-task scenarios.</p> </abstract>

APA, Harvard, Vancouver, ISO, and other styles

22

Siddhant, Aditya, Anuj Goyal, and Angeliki Metallinou. "Unsupervised Transfer Learning for Spoken Language Understanding in Intelligent Agents." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 4959–66. http://dx.doi.org/10.1609/aaai.v33i01.33014959.

Full text

Abstract:

User interaction with voice-powered agents generates large amounts of unlabeled utterances. In this paper, we explore techniques to efficiently transfer the knowledge from these unlabeled utterances to improve model performance on Spoken Language Understanding (SLU) tasks. We use Embeddings from Language Model (ELMo) to take advantage of unlabeled data by learning contextualized word representations. Additionally, we propose ELMo-Light (ELMoL), a faster and simpler unsupervised pre-training method for SLU. Our findings suggest unsupervised pre-training on a large corpora of unlabeled utterances leads to significantly better SLU performance compared to training from scratch and it can even outperform conventional supervised transfer. Additionally, we show that the gains from unsupervised transfer techniques can be further improved by supervised transfer. The improvements are more pronounced in low resource settings and when using only 1000 labeled in-domain samples, our techniques match the performance of training from scratch on 10-15x more labeled in-domain data.

APA, Harvard, Vancouver, ISO, and other styles

23

Gao, Yunfan, Yun Xiong, Siqi Wang, and Haofen Wang. "GeoBERT: Pre-Training Geospatial Representation Learning on Point-of-Interest." Applied Sciences 12, no. 24 (December 16, 2022): 12942. http://dx.doi.org/10.3390/app122412942.

Full text

Abstract:

Thanks to the development of geographic information technology, geospatial representation learning based on POIs (Point-of-Interest) has gained widespread attention in the past few years. POI is an important indicator to reflect urban socioeconomic activities, widely used to extract geospatial information. However, previous studies often focus on a specific area, such as a city or a district, and are designed only for particular tasks, such as land-use classification. On the other hand, large-scale pre-trained models (PTMs) have recently achieved impressive success and become a milestone in artificial intelligence (AI). Against this background, this study proposes the first large-scale pre-training geospatial representation learning model called GeoBERT. First, we collect about 17 million POIs in 30 cities across China to construct pre-training corpora, with 313 POI types as the tokens and the level-7 Geohash grids as the basic units. Second, we pre-train GeoEBRT to learn grid embedding in self-supervised learning by masking the POI type and then predicting. Third, under the paradigm of “pre-training + fine-tuning”, we design five practical downstream tasks. Experiments show that, with just one additional output layer fine-tuning, GeoBERT outperforms previous NLP methods (Word2vec, GloVe) used in geospatial representation learning by 9.21% on average in F1-score for classification tasks, such as store site recommendation and working/living area prediction. For regression tasks, such as POI number prediction, house price prediction, and passenger flow prediction, GeoBERT demonstrates greater performance improvements. The experiment results prove that pre-training on large-scale POI data can significantly improve the ability to extract geospatial information. In the discussion section, we provide a detailed analysis of what GeoBERT has learned from the perspective of attention mechanisms.

APA, Harvard, Vancouver, ISO, and other styles

24

Chiang, Cheng-Han, and Hung-yi Lee. "On the Transferability of Pre-trained Language Models: A Study from Artificial Datasets." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 10 (June 28, 2022): 10518–25. http://dx.doi.org/10.1609/aaai.v36i10.21295.

Full text

Abstract:

Pre-training language models (LMs) on large-scale unlabeled text data makes the model much easier to achieve exceptional downstream performance than their counterparts directly trained on the downstream tasks. In this work, we study what specific traits in the pre-training data, other than the semantics, make a pre-trained LM superior to their counterparts trained from scratch on downstream tasks. We propose to use artificially constructed datasets as the pre-training data to exclude the effect of semantics, and further control what characteristics the pre-training corpora have. By fine-tuning the pre-trained models on GLUE benchmark, we can learn how beneficial it is to transfer the knowledge from the model trained on the dataset possessing that specific trait. We define and discuss three different characteristics in the artificial dataset: 1) matching the token's uni-gram or bi-gram distribution between pre-training and downstream fine-tuning, 2) the presence of the explicit dependencies among the tokens in a sequence, 3) the length of the implicit dependencies among the tokens in a sequence. Our experiments show that the explicit dependencies in the sequences of the pre-training data are critical to the downstream performance. Our results also reveal that models achieve better downstream performance when pre-trained on a dataset with a longer range of implicit dependencies. Based on our analysis, we find that models pre-trained with artificial datasets are prone to learn spurious correlation in downstream tasks. Our work reveals that even if the LMs are not pre-trained on natural language, they still gain transferability on certain human language downstream tasks once the LMs learn to model the token dependencies in the sequences. This result helps us understand the exceptional transferability of pre-trained LMs.

APA, Harvard, Vancouver, ISO, and other styles

25

Li, Yucheng, Frank Guerin, and Chenghua Lin. "LatestEval: Addressing Data Contamination in Language Model Evaluation through Dynamic and Time-Sensitive Test Construction." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 17 (March 24, 2024): 18600–18607. http://dx.doi.org/10.1609/aaai.v38i17.29822.

Full text

Abstract:

Data contamination in evaluation is getting increasingly prevalent with the emergence of language models pre-trained on super large, automatically crawled corpora. This problem leads to significant challenges in the accurate assessment of model capabilities and generalisations. In this paper, we propose LatestEval, an automatic method that leverages the most recent texts to create uncontaminated reading comprehension evaluations. LatestEval avoids data contamination by only using texts published within a recent time window, ensuring no overlap with the training corpora of pre-trained language models. We develop the LatestEval automated pipeline to 1) gather the latest texts; 2) identify key information, and 3) construct questions targeting the information while removing the existing answers from the context. This encourages models to infer the answers themselves based on the remaining context, rather than just copy-paste. Our experiments demonstrate that language models exhibit negligible memorisation behaviours on LatestEval as opposed to previous benchmarks, suggesting a significantly reduced risk of data contamination and leading to a more robust evaluation. Data and code are publicly available at: https://github.com/liyucheng09/LatestEval.

APA, Harvard, Vancouver, ISO, and other styles

26

Karimzadeh, Morteza, and Alan MacEachren. "GeoAnnotator: A Collaborative Semi-Automatic Platform for Constructing Geo-Annotated Text Corpora." ISPRS International Journal of Geo-Information 8, no. 4 (March 27, 2019): 161. http://dx.doi.org/10.3390/ijgi8040161.

Full text

Abstract:

Ground-truth datasets are essential for the training and evaluation of any automated algorithm. As such, gold-standard annotated corpora underlie most advances in natural language processing (NLP). However, only a few relatively small (geo-)annotated datasets are available for geoparsing, i.e., the automatic recognition and geolocation of place references in unstructured text. The creation of geoparsing corpora that include both the recognition of place names in text and matching of those names to toponyms in a geographic gazetteer (a process we call geo-annotation), is a laborious, time-consuming and expensive task. The field lacks efficient geo-annotation tools to support corpus building and lacks design guidelines for the development of such tools. Here, we present the iterative design of GeoAnnotator, a web-based, semi-automatic and collaborative visual analytics platform for geo-annotation. GeoAnnotator facilitates collaborative, multi-annotator creation of large corpora of geo-annotated text by generating computationally-generated pre-annotations that can be improved by human-annotator users. The resulting corpora can be used in improving and benchmarking geoparsing algorithms as well as various other spatial language-related methods. Further, the iterative design process and the resulting design decisions can be used in annotation platforms tailored for other application domains of NLP.

APA, Harvard, Vancouver, ISO, and other styles

27

Bae, Jae Kwon. "A Study on Application of the Artificial Intelligence-Based Pre-trained Language Model." Academic Society of Global Business Administration 21, no. 2 (April 30, 2024): 64–83. http://dx.doi.org/10.38115/asgba.2024.21.2.64.

Full text

Abstract:

Pre-trained Language Model(PLM) refers to a natural language processing(NLP) model that has been pre-trained using large amounts of text data. The PLM has the limitation of not being able to understand domain-specific terminology due to a lack of training data for terminology. Therefore, the need for a domain-specific language model modified through BERT- or GPT-based pre-trained learning has recently been emphasized. In this study, we analyze BERT's pre-training method and BERT-based transformation techniques (ALBERT, RoBERTa, ELECTRA) and propose a PLM that can be used in biomedical, financial, and legal domains. The biomedical-specific pre-trained learning model is designed to learn domain-specific language characteristics such as technical terminology, medical sentence structure, and medical entity name recognition in the biomedical field. It is mainly adjusted to be applied to biomedical tasks through transfer learning based on BERT's pre-training method and architecture. For this purpose, it is pre-trained with pre-trained biomedical text data, and this pre-training transfers domain-specific knowledge to the model through learning representations for biomedical-related texts. The finance-specific pre-trained learning model is a model that can understand and process financial terminology, financial market trends, and sentence structures and vocabulary related to financial products and services. It can be used to generate news articles about financial market trends and to extract key information by concisely summarizing long texts such as financial reports and corporate press releases. Additionally, finance-specific pre-trained models help financial analysts generate investment recommendations based on a company's financial condition, performance, and prospects. The legal-specific pre-trained model is a language model suitable for legal documents and is used for legal document classification, legal document summarization, and legal document similarity evaluation. The legal-specific pre-learning model was created by pre-training the BERT model on special texts in the legal field, and through this, it learns characteristics specialized for legal documents. The performance of the legal-specific pre-training model can be improved to solve legal-related tasks through scratch pre-training and additional pre-training using legal corpora.

APA, Harvard, Vancouver, ISO, and other styles

28

Fang, Liuqin, Qing Ma, and Jiahao Yan. "The effectiveness of corpus-based training on collocation use in L2 writing for Chinese senior secondary school students." Journal of China Computer-Assisted Language Learning 1, no. 1 (August 1, 2021): 80–109. http://dx.doi.org/10.1515/jccall-2021-2004.

Full text

Abstract:

Abstract Corpus tools are known to be effective in helping L2 learners improve their writing, especially regarding their use of words. Most corpus-based L2 writing research has focused on university students while little attention has been paid to secondary school L2 students. This study investigated whether senior secondary school students in China, upon receiving corpus-based training under the framework of data-driven learning (DDL), could improve their vocabulary use, especially the use of collocations, in their writing for the International English Language Testing System (IELTS) test. Twenty-two students aged 16–18 in a senior secondary school in Nanchang, China who were planning to take the IELTS exam participated in the study. Corpus of Contemporary American English (COCA) and Word and Phrase were the main corpora that the participants used to learn various search functions. Pre-writing and post-writing tests were administered to measure the effect of corpus training. In addition, a questionnaire and interviews were used to collect students’ perspectives and attitudes. The results indicate that students made improvement in word selection after three corpus training sessions, and their attitudes towards corpus use were positive even though they were restricted from using computers to access corpora inside their school.

APA, Harvard, Vancouver, ISO, and other styles

29

Kang, Yu, Tianqiao Liu, Hang Li, Yang Hao, and Wenbiao Ding. "Self-Supervised Audio-and-Text Pre-training with Extremely Low-Resource Parallel Data." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 10 (June 28, 2022): 10875–83. http://dx.doi.org/10.1609/aaai.v36i10.21334.

Full text

Abstract:

Multimodal pre-training for audio-and-text has recently been proved to be effective and has significantly improved the performance of many downstream speech understanding tasks. However, these state-of-the-art pre-training audio-text models work well only when provided with large amount of parallel audio-and-text data, which brings challenges on many languages that are rich in unimodal corpora but scarce of parallel cross-modal corpus. In this paper, we investigate whether it is possible to pre-train an audio-text multimodal model with extremely low-resource parallel data and extra non-parallel unimodal data. Our pre-training framework consists of the following components: (1) Intra-modal Denoising Auto-Encoding (IDAE), which is able to reconstruct input text (audio) representations from a noisy version of itself. (2) Cross-modal Denoising Auto-Encoding (CDAE), which is pre-trained to reconstruct the input text (audio), given both a noisy version of the input text (audio) and the corresponding translated noisy audio features (text embeddings). (3) Iterative Denoising Process (IDP), which iteratively translates raw audio (text) and the corresponding text embeddings (audio features) translated from previous iteration into the new less-noisy text embeddings (audio features). We adapt a dual cross-modal Transformer as our backbone model which consists of two unimodal encoders for IDAE and two cross-modal encoders for CDAE and IDP. Our method achieves comparable performance on multiple downstream speech understanding tasks compared with the model pre-trained on fully parallel data, demonstrating the great potential of the proposed method.

APA, Harvard, Vancouver, ISO, and other styles

30

He, Wanwei, Yinpei Dai, Yinhe Zheng, Yuchuan Wu, Zheng Cao, Dermot Liu, Peng Jiang, et al. "GALAXY: A Generative Pre-trained Model for Task-Oriented Dialog with Semi-supervised Learning and Explicit Policy Injection." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 10 (June 28, 2022): 10749–57. http://dx.doi.org/10.1609/aaai.v36i10.21320.

Full text

Abstract:

Pre-trained models have proved to be powerful in enhancing task-oriented dialog systems. However, current pre-training methods mainly focus on enhancing dialog understanding and generation tasks while neglecting the exploitation of dialog policy. In this paper, we propose GALAXY, a novel pre-trained dialog model that explicitly learns dialog policy from limited labeled dialogs and large-scale unlabeled dialog corpora via semi-supervised learning. Specifically, we introduce a dialog act prediction task for policy optimization during pre-training and employ a consistency regularization term to refine the learned representation with the help of unlabeled dialogs. We also implement a gating mechanism to weigh suitable unlabeled dialog samples. Empirical results show that GALAXY substantially improves the performance of task-oriented dialog systems, and achieves new state-of-the-art results on benchmark datasets: In-Car, MultiWOZ2.0 and MultiWOZ2.1, improving their end-to-end combined scores by 2.5, 5.3 and 5.5 points, respectively. We also show that GALAXY has a stronger few-shot ability than existing models under various low-resource settings. For reproducibility, we release the code and data at https://github.com/siat-nlp/GALAXY.

APA, Harvard, Vancouver, ISO, and other styles

31

Garrido-Muñoz , Ismael, Arturo Montejo-Ráez , Fernando Martínez-Santiago , and L. Alfonso Ureña-López . "A Survey on Bias in Deep NLP." Applied Sciences 11, no. 7 (April 2, 2021): 3184. http://dx.doi.org/10.3390/app11073184.

Full text

Abstract:

Deep neural networks are hegemonic approaches to many machine learning areas, including natural language processing (NLP). Thanks to the availability of large corpora collections and the capability of deep architectures to shape internal language mechanisms in self-supervised learning processes (also known as “pre-training”), versatile and performing models are released continuously for every new network design. These networks, somehow, learn a probability distribution of words and relations across the training collection used, inheriting the potential flaws, inconsistencies and biases contained in such a collection. As pre-trained models have been found to be very useful approaches to transfer learning, dealing with bias has become a relevant issue in this new scenario. We introduce bias in a formal way and explore how it has been treated in several networks, in terms of detection and correction. In addition, available resources are identified and a strategy to deal with bias in deep NLP is proposed.

APA, Harvard, Vancouver, ISO, and other styles

32

Perkowski, Ernest, Rui Pan, Tuan Dung Nguyen, Yuan-Sen Ting, Sandor Kruk, Tong Zhang, Charlie O’Neill, et al. "AstroLLaMA-Chat: Scaling AstroLLaMA with Conversational and Diverse Datasets." Research Notes of the AAS 8, no. 1 (January 8, 2024): 7. http://dx.doi.org/10.3847/2515-5172/ad1abe.

Full text

Abstract:

Abstract We explore the potential of enhancing LLM performance in astronomy-focused question-answering through targeted, continual pre-training. By employing a compact 7B-parameter LLaMA-2 model and focusing exclusively on a curated set of astronomy corpora—comprising abstracts, introductions, and conclusions—we achieve notable improvements in specialized topic comprehension. While general LLMs like GPT-4 excel in broader question-answering scenarios due to superior reasoning capabilities, our findings suggest that continual pre-training with limited resources can still enhance model performance on specialized topics. Additionally, we present an extension of AstroLLaMA: the fine-tuning of the 7B LLaMA model on a domain-specific conversational data set, culminating in the release of the chat-enabled AstroLLaMA for community use. Comprehensive quantitative benchmarking is currently in progress and will be detailed in an upcoming full paper. The model, AstroLLaMA-Chat, is now available at https://huggingface.co/universeTBD, providing the first open-source conversational AI tool tailored for the astronomy community.

APA, Harvard, Vancouver, ISO, and other styles

33

Wang, Ke, Xiutian Zhao, and Wei Peng. "Learning from Failure: Improving Meeting Summarization without Good Samples." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 17 (March 24, 2024): 19153–61. http://dx.doi.org/10.1609/aaai.v38i17.29883.

Full text

Abstract:

Existing methods aligning language models with various human needs are reliant heavily on high-quality and task-specific data. However, industrial deployment of task-specific language models often encounter challenges in the availability of appropriate training samples. Taking meeting summarization for instance, public datasets are scarce, and private corpora are also hard to obtain due to privacy issues or resource-demanding annotation. To improve meeting summarization in the absence of positively-rated (i.e., ``good'') samples, we propose Score Tuning, a cold start tuning framework that leverages bad samples of distinguishable degrees to incrementally enhance the performance of summary generation without an initial presence of good samples. Our method utilizes asynchronous and numerical human feedback that measure the quality of generated summaries. Formulating data into triplets of (transcript, summary, score), our approach instructs a pre-trained model to learn the association between summary qualities and human-rated scores and hence to generate better summaries corresponding to higher scores. The experiment results show that our method is effective in improving meeting summarization on both English and Chinese corpora while requiring less annotated data and training resources compared to existing alignment methods. Additionally, we also preliminarily explore the transferability of our approach in machine translation tasks and demonstrate its potential for future development and usage in other domains.

APA, Harvard, Vancouver, ISO, and other styles

34

Pota, Marco, Mirko Ventura, Rosario Catelli, and Massimo Esposito. "An Effective BERT-Based Pipeline for Twitter Sentiment Analysis: A Case Study in Italian." Sensors 21, no. 1 (December 28, 2020): 133. http://dx.doi.org/10.3390/s21010133.

Full text

Abstract:

Over the last decade industrial and academic communities have increased their focus on sentiment analysis techniques, especially applied to tweets. State-of-the-art results have been recently achieved using language models trained from scratch on corpora made up exclusively of tweets, in order to better handle the Twitter jargon. This work aims to introduce a different approach for Twitter sentiment analysis based on two steps. Firstly, the tweet jargon, including emojis and emoticons, is transformed into plain text, exploiting procedures that are language-independent or easily applicable to different languages. Secondly, the resulting tweets are classified using the language model BERT, but pre-trained on plain text, instead of tweets, for two reasons: (1) pre-trained models on plain text are easily available in many languages, avoiding resource- and time-consuming model training directly on tweets from scratch; (2) available plain text corpora are larger than tweet-only ones, therefore allowing better performance. A case study describing the application of the approach to Italian is presented, with a comparison with other Italian existing solutions. The results obtained show the effectiveness of the approach and indicate that, thanks to its general basis from a methodological perspective, it can also be promising for other languages.

APA, Harvard, Vancouver, ISO, and other styles

35

González-Docasal, Ander, and Aitor Álvarez. "Enhancing Voice Cloning Quality through Data Selection and Alignment-Based Metrics." Applied Sciences 13, no. 14 (July 10, 2023): 8049. http://dx.doi.org/10.3390/app13148049.

Full text

Abstract:

Voice cloning, an emerging field in the speech-processing area, aims to generate synthetic utterances that closely resemble the voices of specific individuals. In this study, we investigated the impact of various techniques on improving the quality of voice cloning, specifically focusing on a low-quality dataset. To contrast our findings, we also used two high-quality corpora for comparative analysis. We conducted exhaustive evaluations of the quality of the gathered corpora in order to select the most-suitable data for the training of a voice-cloning system. Following these measurements, we conducted a series of ablations by removing audio files with a lower signal-to-noise ratio and higher variability in utterance speed from the corpora in order to decrease their heterogeneity. Furthermore, we introduced a novel algorithm that calculates the fraction of aligned input characters by exploiting the attention matrix of the Tacotron 2 text-to-speech system. This algorithm provides a valuable metric for evaluating the alignment quality during the voice-cloning process. We present the results of our experiments, demonstrating that the performed ablations significantly increased the quality of synthesised audio for the challenging low-quality corpus. Notably, our findings indicated that models trained on a 3 h corpus from a pre-trained model exhibit comparable audio quality to models trained from scratch using significantly larger amounts of data.

APA, Harvard, Vancouver, ISO, and other styles

36

Vu, Dang Thanh, Gwanghyun Yu, Chilwoo Lee, and Jinyoung Kim. "Text Data Augmentation for the Korean Language." Applied Sciences 12, no. 7 (March 28, 2022): 3425. http://dx.doi.org/10.3390/app12073425.

Full text

Abstract:

Data augmentation (DA) is a universal technique to reduce overfitting and improve the robustness of machine learning models by increasing the quantity and variety of the training dataset. Although data augmentation is essential in vision tasks, it is rarely applied to text datasets since it is less straightforward. Some studies have concerned text data augmentation, but most of them are for the majority languages, such as English or French. There have been only a few studies on data augmentation for minority languages, e.g., Korean. This study fills the gap by demonstrating several common data augmentation methods and Korean corpora with pre-trained language models. In short, we evaluate the performance of two text data augmentation approaches, known as text transformation and back translation. We compare these augmentations among Korean corpora on four downstream tasks: semantic textual similarity (STS), natural language inference (NLI), question duplication verification (QDV), and sentiment classification (STC). Compared to cases without augmentation, the performance gains when applying text data augmentation are 2.24%, 2.19%, 0.66%, and 0.08% on the STS, NLI, QDV, and STC tasks, respectively.

APA, Harvard, Vancouver, ISO, and other styles

37

Qi, Kunxun, and Jianfeng Du. "Translation-Based Matching Adversarial Network for Cross-Lingual Natural Language Inference." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 05 (April 3, 2020): 8632–39. http://dx.doi.org/10.1609/aaai.v34i05.6387.

Full text

Abstract:

Cross-lingual natural language inference is a fundamental task in cross-lingual natural language understanding, widely addressed by neural models recently. Existing neural model based methods either align sentence embeddings between source and target languages, heavily relying on annotated parallel corpora, or exploit pre-trained cross-lingual language models that are fine-tuned on a single language and hard to transfer knowledge to another language. To resolve these limitations in existing methods, this paper proposes an adversarial training framework to enhance both pre-trained models and classical neural models for cross-lingual natural language inference. It trains on the union of data in the source language and data in the target language, learning language-invariant features to improve the inference performance. Experimental results on the XNLI benchmark demonstrate that three popular neural models enhanced by the proposed framework significantly outperform the original models.

APA, Harvard, Vancouver, ISO, and other styles

38

A. Brenes, Jose, Javier Ferrández-Pastor, José M. Cámara-Zapata, and Gabriela Marín-Raventós. "Use of Hough Transform and Homography for the Creation of Image Corpora for Smart Agriculture." International Journal on Cybernetics & Informatics 12, no. 6 (October 7, 2023): 09–19. http://dx.doi.org/10.5121/ijci.2023.120602.

Full text

Abstract:

In the context of smart agriculture, developing deep learning models demands large and highquality datasets for training. However, the current lack of such datasets for specific crops poses a significant challenge to the progress of this field. This research proposes an automated method to facilitate the creation of training datasets through automated image capture and pre-processing. The method’s efficacy is demonstrated through two study cases conducted in a Cannabis Sativa cultivation setting. By leveraging automated processes, the proposed approach enables to create large-volume and high-quality datasets, significantly reducing human effort. The results indicate that the proposed method not only simplifies dataset creation but also allows researchers to concentrate on other critical tasks, such as refining image labeling and advancing artificial intelligence model creation. This work contributes towards efficient and accurate deep learning applications in smart agriculture.

APA, Harvard, Vancouver, ISO, and other styles

39

Yang, Tiancheng, Ilia Sucholutsky, Kuang-Yu Jen, and Matthias Schonlau. "exKidneyBERT: a language model for kidney transplant pathology reports and the crucial role of extended vocabularies." PeerJ Computer Science 10 (February 28, 2024): e1888. http://dx.doi.org/10.7717/peerj-cs.1888.

Full text

Abstract:

Background Pathology reports contain key information about the patient’s diagnosis as well as important gross and microscopic findings. These information-rich clinical reports offer an invaluable resource for clinical studies, but data extraction and analysis from such unstructured texts is often manual and tedious. While neural information retrieval systems (typically implemented as deep learning methods for natural language processing) are automatic and flexible, they typically require a large domain-specific text corpus for training, making them infeasible for many medical subdomains. Thus, an automated data extraction method for pathology reports that does not require a large training corpus would be of significant value and utility. Objective To develop a language model-based neural information retrieval system that can be trained on small datasets and validate it by training it on renal transplant-pathology reports to extract relevant information for two predefined questions: (1) “What kind of rejection does the patient show?”; (2) “What is the grade of interstitial fibrosis and tubular atrophy (IFTA)?” Methods Kidney BERT was developed by pre-training Clinical BERT on 3.4K renal transplant pathology reports and 1.5M words. Then, exKidneyBERT was developed by extending Clinical BERT’s tokenizer with six technical keywords and repeating the pre-training procedure. This extended the model’s vocabulary. All three models were fine-tuned with information retrieval heads. Results The model with extended vocabulary, exKidneyBERT, outperformed Clinical BERT and Kidney BERT in both questions. For rejection, exKidneyBERT achieved an 83.3% overlap ratio for antibody-mediated rejection (ABMR) and 79.2% for T-cell mediated rejection (TCMR). For IFTA, exKidneyBERT had a 95.8% exact match rate. Conclusion ExKidneyBERT is a high-performing model for extracting information from renal pathology reports. Additional pre-training of BERT language models on specialized small domains does not necessarily improve performance. Extending the BERT tokenizer’s vocabulary library is essential for specialized domains to improve performance, especially when pre-training on small corpora.

APA, Harvard, Vancouver, ISO, and other styles

40

Li, Lei, Yongfeng Zhang, and Li Chen. "Personalized Prompt Learning for Explainable Recommendation." ACM Transactions on Information Systems 41, no. 4 (March 23, 2023): 1–26. http://dx.doi.org/10.1145/3580488.

Full text

Abstract:

Providing user-understandable explanations to justify recommendations could help users better understand the recommended items, increase the system’s ease of use, and gain users’ trust. A typical approach to realize it is natural language generation. However, previous works mostly adopt recurrent neural networks to meet the ends, leaving the potentially more effective pre-trained Transformer models under-explored. In fact, user and item IDs, as important identifiers in recommender systems, are inherently in different semantic space as words that pre-trained models were already trained on. Thus, how to effectively fuse IDs into such models becomes a critical issue. Inspired by recent advancement in prompt learning, we come up with two solutions: find alternative words to represent IDs (called discrete prompt learning) and directly input ID vectors to a pre-trained model (termed continuous prompt learning). In the latter case, ID vectors are randomly initialized but the model is trained in advance on large corpora, so they are actually in different learning stages. To bridge the gap, we further propose two training strategies: sequential tuning and recommendation as regularization. Extensive experiments show that our continuous prompt learning approach equipped with the training strategies consistently outperforms strong baselines on three datasets of explainable recommendation.

APA, Harvard, Vancouver, ISO, and other styles

41

Panboonyuen, Teerapong, Kulsawasd Jitkajornwanich, Siam Lawawirojwong, Panu Srestasathiern, and Peerapon Vateekul. "Semantic Segmentation on Remotely Sensed Images Using an Enhanced Global Convolutional Network with Channel Attention and Domain Specific Transfer Learning." Remote Sensing 11, no. 1 (January 4, 2019): 83. http://dx.doi.org/10.3390/rs11010083.

Full text

Abstract:

In the remote sensing domain, it is crucial to complete semantic segmentation on the raster images, e.g., river, building, forest, etc., on raster images. A deep convolutional encoder–decoder (DCED) network is the state-of-the-art semantic segmentation method for remotely sensed images. However, the accuracy is still limited, since the network is not designed for remotely sensed images and the training data in this domain is deficient. In this paper, we aim to propose a novel CNN for semantic segmentation particularly for remote sensing corpora with three main contributions. First, we propose applying a recent CNN called a global convolutional network (GCN), since it can capture different resolutions by extracting multi-scale features from different stages of the network. Additionally, we further enhance the network by improving its backbone using larger numbers of layers, which is suitable for medium resolution remotely sensed images. Second, “channel attention” is presented in our network in order to select the most discriminative filters (features). Third, “domain-specific transfer learning” is introduced to alleviate the scarcity issue by utilizing other remotely sensed corpora with different resolutions as pre-trained data. The experiment was then conducted on two given datasets: (i) medium resolution data collected from Landsat-8 satellite and (ii) very high resolution data called the ISPRS Vaihingen Challenge Dataset. The results show that our networks outperformed DCED in terms of F 1 for 17.48% and 2.49% on medium and very high resolution corpora, respectively.

APA, Harvard, Vancouver, ISO, and other styles

42

Panboonyuen, Teerapong, Kulsawasd Jitkajornwanich, Siam Lawawirojwong, Panu Srestasathiern, and Peerapon Vateekul. "Transformer-Based Decoder Designs for Semantic Segmentation on Remotely Sensed Images." Remote Sensing 13, no. 24 (December 15, 2021): 5100. http://dx.doi.org/10.3390/rs13245100.

Full text

Abstract:

Transformers have demonstrated remarkable accomplishments in several natural language processing (NLP) tasks as well as image processing tasks. Herein, we present a deep-learning (DL) model that is capable of improving the semantic segmentation network in two ways. First, utilizing the pre-training Swin Transformer (SwinTF) under Vision Transformer (ViT) as a backbone, the model weights downstream tasks by joining task layers upon the pretrained encoder. Secondly, decoder designs are applied to our DL network with three decoder designs, U-Net, pyramid scene parsing (PSP) network, and feature pyramid network (FPN), to perform pixel-level segmentation. The results are compared with other image labeling state of the art (SOTA) methods, such as global convolutional network (GCN) and ViT. Extensive experiments show that our Swin Transformer (SwinTF) with decoder designs reached a new state of the art on the Thailand Isan Landsat-8 corpus (89.8% F1 score), Thailand North Landsat-8 corpus (63.12% F1 score), and competitive results on ISPRS Vaihingen. Moreover, both our best-proposed methods (SwinTF-PSP and SwinTF-FPN) even outperformed SwinTF with supervised pre-training ViT on the ImageNet-1K in the Thailand, Landsat-8, and ISPRS Vaihingen corpora.

APA, Harvard, Vancouver, ISO, and other styles

43

Liu, Rui, and Barzan Mozafari. "Transformer with Memory Replay." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 7 (June 28, 2022): 7567–75. http://dx.doi.org/10.1609/aaai.v36i7.20722.

Full text

Abstract:

Transformers achieve state-of-the-art performance for natural language processing tasks by pre-training on large-scale text corpora. They are extremely compute-intensive and have very high sample complexity. Memory replay is a mechanism that remembers and reuses past examples by saving to and replaying from a memory buffer. It has been successfully used in reinforcement learning and GANs due to better sample efficiency. In this paper, we propose Transformer with Memory Replay, which integrates memory replay with transformer, making transformer more sample efficient. Experiments on GLUE and SQuAD benchmark datasets showed that Transformer with Memory Replay can achieve at least 1% point increase compared to the baseline transformer model when pre-trained with the same number of examples. Further, by adopting a careful design that reduces the wall-clock time overhead of memory replay, we also empirically achieve a better runtime efficiency.

APA, Harvard, Vancouver, ISO, and other styles

44

Liu, Weijie, Peng Zhou, Zhe Zhao, Zhiruo Wang, Qi Ju, Haotang Deng, and Ping Wang. "K-BERT: Enabling Language Representation with Knowledge Graph." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 03 (April 3, 2020): 2901–8. http://dx.doi.org/10.1609/aaai.v34i03.5681.

Full text

Abstract:

Pre-trained language representation models, such as BERT, capture a general language representation from large-scale corpora, but lack domain-specific knowledge. When reading a domain text, experts make inferences with relevant knowledge. For machines to achieve this capability, we propose a knowledge-enabled language representation model (K-BERT) with knowledge graphs (KGs), in which triples are injected into the sentences as domain knowledge. However, too much knowledge incorporation may divert the sentence from its correct meaning, which is called knowledge noise (KN) issue. To overcome KN, K-BERT introduces soft-position and visible matrix to limit the impact of knowledge. K-BERT can easily inject domain knowledge into the models by being equipped with a KG without pre-training by itself because it is capable of loading model parameters from the pre-trained BERT. Our investigation reveals promising results in twelve NLP tasks. Especially in domain-specific tasks (including finance, law, and medicine), K-BERT significantly outperforms BERT, which demonstrates that K-BERT is an excellent choice for solving the knowledge-driven problems that require experts.

APA, Harvard, Vancouver, ISO, and other styles

45

Peng, Baolin, Chunyuan Li, Jinchao Li, Shahin Shayandeh, Lars Liden, and Jianfeng Gao. "Soloist: BuildingTask Bots at Scale with Transfer Learning and Machine Teaching." Transactions of the Association for Computational Linguistics 9 (2021): 807–24. http://dx.doi.org/10.1162/tacl_a_00399.

Full text

Abstract:

Abstract We present a new method, Soloist,1 that uses transfer learning and machine teaching to build task bots at scale. We parameterize classical modular task-oriented dialog systems using a Transformer-based auto-regressive language model, which subsumes different dialog modules into a single neural model. We pre-train, on heterogeneous dialog corpora, a task-grounded response generation model, which can generate dialog responses grounded in user goals and real-world knowledge for task completion. The pre-trained model can be efficiently adapted to accomplish new tasks with a handful of task-specific dialogs via machine teaching, where training samples are generated by human teachers interacting with the system. Experiments show that (i)Soloist creates new state-of-the-art on well-studied task-oriented dialog benchmarks, including CamRest676 and MultiWOZ; (ii) in the few-shot fine-tuning settings, Soloist significantly outperforms existing methods; and (iii) the use of machine teaching substantially reduces the labeling cost of fine-tuning. The pre-trained models and codes are available at https://aka.ms/soloist.

APA, Harvard, Vancouver, ISO, and other styles

46

Palagin, O. V., V. Yu Velychko, K. S. Malakhov, and O. S. Shchurov. "Distributional semantic modeling: a revised technique to train term/word vector space models applying the ontology-related approach." PROBLEMS IN PROGRAMMING, no. 2-3 (September 2020): 341–51. http://dx.doi.org/10.15407/pp2020.02-03.341.

Full text

Abstract:

We design a new technique for the distributional semantic modeling with a neural network-based approach to learn distributed term representations (or term embeddings) – term vector space models as a result, inspired by the recent ontology-related approach (using different types of contextual knowledge such as syntactic knowledge, terminological knowledge, semantic knowledge, etc.) to the identification of terms (term extraction) and relations between them (relation extraction) called semantic pre-processing technology – SPT. Our method relies on automatic term extraction from the natural language texts and subsequent formation of the problem-oriented or application-oriented (also deeply annotated) text corpora where the fundamental entity is the term (includes non-compositional and compositional terms). This gives us an opportunity to changeover from distributed word representations (or word embeddings) to distributed term representations (or term embeddings). The main practical result of our work is the development kit (set of toolkits represented as web service APIs and web application), which provides all necessary routines for the basic linguistic pre-processing and the semantic pre-processing of the natural language texts in Ukrainian for future training of term vector space models.

APA, Harvard, Vancouver, ISO, and other styles

47

Choi, Yong-Seok, Yo-Han Park, Seung Yun, Sang-Hun Kim, and Kong-Joo Lee. "Factors Behind the Effectiveness of an Unsupervised Neural Machine Translation System between Korean and Japanese." Applied Sciences 11, no. 16 (August 21, 2021): 7662. http://dx.doi.org/10.3390/app11167662.

Full text

Abstract:

Korean and Japanese have different writing scripts but share the same Subject-Object-Verb (SOV) word order. In this study, we pre-train a language-generation model using a Masked Sequence-to-Sequence pre-training (MASS) method on Korean and Japanese monolingual corpora. When building the pre-trained generation model, we allow the smallest number of shared vocabularies between the two languages. Then, we build an unsupervised Neural Machine Translation (NMT) system between Korean and Japanese based on the pre-trained generation model. Despite the different writing scripts and few shared vocabularies, the unsupervised NMT system performs well compared to other pairs of languages. Our interest is in the common characteristics of both languages that make the unsupervised NMT perform so well. In this study, we propose a new method to analyze cross-attentions between a source and target language to estimate the language differences from the perspective of machine translation. We calculate cross-attention measurements between Korean–Japanese and Korean–English pairs and compare their performances and characteristics. The Korean–Japanese pair has little difference in word order and a morphological system, and thus the unsupervised NMT between Korean and Japanese can be trained well even without parallel sentences and shared vocabularies.

APA, Harvard, Vancouver, ISO, and other styles

48

Zayed, Abdelrahman, Prasanna Parthasarathi, Gonçalo Mordido, Hamid Palangi, Samira Shabanian, and Sarath Chandar. "Deep Learning on a Healthy Data Diet: Finding Important Examples for Fairness." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 12 (June 26, 2023): 14593–601. http://dx.doi.org/10.1609/aaai.v37i12.26706.

Full text

Abstract:

Data-driven predictive solutions predominant in commercial applications tend to suffer from biases and stereotypes, which raises equity concerns. Prediction models may discover, use, or amplify spurious correlations based on gender or other protected personal characteristics, thus discriminating against marginalized groups. Mitigating gender bias has become an important research focus in natural language processing (NLP) and is an area where annotated corpora are available. Data augmentation reduces gender bias by adding counterfactual examples to the training dataset. In this work, we show that some of the examples in the augmented dataset can be not important or even harmful to fairness. We hence propose a general method for pruning both the factual and counterfactual examples to maximize the model’s fairness as measured by the demographic parity, equality of opportunity, and equality of odds. The fairness achieved by our method surpasses that of data augmentation on three text classification datasets, using no more than half of the examples in the augmented dataset. Our experiments are conducted using models of varying sizes and pre-training settings. WARNING: This work uses language that is offensive in nature.

APA, Harvard, Vancouver, ISO, and other styles

49

Keung, Phillip, Julian Salazar, Yichao Lu, and Noah A. Smith. "Unsupervised Bitext Mining and Translation via Self-Trained Contextual Embeddings." Transactions of the Association for Computational Linguistics 8 (December 2020): 828–41. http://dx.doi.org/10.1162/tacl_a_00348.

Full text

Abstract:

We describe an unsupervised method to create pseudo-parallel corpora for machine translation (MT) from unaligned text. We use multilingual BERT to create source and target sentence embeddings for nearest-neighbor search and adapt the model via self-training. We validate our technique by extracting parallel sentence pairs on the BUCC 2017 bitext mining task and observe up to a 24.5 point increase (absolute) in F1 scores over previous unsupervised methods. We then improve an XLM-based unsupervised neural MT system pre-trained on Wikipedia by supplementing it with pseudo-parallel text mined from the same corpus, boosting unsupervised translation performance by up to 3.5 BLEU on the WMT’14 French-English and WMT’16 German-English tasks and outperforming the previous state-of-the-art. Finally, we enrich the IWSLT’15 English-Vietnamese corpus with pseudo-parallel Wikipedia sentence pairs, yielding a 1.2 BLEU improvement on the low-resource MT task. We demonstrate that unsupervised bitext mining is an effective way of augmenting MT datasets and complements existing techniques like initializing with pre-trained contextual embeddings.

APA, Harvard, Vancouver, ISO, and other styles

50

Laucis, Rolands, and Gints Jēkabsons. "Evaluation of Word Embedding Models in Latvian NLP Tasks Based on Publicly Available Corpora." Applied Computer Systems 26, no. 2 (December 1, 2021): 132–38. http://dx.doi.org/10.2478/acss-2021-0016.

Full text

Abstract:

Abstract Nowadays, natural language processing (NLP) is increasingly relaying on pre-trained word embeddings for use in various tasks. However, there is little research devoted to Latvian – a language that is much more morphologically complex than English. In this study, several experiments were carried out in three NLP tasks on four different methods of creating word embeddings: word2vec, fastText, Structured Skip-Gram and ngram2vec. The obtained results can serve as a baseline for future research on the Latvian language in NLP. The main conclusions are the following: First, in the part-of-speech task, using a training corpus 46 times smaller than in a previous study, the accuracy was 91.4 % (versus 98.3 % in the previous study). Second, fastText demonstrated the overall best effectiveness. Third, the best results for all methods were observed for embeddings with a dimension size of 200. Finally, word lemmatization generally did not improve results.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!