Bibliographies thématiques / Pre-training corpora

Littérature scientifique sur le sujet « Pre-training corpora »

Auteur : Grafiati

Publié le 25 mai 2024

Créez une référence correcte selon les styles APA, MLA, Chicago, Harvard et plusieurs autres

Choisissez une source :

Sommaire

Articles de revues
Thèses
Livres
Chapitres de livres
Actes de conférences

Consultez les listes thématiques d’articles de revues, de livres, de thèses, de rapports de conférences et d’autres sources académiques sur le sujet « Pre-training corpora ».

À côté de chaque source dans la liste de références il y a un bouton « Ajouter à la bibliographie ». Cliquez sur ce bouton, et nous générerons automatiquement la référence bibliographique pour la source choisie selon votre style de citation préféré : APA, MLA, Harvard, Vancouver, Chicago, etc.

Vous pouvez aussi télécharger le texte intégral de la publication scolaire au format pdf et consulter son résumé en ligne lorsque ces informations sont inclues dans les métadonnées.

Articles de revues sur le sujet "Pre-training corpora"

Sun, Yu, Shuohuan Wang, Yukun Li, Shikun Feng, Hao Tian, Hua Wu et Haifeng Wang. « ERNIE 2.0 : A Continual Pre-Training Framework for Language Understanding ». Proceedings of the AAAI Conference on Artificial Intelligence 34, n^o 05 (3 avril 2020) : 8968–75. http://dx.doi.org/10.1609/aaai.v34i05.6428.

Texte intégral

Résumé :

Recently pre-trained models have achieved state-of-the-art results in various language understanding tasks. Current pre-training procedures usually focus on training the model with several simple tasks to grasp the co-occurrence of words or sentences. However, besides co-occurring information, there exists other valuable lexical, syntactic and semantic information in training corpora, such as named entities, semantic closeness and discourse relations. In order to extract the lexical, syntactic and semantic information from training corpora, we propose a continual pre-training framework named ERNIE 2.0 which incrementally builds pre-training tasks and then learn pre-trained models on these constructed tasks via continual multi-task learning. Based on this framework, we construct several tasks and train the ERNIE 2.0 model to capture lexical, syntactic and semantic aspects of information in the training data. Experimental results demonstrate that ERNIE 2.0 model outperforms BERT and XLNet on 16 tasks including English tasks on GLUE benchmarks and several similar tasks in Chinese. The source codes and pre-trained models have been released at https://github.com/PaddlePaddle/ERNIE.

Styles APA, Harvard, Vancouver, ISO, etc.

Moodaley, Wayne, et Arnesh Telukdarie. « A Conceptual Framework for Subdomain Specific Pre-Training of Large Language Models for Green Claim Detection ». European Journal of Sustainable Development 12, n^o 4 (1 octobre 2023) : 319. http://dx.doi.org/10.14207/ejsd.2023.v12n4p319.

Texte intégral

Résumé :

Detection of false or misleading green claims (referred to as “greenwashing”) within company sustainability disclosures is challenging for a number of reasons, which include the textual and qualitative nature, volume, and complexity of such disclosures. In recent years, notable progress made in the fields of artificial intelligence and specifically, large language models (LLMs), has showcased the capacity of these tools to effectively analyse extensive and intricate textual data, including the contents of sustainability disclosures. Transformer-based LLMs, such as Google’s BERT architecture, were trained on general domain text corpora. Subsequent research has shown that further pre-training of such LLMs on specific domains, such as the climate or sustainability domains, may improve performance. However, previous research often uses text corpora that exhibit significant variation across topics and language and which often consist of heterogeneous subdomains. We therefore propose a conceptual framework for further pre-training of transformer based LLMs using text corpora relating to specific sustainability subdomains i.e. subdomain specific pre-training. We do so as a basis for the improved performance of such models in analysing sustainability disclosures. The main contribution is a conceptual framework to advance the use of LLMs for the reliable identification of green claims and ultimately, greenwashing. Keywords: greenwashing, artificial intelligence, sustainability, sustainability reporting, sustainability disclosures.

Styles APA, Harvard, Vancouver, ISO, etc.

Liu, Yinhan, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis et Luke Zettlemoyer. « Multilingual Denoising Pre-training for Neural Machine Translation ». Transactions of the Association for Computational Linguistics 8 (novembre 2020) : 726–42. http://dx.doi.org/10.1162/tacl_a_00343.

Texte intégral

Résumé :

This paper demonstrates that multilingual denoising pre-training produces significant performance gains across a wide variety of machine translation (MT) tasks. We present mBART—a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora in many languages using the BART objective (Lewis et al., 2019 ). mBART is the first method for pre-training a complete sequence-to-sequence model by denoising full texts in multiple languages, whereas previous approaches have focused only on the encoder, decoder, or reconstructing parts of the text. Pre-training a complete model allows it to be directly fine-tuned for supervised (both sentence-level and document-level) and unsupervised machine translation, with no task- specific modifications. We demonstrate that adding mBART initialization produces performance gains in all but the highest-resource settings, including up to 12 BLEU points for low resource MT and over 5 BLEU points for many document-level and unsupervised models. We also show that it enables transfer to language pairs with no bi-text or that were not in the pre-training corpus, and present extensive analysis of which factors contribute the most to effective pre-training. 1

Styles APA, Harvard, Vancouver, ISO, etc.

Dean, Roger Thornton, et Marcus Thomas Pearce. « Algorithmically-generated Corpora that use Serial Compositional Principles Can Contribute to the Modeling of Sequential Pitch Structure in Non-tonal Music ». Empirical Musicology Review 11, n^o 1 (8 juillet 2016) : 27. http://dx.doi.org/10.18061/emr.v11i1.4900.

Texte intégral

Résumé :

We investigate whether pitch sequences in non-tonal music can be modeled by an information-theoretic approach using algorithmically-generated melodic sequences, made according to 12-tone serial principles, as the training corpus. This is potentially useful, because symbolic corpora of non-tonal music are not readily available. A non-tonal corpus of serially-composed melodies was constructed algorithmically using classic principles of 12-tone music, including prime, inversion, retrograde and retrograde inversion transforms. A similar algorithm generated a tonal melodic corpus of tonal transformations, in each case based on a novel tonal melody and expressed in alternating major keys. A cognitive model of auditory expectation (IDyOM) was used first to analyze the sequential pitch structure of the corpora, in some cases with pre-training on established tonal folk-song corpora (Essen, Schaffrath, 1995). The two algorithmic corpora can be distinguished in terms of their information content, and they were quite different from random corpora and from the folk-song corpus. We then demonstrate that the algorithmic serial corpora can assist modeling of canonical non-tonal compositions by Webern and Schoenberg, and also non-tonal segments of improvisations by skilled musicians. Separately, we developed the process of algorithmic melody composition into a software system (the Serial Collaborator) capable of generating multi-stranded serial keyboard music. Corpora of such keyboard compositions based either on the non-tonal or the tonal melodic corpora were generated and assessed for their information-theoretic modeling properties.

Styles APA, Harvard, Vancouver, ISO, etc.

Yuan, Sha, Hanyu Zhao, Zhengxiao Du, Ming Ding, Xiao Liu, Yukuo Cen, Xu Zou, Zhilin Yang et Jie Tang. « WuDaoCorpora : A super large-scale Chinese corpora for pre-training language models ». AI Open 2 (2021) : 65–68. http://dx.doi.org/10.1016/j.aiopen.2021.06.001.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

Kreutzer, Julia, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo et al. « Quality at a Glance : An Audit of Web-Crawled Multilingual Datasets ». Transactions of the Association for Computational Linguistics 10 (2022) : 50–72. http://dx.doi.org/10.1162/tacl_a_00447.

Texte intégral

Résumé :

Abstract With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, Web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases.

Styles APA, Harvard, Vancouver, ISO, etc.

Qian, Jing, Yong Yue, Katie Atkinson et Gangmin Li. « Understanding Chinese Moral Stories with Further Pre-Training ». International Journal on Natural Language Computing 12, n^o 2 (29 avril 2023) : 01–12. http://dx.doi.org/10.5121/ijnlc.2023.12201.

Texte intégral

Résumé :

The goal of moral understanding is to grasp the theoretical concepts embedded in a narrative by delving beyond the concrete occurrences and dynamic personas. Specifically, the narrative is compacted into a single statement without involving any characters within the original text, necessitating a more astute language model that can comprehend connotative morality and exhibit commonsense reasoning. The “pretraining + fine-tuning” paradigm is widely embraced in neural language models. In this paper, we propose an intermediary phase to establish an improved paradigm of “pre-training + further pre-training + finetuning”. Further pre-training generally refers to continual learning on task-specific or domain-relevant corpora before being applied to target tasks, which aims at bridging the gap in data distribution between the phases of pre-training and fine-tuning. Our work is based on a Chinese dataset named STORAL-ZH that composes of 4k human-written story-moral pairs. Furthermore, we design a two-step process of domain-adaptive pre-training in the intermediary phase. The first step depends on a newly-collected Chinese dataset of Confucian moral culture. And the second step bases on the Chinese version of a frequently-used commonsense knowledge graph (i.e. ATOMIC) to enrich the backbone model with inferential knowledge besides morality. By comparison with several advanced models including BERTbase, RoBERTa-base and T5-base, experimental results on two understanding tasks demonstrate the effectiveness of our proposed three-phase paradigm.

Styles APA, Harvard, Vancouver, ISO, etc.

Jiang, Xiaoze, Yaobo Liang, Weizhu Chen et Nan Duan. « XLM-K : Improving Cross-Lingual Language Model Pre-training with Multilingual Knowledge ». Proceedings of the AAAI Conference on Artificial Intelligence 36, n^o 10 (28 juin 2022) : 10840–48. http://dx.doi.org/10.1609/aaai.v36i10.21330.

Texte intégral

Résumé :

Cross-lingual pre-training has achieved great successes using monolingual and bilingual plain text corpora. However, most pre-trained models neglect multilingual knowledge, which is language agnostic but comprises abundant cross-lingual structure alignment. In this paper, we propose XLM-K, a cross-lingual language model incorporating multilingual knowledge in pre-training. XLM-K augments existing multilingual pre-training with two knowledge tasks, namely Masked Entity Prediction Task and Object Entailment Task. We evaluate XLM-K on MLQA, NER and XNLI. Experimental results clearly demonstrate significant improvements over existing multilingual language models. The results on MLQA and NER exhibit the superiority of XLM-K in knowledge related tasks. The success in XNLI shows a better cross-lingual transferability obtained in XLM-K. What is more, we provide a detailed probing analysis to confirm the desired knowledge captured in our pre-training regimen. The code is available at https://github.com/microsoft/Unicoder/tree/master/pretraining/xlmk.

Styles APA, Harvard, Vancouver, ISO, etc.

Kajiwara, Tomoyuki, Biwa Miura et Yuki Arase. « Monolingual Transfer Learning via Bilingual Translators for Style-Sensitive Paraphrase Generation ». Proceedings of the AAAI Conference on Artificial Intelligence 34, n^o 05 (3 avril 2020) : 8042–49. http://dx.doi.org/10.1609/aaai.v34i05.6314.

Texte intégral

Résumé :

We tackle the low-resource problem in style transfer by employing transfer learning that utilizes abundantly available raw corpora. Our method consists of two steps: pre-training learns to generate a semantically equivalent sentence with an input assured grammaticality, and fine-tuning learns to add a desired style. Pre-training has two options, auto-encoding and machine translation based methods. Pre-training based on AutoEncoder is a simple way to learn these from a raw corpus. If machine translators are available, the model can learn more diverse paraphrasing via roundtrip translation. After these, fine-tuning achieves high-quality paraphrase generation even in situations where only 1k sentence pairs of the parallel corpus for style transfer is available. Experimental results of formality style transfer indicated the effectiveness of both pre-training methods and the method based on roundtrip translation achieves state-of-the-art performance.

Styles APA, Harvard, Vancouver, ISO, etc.

Kryeziu, Labehat, et Visar Shehu. « Pre-Training MLM Using Bert for the Albanian Language ». SEEU Review 18, n^o 1 (1 juin 2023) : 52–62. http://dx.doi.org/10.2478/seeur-2023-0035.

Texte intégral

Résumé :

Abstract Knowing that language is often used as a classifier of human intelligence and the development of systems that understand human language remains a challenge all the time (Kryeziu & Shehu, 2022). Natural Language Processing is a very active field of study, where transformers have a key role. Transformers function based on neural networks and they are increasingly showing promising results. One of the first major contributions to transfer learning in Natural Language Processing was the use of pre-trained word embeddings in 2010 (Joseph, Lev, & Yoshua, 2010). Pre-trained models like ELMo (Matthew, et al., 2018) and BERT (Delvin, et al., 2019) are trained on large corpora of unlabeled text and as a result learning from text representations has achieved good performance on many of the underlying tasks on datasets from different domains. Pre-training in the language model has proven that there has been an improvement in some aspects of natural language processing, based on the paper (Dai & Le, 2015). In present paper, we will pre-train BERT on the task of Masked Language Modeling (MLM) with the Albanian language dataset (alb_dataset) that we have created for this purpose (Kryeziu et al., 2022). We will compare two approaches: training of BERT using the available OSCAR dataset and using our alb_dataset that we have collected. The paper shows some discrepancies during training, especially while evaluating the performance of the model.

Styles APA, Harvard, Vancouver, ISO, etc.

Plus de sources

Thèses sur le sujet "Pre-training corpora"

Ortiz, Suarez Pedro. « A Data-driven Approach to Natural Language Processing for Contemporary and Historical French ». Electronic Thesis or Diss., Sorbonne université, 2022. http://www.theses.fr/2022SORUS155.

Texte intégral

Résumé :

Depuis plusieurs années, les approches neuronales ont régulièrement amélioré l'état de l'art du traitement automatique des langues (TAL) sur une grande variété de tâches. L'un des principaux facteurs ayant permis ces progrès continus est l'utilisation de techniques d'apprentissage par transfert. Ces méthodes consistent à partir d'un modèle pré-entraîné et à le réutiliser, avec peu ou pas d'entraînement supplémentaire, pour traiter d'autres tâches. Même si ces modèles présentent des avantages évidents, leur principal inconvénient est la quantité de données nécessaire pour les pré-entraîner. Ainsi, le manque de données disponibles à grande échelle a freiné le développement de tels modèles pour le français contemporain et a fortiori pour ses états de langue plus anciens.Cette thèse met l'accent sur le développement de corpus pour le pré-entraînement de telles architectures. Cette approche s'avère extrêmement efficace car nous sommes en mesure d'améliorer l'état de l'art pour un large éventail de tâches de TAL pour le français contemporain et historique, ainsi que pour six autres langues contemporaines. De plus, nous montrons que ces modèles sont extrêmement sensibles à la qualité, à l'hétérogénéité et à l'équilibre des données de pré-entraînement et montrons que ces trois caractéristiques sont de meilleurs prédicteurs de la performance des modèles que la taille des données de pré-entraînement. Nous montrons également que l'importance de la taille des données de pré-entraînement a été surestimée en démontrant à plusieurs reprises que l'on peut pré-entraîner de tels modèles avec des corpus de taille assez modeste
In recent years, neural methods for Natural Language Processing (NLP) have consistently and repeatedly improved the state of the art in a wide variety of NLP tasks. One of the main contributing reasons for this steady improvement is the increased use of transfer learning techniques. These methods consist in taking a pre-trained model and reusing it, with little to no further training, to solve other tasks. Even though these models have clear advantages, their main drawback is the amount of data that is needed to pre-train them. The lack of availability of large-scale data previously hindered the development of such models for contemporary French, and even more so for its historical states.In this thesis, we focus on developing corpora for the pre-training of these transfer learning architectures. This approach proves to be extremely effective, as we are able to establish a new state of the art for a wide range of tasks in NLP for contemporary, medieval and early modern French as well as for six other contemporary languages. Furthermore, we are able to determine, not only that these models are extremely sensitive to pre-training data quality, heterogeneity and balance, but we also show that these three features are better predictors of the pre-trained models' performance in downstream tasks than the pre-training data size itself. In fact, we determine that the importance of the pre-training dataset size was largely overestimated, as we are able to repeatedly show that such models can be pre-trained with corpora of a modest size

Styles APA, Harvard, Vancouver, ISO, etc.

Livres sur le sujet "Pre-training corpora"

Humphreys, S. C. Kinship in Ancient Athens. Oxford University Press, 2018. http://dx.doi.org/10.1093/oso/9780198788249.001.0001.

Texte intégral

Résumé :

The book covers Athenian kinship from Drakon and Solon to Menander (with some references to later developments). It uses a wide range of sources: epigraphic, literary/forensic, and archaeological. It provides an ethnographic ‘thick description’ of Athenians’ interaction with their kin in all contexts: legal relations (adoption, guardianship, marriage, inheritance, disputes in and out of court); economic interaction (property, economic independence/dependence of sons in relation to fathers); training in specialist skills (doctors, actors, artists), loans, guarantees, etc.; rituals (naming, rites de passage, funerals and commemoration, dedications, cultic associations); war (military commands, organization of land and sea forces); and political contexts, both informal (hetaireiai) and formal (Assembly, Council). Volume II deals with corporate groups recruited by patrifiliation: tribes and trittyes (both pre-Kleisthenic and Kleisthenic), phratries, genê, and demes. The section on the demes stresses variety rather than common features, and provides up-to-date information on location and prosopography.

Styles APA, Harvard, Vancouver, ISO, etc.

Peters, Thomas A. Library Programs Online. ABC-CLIO, LLC, 2009. http://dx.doi.org/10.5040/9798400679216.

Texte intégral

Résumé :

Meet your library patrons where they increasingly live and work-online. This guide introduces you to the exciting possibilities online programs offer, and shows you how to set up online programs in your library-whether one-time stand-alone or half-day, full-day, or multi-day workshops and conferences. Public programs-from lectures, demonstrations, and interviews to book discussions and story hours can be delivered in real time (live) primarily over the web, utilizing a variety of interactive communication tools, including voice-over-IP, text chatting, and co-browsing. Furthermore, online programming can be used for district-wide staff training. The author explains how to integrate pre-recorded components of a program into a live, online public program; shows how to extend the reach and appeal of online public programs with podcasting and audiorecordings; and explains how to use voice-over-IP and video-over-IP to enhance online programs. In addition to outlining the costs of staring and operating a public online program, Peters also provides cost recovery methods and scenarios. Online public programs can extend your library's reach into the service population, grab the attention of some early adopters and opinion leaders in the community you serve, and convey to patrons and other libraries that your library is moving boldly into the digital future. Plus, many people are more likely to attend an online library program than an in-library public program. And because online programs are easily recorded and redistributed on demand, your library gets more bang for each buck it invests in its public programming outreach. Distance education programs in higher education, corporate and governmental training efforts, and other sectors of society have become commonplace, but this is the first guide to focus on how libraries (public, academic, school, and special) and library-related organizations (associations, consortia, etc.) can and are developing exciting online programs for library users and librarians.

Styles APA, Harvard, Vancouver, ISO, etc.

Chapitres de livres sur le sujet "Pre-training corpora"

Mahamoud, Ibrahim Souleiman, Mickaël Coustaty, Aurélie Joseph, Vincent Poulain d’Andecy et Jean-Marc Ogier. « KAP : Pre-training Transformers for Corporate Documents Understanding ». Dans Document Analysis and Recognition – ICDAR 2023 Workshops, 65–79. Cham : Springer Nature Switzerland, 2023. http://dx.doi.org/10.1007/978-3-031-41501-2_5.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

Siva Raju, S., et Khushboo Ahire. « Enhancing the Quality of Pre-school Education Through Training of Anganwadi Workers : A CSR Initiative ». Dans Corporate Social Responsibility in India, 81–95. Singapore : Springer Singapore, 2017. http://dx.doi.org/10.1007/978-981-10-3902-7_5.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

Stevens, Meg, Georgina Kennedy et Timothy Churches. « Applying and Improving a Publicly Available Medication NER Pipeline in a Clinical Cancer EMR ». Dans Studies in Health Technology and Informatics. IOS Press, 2024. http://dx.doi.org/10.3233/shti231051.

Texte intégral

Résumé :

Clinical NLP can be applied to extract medication information from free-text notes in EMRs, using NER pipelines. Publicly available annotated data for clinical NLP are scarce, and research annotation budgets are often low. Fine-tuning pre-trained pipelines containing a Transformer layer can produce quality results with relatively small training corpora. We examine the transferability of a publicly available, pre-trained NER pipeline with a Transformer layer for medication targets. The pipeline performs poorly when directly validated but achieves an F1-score of 92% for drug names after fine-tuning with 1,565 annotated samples from a clinical cancer EMR – highlighting the benefits of the Transformer architecture in this setting. Performance was largely influenced by inconsistent annotation – reinforcing the need for innovative annotation processes in clinical NLP applications.

Styles APA, Harvard, Vancouver, ISO, etc.

Jiang, Eric P. « Automatic Text Classification from Labeled and Unlabeled Data ». Dans Intelligent Data Analysis for Real-Life Applications, 249–64. IGI Global, 2012. http://dx.doi.org/10.4018/978-1-4666-1806-0.ch013.

Texte intégral

Résumé :

Automatic text classification is a process that applies information retrieval technology and machine learning algorithms to build models from pre-labeled training samples and then deploys the models to previously unseen documents for classification. Text classification has been widely applied in many fields ranging from Web page indexing, document filtering, and information security, to business intelligence mining. This chapter presents a semi-supervised text classification framework that is based on the radial basis function (RBF) neural networks. The framework integrates an Expectation Maximization (EM) process into a RBF network and can learn for classification effectively from a very small quantity of labeled training samples and a large pool of additional unlabeled documents. The effectiveness of the framework is demonstrated and confirmed by some experiments of the framework on two popular text classification corpora.

Styles APA, Harvard, Vancouver, ISO, etc.

Syed, Mahanazuddin, Shaymaa Al-Shukri, Shorabuddin Syed, Kevin Sexton, Melody L. Greer, Meredith Zozus, Sudeepa Bhattacharyya et Fred Prior. « DeIDNER Corpus : Annotation of Clinical Discharge Summary Notes for Named Entity Recognition Using BRAT Tool ». Dans Studies in Health Technology and Informatics. IOS Press, 2021. http://dx.doi.org/10.3233/shti210195.

Texte intégral

Résumé :

Named Entity Recognition (NER) aims to identify and classify entities into predefined categories is a critical pre-processing task in Natural Language Processing (NLP) pipeline. Readily available off-the-shelf NER algorithms or programs are trained on a general corpus and often need to be retrained when applied on a different domain. The end model’s performance depends on the quality of named entities generated by these NER models used in the NLP task. To improve NER model accuracy, researchers build domain-specific corpora for both model training and evaluation. However, in the clinical domain, there is a dearth of training data because of privacy reasons, forcing many studies to use NER models that are trained in the non-clinical domain to generate NER feature-set. Thus, influencing the performance of the downstream NLP tasks like information extraction and de-identification. In this paper, our objective is to create a high quality annotated clinical corpus for training NER models that can be easily generalizable and can be used in a downstream de-identification task to generate named entities feature-set.

Styles APA, Harvard, Vancouver, ISO, etc.

Revenko, Artem, Victor Mireles, Anna Breit, Peter Bourgonje, Julian Moreno-Schneider, Maria Khvalchik et Georg Rehm. « Learning Ontology Classes from Text by Clustering Lexical Substitutes Derived from Language Models1 ». Dans Towards a Knowledge-Aware AI. IOS Press, 2022. http://dx.doi.org/10.3233/ssw220018.

Texte intégral

Résumé :

Many tools for knowledge management and the Semantic Web presuppose the existence of an arrangement of instances into classes, i. e. an ontology. Creating such an ontology, however, is a labor-intensive task. We present an unsupervised method to learn an ontology from text. We rely on pre-trained language models to generate lexical substitutes of given entities and then use matrix factorization to induce new classes and their entities. Our method differs from previous approaches in that (1) it captures the polysemy of entities; (2) it produces interpretable labels of the induced classes; (3) it does not require any particular structure of the text; (4) no re-training is required. We evaluate our method on German and English WikiNER corpora and demonstrate the improvements over state of the art approaches.

Styles APA, Harvard, Vancouver, ISO, etc.

Iyer, Usha. « Introduction ». Dans Dancing Women, 1–26. Oxford University Press, 2020. http://dx.doi.org/10.1093/oso/9780190938734.003.0001.

Texte intégral

Résumé :

The Introduction sets up the primary analytic frameworks of this book, plotting, through the opening example of the spectacular dance number, “Muqabla humse na karo,” issues of labor, collaboration, and technology that film dance activates. Through attention to gesture, movement vocabulary, training, fame, and erasure, this chapter posits the need for a corporeal history of Hindi cinema that is peopled by many laboring bodies. Such a history takes into account acclaimed and invisibilized performers and celebrates a range of dancing women as co-choreographers of female mobility. The Introduction also provides a brief history of dance in pre-playback Hindi film, and a historical account of responses to the cine-corporeal transformations wrought by dance in Indian cinema.

Styles APA, Harvard, Vancouver, ISO, etc.

Arya, Ali. « Content Description for Face Animation ». Dans Encyclopedia of Information Science and Technology, First Edition, 546–49. IGI Global, 2005. http://dx.doi.org/10.4018/978-1-59140-553-5.ch096.

Texte intégral

Résumé :

Face animation is a challenging area of computer graphics and multimedia systems research (Parke, 1996). Realistic personalized face animation is the basis for virtual software agents that can be used in many applications, including video conferencing, online training and customer service, visual effects in movies, and interactive games. A software agent can play the role of a trainer, a corporate representative, a specific person in an interactive virtual world, and even a virtual actor. Using this technology, movie producers can create new scenes including people who are not physically available. Furthermore, communication systems can represent a caller without any need to transmit high volume multimedia data over limited bandwidth lines. Adding intelligence to these agents makes them ideal for interactive applications such as online games and customer service. In general, the ability to generate new and realistic multimedia data for a specific character is of particular importance in cases where pre-recorded footage is unavailable, difficult, or expensive to generate, or simply too limited due to the interactive nature of the application.

Styles APA, Harvard, Vancouver, ISO, etc.

Bier, Ada, et Elena Borsetto. « Bisogni e preoccupazioni del corpo docente impegnato in English Medium Instruction (EMI) Una prospettiva italiana post-pandemia ». Dans La linguistica educativa tra ricerca e sperimentazione Scritti in onore di Carmel Mary Coonan. Venice : Fondazione Università Ca’ Foscari, 2023. http://dx.doi.org/10.30687/978-88-6969-683-1/018.

Texte intégral

Résumé :

As a global phenomenon, internationalisation exerts a great impact on Higher Education Institutions (HEIs) all over the world. Among the stakeholders mostly affected are the academic staff. The situation is especially critical in Italy, where teacher training was not a priority in pre-Covid times and where the outbreak and effects of the pandemic have been severe. These circumstances contribute to making Italian university teachers’ needs and concerns in post-pandemic EMI (English Medium Instruction) an area particularly worth exploring. This contribution investigates the case of a middle-sized public University in Northern Italy, where a needs analysis questionnaire was sent to the academic staff to understand how they address the additional challenges that internationalisation poses to teaching, with special regard to the provision of EMI. We also inquired into whether the pandemic has modified the teachers’ styles of teaching, and in which ways.

Styles APA, Harvard, Vancouver, ISO, etc.

Actes de conférences sur le sujet "Pre-training corpora"

Vu, Thuy-Trang, Xuanli He, Gholamreza Haffari et Ehsan Shareghi. « Koala : An Index for Quantifying Overlaps with Pre-training Corpora ». Dans Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing : System Demonstrations. Stroudsburg, PA, USA : Association for Computational Linguistics, 2023. http://dx.doi.org/10.18653/v1/2023.emnlp-demo.7.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

Liu, Zhuang, Degen Huang, Kaiyu Huang, Zhuang Li et Jun Zhao. « FinBERT : A Pre-trained Financial Language Representation Model for Financial Text Mining ». Dans Twenty-Ninth International Joint Conference on Artificial Intelligence and Seventeenth Pacific Rim International Conference on Artificial Intelligence {IJCAI-PRICAI-20}. California : International Joint Conferences on Artificial Intelligence Organization, 2020. http://dx.doi.org/10.24963/ijcai.2020/622.

Texte intégral

Résumé :

There is growing interest in the tasks of financial text mining. Over the past few years, the progress of Natural Language Processing (NLP) based on deep learning advanced rapidly. Significant progress has been made with deep learning showing promising results on financial text mining models. However, as NLP models require large amounts of labeled training data, applying deep learning to financial text mining is often unsuccessful due to the lack of labeled training data in financial fields. To address this issue, we present FinBERT (BERT for Financial Text Mining) that is a domain specific language model pre-trained on large-scale financial corpora. In FinBERT, different from BERT, we construct six pre-training tasks covering more knowledge, simultaneously trained on general corpora and financial domain corpora, which can enable FinBERT model better to capture language knowledge and semantic information. The results show that our FinBERT outperforms all current state-of-the-art models. Extensive experimental results demonstrate the effectiveness and robustness of FinBERT. The source code and pre-trained models of FinBERT are available online.

Styles APA, Harvard, Vancouver, ISO, etc.

Qian, Jing, Yong Yue, Katie Atkinson et Gangmin Li. « Knowledge-Enriched Moral Understanding upon Continual Pre-training ». Dans 10th International Conference on Computer Networks & Communications (CCNET 2023). Academy and Industry Research Collaboration Center (AIRCC), 2023. http://dx.doi.org/10.5121/csit.2023.130414.

Texte intégral

Résumé :

The aim of moral understanding is to comprehend the abstract concepts that hide in a story by seeing through concrete events and vivid characters. To be specific, the story is highly summarized in one sentence without covering any characters in the original story, which requires the machine to behave more intelligently with the abilities of moral perception and commonsense reasoning. The paradigm of “pre-training + fine-tuning” is generally accepted for applying neural language models. In this paper, we suggest adding an intermediate stage to build the flow of “pre-training + continual pre-training + finetuning”. Continual pre-training refers to further training on task-relevant or domainspecific corpora with the aim of bridging the data distribution gap between pre-training and fine-tuning. Experiments are basing on a new moral story dataset, STORAL-ZH, that composes of 4,209 Chinese story-moral pairs. We collect a moral corpus about Confucius theory to enrich the T5 model with moral knowledge. Furthermore, we leverage a Chinese commonsense knowledge graph to enhance the model with commonsense knowledge. Experimental results demonstrate the effectiveness of our method, compared with several state-of-the-art models including BERT-base, RoBERTa-base and T5-base.

Styles APA, Harvard, Vancouver, ISO, etc.

Lu, Jinliang, Yu Lu et Jiajun Zhang. « Take a Closer Look at Multilinguality ! Improve Multilingual Pre-Training Using Monolingual Corpora Only ». Dans Findings of the Association for Computational Linguistics : EMNLP 2023. Stroudsburg, PA, USA : Association for Computational Linguistics, 2023. http://dx.doi.org/10.18653/v1/2023.findings-emnlp.190.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

Wang, Xin'ao, Huan Li, Ke Chen et Lidan Shou. « FedBFPT : An Efficient Federated Learning Framework for Bert Further Pre-training ». Dans Thirty-Second International Joint Conference on Artificial Intelligence {IJCAI-23}. California : International Joint Conferences on Artificial Intelligence Organization, 2023. http://dx.doi.org/10.24963/ijcai.2023/483.

Texte intégral

Résumé :

This study proposes FEDBFPT (Federated BERT Further Pre-Training), a Federated Learning (FL) framework for further pre-training the BERT language model in specialized domains while addressing privacy concerns. FEDBFPT enables multiple clients to collaboratively train the shallower layers of BERT, which are crucial in the pre-training stage, without the need to share private data. To achieve this, FEDBFPT involves building a local model for each client, progressively training the shallower layers of local models while sampling deeper layers, and aggregating trained parameters on a server to create the final global model. This approach utilizes multiple smaller local models to further pre-train a global model targeted at specific tasks via fine-tuning, resulting in a reduction in resource usage while maintaining model accuracy. Theoretical analysis is conducted to support the efficiency of FEDBFPT, and experiments are conducted on corpora across domains such as medicine, biology, and computer science. Results indicate that FEDBFPT achieves performance levels comparable to traditional FL methods while reducing computation and communication costs by 46.70% and 7.04%, respectively, even approaching the performance of centralized training models. The Source code is released at https://github.com/Hanzhouu/FedBFPT.

Styles APA, Harvard, Vancouver, ISO, etc.

Qu, Yuanbin, Peihan Liu, Wei Song, Lizhen Liu et Miaomiao Cheng. « A Text Generation and Prediction System : Pre-training on New Corpora Using BERT and GPT-2 ». Dans 2020 IEEE 10th International Conference on Electronics Information and Emergency Communication (ICEIEC). IEEE, 2020. http://dx.doi.org/10.1109/iceiec49280.2020.9152352.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

Zan, Daoguang, Bei Chen, Dejian Yang, Zeqi Lin, Minsu Kim, Bei Guan, Yongji Wang, Weizhu Chen et Jian-Guang Lou. « CERT : Continual Pre-training on Sketches for Library-oriented Code Generation ». Dans Thirty-First International Joint Conference on Artificial Intelligence {IJCAI-22}. California : International Joint Conferences on Artificial Intelligence Organization, 2022. http://dx.doi.org/10.24963/ijcai.2022/329.

Texte intégral

Résumé :

Code generation is a longstanding challenge, aiming to generate a code snippet based on a natural language description. Usually, expensive text-code paired data is essential for training a code generation model. Recently, thanks to the success of pre-training techniques, large language models are trained on large unlabelled code corpora and perform well in generating code. In this paper, we investigate how to leverage an unlabelled code corpus to train a model for library-oriented code generation. Since it is a common practice for programmers to reuse third-party libraries, in which case the text-code paired data are harder to obtain due to the huge number of libraries. We observe that library-oriented code snippets are more likely to share similar code sketches. Hence, we present CERT with two steps: a sketcher generates the sketch, then a generator fills the details in the sketch. Both the sketcher and generator are continually pre-trained upon a base model using unlabelled data. Also, we carefully craft two benchmarks to evaluate library-oriented code generation named PandasEval and NumpyEval. Experimental results have shown the impressive performance of CERT. For example, it surpasses the base model by an absolute 15.67% improvement in terms of pass@1 on PandasEval. Our work is available at https://github.com/microsoft/PyCodeGPT.

Styles APA, Harvard, Vancouver, ISO, etc.

Edwards, Aleksandra, Jose Camacho-Collados, Hélène De Ribaupierre et Alun Preece. « Go Simple and Pre-Train on Domain-Specific Corpora : On the Role of Training Data for Text Classification ». Dans Proceedings of the 28th International Conference on Computational Linguistics. Stroudsburg, PA, USA : International Committee on Computational Linguistics, 2020. http://dx.doi.org/10.18653/v1/2020.coling-main.481.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

Florencio, Felipe de A., Matheus S. de Lacerda, Anderson P. Cavalcanti et Vitor Rolim. « Three-Layer Denoiser : Denoising Parallel Corpora for NMT Systems ». Dans Encontro Nacional de Inteligência Artificial e Computacional. Sociedade Brasileira de Computação - SBC, 2023. http://dx.doi.org/10.5753/eniac.2023.234268.

Texte intégral

Résumé :

In recent years, the field of Machine Translation has witnessed the emergence and growing popularity of Neural Machine Translation (NMT) systems, especially those constructed using transformer architectures. A critical factor in developing an effective NMT model is not just the volume, but also the quality of data. However, removing noise from parallel corpora, which involves the intricacies of two distinct languages, presents a significant challenge. In this paper, we introduce and assess a method for eliminating such noise, known as the Three-layer Denoiser. The first layer of this process, termed textual normalization, involves data cleaning using predetermined rules. The second layer incorporates a text feature extractor and a binary classifier, while the third layer evaluates the quality of sentence pairs using a pre-trained transformer model. Experimental results, obtained from training various NMT models with both clean and raw data, indicate a rise of up to 2.64 BLEU points in the models trained with sentence pairs that were filtered by the Denoiser.

Styles APA, Harvard, Vancouver, ISO, etc.

Nous offrons des réductions sur tous les plans premium pour les auteurs dont les œuvres sont incluses dans des sélections littéraires thématiques. Contactez-nous pour obtenir un code promo unique!