Relevant bibliographies by topics / Pre-training corpora

Journal articles
Dissertations / Theses
Books
Book chapters
Conference papers

Academic literature on the topic 'Pre-training corpora'

Author: Grafiati

Published: 25 May 2024

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Pre-training corpora.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Pre-training corpora"

Sun, Yu, Shuohuan Wang, Yukun Li, Shikun Feng, Hao Tian, Hua Wu, and Haifeng Wang. "ERNIE 2.0: A Continual Pre-Training Framework for Language Understanding." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 05 (April 3, 2020): 8968–75. http://dx.doi.org/10.1609/aaai.v34i05.6428.

Full text

Abstract:

Recently pre-trained models have achieved state-of-the-art results in various language understanding tasks. Current pre-training procedures usually focus on training the model with several simple tasks to grasp the co-occurrence of words or sentences. However, besides co-occurring information, there exists other valuable lexical, syntactic and semantic information in training corpora, such as named entities, semantic closeness and discourse relations. In order to extract the lexical, syntactic and semantic information from training corpora, we propose a continual pre-training framework named ERNIE 2.0 which incrementally builds pre-training tasks and then learn pre-trained models on these constructed tasks via continual multi-task learning. Based on this framework, we construct several tasks and train the ERNIE 2.0 model to capture lexical, syntactic and semantic aspects of information in the training data. Experimental results demonstrate that ERNIE 2.0 model outperforms BERT and XLNet on 16 tasks including English tasks on GLUE benchmarks and several similar tasks in Chinese. The source codes and pre-trained models have been released at https://github.com/PaddlePaddle/ERNIE.

APA, Harvard, Vancouver, ISO, and other styles

Moodaley, Wayne, and Arnesh Telukdarie. "A Conceptual Framework for Subdomain Specific Pre-Training of Large Language Models for Green Claim Detection." European Journal of Sustainable Development 12, no. 4 (October 1, 2023): 319. http://dx.doi.org/10.14207/ejsd.2023.v12n4p319.

Full text

Abstract:

Detection of false or misleading green claims (referred to as “greenwashing”) within company sustainability disclosures is challenging for a number of reasons, which include the textual and qualitative nature, volume, and complexity of such disclosures. In recent years, notable progress made in the fields of artificial intelligence and specifically, large language models (LLMs), has showcased the capacity of these tools to effectively analyse extensive and intricate textual data, including the contents of sustainability disclosures. Transformer-based LLMs, such as Google’s BERT architecture, were trained on general domain text corpora. Subsequent research has shown that further pre-training of such LLMs on specific domains, such as the climate or sustainability domains, may improve performance. However, previous research often uses text corpora that exhibit significant variation across topics and language and which often consist of heterogeneous subdomains. We therefore propose a conceptual framework for further pre-training of transformer based LLMs using text corpora relating to specific sustainability subdomains i.e. subdomain specific pre-training. We do so as a basis for the improved performance of such models in analysing sustainability disclosures. The main contribution is a conceptual framework to advance the use of LLMs for the reliable identification of green claims and ultimately, greenwashing. Keywords: greenwashing, artificial intelligence, sustainability, sustainability reporting, sustainability disclosures.

APA, Harvard, Vancouver, ISO, and other styles

Liu, Yinhan, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. "Multilingual Denoising Pre-training for Neural Machine Translation." Transactions of the Association for Computational Linguistics 8 (November 2020): 726–42. http://dx.doi.org/10.1162/tacl_a_00343.

Full text

Abstract:

This paper demonstrates that multilingual denoising pre-training produces significant performance gains across a wide variety of machine translation (MT) tasks. We present mBART—a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora in many languages using the BART objective (Lewis et al., 2019 ). mBART is the first method for pre-training a complete sequence-to-sequence model by denoising full texts in multiple languages, whereas previous approaches have focused only on the encoder, decoder, or reconstructing parts of the text. Pre-training a complete model allows it to be directly fine-tuned for supervised (both sentence-level and document-level) and unsupervised machine translation, with no task- specific modifications. We demonstrate that adding mBART initialization produces performance gains in all but the highest-resource settings, including up to 12 BLEU points for low resource MT and over 5 BLEU points for many document-level and unsupervised models. We also show that it enables transfer to language pairs with no bi-text or that were not in the pre-training corpus, and present extensive analysis of which factors contribute the most to effective pre-training. 1

APA, Harvard, Vancouver, ISO, and other styles

Dean, Roger Thornton, and Marcus Thomas Pearce. "Algorithmically-generated Corpora that use Serial Compositional Principles Can Contribute to the Modeling of Sequential Pitch Structure in Non-tonal Music." Empirical Musicology Review 11, no. 1 (July 8, 2016): 27. http://dx.doi.org/10.18061/emr.v11i1.4900.

Full text

Abstract:

We investigate whether pitch sequences in non-tonal music can be modeled by an information-theoretic approach using algorithmically-generated melodic sequences, made according to 12-tone serial principles, as the training corpus. This is potentially useful, because symbolic corpora of non-tonal music are not readily available. A non-tonal corpus of serially-composed melodies was constructed algorithmically using classic principles of 12-tone music, including prime, inversion, retrograde and retrograde inversion transforms. A similar algorithm generated a tonal melodic corpus of tonal transformations, in each case based on a novel tonal melody and expressed in alternating major keys. A cognitive model of auditory expectation (IDyOM) was used first to analyze the sequential pitch structure of the corpora, in some cases with pre-training on established tonal folk-song corpora (Essen, Schaffrath, 1995). The two algorithmic corpora can be distinguished in terms of their information content, and they were quite different from random corpora and from the folk-song corpus. We then demonstrate that the algorithmic serial corpora can assist modeling of canonical non-tonal compositions by Webern and Schoenberg, and also non-tonal segments of improvisations by skilled musicians. Separately, we developed the process of algorithmic melody composition into a software system (the Serial Collaborator) capable of generating multi-stranded serial keyboard music. Corpora of such keyboard compositions based either on the non-tonal or the tonal melodic corpora were generated and assessed for their information-theoretic modeling properties.

APA, Harvard, Vancouver, ISO, and other styles

Yuan, Sha, Hanyu Zhao, Zhengxiao Du, Ming Ding, Xiao Liu, Yukuo Cen, Xu Zou, Zhilin Yang, and Jie Tang. "WuDaoCorpora: A super large-scale Chinese corpora for pre-training language models." AI Open 2 (2021): 65–68. http://dx.doi.org/10.1016/j.aiopen.2021.06.001.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Kreutzer, Julia, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, et al. "Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets." Transactions of the Association for Computational Linguistics 10 (2022): 50–72. http://dx.doi.org/10.1162/tacl_a_00447.

Full text

Abstract:

Abstract With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, Web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases.

APA, Harvard, Vancouver, ISO, and other styles

Qian, Jing, Yong Yue, Katie Atkinson, and Gangmin Li. "Understanding Chinese Moral Stories with Further Pre-Training." International Journal on Natural Language Computing 12, no. 2 (April 29, 2023): 01–12. http://dx.doi.org/10.5121/ijnlc.2023.12201.

Full text

Abstract:

The goal of moral understanding is to grasp the theoretical concepts embedded in a narrative by delving beyond the concrete occurrences and dynamic personas. Specifically, the narrative is compacted into a single statement without involving any characters within the original text, necessitating a more astute language model that can comprehend connotative morality and exhibit commonsense reasoning. The “pretraining + fine-tuning” paradigm is widely embraced in neural language models. In this paper, we propose an intermediary phase to establish an improved paradigm of “pre-training + further pre-training + finetuning”. Further pre-training generally refers to continual learning on task-specific or domain-relevant corpora before being applied to target tasks, which aims at bridging the gap in data distribution between the phases of pre-training and fine-tuning. Our work is based on a Chinese dataset named STORAL-ZH that composes of 4k human-written story-moral pairs. Furthermore, we design a two-step process of domain-adaptive pre-training in the intermediary phase. The first step depends on a newly-collected Chinese dataset of Confucian moral culture. And the second step bases on the Chinese version of a frequently-used commonsense knowledge graph (i.e. ATOMIC) to enrich the backbone model with inferential knowledge besides morality. By comparison with several advanced models including BERTbase, RoBERTa-base and T5-base, experimental results on two understanding tasks demonstrate the effectiveness of our proposed three-phase paradigm.

APA, Harvard, Vancouver, ISO, and other styles

Jiang, Xiaoze, Yaobo Liang, Weizhu Chen, and Nan Duan. "XLM-K: Improving Cross-Lingual Language Model Pre-training with Multilingual Knowledge." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 10 (June 28, 2022): 10840–48. http://dx.doi.org/10.1609/aaai.v36i10.21330.

Full text

Abstract:

Cross-lingual pre-training has achieved great successes using monolingual and bilingual plain text corpora. However, most pre-trained models neglect multilingual knowledge, which is language agnostic but comprises abundant cross-lingual structure alignment. In this paper, we propose XLM-K, a cross-lingual language model incorporating multilingual knowledge in pre-training. XLM-K augments existing multilingual pre-training with two knowledge tasks, namely Masked Entity Prediction Task and Object Entailment Task. We evaluate XLM-K on MLQA, NER and XNLI. Experimental results clearly demonstrate significant improvements over existing multilingual language models. The results on MLQA and NER exhibit the superiority of XLM-K in knowledge related tasks. The success in XNLI shows a better cross-lingual transferability obtained in XLM-K. What is more, we provide a detailed probing analysis to confirm the desired knowledge captured in our pre-training regimen. The code is available at https://github.com/microsoft/Unicoder/tree/master/pretraining/xlmk.

APA, Harvard, Vancouver, ISO, and other styles

Kajiwara, Tomoyuki, Biwa Miura, and Yuki Arase. "Monolingual Transfer Learning via Bilingual Translators for Style-Sensitive Paraphrase Generation." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 05 (April 3, 2020): 8042–49. http://dx.doi.org/10.1609/aaai.v34i05.6314.

Full text

Abstract:

We tackle the low-resource problem in style transfer by employing transfer learning that utilizes abundantly available raw corpora. Our method consists of two steps: pre-training learns to generate a semantically equivalent sentence with an input assured grammaticality, and fine-tuning learns to add a desired style. Pre-training has two options, auto-encoding and machine translation based methods. Pre-training based on AutoEncoder is a simple way to learn these from a raw corpus. If machine translators are available, the model can learn more diverse paraphrasing via roundtrip translation. After these, fine-tuning achieves high-quality paraphrase generation even in situations where only 1k sentence pairs of the parallel corpus for style transfer is available. Experimental results of formality style transfer indicated the effectiveness of both pre-training methods and the method based on roundtrip translation achieves state-of-the-art performance.

APA, Harvard, Vancouver, ISO, and other styles

Kryeziu, Labehat, and Visar Shehu. "Pre-Training MLM Using Bert for the Albanian Language." SEEU Review 18, no. 1 (June 1, 2023): 52–62. http://dx.doi.org/10.2478/seeur-2023-0035.

Full text

Abstract:

Abstract Knowing that language is often used as a classifier of human intelligence and the development of systems that understand human language remains a challenge all the time (Kryeziu & Shehu, 2022). Natural Language Processing is a very active field of study, where transformers have a key role. Transformers function based on neural networks and they are increasingly showing promising results. One of the first major contributions to transfer learning in Natural Language Processing was the use of pre-trained word embeddings in 2010 (Joseph, Lev, & Yoshua, 2010). Pre-trained models like ELMo (Matthew, et al., 2018) and BERT (Delvin, et al., 2019) are trained on large corpora of unlabeled text and as a result learning from text representations has achieved good performance on many of the underlying tasks on datasets from different domains. Pre-training in the language model has proven that there has been an improvement in some aspects of natural language processing, based on the paper (Dai & Le, 2015). In present paper, we will pre-train BERT on the task of Masked Language Modeling (MLM) with the Albanian language dataset (alb_dataset) that we have created for this purpose (Kryeziu et al., 2022). We will compare two approaches: training of BERT using the available OSCAR dataset and using our alb_dataset that we have collected. The paper shows some discrepancies during training, especially while evaluating the performance of the model.

APA, Harvard, Vancouver, ISO, and other styles

More sources

Dissertations / Theses on the topic "Pre-training corpora"

Ortiz, Suarez Pedro. "A Data-driven Approach to Natural Language Processing for Contemporary and Historical French." Electronic Thesis or Diss., Sorbonne université, 2022. http://www.theses.fr/2022SORUS155.

Full text

Abstract:

Depuis plusieurs années, les approches neuronales ont régulièrement amélioré l'état de l'art du traitement automatique des langues (TAL) sur une grande variété de tâches. L'un des principaux facteurs ayant permis ces progrès continus est l'utilisation de techniques d'apprentissage par transfert. Ces méthodes consistent à partir d'un modèle pré-entraîné et à le réutiliser, avec peu ou pas d'entraînement supplémentaire, pour traiter d'autres tâches. Même si ces modèles présentent des avantages évidents, leur principal inconvénient est la quantité de données nécessaire pour les pré-entraîner. Ainsi, le manque de données disponibles à grande échelle a freiné le développement de tels modèles pour le français contemporain et a fortiori pour ses états de langue plus anciens.Cette thèse met l'accent sur le développement de corpus pour le pré-entraînement de telles architectures. Cette approche s'avère extrêmement efficace car nous sommes en mesure d'améliorer l'état de l'art pour un large éventail de tâches de TAL pour le français contemporain et historique, ainsi que pour six autres langues contemporaines. De plus, nous montrons que ces modèles sont extrêmement sensibles à la qualité, à l'hétérogénéité et à l'équilibre des données de pré-entraînement et montrons que ces trois caractéristiques sont de meilleurs prédicteurs de la performance des modèles que la taille des données de pré-entraînement. Nous montrons également que l'importance de la taille des données de pré-entraînement a été surestimée en démontrant à plusieurs reprises que l'on peut pré-entraîner de tels modèles avec des corpus de taille assez modeste
In recent years, neural methods for Natural Language Processing (NLP) have consistently and repeatedly improved the state of the art in a wide variety of NLP tasks. One of the main contributing reasons for this steady improvement is the increased use of transfer learning techniques. These methods consist in taking a pre-trained model and reusing it, with little to no further training, to solve other tasks. Even though these models have clear advantages, their main drawback is the amount of data that is needed to pre-train them. The lack of availability of large-scale data previously hindered the development of such models for contemporary French, and even more so for its historical states.In this thesis, we focus on developing corpora for the pre-training of these transfer learning architectures. This approach proves to be extremely effective, as we are able to establish a new state of the art for a wide range of tasks in NLP for contemporary, medieval and early modern French as well as for six other contemporary languages. Furthermore, we are able to determine, not only that these models are extremely sensitive to pre-training data quality, heterogeneity and balance, but we also show that these three features are better predictors of the pre-trained models' performance in downstream tasks than the pre-training data size itself. In fact, we determine that the importance of the pre-training dataset size was largely overestimated, as we are able to repeatedly show that such models can be pre-trained with corpora of a modest size

APA, Harvard, Vancouver, ISO, and other styles

Books on the topic "Pre-training corpora"

Humphreys, S. C. Kinship in Ancient Athens. Oxford University Press, 2018. http://dx.doi.org/10.1093/oso/9780198788249.001.0001.

Full text

Abstract:

The book covers Athenian kinship from Drakon and Solon to Menander (with some references to later developments). It uses a wide range of sources: epigraphic, literary/forensic, and archaeological. It provides an ethnographic ‘thick description’ of Athenians’ interaction with their kin in all contexts: legal relations (adoption, guardianship, marriage, inheritance, disputes in and out of court); economic interaction (property, economic independence/dependence of sons in relation to fathers); training in specialist skills (doctors, actors, artists), loans, guarantees, etc.; rituals (naming, rites de passage, funerals and commemoration, dedications, cultic associations); war (military commands, organization of land and sea forces); and political contexts, both informal (hetaireiai) and formal (Assembly, Council). Volume II deals with corporate groups recruited by patrifiliation: tribes and trittyes (both pre-Kleisthenic and Kleisthenic), phratries, genê, and demes. The section on the demes stresses variety rather than common features, and provides up-to-date information on location and prosopography.

APA, Harvard, Vancouver, ISO, and other styles

Peters, Thomas A. Library Programs Online. ABC-CLIO, LLC, 2009. http://dx.doi.org/10.5040/9798400679216.

Full text

Abstract:

Meet your library patrons where they increasingly live and work-online. This guide introduces you to the exciting possibilities online programs offer, and shows you how to set up online programs in your library-whether one-time stand-alone or half-day, full-day, or multi-day workshops and conferences. Public programs-from lectures, demonstrations, and interviews to book discussions and story hours can be delivered in real time (live) primarily over the web, utilizing a variety of interactive communication tools, including voice-over-IP, text chatting, and co-browsing. Furthermore, online programming can be used for district-wide staff training. The author explains how to integrate pre-recorded components of a program into a live, online public program; shows how to extend the reach and appeal of online public programs with podcasting and audiorecordings; and explains how to use voice-over-IP and video-over-IP to enhance online programs. In addition to outlining the costs of staring and operating a public online program, Peters also provides cost recovery methods and scenarios. Online public programs can extend your library's reach into the service population, grab the attention of some early adopters and opinion leaders in the community you serve, and convey to patrons and other libraries that your library is moving boldly into the digital future. Plus, many people are more likely to attend an online library program than an in-library public program. And because online programs are easily recorded and redistributed on demand, your library gets more bang for each buck it invests in its public programming outreach. Distance education programs in higher education, corporate and governmental training efforts, and other sectors of society have become commonplace, but this is the first guide to focus on how libraries (public, academic, school, and special) and library-related organizations (associations, consortia, etc.) can and are developing exciting online programs for library users and librarians.

APA, Harvard, Vancouver, ISO, and other styles

Book chapters on the topic "Pre-training corpora"

Mahamoud, Ibrahim Souleiman, Mickaël Coustaty, Aurélie Joseph, Vincent Poulain d’Andecy, and Jean-Marc Ogier. "KAP: Pre-training Transformers for Corporate Documents Understanding." In Document Analysis and Recognition – ICDAR 2023 Workshops, 65–79. Cham: Springer Nature Switzerland, 2023. http://dx.doi.org/10.1007/978-3-031-41501-2_5.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Siva Raju, S., and Khushboo Ahire. "Enhancing the Quality of Pre-school Education Through Training of Anganwadi Workers: A CSR Initiative." In Corporate Social Responsibility in India, 81–95. Singapore: Springer Singapore, 2017. http://dx.doi.org/10.1007/978-981-10-3902-7_5.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Stevens, Meg, Georgina Kennedy, and Timothy Churches. "Applying and Improving a Publicly Available Medication NER Pipeline in a Clinical Cancer EMR." In Studies in Health Technology and Informatics. IOS Press, 2024. http://dx.doi.org/10.3233/shti231051.

Full text

Abstract:

Clinical NLP can be applied to extract medication information from free-text notes in EMRs, using NER pipelines. Publicly available annotated data for clinical NLP are scarce, and research annotation budgets are often low. Fine-tuning pre-trained pipelines containing a Transformer layer can produce quality results with relatively small training corpora. We examine the transferability of a publicly available, pre-trained NER pipeline with a Transformer layer for medication targets. The pipeline performs poorly when directly validated but achieves an F1-score of 92% for drug names after fine-tuning with 1,565 annotated samples from a clinical cancer EMR – highlighting the benefits of the Transformer architecture in this setting. Performance was largely influenced by inconsistent annotation – reinforcing the need for innovative annotation processes in clinical NLP applications.

APA, Harvard, Vancouver, ISO, and other styles

Jiang, Eric P. "Automatic Text Classification from Labeled and Unlabeled Data." In Intelligent Data Analysis for Real-Life Applications, 249–64. IGI Global, 2012. http://dx.doi.org/10.4018/978-1-4666-1806-0.ch013.

Full text

Abstract:

Automatic text classification is a process that applies information retrieval technology and machine learning algorithms to build models from pre-labeled training samples and then deploys the models to previously unseen documents for classification. Text classification has been widely applied in many fields ranging from Web page indexing, document filtering, and information security, to business intelligence mining. This chapter presents a semi-supervised text classification framework that is based on the radial basis function (RBF) neural networks. The framework integrates an Expectation Maximization (EM) process into a RBF network and can learn for classification effectively from a very small quantity of labeled training samples and a large pool of additional unlabeled documents. The effectiveness of the framework is demonstrated and confirmed by some experiments of the framework on two popular text classification corpora.

APA, Harvard, Vancouver, ISO, and other styles

Syed, Mahanazuddin, Shaymaa Al-Shukri, Shorabuddin Syed, Kevin Sexton, Melody L. Greer, Meredith Zozus, Sudeepa Bhattacharyya, and Fred Prior. "DeIDNER Corpus: Annotation of Clinical Discharge Summary Notes for Named Entity Recognition Using BRAT Tool." In Studies in Health Technology and Informatics. IOS Press, 2021. http://dx.doi.org/10.3233/shti210195.

Full text

Abstract:

Named Entity Recognition (NER) aims to identify and classify entities into predefined categories is a critical pre-processing task in Natural Language Processing (NLP) pipeline. Readily available off-the-shelf NER algorithms or programs are trained on a general corpus and often need to be retrained when applied on a different domain. The end model’s performance depends on the quality of named entities generated by these NER models used in the NLP task. To improve NER model accuracy, researchers build domain-specific corpora for both model training and evaluation. However, in the clinical domain, there is a dearth of training data because of privacy reasons, forcing many studies to use NER models that are trained in the non-clinical domain to generate NER feature-set. Thus, influencing the performance of the downstream NLP tasks like information extraction and de-identification. In this paper, our objective is to create a high quality annotated clinical corpus for training NER models that can be easily generalizable and can be used in a downstream de-identification task to generate named entities feature-set.

APA, Harvard, Vancouver, ISO, and other styles

Revenko, Artem, Victor Mireles, Anna Breit, Peter Bourgonje, Julian Moreno-Schneider, Maria Khvalchik, and Georg Rehm. "Learning Ontology Classes from Text by Clustering Lexical Substitutes Derived from Language Models1." In Towards a Knowledge-Aware AI. IOS Press, 2022. http://dx.doi.org/10.3233/ssw220018.

Full text

Abstract:

Many tools for knowledge management and the Semantic Web presuppose the existence of an arrangement of instances into classes, i. e. an ontology. Creating such an ontology, however, is a labor-intensive task. We present an unsupervised method to learn an ontology from text. We rely on pre-trained language models to generate lexical substitutes of given entities and then use matrix factorization to induce new classes and their entities. Our method differs from previous approaches in that (1) it captures the polysemy of entities; (2) it produces interpretable labels of the induced classes; (3) it does not require any particular structure of the text; (4) no re-training is required. We evaluate our method on German and English WikiNER corpora and demonstrate the improvements over state of the art approaches.

APA, Harvard, Vancouver, ISO, and other styles

Iyer, Usha. "Introduction." In Dancing Women, 1–26. Oxford University Press, 2020. http://dx.doi.org/10.1093/oso/9780190938734.003.0001.

Full text

Abstract:

The Introduction sets up the primary analytic frameworks of this book, plotting, through the opening example of the spectacular dance number, “Muqabla humse na karo,” issues of labor, collaboration, and technology that film dance activates. Through attention to gesture, movement vocabulary, training, fame, and erasure, this chapter posits the need for a corporeal history of Hindi cinema that is peopled by many laboring bodies. Such a history takes into account acclaimed and invisibilized performers and celebrates a range of dancing women as co-choreographers of female mobility. The Introduction also provides a brief history of dance in pre-playback Hindi film, and a historical account of responses to the cine-corporeal transformations wrought by dance in Indian cinema.

APA, Harvard, Vancouver, ISO, and other styles

Arya, Ali. "Content Description for Face Animation." In Encyclopedia of Information Science and Technology, First Edition, 546–49. IGI Global, 2005. http://dx.doi.org/10.4018/978-1-59140-553-5.ch096.

Full text

Abstract:

Face animation is a challenging area of computer graphics and multimedia systems research (Parke, 1996). Realistic personalized face animation is the basis for virtual software agents that can be used in many applications, including video conferencing, online training and customer service, visual effects in movies, and interactive games. A software agent can play the role of a trainer, a corporate representative, a specific person in an interactive virtual world, and even a virtual actor. Using this technology, movie producers can create new scenes including people who are not physically available. Furthermore, communication systems can represent a caller without any need to transmit high volume multimedia data over limited bandwidth lines. Adding intelligence to these agents makes them ideal for interactive applications such as online games and customer service. In general, the ability to generate new and realistic multimedia data for a specific character is of particular importance in cases where pre-recorded footage is unavailable, difficult, or expensive to generate, or simply too limited due to the interactive nature of the application.

APA, Harvard, Vancouver, ISO, and other styles

Bier, Ada, and Elena Borsetto. "Bisogni e preoccupazioni del corpo docente impegnato in English Medium Instruction (EMI) Una prospettiva italiana post-pandemia." In La linguistica educativa tra ricerca e sperimentazione Scritti in onore di Carmel Mary Coonan. Venice: Fondazione Università Ca’ Foscari, 2023. http://dx.doi.org/10.30687/978-88-6969-683-1/018.

Full text

Abstract:

As a global phenomenon, internationalisation exerts a great impact on Higher Education Institutions (HEIs) all over the world. Among the stakeholders mostly affected are the academic staff. The situation is especially critical in Italy, where teacher training was not a priority in pre-Covid times and where the outbreak and effects of the pandemic have been severe. These circumstances contribute to making Italian university teachers’ needs and concerns in post-pandemic EMI (English Medium Instruction) an area particularly worth exploring. This contribution investigates the case of a middle-sized public University in Northern Italy, where a needs analysis questionnaire was sent to the academic staff to understand how they address the additional challenges that internationalisation poses to teaching, with special regard to the provision of EMI. We also inquired into whether the pandemic has modified the teachers’ styles of teaching, and in which ways.

APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Pre-training corpora"

Vu, Thuy-Trang, Xuanli He, Gholamreza Haffari, and Ehsan Shareghi. "Koala: An Index for Quantifying Overlaps with Pre-training Corpora." In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Stroudsburg, PA, USA: Association for Computational Linguistics, 2023. http://dx.doi.org/10.18653/v1/2023.emnlp-demo.7.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Liu, Zhuang, Degen Huang, Kaiyu Huang, Zhuang Li, and Jun Zhao. "FinBERT: A Pre-trained Financial Language Representation Model for Financial Text Mining." In Twenty-Ninth International Joint Conference on Artificial Intelligence and Seventeenth Pacific Rim International Conference on Artificial Intelligence {IJCAI-PRICAI-20}. California: International Joint Conferences on Artificial Intelligence Organization, 2020. http://dx.doi.org/10.24963/ijcai.2020/622.

Full text

Abstract:

There is growing interest in the tasks of financial text mining. Over the past few years, the progress of Natural Language Processing (NLP) based on deep learning advanced rapidly. Significant progress has been made with deep learning showing promising results on financial text mining models. However, as NLP models require large amounts of labeled training data, applying deep learning to financial text mining is often unsuccessful due to the lack of labeled training data in financial fields. To address this issue, we present FinBERT (BERT for Financial Text Mining) that is a domain specific language model pre-trained on large-scale financial corpora. In FinBERT, different from BERT, we construct six pre-training tasks covering more knowledge, simultaneously trained on general corpora and financial domain corpora, which can enable FinBERT model better to capture language knowledge and semantic information. The results show that our FinBERT outperforms all current state-of-the-art models. Extensive experimental results demonstrate the effectiveness and robustness of FinBERT. The source code and pre-trained models of FinBERT are available online.

APA, Harvard, Vancouver, ISO, and other styles

Qian, Jing, Yong Yue, Katie Atkinson, and Gangmin Li. "Knowledge-Enriched Moral Understanding upon Continual Pre-training." In 10th International Conference on Computer Networks & Communications (CCNET 2023). Academy and Industry Research Collaboration Center (AIRCC), 2023. http://dx.doi.org/10.5121/csit.2023.130414.

Full text

Abstract:

The aim of moral understanding is to comprehend the abstract concepts that hide in a story by seeing through concrete events and vivid characters. To be specific, the story is highly summarized in one sentence without covering any characters in the original story, which requires the machine to behave more intelligently with the abilities of moral perception and commonsense reasoning. The paradigm of “pre-training + fine-tuning” is generally accepted for applying neural language models. In this paper, we suggest adding an intermediate stage to build the flow of “pre-training + continual pre-training + finetuning”. Continual pre-training refers to further training on task-relevant or domainspecific corpora with the aim of bridging the data distribution gap between pre-training and fine-tuning. Experiments are basing on a new moral story dataset, STORAL-ZH, that composes of 4,209 Chinese story-moral pairs. We collect a moral corpus about Confucius theory to enrich the T5 model with moral knowledge. Furthermore, we leverage a Chinese commonsense knowledge graph to enhance the model with commonsense knowledge. Experimental results demonstrate the effectiveness of our method, compared with several state-of-the-art models including BERT-base, RoBERTa-base and T5-base.

APA, Harvard, Vancouver, ISO, and other styles

Lu, Jinliang, Yu Lu, and Jiajun Zhang. "Take a Closer Look at Multilinguality! Improve Multilingual Pre-Training Using Monolingual Corpora Only." In Findings of the Association for Computational Linguistics: EMNLP 2023. Stroudsburg, PA, USA: Association for Computational Linguistics, 2023. http://dx.doi.org/10.18653/v1/2023.findings-emnlp.190.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Wang, Xin'ao, Huan Li, Ke Chen, and Lidan Shou. "FedBFPT: An Efficient Federated Learning Framework for Bert Further Pre-training." In Thirty-Second International Joint Conference on Artificial Intelligence {IJCAI-23}. California: International Joint Conferences on Artificial Intelligence Organization, 2023. http://dx.doi.org/10.24963/ijcai.2023/483.

Full text

Abstract:

This study proposes FEDBFPT (Federated BERT Further Pre-Training), a Federated Learning (FL) framework for further pre-training the BERT language model in specialized domains while addressing privacy concerns. FEDBFPT enables multiple clients to collaboratively train the shallower layers of BERT, which are crucial in the pre-training stage, without the need to share private data. To achieve this, FEDBFPT involves building a local model for each client, progressively training the shallower layers of local models while sampling deeper layers, and aggregating trained parameters on a server to create the final global model. This approach utilizes multiple smaller local models to further pre-train a global model targeted at specific tasks via fine-tuning, resulting in a reduction in resource usage while maintaining model accuracy. Theoretical analysis is conducted to support the efficiency of FEDBFPT, and experiments are conducted on corpora across domains such as medicine, biology, and computer science. Results indicate that FEDBFPT achieves performance levels comparable to traditional FL methods while reducing computation and communication costs by 46.70% and 7.04%, respectively, even approaching the performance of centralized training models. The Source code is released at https://github.com/Hanzhouu/FedBFPT.

APA, Harvard, Vancouver, ISO, and other styles

Qu, Yuanbin, Peihan Liu, Wei Song, Lizhen Liu, and Miaomiao Cheng. "A Text Generation and Prediction System: Pre-training on New Corpora Using BERT and GPT-2." In 2020 IEEE 10th International Conference on Electronics Information and Emergency Communication (ICEIEC). IEEE, 2020. http://dx.doi.org/10.1109/iceiec49280.2020.9152352.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Zan, Daoguang, Bei Chen, Dejian Yang, Zeqi Lin, Minsu Kim, Bei Guan, Yongji Wang, Weizhu Chen, and Jian-Guang Lou. "CERT: Continual Pre-training on Sketches for Library-oriented Code Generation." In Thirty-First International Joint Conference on Artificial Intelligence {IJCAI-22}. California: International Joint Conferences on Artificial Intelligence Organization, 2022. http://dx.doi.org/10.24963/ijcai.2022/329.

Full text

Abstract:

Code generation is a longstanding challenge, aiming to generate a code snippet based on a natural language description. Usually, expensive text-code paired data is essential for training a code generation model. Recently, thanks to the success of pre-training techniques, large language models are trained on large unlabelled code corpora and perform well in generating code. In this paper, we investigate how to leverage an unlabelled code corpus to train a model for library-oriented code generation. Since it is a common practice for programmers to reuse third-party libraries, in which case the text-code paired data are harder to obtain due to the huge number of libraries. We observe that library-oriented code snippets are more likely to share similar code sketches. Hence, we present CERT with two steps: a sketcher generates the sketch, then a generator fills the details in the sketch. Both the sketcher and generator are continually pre-trained upon a base model using unlabelled data. Also, we carefully craft two benchmarks to evaluate library-oriented code generation named PandasEval and NumpyEval. Experimental results have shown the impressive performance of CERT. For example, it surpasses the base model by an absolute 15.67% improvement in terms of pass@1 on PandasEval. Our work is available at https://github.com/microsoft/PyCodeGPT.

APA, Harvard, Vancouver, ISO, and other styles

Edwards, Aleksandra, Jose Camacho-Collados, Hélène De Ribaupierre, and Alun Preece. "Go Simple and Pre-Train on Domain-Specific Corpora: On the Role of Training Data for Text Classification." In Proceedings of the 28th International Conference on Computational Linguistics. Stroudsburg, PA, USA: International Committee on Computational Linguistics, 2020. http://dx.doi.org/10.18653/v1/2020.coling-main.481.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Full text

APA, Harvard, Vancouver, ISO, and other styles

Florencio, Felipe de A., Matheus S. de Lacerda, Anderson P. Cavalcanti, and Vitor Rolim. "Three-Layer Denoiser: Denoising Parallel Corpora for NMT Systems." In Encontro Nacional de Inteligência Artificial e Computacional. Sociedade Brasileira de Computação - SBC, 2023. http://dx.doi.org/10.5753/eniac.2023.234268.

Full text

Abstract:

In recent years, the field of Machine Translation has witnessed the emergence and growing popularity of Neural Machine Translation (NMT) systems, especially those constructed using transformer architectures. A critical factor in developing an effective NMT model is not just the volume, but also the quality of data. However, removing noise from parallel corpora, which involves the intricacies of two distinct languages, presents a significant challenge. In this paper, we introduce and assess a method for eliminating such noise, known as the Three-layer Denoiser. The first layer of this process, termed textual normalization, involves data cleaning using predetermined rules. The second layer incorporates a text feature extractor and a binary classifier, while the third layer evaluates the quality of sentence pairs using a pre-trained transformer model. Experimental results, obtained from training various NMT models with both clean and raw data, indicate a rise of up to 2.64 BLEU points in the models trained with sentence pairs that were filtered by the Denoiser.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

Contents

Academic literature on the topic 'Pre-training corpora'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Journal articles on the topic "Pre-training corpora"

Dissertations / Theses on the topic "Pre-training corpora"

Books on the topic "Pre-training corpora"

Book chapters on the topic "Pre-training corpora"

Conference papers on the topic "Pre-training corpora"