Literatura académica sobre el tema "Pre-training corpora"

Crea una cita precisa en los estilos APA, MLA, Chicago, Harvard y otros

Elija tipo de fuente:

Consulte las listas temáticas de artículos, libros, tesis, actas de conferencias y otras fuentes académicas sobre el tema "Pre-training corpora".

Junto a cada fuente en la lista de referencias hay un botón "Agregar a la bibliografía". Pulsa este botón, y generaremos automáticamente la referencia bibliográfica para la obra elegida en el estilo de cita que necesites: APA, MLA, Harvard, Vancouver, Chicago, etc.

También puede descargar el texto completo de la publicación académica en formato pdf y leer en línea su resumen siempre que esté disponible en los metadatos.

Artículos de revistas sobre el tema "Pre-training corpora"

1

Sun, Yu, Shuohuan Wang, Yukun Li, et al. "ERNIE 2.0: A Continual Pre-Training Framework for Language Understanding." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 05 (2020): 8968–75. http://dx.doi.org/10.1609/aaai.v34i05.6428.

Texto completo
Resumen
Recently pre-trained models have achieved state-of-the-art results in various language understanding tasks. Current pre-training procedures usually focus on training the model with several simple tasks to grasp the co-occurrence of words or sentences. However, besides co-occurring information, there exists other valuable lexical, syntactic and semantic information in training corpora, such as named entities, semantic closeness and discourse relations. In order to extract the lexical, syntactic and semantic information from training corpora, we propose a continual pre-training framework named E
Los estilos APA, Harvard, Vancouver, ISO, etc.
2

Moodaley, Wayne, and Arnesh Telukdarie. "A Conceptual Framework for Subdomain Specific Pre-Training of Large Language Models for Green Claim Detection." European Journal of Sustainable Development 12, no. 4 (2023): 319. http://dx.doi.org/10.14207/ejsd.2023.v12n4p319.

Texto completo
Resumen
Detection of false or misleading green claims (referred to as “greenwashing”) within company sustainability disclosures is challenging for a number of reasons, which include the textual and qualitative nature, volume, and complexity of such disclosures. In recent years, notable progress made in the fields of artificial intelligence and specifically, large language models (LLMs), has showcased the capacity of these tools to effectively analyse extensive and intricate textual data, including the contents of sustainability disclosures. Transformer-based LLMs, such as Google’s BERT architecture, w
Los estilos APA, Harvard, Vancouver, ISO, etc.
3

Hussain, Rida Ghafoor. "RiskBERT: A Pre-Trained Insurance-Based Language Model for Text Classification." International Journal of Innovative Technology and Exploring Engineering 14, no. 7 (2025): 12–18. https://doi.org/10.35940/ijitee.f1097.14070625.

Texto completo
Resumen
The rapid growth of insurance-related documents has increased the need for efficient and accurate text classification techniques. Advances in natural language processing (NLP) and deep learning have enabled the extraction of valuable insights from textual data, particularly in specialised domains such as insurance, legal, and scientific documents. While Bidirectional Encoder Representations from Transformers (BERT) models have demonstrated state-of-theart performance across various NLP tasks, their application to domain-specific corpora often results in suboptimal accuracy due to linguistic an
Los estilos APA, Harvard, Vancouver, ISO, etc.
4

Liu, Yinhan, Jiatao Gu, Naman Goyal, et al. "Multilingual Denoising Pre-training for Neural Machine Translation." Transactions of the Association for Computational Linguistics 8 (November 2020): 726–42. http://dx.doi.org/10.1162/tacl_a_00343.

Texto completo
Resumen
This paper demonstrates that multilingual denoising pre-training produces significant performance gains across a wide variety of machine translation (MT) tasks. We present mBART—a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora in many languages using the BART objective (Lewis et al., 2019 ). mBART is the first method for pre-training a complete sequence-to-sequence model by denoising full texts in multiple languages, whereas previous approaches have focused only on the encoder, decoder, or reconstructing parts of the text. Pre-training a complete mod
Los estilos APA, Harvard, Vancouver, ISO, etc.
5

Dean, Roger Thornton, and Marcus Thomas Pearce. "Algorithmically-generated Corpora that use Serial Compositional Principles Can Contribute to the Modeling of Sequential Pitch Structure in Non-tonal Music." Empirical Musicology Review 11, no. 1 (2016): 27. http://dx.doi.org/10.18061/emr.v11i1.4900.

Texto completo
Resumen
We investigate whether pitch sequences in non-tonal music can be modeled by an information-theoretic approach using algorithmically-generated melodic sequences, made according to 12-tone serial principles, as the training corpus. This is potentially useful, because symbolic corpora of non-tonal music are not readily available. A non-tonal corpus of serially-composed melodies was constructed algorithmically using classic principles of 12-tone music, including prime, inversion, retrograde and retrograde inversion transforms. A similar algorithm generated a tonal melodic corpus of tonal transform
Los estilos APA, Harvard, Vancouver, ISO, etc.
6

Kreutzer, Julia, Isaac Caswell, Lisa Wang, et al. "Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets." Transactions of the Association for Computational Linguistics 10 (2022): 50–72. http://dx.doi.org/10.1162/tacl_a_00447.

Texto completo
Resumen
Abstract With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, Web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/am
Los estilos APA, Harvard, Vancouver, ISO, etc.
7

Yuan, Sha, Hanyu Zhao, Zhengxiao Du, et al. "WuDaoCorpora: A super large-scale Chinese corpora for pre-training language models." AI Open 2 (2021): 65–68. http://dx.doi.org/10.1016/j.aiopen.2021.06.001.

Texto completo
Los estilos APA, Harvard, Vancouver, ISO, etc.
8

Qian, Jing, Yong Yue, Katie Atkinson, and Gangmin Li. "Understanding Chinese Moral Stories with Further Pre-Training." International Journal on Natural Language Computing 12, no. 2 (2023): 01–12. http://dx.doi.org/10.5121/ijnlc.2023.12201.

Texto completo
Resumen
The goal of moral understanding is to grasp the theoretical concepts embedded in a narrative by delving beyond the concrete occurrences and dynamic personas. Specifically, the narrative is compacted into a single statement without involving any characters within the original text, necessitating a more astute language model that can comprehend connotative morality and exhibit commonsense reasoning. The “pretraining + fine-tuning” paradigm is widely embraced in neural language models. In this paper, we propose an intermediary phase to establish an improved paradigm of “pre-training + further pre
Los estilos APA, Harvard, Vancouver, ISO, etc.
9

Jing, Qian, Yue Yong, Atkinson Katie, and Li Gangmin. "Understanding Chinese Moral Stories with Further Pre-Training." International Journal on Natural Language Computing (IJNLC) 12, no. 2 (2023): 12. https://doi.org/10.5281/zenodo.7929155.

Texto completo
Resumen
The goal of moral understanding is to grasp the theoretical concepts embedded in a narrative by delving beyond the concrete occurrences and dynamic personas. Specifically, the narrative is compacted into a single statement without involving any characters within the original text, necessitating a more astute language model that can comprehend connotative morality and exhibit commonsense reasoning. The “pre-training + fine-tuning” paradigm is widely embraced in neural language models. In this paper, we propose an intermediary phase to establish an improved paradigm of “pre-tra
Los estilos APA, Harvard, Vancouver, ISO, etc.
10

Chukhno, Olena, and Nataliia Tuchyna. "OVERCOMING DIFFICULATIES IN USING LINGUISTIC CORPORA FOR TEACHING ENGLISH TO PRE-SERVICE TEACHERS." Education. Innovation. Practice 12, no. 7 (2024): 91–105. http://dx.doi.org/10.31110/2616-650x-vol12i7-014.

Texto completo
Resumen
The rapid pace of technological advancement necessitates that Ukrainian graduates possess advanced digital literacy and critical thinking skills as well as lifelong learning abilities. Within this context, using linguistic corpora can be considered an effective approach which contributes to developing professional communicative competence by engaging students with authentic language data and promoting critical analysis and independent learning. The article addresses the challenges of integrating the direct corpus-based approach into pre-service English language teacher education. These include
Los estilos APA, Harvard, Vancouver, ISO, etc.
Más fuentes

Tesis sobre el tema "Pre-training corpora"

1

Ortiz, Suarez Pedro. "A Data-driven Approach to Natural Language Processing for Contemporary and Historical French." Electronic Thesis or Diss., Sorbonne université, 2022. http://www.theses.fr/2022SORUS155.

Texto completo
Resumen
Depuis plusieurs années, les approches neuronales ont régulièrement amélioré l'état de l'art du traitement automatique des langues (TAL) sur une grande variété de tâches. L'un des principaux facteurs ayant permis ces progrès continus est l'utilisation de techniques d'apprentissage par transfert. Ces méthodes consistent à partir d'un modèle pré-entraîné et à le réutiliser, avec peu ou pas d'entraînement supplémentaire, pour traiter d'autres tâches. Même si ces modèles présentent des avantages évidents, leur principal inconvénient est la quantité de données nécessaire pour les pré-entraîner. Ain
Los estilos APA, Harvard, Vancouver, ISO, etc.

Libros sobre el tema "Pre-training corpora"

1

Humphreys, S. C. Kinship in Ancient Athens. Oxford University Press, 2018. http://dx.doi.org/10.1093/oso/9780198788249.001.0001.

Texto completo
Resumen
The book covers Athenian kinship from Drakon and Solon to Menander (with some references to later developments). It uses a wide range of sources: epigraphic, literary/forensic, and archaeological. It provides an ethnographic ‘thick description’ of Athenians’ interaction with their kin in all contexts: legal relations (adoption, guardianship, marriage, inheritance, disputes in and out of court); economic interaction (property, economic independence/dependence of sons in relation to fathers); training in specialist skills (doctors, actors, artists), loans, guarantees, etc.; rituals (naming, rite
Los estilos APA, Harvard, Vancouver, ISO, etc.
2

Peters, Thomas A. Library Programs Online. ABC-CLIO, LLC, 2009. http://dx.doi.org/10.5040/9798400679216.

Texto completo
Resumen
Meet your library patrons where they increasingly live and work-online. This guide introduces you to the exciting possibilities online programs offer, and shows you how to set up online programs in your library-whether one-time stand-alone or half-day, full-day, or multi-day workshops and conferences. Public programs-from lectures, demonstrations, and interviews to book discussions and story hours can be delivered in real time (live) primarily over the web, utilizing a variety of interactive communication tools, including voice-over-IP, text chatting, and co-browsing. Furthermore, online progr
Los estilos APA, Harvard, Vancouver, ISO, etc.

Capítulos de libros sobre el tema "Pre-training corpora"

1

Perełkiewicz, Michał, and Rafał Poświata. "A Review of the Challenges with Massive Web-Mined Corpora Used in Large Language Models Pre-training." In Lecture Notes in Computer Science. Springer Nature Switzerland, 2025. https://doi.org/10.1007/978-3-031-81596-6_14.

Texto completo
Los estilos APA, Harvard, Vancouver, ISO, etc.
2

Nag, Arijit, Bidisha Samanta, Animesh Mukherjee, Niloy Ganguly, and Soumen Chakrabarty. "Effect of Unknown and Fragmented Tokens on the Performance of Multilingual Language Models at Low-Resource Tasks." In Event Analytics across Languages and Communities. Springer Nature Switzerland, 2024. http://dx.doi.org/10.1007/978-3-031-64451-1_5.

Texto completo
Resumen
AbstractMultilingual language models (MLLMs) like mBERT promise to extend the benefits of NLP research to low-resource languages (LRLs). However, LRL vocabulary is often seriously under-represented in the workpiece dictionaries of MLLMs. This leads to many LRL words being replaced by UNK (unknown tokens) or concatenated from morphologically unrelated wordpieces, consequently leading to low task accuracy. Pre-training MLLMs after including LRL documents is extremely resource-intensive in terms of both human inputs and computational resources. In this chapter, we study intuitive strategies to se
Los estilos APA, Harvard, Vancouver, ISO, etc.
3

Mahamoud, Ibrahim Souleiman, Mickaël Coustaty, Aurélie Joseph, Vincent Poulain d’Andecy, and Jean-Marc Ogier. "KAP: Pre-training Transformers for Corporate Documents Understanding." In Document Analysis and Recognition – ICDAR 2023 Workshops. Springer Nature Switzerland, 2023. http://dx.doi.org/10.1007/978-3-031-41501-2_5.

Texto completo
Los estilos APA, Harvard, Vancouver, ISO, etc.
4

Siva Raju, S., and Khushboo Ahire. "Enhancing the Quality of Pre-school Education Through Training of Anganwadi Workers: A CSR Initiative." In Corporate Social Responsibility in India. Springer Singapore, 2017. http://dx.doi.org/10.1007/978-981-10-3902-7_5.

Texto completo
Los estilos APA, Harvard, Vancouver, ISO, etc.
5

Naumenko, Maksym, Iryna Hrashchenko, Tetiana Tsalko, Svitlana Nevmerzhytska, Svitlana Krasniuk, and Yurii Kulynych. "Innovative technological modes of data mining and modelling for adaptive project management of food industry competitive enterprises in crisis conditions." In PROJECT MANAGEMENT: INDUSTRY SPECIFICS. TECHNOLOGY CENTER PC, 2024. https://doi.org/10.15587/978-617-8360-03-0.ch2.

Texto completo
Resumen
Developed in this research scientific and practical applied project solutions regarding Data Mining for enterprises and companies (on the example of food industry) involve the application of advanced cybernetic computing methods/algorithms, technological modes and scenarios (for integration, pre-processing, machine learning, testing and in-depth comprehensive interpretation of the results) of analysis and analytics of large structured and semi-structured data sets for training high-quality descriptive, predictive and even prescriptive models. The proposed by authors multi-mode adaptive Data Mi
Los estilos APA, Harvard, Vancouver, ISO, etc.
6

Ho, Shaun. "Impacts of Continued Legal Pre-Training and IFT on LLMs’ Latent Representations of Human-Defined Legal Concepts." In Frontiers in Artificial Intelligence and Applications. IOS Press, 2024. https://doi.org/10.3233/faia241259.

Texto completo
Resumen
This paper aims to offer AI & Law researchers and practitioners a more detailed understanding of whether and how continued pre-training and instruction fine-tuning (IFT) of large language models (LLMs) on legal corpora increases their utilization of human-defined legal concepts when developing global contextual representations of input sequences. We compared three models: Mistral 7B, SaulLM-7B-Base (Mistral 7B with continued pre-training on legal corpora), and SaulLM-7B-Instruct (with further IFT). This preliminary assessment examined 7 distinct text sequences from recent AI & Law lite
Los estilos APA, Harvard, Vancouver, ISO, etc.
7

Tufiş Dan. "Algorithms and Data Design Issues for Basic NLP Tools." In NATO Science for Peace and Security Series - D: Information and Communication Security. IOS Press, 2009. https://doi.org/10.3233/978-1-58603-954-7-3.

Texto completo
Resumen
This chapter presents some of the basic language engineering pre-processing steps (tokenization, part-of-speech tagging, lemmatization, and sentence and word alignment). Tagging is among the most important processing steps and its accuracy significantly influences any further processing. Therefore, tagset design, validation and correction of training data and the various techniques for improving the tagging quality are discussed in detail. Since sentence and word alignment are prerequisite operations for exploiting parallel corpora for a multitude of purposes such as machine translation, bilin
Los estilos APA, Harvard, Vancouver, ISO, etc.
8

Stevens, Meg, Georgina Kennedy, and Timothy Churches. "Applying and Improving a Publicly Available Medication NER Pipeline in a Clinical Cancer EMR." In Studies in Health Technology and Informatics. IOS Press, 2024. http://dx.doi.org/10.3233/shti231051.

Texto completo
Resumen
Clinical NLP can be applied to extract medication information from free-text notes in EMRs, using NER pipelines. Publicly available annotated data for clinical NLP are scarce, and research annotation budgets are often low. Fine-tuning pre-trained pipelines containing a Transformer layer can produce quality results with relatively small training corpora. We examine the transferability of a publicly available, pre-trained NER pipeline with a Transformer layer for medication targets. The pipeline performs poorly when directly validated but achieves an F1-score of 92% for drug names after fine-tun
Los estilos APA, Harvard, Vancouver, ISO, etc.
9

Jiang, Eric P. "Automatic Text Classification from Labeled and Unlabeled Data." In Intelligent Data Analysis for Real-Life Applications. IGI Global, 2012. http://dx.doi.org/10.4018/978-1-4666-1806-0.ch013.

Texto completo
Resumen
Automatic text classification is a process that applies information retrieval technology and machine learning algorithms to build models from pre-labeled training samples and then deploys the models to previously unseen documents for classification. Text classification has been widely applied in many fields ranging from Web page indexing, document filtering, and information security, to business intelligence mining. This chapter presents a semi-supervised text classification framework that is based on the radial basis function (RBF) neural networks. The framework integrates an Expectation Maxi
Los estilos APA, Harvard, Vancouver, ISO, etc.
10

Liu, Ran, Ming Liu, Min Yu, et al. "GLIMMER: Incorporating Graph and Lexical Features in Unsupervised Multi-Document Summarization." In Frontiers in Artificial Intelligence and Applications. IOS Press, 2024. http://dx.doi.org/10.3233/faia240930.

Texto completo
Resumen
Pre-trained language models are increasingly being used in multi-document summarization tasks. However, these models need large-scale corpora for pre-training and are domain-dependent. Other non-neural unsupervised summarization approaches mostly rely on key sentence extraction, which can lead to information loss. To address these challenges, we propose a lightweight yet effective unsupervised approach called GLIMMER: a Graph and LexIcal features based unsupervised Multi-docuMEnt summaRization approach. It first constructs a sentence graph from the source documents, then automatically identifi
Los estilos APA, Harvard, Vancouver, ISO, etc.

Actas de conferencias sobre el tema "Pre-training corpora"

1

Vu, Thuy-Trang, Xuanli He, Gholamreza Haffari, and Ehsan Shareghi. "Koala: An Index for Quantifying Overlaps with Pre-training Corpora." In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, 2023. http://dx.doi.org/10.18653/v1/2023.emnlp-demo.7.

Texto completo
Los estilos APA, Harvard, Vancouver, ISO, etc.
2

Liu, Zhuang, Degen Huang, Kaiyu Huang, Zhuang Li, and Jun Zhao. "FinBERT: A Pre-trained Financial Language Representation Model for Financial Text Mining." In Twenty-Ninth International Joint Conference on Artificial Intelligence and Seventeenth Pacific Rim International Conference on Artificial Intelligence {IJCAI-PRICAI-20}. International Joint Conferences on Artificial Intelligence Organization, 2020. http://dx.doi.org/10.24963/ijcai.2020/622.

Texto completo
Resumen
There is growing interest in the tasks of financial text mining. Over the past few years, the progress of Natural Language Processing (NLP) based on deep learning advanced rapidly. Significant progress has been made with deep learning showing promising results on financial text mining models. However, as NLP models require large amounts of labeled training data, applying deep learning to financial text mining is often unsuccessful due to the lack of labeled training data in financial fields. To address this issue, we present FinBERT (BERT for Financial Text Mining) that is a domain specific la
Los estilos APA, Harvard, Vancouver, ISO, etc.
3

Qian, Jing, Yong Yue, Katie Atkinson, and Gangmin Li. "Knowledge-Enriched Moral Understanding upon Continual Pre-training." In 10th International Conference on Computer Networks & Communications (CCNET 2023). Academy and Industry Research Collaboration Center (AIRCC), 2023. http://dx.doi.org/10.5121/csit.2023.130414.

Texto completo
Resumen
The aim of moral understanding is to comprehend the abstract concepts that hide in a story by seeing through concrete events and vivid characters. To be specific, the story is highly summarized in one sentence without covering any characters in the original story, which requires the machine to behave more intelligently with the abilities of moral perception and commonsense reasoning. The paradigm of “pre-training + fine-tuning” is generally accepted for applying neural language models. In this paper, we suggest adding an intermediate stage to build the flow of “pre-training + continual pre-tra
Los estilos APA, Harvard, Vancouver, ISO, etc.
4

Lu, Jinliang, Yu Lu, and Jiajun Zhang. "Take a Closer Look at Multilinguality! Improve Multilingual Pre-Training Using Monolingual Corpora Only." In Findings of the Association for Computational Linguistics: EMNLP 2023. Association for Computational Linguistics, 2023. http://dx.doi.org/10.18653/v1/2023.findings-emnlp.190.

Texto completo
Los estilos APA, Harvard, Vancouver, ISO, etc.
5

Xu, Yipei, Dakuan Lu, Jiaqing Liang, et al. "Source Prompt: Coordinated Pre-training of Language Models on Diverse Corpora from Multiple Sources." In CIKM '24: The 33rd ACM International Conference on Information and Knowledge Management. ACM, 2024. http://dx.doi.org/10.1145/3627673.3679835.

Texto completo
Los estilos APA, Harvard, Vancouver, ISO, etc.
6

Wang, Xin'ao, Huan Li, Ke Chen, and Lidan Shou. "FedBFPT: An Efficient Federated Learning Framework for Bert Further Pre-training." In Thirty-Second International Joint Conference on Artificial Intelligence {IJCAI-23}. International Joint Conferences on Artificial Intelligence Organization, 2023. http://dx.doi.org/10.24963/ijcai.2023/483.

Texto completo
Resumen
This study proposes FEDBFPT (Federated BERT Further Pre-Training), a Federated Learning (FL) framework for further pre-training the BERT language model in specialized domains while addressing privacy concerns. FEDBFPT enables multiple clients to collaboratively train the shallower layers of BERT, which are crucial in the pre-training stage, without the need to share private data. To achieve this, FEDBFPT involves building a local model for each client, progressively training the shallower layers of local models while sampling deeper layers, and aggregating trained parameters on a server to cre
Los estilos APA, Harvard, Vancouver, ISO, etc.
7

Qu, Yuanbin, Peihan Liu, Wei Song, Lizhen Liu, and Miaomiao Cheng. "A Text Generation and Prediction System: Pre-training on New Corpora Using BERT and GPT-2." In 2020 IEEE 10th International Conference on Electronics Information and Emergency Communication (ICEIEC). IEEE, 2020. http://dx.doi.org/10.1109/iceiec49280.2020.9152352.

Texto completo
Los estilos APA, Harvard, Vancouver, ISO, etc.
8

Zan, Daoguang, Bei Chen, Dejian Yang, et al. "CERT: Continual Pre-training on Sketches for Library-oriented Code Generation." In Thirty-First International Joint Conference on Artificial Intelligence {IJCAI-22}. International Joint Conferences on Artificial Intelligence Organization, 2022. http://dx.doi.org/10.24963/ijcai.2022/329.

Texto completo
Resumen
Code generation is a longstanding challenge, aiming to generate a code snippet based on a natural language description. Usually, expensive text-code paired data is essential for training a code generation model. Recently, thanks to the success of pre-training techniques, large language models are trained on large unlabelled code corpora and perform well in generating code. In this paper, we investigate how to leverage an unlabelled code corpus to train a model for library-oriented code generation. Since it is a common practice for programmers to reuse third-party libraries, in which case the t
Los estilos APA, Harvard, Vancouver, ISO, etc.
9

Edwards, Aleksandra, Jose Camacho-Collados, Hélène De Ribaupierre, and Alun Preece. "Go Simple and Pre-Train on Domain-Specific Corpora: On the Role of Training Data for Text Classification." In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, 2020. http://dx.doi.org/10.18653/v1/2020.coling-main.481.

Texto completo
Los estilos APA, Harvard, Vancouver, ISO, etc.
10

Edwards, Aleksandra, Jose Camacho-Collados, Hélène De Ribaupierre, and Alun Preece. "Go Simple and Pre-Train on Domain-Specific Corpora: On the Role of Training Data for Text Classification." In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, 2020. http://dx.doi.org/10.18653/v1/2020.coling-main.481.

Texto completo
Los estilos APA, Harvard, Vancouver, ISO, etc.

Informes sobre el tema "Pre-training corpora"

1

Rosenblat, Sruly, Tim O'Reilly, and Ilan Strauss. Beyond Public Access in LLM Pre-Training Data: Non-public book content in OpenAI’s Models. AI Disclosures Project, Social Science Research Council, 2025. https://doi.org/10.35650/aidp.4111.d.2025.

Texto completo
Resumen
Using a legally obtained dataset of 34 copyrighted O’Reilly Media books, we apply the DE-COP membership inference attack method to investigate whether OpenAI’s large language models were trained on copyrighted content without consent. Our AUROC scores show that GPT-4o, OpenAI’s more recent and capable model, demonstrates strong recognition of paywalled O’Reilly book content (AUROC = 82%), compared to OpenAI’s earlier model GPT-3.5 Turbo. In contrast, GPT-3.5 Turbo shows greater relative recognition of publicly accessible O’Reilly book samples. GPT-4o Mini, as a much smaller model, shows no kno
Los estilos APA, Harvard, Vancouver, ISO, etc.
2

Strauss, Ilan, Isobel Moure, Tim O’Reilly, and Sruly Rosenblat. The State of AI Governance Research: AI Safety and Reliability in Real World Commercial Deployment. AI Disclosures Project, Social Science Research Council, 2025. https://doi.org/10.35650/aidp.4112.d.2025.

Texto completo
Resumen
Drawing on 1,178 safety and reliability papers from 9,439 generative AI papers (Jan- uary 2020 - March 2025), we compare research outputs of leading AI companies (An- thropic, Google DeepMind, Meta, Microsoft, and OpenAI) and AI universities (CMU, MIT, NYU, Stanford, UC Berkeley, and University of Washington). We find that cor- porate AI research increasingly concentrates on pre-deployment areas — model align- ment and testing & evaluation — while attention to deployment-stage issues, such as model bias, has waned, as commercial imperatives and existential risks have come into focus. We fi
Los estilos APA, Harvard, Vancouver, ISO, etc.
Ofrecemos descuentos en todos los planes premium para autores cuyas obras están incluidas en selecciones literarias temáticas. ¡Contáctenos para obtener un código promocional único!