Acceder

Bibliografías temáticas / Generative audio models / Artículos de revistas

Siga este enlace para ver otros tipos de publicaciones sobre el tema: Generative audio models.

Artículos de revistas sobre el tema "Generative audio models"

Autor: Grafiati

Publicado: 25 de mayo de 2024

Última modificación: 31 de julio de 2025

Crea una cita precisa en los estilos APA, MLA, Chicago, Harvard y otros

Elija tipo de fuente:

Consulte los 50 mejores artículos de revistas para su investigación sobre el tema "Generative audio models".

Junto a cada fuente en la lista de referencias hay un botón "Agregar a la bibliografía". Pulsa este botón, y generaremos automáticamente la referencia bibliográfica para la obra elegida en el estilo de cita que necesites: APA, MLA, Harvard, Vancouver, Chicago, etc.

También puede descargar el texto completo de la publicación académica en formato pdf y leer en línea su resumen siempre que esté disponible en los metadatos.

Explore artículos de revistas sobre una amplia variedad de disciplinas y organice su bibliografía correctamente.

1

Evans, Zach, Scott H. Hawley, and Katherine Crowson. "Musical audio samples generated from joint text embeddings." Journal of the Acoustical Society of America 152, no. 4 (2022): A178. http://dx.doi.org/10.1121/10.0015956.

Texto completo

Resumen

The field of machine learning has benefited from the appearance of diffusion-based generative models for images and audio. While text-to-image models have become increasingly prevalent, text-to-audio generative models are currently an active area of research. We present work on generating short samples of musical instrument sounds generated by a model which was conditioned on text descriptions and the file structure labels of large sample libraries. Preliminary findings indicate that generation of wide-spectrum sounds such as percussion are not difficult, while the generation of harmonic music

Los estilos APA, Harvard, Vancouver, ISO, etc.

2

Kang, Hyunju, Geonhee Han, Yoonjae Jeong, and Hogun Park. "AudioGenX: Explainability on Text-to-Audio Generative Models." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 17 (2025): 17733–41. https://doi.org/10.1609/aaai.v39i17.33950.

Texto completo

Resumen

Text-to-audio generation models (TAG) have achieved significant advances in generating audio conditioned on text descriptions. However, a critical challenge lies in the lack of transparency regarding how each textual input impacts the generated audio. To address this issue, we introduce AudioGenX, an Explainable AI (XAI) method that provides explanations for text-to-audio generation models by highlighting the importance of input tokens. AudioGenX optimizes an Explainer by leveraging factual and counterfactual objective functions to provide faithful explanations at the audio token level. This m

Los estilos APA, Harvard, Vancouver, ISO, etc.

3

Samson, Grzegorz. "Perspectives on Generative Sound Design: A Generative Soundscapes Showcase." Arts 14, no. 3 (2025): 67. https://doi.org/10.3390/arts14030067.

Texto completo

Resumen

Recent advancements in generative neural networks, particularly transformer-based models, have introduced novel possibilities for sound design. This study explores the use of generative pre-trained transformers (GPT) to create complex, multilayered soundscapes from textual and visual prompts. A custom pipeline is proposed, featuring modules for converting the source input into structured sound descriptions and subsequently generating cohesive auditory outputs. As a complementary solution, a granular synthesizer prototype was developed to enhance the usability of generative audio samples by ena

Los estilos APA, Harvard, Vancouver, ISO, etc.

4

Jeong, Yujin, Yunji Kim, Sanghyuk Chun, and Jiyoung Lee. "Read, Watch and Scream! Sound Generation from Text and Video." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 17 (2025): 17590–98. https://doi.org/10.1609/aaai.v39i17.33934.

Texto completo

Resumen

Despite the impressive progress of multimodal generative models, video-to-audio generation still suffers from limited performance and limits the flexibility to prioritize sound synthesis for specific objects within the scene. Conversely, text-to-audio generation methods generate high-quality audio but pose challenges in ensuring comprehensive scene depiction and time-varying control. To tackle these challenges, we propose a novel video-and-text-to-audio generation method, called ReWaS, where video serves as a conditional control for a text-to-audio generation model. Especially, our method esti

Los estilos APA, Harvard, Vancouver, ISO, etc.

5

Wang, Heng, Jianbo Ma, Santiago Pascual, Richard Cartwright, and Weidong Cai. "V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 14 (2024): 15492–501. http://dx.doi.org/10.1609/aaai.v38i14.29475.

Texto completo

Resumen

Building artificial intelligence (AI) systems on top of a set of foundation models (FMs) is becoming a new paradigm in AI research. Their representative and generative abilities learnt from vast amounts of data can be easily adapted and transferred to a wide range of downstream tasks without extra training from scratch. However, leveraging FMs in cross-modal generation remains under-researched when audio modality is involved. On the other hand, automatically generating semantically-relevant sound from visual input is an important problem in cross-modal generation studies. To solve this vision-

Los estilos APA, Harvard, Vancouver, ISO, etc.

6

Ji, Wenliang, Ming Jin, and Yixin Chen. "Optimization of Digital Media Content Generation and Communication Effect Combined with Deep Learning Technology." Journal of Combinatorial Mathematics and Combinatorial Computing 127a (April 15, 2025): 1449–66. https://doi.org/10.61091/jcmcc127a-084.

Texto completo

Resumen

The combination of deep learning and digital media technology provides great scope for content creation. The article uses Generative Adversarial Network (GAN) in deep learning for content generation. Based on the three major forms of digital media content (image, audio, and video), image, audio, and video are generated by U-Net_GAN model, MAS-GAN model, and SSFLVGAN model, respectively, to construct a digital media content generation model based on generative adversarial networks. Subsequently, the model is validated for performance and the generated images, audio and video are evaluated for e

Los estilos APA, Harvard, Vancouver, ISO, etc.

7

Sakirin, Tam, and Siddartha Kusuma. "A Survey of Generative Artificial Intelligence Techniques." Babylonian Journal of Artificial Intelligence 2023 (March 10, 2023): 10–14. http://dx.doi.org/10.58496/bjai/2023/003.

Texto completo

Resumen

Generative artificial intelligence (AI) refers to algorithms capable of creating novel, realistic digital content autonomously. Recently, generative models have attained groundbreaking results in domains like image and audio synthesis, spurring vast interest in the field. This paper surveys the landscape of modern techniques powering the rise of creative AI systems. We structurally examine predominant algorithmic approaches including generative adversarial networks (GANs), variational autoencoders (VAEs), and autoregressive models. Architectural innovations and illustrations of generated outpu

Los estilos APA, Harvard, Vancouver, ISO, etc.

8

Broad, Terence, Frederic Fol Leymarie, and Mick Grierson. "Network Bending: Expressive Manipulation of Generative Models in Multiple Domains." Entropy 24, no. 1 (2021): 28. http://dx.doi.org/10.3390/e24010028.

Texto completo

Resumen

This paper presents the network bending framework, a new approach for manipulating and interacting with deep generative models. We present a comprehensive set of deterministic transformations that can be inserted as distinct layers into the computational graph of a trained generative neural network and applied during inference. In addition, we present a novel algorithm for analysing the deep generative model and clustering features based on their spatial activation maps. This allows features to be grouped together based on spatial similarity in an unsupervised fashion. This results in the mean

Los estilos APA, Harvard, Vancouver, ISO, etc.

9

Cao, Yongnian, Xuechun Yang, and Rui Sun. "Generative AI Models Theoretical Foundations and Algorithmic Practices." Journal of Industrial Engineering and Applied Science 3, no. 1 (2025): 1–9. https://doi.org/10.70393/6a69656173.323633.

Texto completo

Resumen

Generative models in AI are an entirely new paradigm for machine learning, allowing computers to create realistic data in all kinds of categories, like text (NLP), images, and even physics simulations. In this paper this formalism is used to guide the theory, algorithms and applications of generative models, with particular focus on a few well established techniques like VAEs, GANs, and diffusion models. It stresses the importance of probabilistic generative modelling and information theory (I.e. KL divergence, ELBO, adversarial optimization, etc.) We cover algorithmic practices such as optimi

Los estilos APA, Harvard, Vancouver, ISO, etc.

10

Aldausari, Nuha, Arcot Sowmya, Nadine Marcus, and Gelareh Mohammadi. "Video Generative Adversarial Networks: A Review." ACM Computing Surveys 55, no. 2 (2023): 1–25. http://dx.doi.org/10.1145/3487891.

Texto completo

Resumen

With the increasing interest in the content creation field in multiple sectors such as media, education, and entertainment, there is an increased trend in the papers that use AI algorithms to generate content such as images, videos, audio, and text. Generative Adversarial Networks (GANs) is one of the promising models that synthesizes data samples that are similar to real data samples. While the variations of GANs models in general have been covered to some extent in several survey papers, to the best of our knowledge, this is the first paper that reviews the state-of-the-art video GANs models

Los estilos APA, Harvard, Vancouver, ISO, etc.

11

Dzwonczyk, Luke, Carmine-Emanuele Cella, and David Ban. "Generating Music Reactive Videos by Applying Network Bending to Stable Diffusion." Journal of the Audio Engineering Society 73, no. 6 (2025): 388–98. https://doi.org/10.17743/jaes.2022.0210.

Texto completo

Resumen

This paper presents the first steps toward the creation of a tool which enables artists to create music visualizations using pretrained, generative, machine learning models. First, the authors investigate the application of network bending, the process of applying transforms within the layers of a generative network, to image generation diffusion models by utilizing a range of point-wise, tensor-wise, and morphological operators. A number of visual effects that result from various operators, including some that are not easily recreated with standard image editing tools, are identified. The aut

Los estilos APA, Harvard, Vancouver, ISO, etc.

12

Neto, Wilson A. de Oliveira, Elloá B. Guedes, and Carlos Maurício S. Figueiredo. "Anomaly Detection in Sound Activity with Generative Adversarial Network Models." Journal of Internet Services and Applications 15, no. 1 (2024): 313–24. http://dx.doi.org/10.5753/jisa.2024.3897.

Texto completo

Resumen

In state-of-art anomaly detection research, prevailing methodologies predominantly employ Generative Adversarial Networks and Autoencoders for image-based applications. Despite the efficacy demonstrated in the visual domain, there remains a notable dearth of studies showcasing the application of these architectures in anomaly detection within the sound domain. This paper introduces tailored adaptations of cutting-edge architectures for anomaly detection in audio and conducts a comprehensive comparative analysis to substantiate the viability of this novel approach. The evaluation is performed o

Los estilos APA, Harvard, Vancouver, ISO, etc.

13

Shen, Qiwei, Junjie Xu, Jiahao Mei, Xingjiao Wu, and Daoguo Dong. "EmoStyle: Emotion-Aware Semantic Image Manipulation with Audio Guidance." Applied Sciences 14, no. 8 (2024): 3193. http://dx.doi.org/10.3390/app14083193.

Texto completo

Resumen

With the flourishing development of generative models, image manipulation is receiving increasing attention. Rather than text modality, several elegant designs have delved into leveraging audio to manipulate images. However, existing methodologies mainly focus on image generation conditional on semantic alignment, ignoring the vivid affective information depicted in the audio. We propose an Emotion-aware StyleGAN Manipulator (EmoStyle), a framework where affective information from audio can be explicitly extracted and further utilized during image manipulation. Specifically, we first leverage

Los estilos APA, Harvard, Vancouver, ISO, etc.

14

Gupta, Jyoti, Monica Bhutani, Pramod Kumar, et al. "A comprehensive review of recent advances and future prospects of generative AI." Journal of Information and Optimization Sciences 46, no. 1 (2025): 205–11. https://doi.org/10.47974/jios-1864.

Texto completo

Resumen

Generative AI has evolved rapidly and demonstrated accuracy in creating content with diverse yet too realistic styles. This paper provides a complete overview of the field, starting with its core principles and continuing with some recent results and potential future applications. This also covers requirements for new task-specific and data models, including our critical generative model generation (GANs, VAE, and more) in four image audio text videos. The paper emphasizes that generative AI has the potential to transform industries and lists some of these possible applications. It also review

Los estilos APA, Harvard, Vancouver, ISO, etc.

15

Meshram, Sahil. "Genius AI A Unified Platform for Text, Image, Audio, Video, and Code AI." International Journal for Research in Applied Science and Engineering Technology 13, no. 6 (2025): 825–29. https://doi.org/10.22214/ijraset.2025.71461.

Texto completo

Resumen

The rapid evolution of artificial intelligence (AI) has led to the development of specialized models across different modalities such as text, image, video, audio, and program code. This paper presents the design and conceptual framework for a multimodal AI platform that harmoniously brings together multiple AI systems into a single, user-friendly. The proposed platform leverages state-of-the-art AI models, each tailored for a specific modality—Natural Language Processing (NLP) models for text understanding and generation, Computer Vision models for image analysis and synthesis, Generative Vid

Los estilos APA, Harvard, Vancouver, ISO, etc.

16

Purshottam J. Assudani, Balakrishnan P, A. Anny Leema, and Rajesh K Nasare. "Generative AI-Powered Framework for Audio Analysis and Conversational Exploration." Metallurgical and Materials Engineering 31, no. 4 (2025): 206–11. https://doi.org/10.63278/1425.

Texto completo

Resumen

This paper introduces a hybrid deep learning system for complex audio interpretation and post time communication utilizing associated hidden Convolutional Neural Networks (CNNs) with transformer based Large Language Models (LLMs) over spectrogram. The system inputs raw audio input in the form of audio signals, and maps them into spectrograms, extracts high level features using CNNs, and asks for fusion of LLM-produced embeddings with it, for adding semantic understanding, and contextual discussions. The multimodal attention technique helps in crossing the audio-linguistic gap and therefore, it

Los estilos APA, Harvard, Vancouver, ISO, etc.

17

S, Dr Manimala. "GenNarrate: AI-Powered Story Synthesis with Visual and Audio Outputs." International Journal for Research in Applied Science and Engineering Technology 13, no. 5 (2025): 2352–58. https://doi.org/10.22214/ijraset.2025.70567.

Texto completo

Resumen

Abstract: The emergence of generative artificial intelligence has redefined the boundaries of digital content creation, particularly in the domain of computational storytelling. This paper presents GenNarrate, a modular, multi-modal generative AI system engineered to synthesize coherent narratives augmented with corresponding visual and auditory elements. The architecture leverages advanced machine learning models, including LLaMA2 for text generation, DALL·E for image synthesis, and a combination of Google Text-to-Speech (GTTS) and AudioLDM for expressive audio narration and sound design. Gen

Los estilos APA, Harvard, Vancouver, ISO, etc.

18

Andreu, Sergi, and Monica Villanueva Aylagas. "Neural Synthesis of Sound Effects Using Flow-Based Deep Generative Models." Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment 18, no. 1 (2022): 2–9. http://dx.doi.org/10.1609/aiide.v18i1.21941.

Texto completo

Resumen

Creating variations of sound effects for video games is a time-consuming task that grows with the size and complexity of the games themselves. The process usually comprises recording source material and mixing different layers of sound to create sound effects that are perceived as diverse during gameplay. In this work, we present a method to generate controllable variations of sound effects that can be used in the creative process of sound designers. We adopt WaveFlow, a generative flow model that works directly on raw audio and has proven to perform well for speech synthesis. Using a lower-di

Los estilos APA, Harvard, Vancouver, ISO, etc.

19

Lattner, Stefan, and Javier Nistal. "Stochastic Restoration of Heavily Compressed Musical Audio Using Generative Adversarial Networks." Electronics 10, no. 11 (2021): 1349. http://dx.doi.org/10.3390/electronics10111349.

Texto completo

Resumen

Lossy audio codecs compress (and decompress) digital audio streams by removing information that tends to be inaudible in human perception. Under high compression rates, such codecs may introduce a variety of impairments in the audio signal. Many works have tackled the problem of audio enhancement and compression artifact removal using deep-learning techniques. However, only a few works tackle the restoration of heavily compressed audio signals in the musical domain. In such a scenario, there is no unique solution for the restoration of the original signal. Therefore, in this study, we test a s

Los estilos APA, Harvard, Vancouver, ISO, etc.

20

Thorat, Ms Madhuri. "From Words to Wonders: AI-Generated Multimedia for Poetry Learning." International Journal for Research in Applied Science and Engineering Technology 13, no. 5 (2025): 3382–94. https://doi.org/10.22214/ijraset.2025.70946.

Texto completo

Resumen

The rise of Generative AI has led to the development of various tools that present new opportunities for businesses and professionals engaged in content creation. The education sector is undergoing a significant transformation in the methods of content development and delivery. AI models and tools facilitate the creation of customized learning materials and effective visuals that enhance and simplify the educational experience. The advent of Large Language Models (LLMs) such as GPT and Text-to-Image models like Stable Diffusion, Flux-Schnell has fundamentally changed and expedited the content

Los estilos APA, Harvard, Vancouver, ISO, etc.

21

Giudici, Gregorio Andrea, Franco Caspe, Leonardo Gabrielli, Stefano Squartini, and Luca Turchet. "Distilling DDSP: Exploring Real-Time Audio Generation on Embedded Systems." Journal of the Audio Engineering Society 73, no. 6 (2025): 331–45. https://doi.org/10.17743/jaes.2022.0211.

Texto completo

Resumen

This paper investigates the feasibility of running neural audio generative models on embedded systems, by comparing the performance of various models and evaluating their trade-offs in audio quality, inference speed, and memory usage. This work focuses on differentiable digital signal processing (DDSP) models, due to their hybrid architecture, which combines the efficiency and interoperability of traditional DSP with the flexibility of neural networks. In addition, the application of knowledge distillation (KD) is explored to improve the performance of smaller models. Two types of distillation

Los estilos APA, Harvard, Vancouver, ISO, etc.

22

G, Ananya. "RAG based Chatbot using LLMs." INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT 08, no. 06 (2024): 1–5. http://dx.doi.org/10.55041/ijsrem35600.

Texto completo

Resumen

Historically, Artificial Intelligence (AI) was used to understand and recommend information. Now, Generative AI can also help us create new content. Generative AI builds on existing technologies, like Large Language Models (LLMs) which are trained on large amounts of text and learn to predict the next word in a sentence. Generative AI can not only create new text, but also images, videos, or audio. This project focuses on the implementation of a chatbot based the concepts of Generative AI and Large Language Models which can answer any query regarding the content provided in the PDFs. The prima

Los estilos APA, Harvard, Vancouver, ISO, etc.

23

Yang, Junpeng, and Haoran Zhang. "Development And Challenges of Generative Artificial Intelligence in Education and Art." Highlights in Science, Engineering and Technology 85 (March 13, 2024): 1334–47. http://dx.doi.org/10.54097/vaeav407.

Texto completo

Resumen

Thanks to the rapid development of generative deep learning models, Artificial Intelligence Generated Content (AIGC) has attracted more and more research attention in recent years, which aims to learn models from massive data to generate relevant content based on input conditions. Different from traditional single-modal generation tasks that focus on content generation for a particular modality, such as image generation, text generation, or semantic generation, AIGC trains a single model that can simultaneously understand language, images, videos, audio, and more. AIGC marks the transition fro

Los estilos APA, Harvard, Vancouver, ISO, etc.

24

Choi, Ha-Yeong, Sang-Hoon Lee, and Seong-Whan Lee. "DDDM-VC: Decoupled Denoising Diffusion Models with Disentangled Representation and Prior Mixup for Verified Robust Voice Conversion." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 16 (2024): 17862–70. http://dx.doi.org/10.1609/aaai.v38i16.29740.

Texto completo

Resumen

Diffusion-based generative models have recently exhibited powerful generative performance. However, as many attributes exist in the data distribution and owing to several limitations of sharing the model parameters across all levels of the generation process, it remains challenging to control specific styles for each attribute. To address the above problem, we introduce decoupled denoising diffusion models (DDDMs) with disentangled representations, which can enable effective style transfers for each attribute in generative models. In particular, we apply DDDMs for voice conversion (VC) tasks,

Los estilos APA, Harvard, Vancouver, ISO, etc.

25

Zhou, Zhenghao, Yongjie Liu, and Chen Cao. "Advancing Audio-Based Text Generation with Imbalance Preference Optimization." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 24 (2025): 26120–28. https://doi.org/10.1609/aaai.v39i24.34808.

Texto completo

Resumen

Human feedback in generative systems is a highly active frontier of research that aims to improve the quality of generated content and align it with subjective preferences. Existing efforts predominantly focus on text-only large language models (LLMs) or text-based image generation, while cross-modal generation between audio and text remains largely unexplored. Moreover, there is currently no open-source preference dataset to support the deployment of alignment algorithms in this domain. In this work, we take audio speech translation (AST) and audio captioning (AAC) tasks as examples to explor

Los estilos APA, Harvard, Vancouver, ISO, etc.

26

Viomesh Singh. "VidTextBot using Generative AI." Journal of Information Systems Engineering and Management 10, no. 18s (2025): 128–32. https://doi.org/10.52783/jisem.v10i18s.2894.

Texto completo

Resumen

Introduction: This research paper presents the design and implementation of a VidTextBot , it is a cutting-edge system that is used to integrate the video-to-text conversion using generative AI for analyzing the video content. The system will allow the users to upload the video or the youtube link. This youtube link or the video is processed to extract the audio, transcribe it into text, and extract subtitles if available. These outputs are stored into the database for smooth future reference and efficient data retrieval. By utilizing advanced NLP models like ChatGPT, the chatBot will help the

Los estilos APA, Harvard, Vancouver, ISO, etc.

27

Gupta, Jyoti, Monica Bhutani, Mahesh Kumar, Aman Dureja, Shyla Singh, and Mohit Dayal. "State-of-the-art review and critical analysis of emerging trends in generative artificial intelligence." Journal of Information and Optimization Sciences 46, no. 5 (2025): 1691–704. https://doi.org/10.47974/jios-1945.

Texto completo

Resumen

Generative AI technology now leads the way as a transformative innovation that produces diverse, realistic content across various content modalities. This research extensively reviews generative AI by examining its core mechanisms, architectural improvements, and current breakthroughs in making images, text, audio, and videos. The paper discusses three main generative models, GANs, VAEs and diffusion models, to explain their distinctive advantages and technical constraints. The document demonstrates industrial uses in different sectors yet focuses on essential issues connected to biased data,

Los estilos APA, Harvard, Vancouver, ISO, etc.

28

Gupta, Chitralekha, Shreyas Sridhar, Denys J. C. Matthies, Christophe Jouffrais, and Suranga Nanayakkara. "SonicVista: Towards Creating Awareness of Distant Scenes through Sonification." Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 8, no. 2 (2024): 1–32. http://dx.doi.org/10.1145/3659609.

Texto completo

Resumen

Spatial awareness, particularly awareness of distant environmental scenes known as vista-space, is crucial and contributes to the cognitive and aesthetic needs of People with Visual Impairments (PVI). In this work, through a formative study with PVIs, we establish the need for vista-space awareness amongst people with visual impairments, and the possible scenarios where this awareness would be helpful. We investigate the potential of existing sonification techniques as well as AI-based audio generative models to design sounds that can create awareness of vista-space scenes. Our first user stud

Los estilos APA, Harvard, Vancouver, ISO, etc.

29

Lin, Hong, Xuan Liu, Chaomurilige Chaomurilige, et al. "LongMergent: Pioneering audio mixing strategies for exquisite music generation." Computer Software and Media Applications 8, no. 1 (2025): 11516. https://doi.org/10.24294/csma11516.

Texto completo

Resumen

Artificial intelligence-empowered music processing is a domain that involves the use of artificial intelligence technologies to enhance music analysis, understanding, and generation. This field encompasses a variety of tasks from music generation to music comprehension. In practical applications, the complexity of interwoven tasks, differences in data representation, scattered distribution of tool resources, and the threshold of professional music knowledge often become barriers that hinder developers from smoothly carrying out generative tasks. Therefore, it is essential to establish a system

Los estilos APA, Harvard, Vancouver, ISO, etc.

30

Yang, Chenyu, Shuai Wang, Hangting Chen, et al. "SongEditor: Adapting Zero-Shot Song Generation Language Model as a Multi-Task Editor." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 24 (2025): 25597–605. https://doi.org/10.1609/aaai.v39i24.34750.

Texto completo

Resumen

The emergence of novel generative modeling paradigms, particularly audio language models, has significantly advanced the field of song generation. Although state-of-the-art models are capable of synthesizing both vocals and accompaniment tracks up to several minutes long concurrently, research about partial adjustments or editing of existing songs is still underexplored, which allows for more flexible and effective production. In this paper, we present SongEditor, the first song editing paradigm that introduces the editing capabilities into language-modeling song generation approaches, facilit

Los estilos APA, Harvard, Vancouver, ISO, etc.

31

Adithya, Suresh, A. Faras, Habeeba K. M. Ummu, Eldho Anu, J. George Asha, and Roy Meckamalil Rotney. "Autism Detection Using Self-Stimulatory Behaviors." Advancement in Image Processing and Pattern Recognition 8, no. 3 (2025): 13–24. https://doi.org/10.5281/zenodo.15516090.

Texto completo

Resumen

<em>This paper introduces a novel video-audio-based model for the early detection of Autism Spectrum Disorder (ASD), focusing on analyzing self-stimulatory behaviors (stim- ming) such as arm flapping, head banging, and spinning, which are critical diagnostic markers. Traditional diagnostic approaches often depend on subjective clinical observations, leading to in- consistencies, delays, and limited accessibility in diverse settings. The proposed model combines video analysis with audio detection to address these shortcomings, supported by a generative AI- based method to create an audio datase

Los estilos APA, Harvard, Vancouver, ISO, etc.

32

Prudhvi, Y., T. Adinarayana, T. Chandu, S. Musthak, and G. Sireesha. "Vocal Visage: Crafting Lifelike 3D Talking Faces from Static Images and Sound." International Journal of Innovative Research in Computer Science and Technology 11, no. 6 (2023): 13–17. http://dx.doi.org/10.55524/ijircst.2023.11.6.3.

Texto completo

Resumen

In the field of computer graphics and animation, the challenge of generating lifelike and expressive talking face animations has historically necessitated extensive 3D data and complex facial motion capture systems. However, this project presents an innovative approach to tackle this challenge, with the primary goal of producing realistic 3D motion coefficients for stylized talking face animations driven by a single reference image synchronized with audio input. Leveraging state-of-the-art deep learning techniques, including generative models, image-to-image translation networks, and audio pro

Los estilos APA, Harvard, Vancouver, ISO, etc.

33

A M, Vandana Pranavi, and Dr Nagaraj G. Cholli. "Comprehensive Survey On Generative AI, Plethora Of Applications And Impacts." IOSR Journal of Computer Engineering 26, no. 5 (2024): 06–15. http://dx.doi.org/10.9790/0661-2605020615.

Texto completo

Resumen

The primary objective for the AI subfield of "generative artificial intelligence" is to develop systems that may produce new, novel and creative content including text, photos, audio, music, and movies. These models are able to generate fresh content that nearly mimics realistic content created by humans by utilizing deep learning techniques. These models of Gen AI have gained significant importance in research and have plethora of applications in wide varieties of fields. The impact of GenAI is not just on abled but also on disabled communities who are sometimes unnoticed. This survey provide

Los estilos APA, Harvard, Vancouver, ISO, etc.

34

Liang, Kai, and Haijun Zhao. "Application of Generative Adversarial Nets (GANs) in Active Sound Production System of Electric Automobiles." Shock and Vibration 2020 (October 28, 2020): 1–10. http://dx.doi.org/10.1155/2020/8888578.

Texto completo

Resumen

To improve the diversity and quality of sound mimicry of electric automobile engines, a generative adversarial network (GAN) model was used to construct an active sound production model for electric automobiles. The structure of each layer in the network in this model and the size of its convolution kernel were designed. The gradient descent in network training was optimized using the adaptive moment estimation (Adam) algorithm. To demonstrate the quality difference of the generated samples from different input signals, two GAN models with different inputs were constructed. The experimental re

Los estilos APA, Harvard, Vancouver, ISO, etc.

35

Li, Lianghao. "Overview of Multimodal Generative Models in Natural Language Processing and Computer Vision." Journal of Computer Technology and Applied Mathematics 1, no. 4 (2024): 69–78. https://doi.org/10.5281/zenodo.13988327.

Texto completo

Resumen

Multimodal generative models have become essential in the deep learning renaissance, as they provide unparalleled flexibility over a diverse context of applications within Natural Language Processing (NLP) and Computer Vision (CV). In this paper, we systematically review the basic concepts and technical improvements in multimodal generative models by discussing their applications across different modalities such as text, images, audio,and video. These models though augment the strength of AI to comprehend and perform complicated tasks by coalescing data from various modalities. In this paper,

Los estilos APA, Harvard, Vancouver, ISO, etc.

36

Agarwal,, Pratham. "MedBot : A GenAI based Chatbot for Healthcare." INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT 08, no. 06 (2024): 1–5. http://dx.doi.org/10.55041/ijsrem35757.

Texto completo

Resumen

Generative Artificial intelligence (GenAI) is transforming the healthcare industry by providing innovative solutions for patient care and information retrieval. MedBot is an innovative GenAI-driven chatbot designed to improve healthcare services by providing accurate and timely medical information. Utilizing advanced generative AI models, MedBot can respond to text, image, and audio queries, making it a versatile tool for diverse healthcare needs. The chatbot offers functionalities such as document summarization and insight extraction, aiding users in comprehending complex medical data. MedBot

Los estilos APA, Harvard, Vancouver, ISO, etc.

37

Li, Jing, Zhengping Li, Ying Li, and Lijun Wang. "P‐2.12: A Comprehensive Study of Content Generation Using Diffusion Model." SID Symposium Digest of Technical Papers 54, S1 (2023): 522–24. http://dx.doi.org/10.1002/sdtp.16346.

Texto completo

Resumen

The essence of the Metaverse is the process of integrating a large number of existing technologies to virtualize and digitize the real world. With the development of artificial intelligence technology, a large number of digital native content in the Metaverse needs to be completed by artificial intelligence. The current artificial intelligence technology allows computers to automatically and efficiently generate text, pictures, audio, video, and even 3D models. With the further development of natural language processing technology and generative network models, future artificial intelligence g

Los estilos APA, Harvard, Vancouver, ISO, etc.

38

Cheng, Liehai, Zhenli Zhang, Giuseppe Lacidogna, Xiao Wang, Mutian Jia, and Zhitao Liu. "Sound Sensing: Generative and Discriminant Model-Based Approaches to Bolt Loosening Detection." Sensors 24, no. 19 (2024): 6447. http://dx.doi.org/10.3390/s24196447.

Texto completo

Resumen

The detection of bolt looseness is crucial to ensure the integrity and safety of bolted connection structures. Percussion-based bolt looseness detection provides a simple and cost-effective approach. However, this method has some inherent shortcomings that limit its application. For example, it highly depends on the inspector’s hearing and experience and is more easily affected by ambient noise. In this article, a whole set of signal processing procedures are proposed and a new kind of damage index vector is constructed to strengthen the reliability and robustness of this method. Firstly, a se

Los estilos APA, Harvard, Vancouver, ISO, etc.

39

Liu, Yunyi, and Craig Jin. "Impact on quality and diversity from integrating a reconstruction loss into neural audio synthesis." Journal of the Acoustical Society of America 154, no. 4_supplement (2023): A99. http://dx.doi.org/10.1121/10.0022922.

Texto completo

Resumen

In digital media or games, sound effects are typically recorded or synthesized. While there are a great many digital synthesis tools, the synthesized audio quality is generally not on par with sound recordings. Nonetheless, sound synthesis techniques provide a popular means to generate new sound variations. In this research, we study sound effects synthesis using generative models that are inspired by the models used for high-quality speech and music synthesis. In particular, we explore the trade-off between synthesis quality and variation. With regard to quality, we integrate a reconstruction

Los estilos APA, Harvard, Vancouver, ISO, etc.

40

Cheng, Hsu-Yung, Chia-Cheng Su, Chi-Lun Jiang, and Chih-Chang Yu. "Pose Transfer with Multi-Scale Features Combined with Latent Diffusion Model and ControlNet." Electronics 14, no. 6 (2025): 1179. https://doi.org/10.3390/electronics14061179.

Texto completo

Resumen

In recent years, generative AI has become popular in areas like natural language processing, as well as image and audio processing, significantly expanding AI’s creative capabilities. Particularly in the realm of image generation, diffusion models have achieved remarkable success across various applications, such as image synthesis and transformation. However, traditional diffusion models operate at the pixel level when learning image features, which inevitably demands significant computational resources. To address this issue, this paper proposes a pose transfer model that integrates the late

Los estilos APA, Harvard, Vancouver, ISO, etc.

41

Sheikh, Dr Shagufta Mohammad Sayeed. "Empowering Learning: Crafting Educational Podcasts with GEN AI." International Journal for Research in Applied Science and Engineering Technology 13, no. 4 (2025): 4517–28. https://doi.org/10.22214/ijraset.2025.69144.

Texto completo

Resumen

The integration of Generative AI has facilitated innovative tools that are transforming content creation in the educational landscape, ushering in a shift towards more efficient and accessible learning paradigms. In this project, AI is leveraged to automate podcast production, replicating text-based educational content into high-fidelity audio content that caters to varied learning needs. Leveraging advanced frameworks like Large Language Models (LLMs) and Text-to-Speech (TTS) technologies, the system streamlines otherwise time-consuming processes like scripting, recording, and editing, thus s

Los estilos APA, Harvard, Vancouver, ISO, etc.

42

B, Yeshitha, Vinitha V, Anubha Mittal, Harshitha Reddy P., and Katiyar Rajani. "Emotion Detection and Voice-Emotion Conversions using Deep Learning." International Journal of Microsystems and IoT 2, no. 3 (2024): 685–91. https://doi.org/10.5281/zenodo.11159090.

Texto completo

Resumen

Emotion, especially through speech, is a powerful tool humans possess that conveys much more information than any text can describe. Using artificial intelligence to tap into this can have a big    positive impact on a variety of industries, including audio mining, customer service applications,    security and forensics, and more. A growing field of research, spoken emotion recognition, has  relied heavily on models that employ audio data to create effective classifiers. This paper resents convolutional neural network as a deep learning classification algorithm t

Los estilos APA, Harvard, Vancouver, ISO, etc.

43

He, Yibo, Kah Phooi Seng, and Li Minn Ang. "Multimodal Sensor-Input Architecture with Deep Learning for Audio-Visual Speech Recognition in Wild." Sensors 23, no. 4 (2023): 1834. http://dx.doi.org/10.3390/s23041834.

Texto completo

Resumen

This paper investigates multimodal sensor architectures with deep learning for audio-visual speech recognition, focusing on in-the-wild scenarios. The term “in the wild” is used to describe AVSR for unconstrained natural-language audio streams and video-stream modalities. Audio-visual speech recognition (AVSR) is a speech-recognition task that leverages both an audio input of a human voice and an aligned visual input of lip motions. However, since in-the-wild scenarios can include more noise, AVSR’s performance is affected. Here, we propose new improvements for AVSR models by incorporating dat

Los estilos APA, Harvard, Vancouver, ISO, etc.

44

Xi, Wang, Guillaume Devineau, Fabien Moutarde, and Jie Yang. "Generative Model for Skeletal Human Movements Based on Conditional DC-GAN Applied to Pseudo-Images." Algorithms 13, no. 12 (2020): 319. http://dx.doi.org/10.3390/a13120319.

Texto completo

Resumen

Generative models for images, audio, text, and other low-dimension data have achieved great success in recent years. Generating artificial human movements can also be useful for many applications, including improvement of data augmentation methods for human gesture recognition. The objective of this research is to develop a generative model for skeletal human movement, allowing to control the action type of generated motion while keeping the authenticity of the result and the natural style variability of gesture execution. We propose to use a conditional Deep Convolutional Generative Adversari

Los estilos APA, Harvard, Vancouver, ISO, etc.

45

R, Arun Kumar, Lisa C, Rashmi V R, and Sandhya K. "GENERATIVE ADVERSARIAL NETWORKS (GANs) IN MULTIMODAL AI USING BRIDGING TEXT, IMAGE, AND AUDIO DATA FOR ENHANCED MODEL PERFORMANCE." ICTACT Journal on Soft Computing 15, no. 3 (2025): 3567–77. https://doi.org/10.21917/ijsc.2025.0497.

Texto completo

Resumen

The integration of multimodal data is critical in advancing artificial intelligence models capable of interpreting diverse and complex inputs. While standalone models excel in processing individual data types like text, image, or audio, they often fail to achieve comparable performance when these modalities are combined. Generative Adversarial Networks (GANs) have emerged as a transformative approach in this domain due to their ability to synthesize and learn across disparate data types effectively. This study addresses the challenge of bridging multimodal datasets to improve the generalizatio

Los estilos APA, Harvard, Vancouver, ISO, etc.

46

Gong, Yuan, Cheng-I. Lai, Yu-An Chung, and James Glass. "SSAST: Self-Supervised Audio Spectrogram Transformer." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 10 (2022): 10699–709. http://dx.doi.org/10.1609/aaai.v36i10.21315.

Texto completo

Resumen

Recently, neural networks based purely on self-attention, such as the Vision Transformer (ViT), have been shown to outperform deep learning models constructed with convolutional neural networks (CNNs) on various vision tasks, thus extending the success of Transformers, which were originally developed for language processing, to the vision domain. A recent study showed that a similar methodology can also be applied to the audio domain. Specifically, the Audio Spectrogram Transformer (AST) achieves state-of-the-art results on various audio classification benchmarks. However, pure Transformer mod

Los estilos APA, Harvard, Vancouver, ISO, etc.

47

Appiani, Andrea, and Cigdem Beyan. "VAD-CLVA: Integrating CLIP with LLaVA for Voice Activity Detection." Information 16, no. 3 (2025): 233. https://doi.org/10.3390/info16030233.

Texto completo

Resumen

Voice activity detection (VAD) is the process of automatically determining whether a person is speaking and identifying the timing of their speech in an audiovisual data. Traditionally, this task has been tackled by processing either audio signals or visual data, or by combining both modalities through fusion or joint learning. In our study, drawing inspiration from recent advancements in visual-language models, we introduce a novel approach leveraging Contrastive Language-Image Pretraining (CLIP) models. The CLIP visual encoder analyzes video segments focusing on the upper body of an individu

Los estilos APA, Harvard, Vancouver, ISO, etc.

48

Juby Nedumthakidiyil Zacharias. "Generative product content using vision-language models: Transforming e-commerce experiences." World Journal of Advanced Engineering Technology and Sciences 15, no. 3 (2025): 1130–37. https://doi.org/10.30574/wjaets.2025.15.3.1046.

Texto completo

Resumen

Vision-language models (VLMs) are fundamentally transforming product content creation in e-commerce, representing a paradigm shift in how digital retail platforms manage product information. These sophisticated systems, which leverage dual-encoder architectures and contrastive learning methods, establish meaningful connections between visual attributes and textual descriptions to generate comprehensive product content directly from images. By analyzing product photographs, these models automatically create detailed descriptions, ingredient lists, and usage recommendations with remarkable accur

Los estilos APA, Harvard, Vancouver, ISO, etc.

49

Davis, Jason. "In a Digital World With Generative AI Detection Will Not be Enough." Newhouse Impact Journal 1, no. 1 (2024): 9–12. http://dx.doi.org/10.14305/jn.29960819.2024.1.1.01.

Texto completo

Resumen

Recent and dramatic improvements in AI driven large language models (LLMs), image generators, audio and video have fed an exponential growth in Generative AI applications and accessibility. The disruptive ripples of this rapid evolution have already begun to fundamentally impact how we create and consume content on a global scale. While the use of Generative AI has and will continue to enable massive increases in the speed and efficiency of content creation, it has come at the cost of uncomfortable conversations about transparency and the erosion of digital trust. To have any chance at actuall

Los estilos APA, Harvard, Vancouver, ISO, etc.

50

Armstrong Joseph J and Senthil S. "The Dark Side of Generative AI: Ethical, Security, and Social Concerns." International Research Journal on Advanced Engineering Hub (IRJAEH) 3, no. 04 (2025): 1720–23. https://doi.org/10.47392/irjaeh.2025.0247.

Texto completo

Resumen

Generative Artificial Intelligence (AI) represents a significant leap in technology, enabling the creation of novel content from text, images, videos, and audio. While its potential to drive innovation and improve productivity is immense, the risks associated with its application are equally formidable. This paper explores the darker aspects of generative AI, focusing on ethical dilemmas, social implications, security threats, and the potential for misuse. We examine issues such as misinformation, biases in AI models, job displacement, and the dangers posed by AI-driven automation. Finally, we

Los estilos APA, Harvard, Vancouver, ISO, etc.

Ofrecemos descuentos en todos los planes premium para autores cuyas obras están incluidas en selecciones literarias temáticas. ¡Contáctenos para obtener un código promocional único!