Academic literature on the topic 'Neural audio synthesis'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Neural audio synthesis.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Neural audio synthesis":

1

Li, Dongze, Kang Zhao, Wei Wang, Bo Peng, Yingya Zhang, Jing Dong, and Tieniu Tan. "AE-NeRF: Audio Enhanced Neural Radiance Field for Few Shot Talking Head Synthesis." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 4 (March 24, 2024): 3037–45. http://dx.doi.org/10.1609/aaai.v38i4.28086.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Audio-driven talking head synthesis is a promising topic with wide applications in digital human, film making and virtual reality. Recent NeRF-based approaches have shown superiority in quality and fidelity compared to previous studies. However, when it comes to few-shot talking head generation, a practical scenario where only few seconds of talking video is available for one identity, two limitations emerge: 1) they either have no base model, which serves as a facial prior for fast convergence, or ignore the importance of audio when building the prior; 2) most of them overlook the degree of correlation between different face regions and audio, e.g., mouth is audio related, while ear is audio independent. In this paper, we present Audio Enhanced Neural Radiance Field (AE-NeRF) to tackle the above issues, which can generate realistic portraits of a new speaker with few-shot dataset. Specifically, we introduce an Audio Aware Aggregation module into the feature fusion stage of the reference scheme, where the weight is determined by the similarity of audio between reference and target image. Then, an Audio-Aligned Face Generation strategy is proposed to model the audio related and audio independent regions respectively, with a dual-NeRF framework. Extensive experiments have shown AE-NeRF surpasses the state-of-the-art on image fidelity, audio-lip synchronization, and generalization ability, even in limited training set or training iterations.
2

Vyawahare, Prof D. G. "Image to Audio Conversion for Blind People Using Neural Network." International Journal for Research in Applied Science and Engineering Technology 11, no. 12 (December 31, 2023): 1949–57. http://dx.doi.org/10.22214/ijraset.2023.57712.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Abstract: The development of an image-to-audio conversion system represents a significant stride towards enhancing accessibility and autonomy for visually impaired individuals. This innovative technology leverages computer vision and audio synthesis techniques to convert visual information from images into auditory cues, enabling blind users to interpret and comprehend their surroundings more effectively. The core of this system relies on advanced computer vision algorithms that process input images, recognizing objects, text, and scene elements. These algorithms employ deep learning models to extract meaningful visual features and convert them into a structured representation of the image content. Simultaneously, natural language processing techniques are employed to extract and interpret textual information within the image, such as signs, labels, or written instructions. Once the image content is comprehended, an audio synthesis engine generates a corresponding auditory output. This auditory output is designed to convey the information in a clear and intuitive manner. Additionally, the system can adapt its output based on user preferences and environmental context, providing a customizable and dynamic auditory experience. It empowers blind individuals to independently access visual information from a variety of sources, including printed materials, digital displays, and real-world scenes. Moreover, it promotes inclusion by reducing the reliance on sighted assistance and fostering greater self-reliance and confidence among visually impaired individuals. By harnessing computer vision and audio synthesis, it provides a means for blind individuals to access and interpret visual information independently, thereby enhancing their autonomy, inclusion, and overall quality of life. This innovative solution underscores the potential of technology to bridge accessibility gaps and empower individuals with disabilities.
3

Kiefer, Chris. "Sample-level sound synthesis with recurrent neural networks and conceptors." PeerJ Computer Science 5 (July 8, 2019): e205. http://dx.doi.org/10.7717/peerj-cs.205.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Conceptors are a recent development in the field of reservoir computing; they can be used to influence the dynamics of recurrent neural networks (RNNs), enabling generation of arbitrary patterns based on training data. Conceptors allow interpolation and extrapolation between patterns, and also provide a system of boolean logic for combining patterns together. Generation and manipulation of arbitrary patterns using conceptors has significant potential as a sound synthesis method for applications in computer music but has yet to be explored. Conceptors are untested with the generation of multi-timbre audio patterns, and little testing has been done on scalability to longer patterns required for audio. A novel method of sound synthesis based on conceptors is introduced. Conceptular Synthesis is based on granular synthesis; sets of conceptors are trained to recall varying patterns from a single RNN, then a runtime mechanism switches between them, generating short patterns which are recombined into a longer sound. The quality of sound resynthesis using this technique is experimentally evaluated. Conceptor models are shown to resynthesise audio with a comparable quality to a close equivalent technique using echo state networks with stored patterns and output feedback. Conceptor models are also shown to excel in their malleability and potential for creative sound manipulation, in comparison to echo state network models which tend to fail when the same manipulations are applied. Examples are given demonstrating creative sonic possibilities, by exploiting conceptor pattern morphing, boolean conceptor logic and manipulation of RNN dynamics. Limitations of conceptor models are revealed with regards to reproduction quality, and pragmatic limitations are also shown, where rises in computation and memory requirements preclude the use of these models for training with longer sound samples. The techniques presented here represent an initial exploration of the sound synthesis potential of conceptors, demonstrating possible creative applications in sound design; future possibilities and research questions are outlined.
4

Liu, Yunyi, and Craig Jin. "Impact on quality and diversity from integrating a reconstruction loss into neural audio synthesis." Journal of the Acoustical Society of America 154, no. 4_supplement (October 1, 2023): A99. http://dx.doi.org/10.1121/10.0022922.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
In digital media or games, sound effects are typically recorded or synthesized. While there are a great many digital synthesis tools, the synthesized audio quality is generally not on par with sound recordings. Nonetheless, sound synthesis techniques provide a popular means to generate new sound variations. In this research, we study sound effects synthesis using generative models that are inspired by the models used for high-quality speech and music synthesis. In particular, we explore the trade-off between synthesis quality and variation. With regard to quality, we integrate a reconstruction loss into the original training objective to penalize imperfect audio reconstruction and compare it with neural vocoders and traditional spectrogram inversion methods. We use a Wasserstein GAN (WGAN) as an example model to explore the synthesis quality of generated sound effects, such as footsteps, birds, guns, rain, and engine sounds. In addition to synthesis quality, we also consider the range of sound variation that is possible with our generative model. We report on the trade-off that we obtain with our model regarding the quality and diversity of synthesized sound effects.
5

Khandelwal, Karan, Krishiv Pandita, Kshitij Priyankar, Kumar Parakram, and Tejaswini K. "Svara Rachana - Audio Driven Facial Expression Synthesis." International Journal for Research in Applied Science and Engineering Technology 12, no. 5 (May 31, 2024): 2024–29. http://dx.doi.org/10.22214/ijraset.2024.62019.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Abstract: Svara Rachana is a fusion of artificial intelligence and facial animation which aims to revolutionize the field of digital communication. Harnessing the ever-evolving power of neural networks in the form of Long Short-Term Memory (LSTM) model, Svara Rachana offers a cutting edge, interactive web application designed to synchronize human speech with realistic 3D facial animation. Users can upload or record an audio file and upload it to the web interface containing human speech, with the core functionality being the generation of synchronized lip movements on a 3D avatar. The system gives special emphasis on the accuracy of the system to generate reliable facial animation movements. By providing an interactive, human like 3D model, Svara Rachana aims to make machine to human interaction a more impactful experience by blurring the lines between humans and machines.
6

VOITKO, Viktoriia, Svitlana BEVZ, Sergii BURBELO, and Pavlo STAVYTSKYI. "AUDIO GENERATION TECHNOLOGY OF A SYSTEM OF SYNTHESIS AND ANALYSIS OF MUSIC COMPOSITIONS." Herald of Khmelnytskyi National University 305, no. 1 (February 23, 2022): 64–67. http://dx.doi.org/10.31891/2307-5732-2022-305-1-64-67.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
System of audio synthesis and analysis of music compositions is considered. It consists of two primary parts, the audio analysis component, and the music synthesis component. An audio generation component implements various ways of creating audio sequences. One of them is aimed to record melodies played with voice and transform them into sequences played with selected musical instruments. In addition, an audio input created with a human voice can be utilized as a seed, that is used to generate similar music sequences using artificial intelligence. Finally, a manual approach for music generation and editing is available. After automatic mechanisms for composition generation are used, the results of their work are presented on a two-dimensional plane which represents the dependence of music note pitches on time. It is possible to manually adjust the result of audio generation or create new music sequences with this approach. A creation process could be used iteratively to create multiple parallel music sequences that are to be played as a single audio composition. To implement a seed-based audio synthesis, a deep learning architecture based on a variational autoencoder is used to train a neural network that can reproduce input-like data. When using such an approach an additional important step must be considered. All the input data must be converted from a raw audio format to spectrograms which are represented as grayscale images. Moreover, the result of a sound generation is also represented in a spectrogram and therefore, must be converted back to an output audio format that can be played using speakers. This is required as using spectrograms helps to discard redundant data that raw audio format contains and thus significantly reduces resources consumption and increases overall synthesis speed.
7

Li, Naihan, Yanqing Liu, Yu Wu, Shujie Liu, Sheng Zhao, and Ming Liu. "RobuTrans: A Robust Transformer-Based Text-to-Speech Model." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 05 (April 3, 2020): 8228–35. http://dx.doi.org/10.1609/aaai.v34i05.6337.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Recently, neural network based speech synthesis has achieved outstanding results, by which the synthesized audios are of excellent quality and naturalness. However, current neural TTS models suffer from the robustness issue, which results in abnormal audios (bad cases) especially for unusual text (unseen context). To build a neural model which can synthesize both natural and stable audios, in this paper, we make a deep analysis of why the previous neural TTS models are not robust, based on which we propose RobuTrans (Robust Transformer), a robust neural TTS model based on Transformer. Comparing to TransformerTTS, our model first converts input texts to linguistic features, including phonemic features and prosodic features, then feed them to the encoder. In the decoder, the encoder-decoder attention is replaced with a duration-based hard attention mechanism, and the causal self-attention is replaced with a "pseudo non-causal attention" mechanism to model the holistic information of the input. Besides, the position embedding is replaced with a 1-D CNN, since it constrains the maximum length of synthesized audio. With these modifications, our model not only fix the robustness problem, but also achieves on parity MOS (4.36) with TransformerTTS (4.37) and Tacotron2 (4.37) on our general set.
8

Hryhorenko, N., N. Larionov, and V. Bredikhin. "RESEARCH OF THE PROCESS OF VISUAL ART TRANSMISSION IN MUSIC AND THE CREATION OF COLLECTIONS FOR PEOPLE WITH VISUAL IMPAIRMENTS." Municipal economy of cities 6, no. 180 (December 4, 2023): 2–6. http://dx.doi.org/10.33042/2522-1809-2023-6-180-2-6.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
This article explores the creation of music through the automated generation of sounds from images. The developed automatic image sound generation method is based on the joint use of neural networks and light-music theory. Translating visual art into music using machine learning models can be used to make extensive museum collections accessible to the visually impaired by translating artworks from an inaccessible sensory modality (sight) to an accessible one (hearing). Studies of other audio-visual models have shown that previous research has focused on improving model performance with multimodal information, as well as improving the accessibility of visual information through audio presentation, so the work process consists of two parts. The result of the work of the first part of the algorithm for determining the tonality of a piece is a graphic annotation of the transformation of the graphic image into a musical series using all colour characteristics, which is transmitted to the input of the neural network. While researching sound synthesis methods, we considered and analysed the most popular ones: additive synthesis, FM synthesis, phase modulation, sampling, table-wave synthesis, linear-arithmetic synthesis, subtractive synthesis, and vector synthesis. Sampling was chosen to implement the system. This method gives the most realistic sound of instruments, which is an important characteristic. The second task of generating music from an image is performed by a recurrent neural network with a two-layer batch LSTM network with 512 hidden units in each LSTM cell, which assembles spectrograms from the input line of the image and converts it into an audio clip. Twenty-nine compositions of modern music were used to train the network. To test the network, we compiled a set of ten test images of different types (abstract images, landscapes, cities, and people) on which the original musical compositions were obtained and stored. In conclusion, it should be noted that the composition generated from abstract images is more pleasant to the ear than the generation from landscapes. In general, the overall impression of the generated compositions is positive. Keywords: recurrent neural network, light music theory, spectrogram, generation of compositions.
9

Andreu, Sergi, and Monica Villanueva Aylagas. "Neural Synthesis of Sound Effects Using Flow-Based Deep Generative Models." Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment 18, no. 1 (October 11, 2022): 2–9. http://dx.doi.org/10.1609/aiide.v18i1.21941.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Creating variations of sound effects for video games is a time-consuming task that grows with the size and complexity of the games themselves. The process usually comprises recording source material and mixing different layers of sound to create sound effects that are perceived as diverse during gameplay. In this work, we present a method to generate controllable variations of sound effects that can be used in the creative process of sound designers. We adopt WaveFlow, a generative flow model that works directly on raw audio and has proven to perform well for speech synthesis. Using a lower-dimensional mel spectrogram as the conditioner allows both user controllability and a way for the network to generate more diversity. Additionally, it gives the model style transfer capabilities. We evaluate several models in terms of the quality and variability of the generated sounds using both quantitative and subjective evaluations. The results suggest that there is a trade-off between quality and diversity. Nevertheless, our method achieves a quality level similar to that of the training set while generating perceivable variations according to a perceptual study that includes game audio experts.
10

Li, Naihan, Shujie Liu, Yanqing Liu, Sheng Zhao, and Ming Liu. "Neural Speech Synthesis with Transformer Network." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 6706–13. http://dx.doi.org/10.1609/aaai.v33i01.33016706.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Although end-to-end neural text-to-speech (TTS) methods (such as Tacotron2) are proposed and achieve state-of-theart performance, they still suffer from two problems: 1) low efficiency during training and inference; 2) hard to model long dependency using current recurrent neural networks (RNNs). Inspired by the success of Transformer network in neural machine translation (NMT), in this paper, we introduce and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in Tacotron2. With the help of multi-head self-attention, the hidden states in the encoder and decoder are constructed in parallel, which improves training efficiency. Meanwhile, any two inputs at different times are connected directly by a self-attention mechanism, which solves the long range dependency problem effectively. Using phoneme sequences as input, our Transformer TTS network generates mel spectrograms, followed by a WaveNet vocoder to output the final audio results. Experiments are conducted to test the efficiency and performance of our new network. For the efficiency, our Transformer TTS network can speed up the training about 4.25 times faster compared with Tacotron2. For the performance, rigorous human tests show that our proposed model achieves state-of-the-art performance (outperforms Tacotron2 with a gap of 0.048) and is very close to human quality (4.39 vs 4.44 in MOS).

Dissertations / Theses on the topic "Neural audio synthesis":

1

Lundberg, Anton. "Data-Driven Procedural Audio : Procedural Engine Sounds Using Neural Audio Synthesis." Thesis, KTH, Datavetenskap, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-280132.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
The currently dominating approach for rendering audio content in interactivemedia, such as video games and virtual reality, involves playback of static audiofiles. This approach is inflexible and requires management of large quantities of audio data. An alternative approach is procedural audio, where sound models are used to generate audio in real time from live inputs. While providing many advantages, procedural audio has yet to find widespread use in commercial productions, partly due to the audio produced by many of the proposed models not meeting industry standards. This thesis investigates how procedural audio can be performed using datadriven methods. We do this by specifically investigating how to generate the sound of car engines using neural audio synthesis. Building on a recently published method that integrates digital signal processing with deep learning, called Differentiable Digital Signal Processing (DDSP), our method obtains sound models by training deep neural networks to reconstruct recorded audio examples from interpretable latent features. We propose a method for incorporating engine cycle phase information, as well as a differentiable transient synthesizer. Our results illustrate that DDSP can be used for procedural engine sounds; however, further work is needed before our models can generate engine sounds without undesired artifacts and before they can be used in live real-time applications. We argue that our approach can be useful for procedural audio in more general contexts, and discuss how our method can be applied to other sound sources.
Det i dagsläget dominerande tillvägagångssättet för rendering av ljud i interaktivamedia, såsom datorspel och virtual reality, innefattar uppspelning av statiska ljudfiler. Detta tillvägagångssätt saknar flexibilitet och kräver hantering av stora mängder ljuddata. Ett alternativt tillvägagångssätt är procedurellt ljud, vari ljudmodeller styrs för att generera ljud i realtid. Trots sina många fördelar används procedurellt ljud ännu inte i någon vid utsträckning inom kommersiella produktioner, delvis på grund av att det genererade ljudet från många föreslagna modeller inte når upp till industrins standarder. Detta examensarbete undersöker hur procedurellt ljud kan utföras med datadrivna metoder. Vi gör detta genom att specifikt undersöka metoder för syntes av bilmotorljud baserade på neural ljudsyntes. Genom att bygga på en nyligen publicerad metod som integrerar digital signalbehandling med djupinlärning, kallad Differentiable Digital Signal Processing (DDSP), kan vår metod skapa ljudmodeller genom att träna djupa neurala nätverk att rekonstruera inspelade ljudexempel från tolkningsbara latenta prediktorer. Vi föreslår en metod för att använda fasinformation från motorers förbränningscykler, samt en differentierbar metod för syntes av transienter. Våra resultat visar att DDSP kan användas till procedurella motorljud, men mer arbete krävs innan våra modeller kan generera motorljud utan oönskade artefakter samt innan de kan användas i realtidsapplikationer. Vi diskuterar hur vårt tillvägagångssätt kan vara användbart inom procedurellt ljud i mer generella sammanhang, samt hur vår metod kan tillämpas på andra ljudkällor
2

Nistal, Hurlé Javier. "Exploring generative adversarial networks for controllable musical audio synthesis." Electronic Thesis or Diss., Institut polytechnique de Paris, 2022. http://www.theses.fr/2022IPPAT009.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Les synthétiseurs audio sont des instruments de musique électroniques qui génèrent des sons artificiels sous un certain contrôle paramétrique. Alors que les synthétiseurs ont évolué depuis leur popularisation dans les années 70, deux défis fondamentaux restent encore non résolus: 1) le développement de systèmes de synthèse répondant à des paramètres sémantiquement intuitifs; 2) la conception de techniques de synthèse «universelles», indépendantes de la source à modéliser. Cette thèse étudie l’utilisation des réseaux adversariaux génératifs (ou GAN) pour construire de tels systèmes. L’objectif principal est de rechercher et de développer de nouveaux outils pour la production musicale, qui offrent des moyens intuitifs de manipulation du son, par exemple en contrôlant des paramètres qui répondent aux propriétés perceptives du son et à d’autres caractéristiques. Notre premier travail étudie les performances des GAN lorsqu’ils sont entraînés sur diverses représentations de signaux audio. Ces expériences comparent différentes formes de données audio dans le contexte de la synthèse sonore tonale. Les résultats montrent que la représentation magnitude-fréquence instantanée et la transformée de Fourier à valeur complexe obtiennent les meilleurs résultats. En s’appuyant sur ce résultat, notre travail suivant présente DrumGAN, un synthétiseur audio de sons percussifs. En conditionnant le modèle sur des caractéristiques perceptives décrivant des propriétés timbrales de haut niveau, nous démontrons qu’un contrôle intuitif peut être obtenu sur le processus de génération. Ce travail aboutit au développement d’un plugin VST générant de l’audio haute résolution. La rareté des annotations dans les ensembles de données audio musicales remet en cause l’application de méthodes supervisées pour la génération conditionnelle. On utilise une approche de distillation des connaissances pour extraire de telles annotations à partir d’un système d’étiquetage audio préentraîné. DarkGAN est un synthétiseur de sons tonaux qui utilise les probabilités de sortie d’un tel système (appelées « étiquettes souples ») comme informations conditionnelles. Les résultats montrent que DarkGAN peut répondre modérément à de nombreux attributs intuitifs, même avec un conditionnement d’entrée hors distribution. Les applications des GAN à la synthèse audio apprennent généralement à partir de données de spectrogramme de taille fixe. Nous abordons cette limitation en exploitant une méthode auto-supervisée pour l’apprentissage de caractéristiques discrètes à partir de données séquentielles. De telles caractéristiques sont utilisées comme entrée conditionnelle pour fournir au modèle des informations dépendant du temps par étapes. La cohérence globale est assurée en fixant le bruit d’entrée z (caractéristique en GANs). Les résultats montrent que, tandis que les modèles entraînés sur un schéma de taille fixe obtiennent une meilleure qualité et diversité audio, les nôtres peuvent générer avec compétence un son de n’importe quelle durée. Une direction de recherche intéressante est la génération d’audio conditionnée par du matériel musical préexistant. Nous étudions si un générateur GAN, conditionné sur des signaux audio musicaux hautement compressés, peut générer des sorties ressemblant à l’audio non compressé d’origine. Les résultats montrent que le GAN peut améliorer la qualité des signaux audio par rapport aux versions MP3 pour des taux de compression très élevés (16 et 32 kbit/s). En conséquence directe de l’application de techniques d’intelligence artificielle dans des contextes musicaux, nous nous demandons comment la technologie basée sur l’IA peut favoriser l’innovation dans la pratique musicale. Par conséquent, nous concluons cette thèse en offrant une large perspective sur le développement d’outils d’IA pour la production musicale, éclairée par des considérations théoriques et des rapports d’utilisation d’outils d’IA dans le monde réel par des artistes professionnels
Audio synthesizers are electronic musical instruments that generate artificial sounds under some parametric control. While synthesizers have evolved since they were popularized in the 70s, two fundamental challenges are still unresolved: 1) the development of synthesis systems responding to semantically intuitive parameters; 2) the design of "universal," source-agnostic synthesis techniques. This thesis researches the use of Generative Adversarial Networks (GAN) towards building such systems. The main goal is to research and develop novel tools for music production that afford intuitive and expressive means of sound manipulation, e.g., by controlling parameters that respond to perceptual properties of the sound and other high-level features. Our first work studies the performance of GANs when trained on various common audio signal representations (e.g., waveform, time-frequency representations). These experiments compare different forms of audio data in the context of tonal sound synthesis. Results show that the Magnitude and Instantaneous Frequency of the phase and the complex-valued Short-Time Fourier Transform achieve the best results. Building on this, our following work presents DrumGAN, a controllable adversarial audio synthesizer of percussive sounds. By conditioning the model on perceptual features describing high-level timbre properties, we demonstrate that intuitive control can be gained over the generation process. This work results in the development of a VST plugin generating full-resolution audio and compatible with any Digital Audio Workstation (DAW). We show extensive musical material produced by professional artists from Sony ATV using DrumGAN. The scarcity of annotations in musical audio datasets challenges the application of supervised methods to conditional generation settings. Our third contribution employs a knowledge distillation approach to extract such annotations from a pre-trained audio tagging system. DarkGAN is an adversarial synthesizer of tonal sounds that employs the output probabilities of such a system (so-called “soft labels”) as conditional information. Results show that DarkGAN can respond moderately to many intuitive attributes, even with out-of-distribution input conditioning. Applications of GANs to audio synthesis typically learn from fixed-size two-dimensional spectrogram data analogously to the "image data" in computer vision; thus, they cannot generate sounds with variable duration. In our fourth paper, we address this limitation by exploiting a self-supervised method for learning discrete features from sequential data. Such features are used as conditional input to provide step-wise time-dependent information to the model. Global consistency is ensured by fixing the input noise z (characteristic in adversarial settings). Results show that, while models trained on a fixed-size scheme obtain better audio quality and diversity, ours can competently generate audio of any duration. One interesting direction for research is the generation of audio conditioned on preexisting musical material, e.g., the generation of some drum pattern given the recording of a bass line. Our fifth paper explores a simple pretext task tailored at learning such types of complex musical relationships. Concretely, we study whether a GAN generator, conditioned on highly compressed MP3 musical audio signals, can generate outputs resembling the original uncompressed audio. Results show that the GAN can improve the quality of the audio signals over the MP3 versions for very high compression rates (16 and 32 kbit/s). As a direct consequence of applying artificial intelligence techniques in musical contexts, we ask how AI-based technology can foster innovation in musical practice. Therefore, we conclude this thesis by providing a broad perspective on the development of AI tools for music production, informed by theoretical considerations and reports from real-world AI tool usage by professional artists
3

Andreux, Mathieu. "Foveal autoregressive neural time-series modeling." Electronic Thesis or Diss., Paris Sciences et Lettres (ComUE), 2018. http://www.theses.fr/2018PSLEE073.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Cette thèse s'intéresse à la modélisation non-supervisée de séries temporelles univariées. Nous abordons tout d'abord le problème de prédiction linéaire des valeurs futures séries temporelles gaussiennes sous hypothèse de longues dépendances, qui nécessitent de tenir compte d'un large passé. Nous introduisons une famille d'ondelettes fovéales et causales qui projettent les valeurs passées sur un sous-espace adapté au problème, réduisant ainsi la variance des estimateurs associés. Dans un deuxième temps, nous cherchons sous quelles conditions les prédicteurs non-linéaires sont plus performants que les méthodes linéaires. Les séries temporelles admettant une représentation parcimonieuse en temps-fréquence, comme celles issues de l'audio, réunissent ces conditions, et nous proposons un algorithme de prédiction utilisant une telle représentation. Le dernier problème que nous étudions est la synthèse de signaux audios. Nous proposons une nouvelle méthode de génération reposant sur un réseau de neurones convolutionnel profond, avec une architecture encodeur-décodeur, qui permet de synthétiser de nouveaux signaux réalistes. Contrairement à l'état de l'art, nous exploitons explicitement les propriétés temps-fréquence des sons pour définir un encodeur avec la transformée en scattering, tandis que le décodeur est entraîné pour résoudre un problème inverse dans une métrique adaptée
This dissertation studies unsupervised time-series modelling. We first focus on the problem of linearly predicting future values of a time-series under the assumption of long-range dependencies, which requires to take into account a large past. We introduce a family of causal and foveal wavelets which project past values on a subspace which is adapted to the problem, thereby reducing the variance of the associated estimators. We then investigate under which conditions non-linear predictors exhibit better performances than linear ones. Time-series which admit a sparse time-frequency representation, such as audio ones, satisfy those requirements, and we propose a prediction algorithm using such a representation. The last problem we tackle is audio time-series synthesis. We propose a new generation method relying on a deep convolutional neural network, with an encoder-decoder architecture, which allows to synthesize new realistic signals. Contrary to state-of-the-art methods, we explicitly use time-frequency properties of sounds to define an encoder with the scattering transform, while the decoder is trained to solve an inverse problem in an adapted metric

Books on the topic "Neural audio synthesis":

1

Nakagawa, Seiichi. Speech, hearing and neural network models. Tokyo: Ohmsha, 1995.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
2

Shikano, K., and Y. Tohkura. Speech, Hearing and Neural Network Models, (Biomedical and Health Research). Ios Pr Inc, 1995.

Find full text
APA, Harvard, Vancouver, ISO, and other styles

Book chapters on the topic "Neural audio synthesis":

1

Eppe, Manfred, Tayfun Alpay, and Stefan Wermter. "Towards End-to-End Raw Audio Music Synthesis." In Artificial Neural Networks and Machine Learning – ICANN 2018, 137–46. Cham: Springer International Publishing, 2018. http://dx.doi.org/10.1007/978-3-030-01424-7_14.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Tarjano, Carlos, and Valdecy Pereira. "Neuro-Spectral Audio Synthesis: Exploiting Characteristics of the Discrete Fourier Transform in the Real-Time Simulation of Musical Instruments Using Parallel Neural Networks." In Artificial Neural Networks and Machine Learning – ICANN 2019: Text and Time Series, 362–75. Cham: Springer International Publishing, 2019. http://dx.doi.org/10.1007/978-3-030-30490-4_30.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Singh, Harman, Parminder Singh, and Manjot Kaur Gill. "Statistical Parametric Speech Synthesis for Punjabi Language using Deep Neural Network." In SCRS CONFERENCE PROCEEDINGS ON INTELLIGENT SYSTEMS, 431–41. Soft Computing Research Society, 2021. http://dx.doi.org/10.52458/978-93-91842-08-6-41.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
In recent years, speech technology gets very advanced, due to which speech synthesis becomes an interesting area of study for researchers. Text-To-Speech (TTS) system generates the speech from the text by using a synthesized technique like concatenative, formant, articulatory, Statistical Parametric Speech Synthesis (SPSS) etc. The Deep Neural Network (DNN) based SPSS for the Punjabi language is used in this research work. The database used for this research works contains 674 audio files and a single text file containing 674 sentences. This database was created at the Language Technologies Institute at Carnegie Mellon University (CMU) provided under Festvox distribution. Ossian toolkit is used as a front-end for text processing. The two DNNs are modeled using the merlin toolkit. The duration DNN maps the linguistic and duration features of speech. The acoustic DNN maps the linguistic and acoustic features. The subjective evaluation using the Mean Opinion Score (MOS) shows that this TTS system has good quality of naturalness that is 80.2%.
4

Tits, Noé, Kevin El Haddad, and Thierry Dutoit. "The Theory behind Controllable Expressive Speech Synthesis: A Cross-Disciplinary Approach." In Human 4.0 - From Biology to Cybernetic. IntechOpen, 2021. http://dx.doi.org/10.5772/intechopen.89849.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
As part of the Human-Computer Interaction field, Expressive speech synthesis is a very rich domain as it requires knowledge in areas such as machine learning, signal processing, sociology, and psychology. In this chapter, we will focus mostly on the technical side. From the recording of expressive speech to its modeling, the reader will have an overview of the main paradigms used in this field, through some of the most prominent systems and methods. We explain how speech can be represented and encoded with audio features. We present a history of the main methods of Text-to-Speech synthesis: concatenative, parametric and statistical parametric speech synthesis. Finally, we focus on the last one, with the last techniques modeling Text-to-Speech synthesis as a sequence-to-sequence problem. This enables the use of Deep Learning blocks such as Convolutional and Recurrent Neural Networks as well as Attention Mechanism. The last part of the chapter intends to assemble the different aspects of the theory and summarize the concepts.
5

Min, Zeping, Qian Ge, and Zhong Li. "CAMP: A Unified Data Solution for Mandarin Speech Recognition Tasks." In Advances in Transdisciplinary Engineering. IOS Press, 2023. http://dx.doi.org/10.3233/atde230552.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Speech recognition, the transformation of spoken language into written text, is becoming increasingly vital across a broad range of applications. Despite the advancements in end-to-end Neural Network (NN) based speech recognition systems, the requirement for large volumes of annotated audio data tailored to specific scenarios remains a significant challenge. To address this, we introduce a novel approach, the Character Audio Mix-up (CAMP), which synthesizes scenario-specific audio data for Mandarin at a significantly reduced cost and effort. This method concatenates the audio segments of each character’s Pinyin in the text, obtained through force alignment on an existing annotated dataset, to synthesize the audio. These synthesized audios are then used to train the Automatic Speech Recognition (ASR) models. Experiments conducted on the AISHELL-3, and AIDATATANG datasets validate the effectiveness of CAMP, with ASR models trained on CAMP synthesized data performing relatively well compared to those trained with actual data from these datasets. Further, our ablation study reveals that while synthesized audio data can significantly reduce the need for real annotated audio specific to each scenario, it cannot entirely replace real audio. Thus, the importance of real annotated audio data in specific application scenarios is emphasized.

Conference papers on the topic "Neural audio synthesis":

1

Pons, Jordi, Santiago Pascual, Giulio Cengarle, and Joan Serra. "Upsampling Artifacts in Neural Audio Synthesis." In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021. http://dx.doi.org/10.1109/icassp39728.2021.9414913.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Yang, Zih-Syuan, and Jason Hockman. "A Plugin for Neural Audio Synthesis of Impact Sound Effects." In AM '23: Audio Mostly 2023. New York, NY, USA: ACM, 2023. http://dx.doi.org/10.1145/3616195.3616221.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Ezzerg, Abdelhamid, Adam Gabrys, Bartosz Putrycz, Daniel Korzekwa, Daniel Saez-Trigueros, David McHardy, Kamil Pokora, Jakub Lachowicz, Jaime Lorenzo-Trueba, and Viacheslav Klimkov. "Enhancing audio quality for expressive Neural Text-to-Speech." In 11th ISCA Speech Synthesis Workshop (SSW 11). ISCA: ISCA, 2021. http://dx.doi.org/10.21437/ssw.2021-14.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Shimba, Taiki, Ryuhei Sakurai, Hirotake Yamazoe, and Joo-Ho Lee. "Talking heads synthesis from audio with deep neural networks." In 2015 IEEE/SICE International Symposium on System Integration (SII). IEEE, 2015. http://dx.doi.org/10.1109/sii.2015.7404961.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Ramos, Vania Miriam Ortiz, and Sukhan Lee. "Synthesis of Disparate Audio Species via Recurrent Neural Embedding." In 2023 IEEE International Symposium on Multimedia (ISM). IEEE, 2023. http://dx.doi.org/10.1109/ism59092.2023.00036.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Antognini, Joseph M., Matt Hoffman, and Ron J. Weiss. "Audio Texture Synthesis with Random Neural Networks: Improving Diversity and Quality." In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019. http://dx.doi.org/10.1109/icassp.2019.8682598.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Guo, Yudong, Keyu Chen, Sen Liang, Yong-Jin Liu, Hujun Bao, and Juyong Zhang. "AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis." In 2021 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2021. http://dx.doi.org/10.1109/iccv48922.2021.00573.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Huang, Mincong (Jerry), Samuel Chabot, and Jonas Braasch. "Panoptic Reconstruction of Immersive Virtual Soundscapes Using Human-Scale Panoramic Imagery with Visual Recognition." In ICAD 2021: The 26th International Conference on Auditory Display. icad.org: International Community for Auditory Display, 2021. http://dx.doi.org/10.21785/icad2021.043.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
This work, situated at Rensselaer’s Collaborative-Research Augmented Immersive Virtual Environment Laboratory (CRAIVE-Lab), uses panoramic image datasets for spatial audio display. A system is developed for the room-centered immersive virtual reality facility to analyze panoramic images on a segment-by-segment basis, using pre-trained neural network models for semantic segmentation and object detection, thereby generating audio objects with respective spatial locations. These audio objects are then mapped with a series of synthetic and recorded audio datasets and populated within a spatial audio environment as virtual sound sources. The resulting audiovisual outcomes are then displayed using the facility’s human-scale panoramic display, as well as the 128-channel loudspeaker array for wave field synthesis (WFS). Performance evaluation indicates effectiveness for real-time enhancements, with potentials for large-scale expansion and rapid deployment in dynamic immersive virtual environments.
9

Kazakova, Sophia A., Anastasia A. Zorkina, Armen M. Kocharyan, Aleksei N. Svischev, and Sergey V. Rybin. "Expressive Audio Data Augmentation Based on Speech Synthesis and Deep Neural Networks." In 2023 International Conference on Quality Management, Transport and Information Security, Information Technologies (IT&QM&IS). IEEE, 2023. http://dx.doi.org/10.1109/itqmtis58985.2023.10346366.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Liu, Yunyi, and Craig Jin. "Impact on quality and diversity from integrating a reconstruction loss into neural audio synthesis." In 185th Meeting of the Acoustical Society of America. ASA, 2023. http://dx.doi.org/10.1121/2.0001871.

Full text
APA, Harvard, Vancouver, ISO, and other styles

To the bibliography