Log in

Relevant bibliographies by topics / Neural audio synthesis / Journal articles

Journal articles on the topic 'Neural audio synthesis'

To see the other types of publications on this topic, follow the link: Neural audio synthesis.

Author: Grafiati

Published: 1 June 2024

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Neural audio synthesis.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Li, Dongze, Kang Zhao, Wei Wang, Bo Peng, Yingya Zhang, Jing Dong, and Tieniu Tan. "AE-NeRF: Audio Enhanced Neural Radiance Field for Few Shot Talking Head Synthesis." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 4 (March 24, 2024): 3037–45. http://dx.doi.org/10.1609/aaai.v38i4.28086.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Audio-driven talking head synthesis is a promising topic with wide applications in digital human, film making and virtual reality. Recent NeRF-based approaches have shown superiority in quality and fidelity compared to previous studies. However, when it comes to few-shot talking head generation, a practical scenario where only few seconds of talking video is available for one identity, two limitations emerge: 1) they either have no base model, which serves as a facial prior for fast convergence, or ignore the importance of audio when building the prior; 2) most of them overlook the degree of correlation between different face regions and audio, e.g., mouth is audio related, while ear is audio independent. In this paper, we present Audio Enhanced Neural Radiance Field (AE-NeRF) to tackle the above issues, which can generate realistic portraits of a new speaker with few-shot dataset. Specifically, we introduce an Audio Aware Aggregation module into the feature fusion stage of the reference scheme, where the weight is determined by the similarity of audio between reference and target image. Then, an Audio-Aligned Face Generation strategy is proposed to model the audio related and audio independent regions respectively, with a dual-NeRF framework. Extensive experiments have shown AE-NeRF surpasses the state-of-the-art on image fidelity, audio-lip synchronization, and generalization ability, even in limited training set or training iterations.

2

Vyawahare, Prof D. G. "Image to Audio Conversion for Blind People Using Neural Network." International Journal for Research in Applied Science and Engineering Technology 11, no. 12 (December 31, 2023): 1949–57. http://dx.doi.org/10.22214/ijraset.2023.57712.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Abstract: The development of an image-to-audio conversion system represents a significant stride towards enhancing accessibility and autonomy for visually impaired individuals. This innovative technology leverages computer vision and audio synthesis techniques to convert visual information from images into auditory cues, enabling blind users to interpret and comprehend their surroundings more effectively. The core of this system relies on advanced computer vision algorithms that process input images, recognizing objects, text, and scene elements. These algorithms employ deep learning models to extract meaningful visual features and convert them into a structured representation of the image content. Simultaneously, natural language processing techniques are employed to extract and interpret textual information within the image, such as signs, labels, or written instructions. Once the image content is comprehended, an audio synthesis engine generates a corresponding auditory output. This auditory output is designed to convey the information in a clear and intuitive manner. Additionally, the system can adapt its output based on user preferences and environmental context, providing a customizable and dynamic auditory experience. It empowers blind individuals to independently access visual information from a variety of sources, including printed materials, digital displays, and real-world scenes. Moreover, it promotes inclusion by reducing the reliance on sighted assistance and fostering greater self-reliance and confidence among visually impaired individuals. By harnessing computer vision and audio synthesis, it provides a means for blind individuals to access and interpret visual information independently, thereby enhancing their autonomy, inclusion, and overall quality of life. This innovative solution underscores the potential of technology to bridge accessibility gaps and empower individuals with disabilities.

3

Kiefer, Chris. "Sample-level sound synthesis with recurrent neural networks and conceptors." PeerJ Computer Science 5 (July 8, 2019): e205. http://dx.doi.org/10.7717/peerj-cs.205.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Conceptors are a recent development in the field of reservoir computing; they can be used to influence the dynamics of recurrent neural networks (RNNs), enabling generation of arbitrary patterns based on training data. Conceptors allow interpolation and extrapolation between patterns, and also provide a system of boolean logic for combining patterns together. Generation and manipulation of arbitrary patterns using conceptors has significant potential as a sound synthesis method for applications in computer music but has yet to be explored. Conceptors are untested with the generation of multi-timbre audio patterns, and little testing has been done on scalability to longer patterns required for audio. A novel method of sound synthesis based on conceptors is introduced. Conceptular Synthesis is based on granular synthesis; sets of conceptors are trained to recall varying patterns from a single RNN, then a runtime mechanism switches between them, generating short patterns which are recombined into a longer sound. The quality of sound resynthesis using this technique is experimentally evaluated. Conceptor models are shown to resynthesise audio with a comparable quality to a close equivalent technique using echo state networks with stored patterns and output feedback. Conceptor models are also shown to excel in their malleability and potential for creative sound manipulation, in comparison to echo state network models which tend to fail when the same manipulations are applied. Examples are given demonstrating creative sonic possibilities, by exploiting conceptor pattern morphing, boolean conceptor logic and manipulation of RNN dynamics. Limitations of conceptor models are revealed with regards to reproduction quality, and pragmatic limitations are also shown, where rises in computation and memory requirements preclude the use of these models for training with longer sound samples. The techniques presented here represent an initial exploration of the sound synthesis potential of conceptors, demonstrating possible creative applications in sound design; future possibilities and research questions are outlined.

4

Liu, Yunyi, and Craig Jin. "Impact on quality and diversity from integrating a reconstruction loss into neural audio synthesis." Journal of the Acoustical Society of America 154, no. 4_supplement (October 1, 2023): A99. http://dx.doi.org/10.1121/10.0022922.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

In digital media or games, sound effects are typically recorded or synthesized. While there are a great many digital synthesis tools, the synthesized audio quality is generally not on par with sound recordings. Nonetheless, sound synthesis techniques provide a popular means to generate new sound variations. In this research, we study sound effects synthesis using generative models that are inspired by the models used for high-quality speech and music synthesis. In particular, we explore the trade-off between synthesis quality and variation. With regard to quality, we integrate a reconstruction loss into the original training objective to penalize imperfect audio reconstruction and compare it with neural vocoders and traditional spectrogram inversion methods. We use a Wasserstein GAN (WGAN) as an example model to explore the synthesis quality of generated sound effects, such as footsteps, birds, guns, rain, and engine sounds. In addition to synthesis quality, we also consider the range of sound variation that is possible with our generative model. We report on the trade-off that we obtain with our model regarding the quality and diversity of synthesized sound effects.

5

Khandelwal, Karan, Krishiv Pandita, Kshitij Priyankar, Kumar Parakram, and Tejaswini K. "Svara Rachana - Audio Driven Facial Expression Synthesis." International Journal for Research in Applied Science and Engineering Technology 12, no. 5 (May 31, 2024): 2024–29. http://dx.doi.org/10.22214/ijraset.2024.62019.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Abstract: Svara Rachana is a fusion of artificial intelligence and facial animation which aims to revolutionize the field of digital communication. Harnessing the ever-evolving power of neural networks in the form of Long Short-Term Memory (LSTM) model, Svara Rachana offers a cutting edge, interactive web application designed to synchronize human speech with realistic 3D facial animation. Users can upload or record an audio file and upload it to the web interface containing human speech, with the core functionality being the generation of synchronized lip movements on a 3D avatar. The system gives special emphasis on the accuracy of the system to generate reliable facial animation movements. By providing an interactive, human like 3D model, Svara Rachana aims to make machine to human interaction a more impactful experience by blurring the lines between humans and machines.

6

VOITKO, Viktoriia, Svitlana BEVZ, Sergii BURBELO, and Pavlo STAVYTSKYI. "AUDIO GENERATION TECHNOLOGY OF A SYSTEM OF SYNTHESIS AND ANALYSIS OF MUSIC COMPOSITIONS." Herald of Khmelnytskyi National University 305, no. 1 (February 23, 2022): 64–67. http://dx.doi.org/10.31891/2307-5732-2022-305-1-64-67.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

System of audio synthesis and analysis of music compositions is considered. It consists of two primary parts, the audio analysis component, and the music synthesis component. An audio generation component implements various ways of creating audio sequences. One of them is aimed to record melodies played with voice and transform them into sequences played with selected musical instruments. In addition, an audio input created with a human voice can be utilized as a seed, that is used to generate similar music sequences using artificial intelligence. Finally, a manual approach for music generation and editing is available. After automatic mechanisms for composition generation are used, the results of their work are presented on a two-dimensional plane which represents the dependence of music note pitches on time. It is possible to manually adjust the result of audio generation or create new music sequences with this approach. A creation process could be used iteratively to create multiple parallel music sequences that are to be played as a single audio composition. To implement a seed-based audio synthesis, a deep learning architecture based on a variational autoencoder is used to train a neural network that can reproduce input-like data. When using such an approach an additional important step must be considered. All the input data must be converted from a raw audio format to spectrograms which are represented as grayscale images. Moreover, the result of a sound generation is also represented in a spectrogram and therefore, must be converted back to an output audio format that can be played using speakers. This is required as using spectrograms helps to discard redundant data that raw audio format contains and thus significantly reduces resources consumption and increases overall synthesis speed.

7

Li, Naihan, Yanqing Liu, Yu Wu, Shujie Liu, Sheng Zhao, and Ming Liu. "RobuTrans: A Robust Transformer-Based Text-to-Speech Model." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 05 (April 3, 2020): 8228–35. http://dx.doi.org/10.1609/aaai.v34i05.6337.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Recently, neural network based speech synthesis has achieved outstanding results, by which the synthesized audios are of excellent quality and naturalness. However, current neural TTS models suffer from the robustness issue, which results in abnormal audios (bad cases) especially for unusual text (unseen context). To build a neural model which can synthesize both natural and stable audios, in this paper, we make a deep analysis of why the previous neural TTS models are not robust, based on which we propose RobuTrans (Robust Transformer), a robust neural TTS model based on Transformer. Comparing to TransformerTTS, our model first converts input texts to linguistic features, including phonemic features and prosodic features, then feed them to the encoder. In the decoder, the encoder-decoder attention is replaced with a duration-based hard attention mechanism, and the causal self-attention is replaced with a "pseudo non-causal attention" mechanism to model the holistic information of the input. Besides, the position embedding is replaced with a 1-D CNN, since it constrains the maximum length of synthesized audio. With these modifications, our model not only fix the robustness problem, but also achieves on parity MOS (4.36) with TransformerTTS (4.37) and Tacotron2 (4.37) on our general set.

8

Hryhorenko, N., N. Larionov, and V. Bredikhin. "RESEARCH OF THE PROCESS OF VISUAL ART TRANSMISSION IN MUSIC AND THE CREATION OF COLLECTIONS FOR PEOPLE WITH VISUAL IMPAIRMENTS." Municipal economy of cities 6, no. 180 (December 4, 2023): 2–6. http://dx.doi.org/10.33042/2522-1809-2023-6-180-2-6.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

This article explores the creation of music through the automated generation of sounds from images. The developed automatic image sound generation method is based on the joint use of neural networks and light-music theory. Translating visual art into music using machine learning models can be used to make extensive museum collections accessible to the visually impaired by translating artworks from an inaccessible sensory modality (sight) to an accessible one (hearing). Studies of other audio-visual models have shown that previous research has focused on improving model performance with multimodal information, as well as improving the accessibility of visual information through audio presentation, so the work process consists of two parts. The result of the work of the first part of the algorithm for determining the tonality of a piece is a graphic annotation of the transformation of the graphic image into a musical series using all colour characteristics, which is transmitted to the input of the neural network. While researching sound synthesis methods, we considered and analysed the most popular ones: additive synthesis, FM synthesis, phase modulation, sampling, table-wave synthesis, linear-arithmetic synthesis, subtractive synthesis, and vector synthesis. Sampling was chosen to implement the system. This method gives the most realistic sound of instruments, which is an important characteristic. The second task of generating music from an image is performed by a recurrent neural network with a two-layer batch LSTM network with 512 hidden units in each LSTM cell, which assembles spectrograms from the input line of the image and converts it into an audio clip. Twenty-nine compositions of modern music were used to train the network. To test the network, we compiled a set of ten test images of different types (abstract images, landscapes, cities, and people) on which the original musical compositions were obtained and stored. In conclusion, it should be noted that the composition generated from abstract images is more pleasant to the ear than the generation from landscapes. In general, the overall impression of the generated compositions is positive. Keywords: recurrent neural network, light music theory, spectrogram, generation of compositions.

9

Andreu, Sergi, and Monica Villanueva Aylagas. "Neural Synthesis of Sound Effects Using Flow-Based Deep Generative Models." Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment 18, no. 1 (October 11, 2022): 2–9. http://dx.doi.org/10.1609/aiide.v18i1.21941.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Creating variations of sound effects for video games is a time-consuming task that grows with the size and complexity of the games themselves. The process usually comprises recording source material and mixing different layers of sound to create sound effects that are perceived as diverse during gameplay. In this work, we present a method to generate controllable variations of sound effects that can be used in the creative process of sound designers. We adopt WaveFlow, a generative flow model that works directly on raw audio and has proven to perform well for speech synthesis. Using a lower-dimensional mel spectrogram as the conditioner allows both user controllability and a way for the network to generate more diversity. Additionally, it gives the model style transfer capabilities. We evaluate several models in terms of the quality and variability of the generated sounds using both quantitative and subjective evaluations. The results suggest that there is a trade-off between quality and diversity. Nevertheless, our method achieves a quality level similar to that of the training set while generating perceivable variations according to a perceptual study that includes game audio experts.

10

Li, Naihan, Shujie Liu, Yanqing Liu, Sheng Zhao, and Ming Liu. "Neural Speech Synthesis with Transformer Network." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 6706–13. http://dx.doi.org/10.1609/aaai.v33i01.33016706.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Although end-to-end neural text-to-speech (TTS) methods (such as Tacotron2) are proposed and achieve state-of-theart performance, they still suffer from two problems: 1) low efficiency during training and inference; 2) hard to model long dependency using current recurrent neural networks (RNNs). Inspired by the success of Transformer network in neural machine translation (NMT), in this paper, we introduce and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in Tacotron2. With the help of multi-head self-attention, the hidden states in the encoder and decoder are constructed in parallel, which improves training efficiency. Meanwhile, any two inputs at different times are connected directly by a self-attention mechanism, which solves the long range dependency problem effectively. Using phoneme sequences as input, our Transformer TTS network generates mel spectrograms, followed by a WaveNet vocoder to output the final audio results. Experiments are conducted to test the efficiency and performance of our new network. For the efficiency, our Transformer TTS network can speed up the training about 4.25 times faster compared with Tacotron2. For the performance, rigorous human tests show that our proposed model achieves state-of-the-art performance (outperforms Tacotron2 with a gap of 0.048) and is very close to human quality (4.39 vs 4.44 in MOS).

11

Li, Yusen, Ying Shen, and Dongqing Wang. "DIFFBAS: An Advanced Binaural Audio Synthesis Model Focusing on Binaural Differences Recovery." Applied Sciences 14, no. 8 (April 17, 2024): 3385. http://dx.doi.org/10.3390/app14083385.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Binaural audio synthesis (BAS) aims to restore binaural audio from mono signals obtained from the environment to enhance users’ immersive experiences. It plays an essential role in building Augmented Reality and Virtual Reality environments. Existing deep neural network (DNN)-based BAS systems synthesize binaural audio by modeling the overall sound propagation processes from the source to the left and right ears, which encompass early decay, room reverberation, and head/ear-related filtering. However, this end-to-end modeling approach brings in the overfitting problem for BAS models when they are trained using a small and homogeneous data set. Moreover, existing losses cannot well supervise the training process. As a consequence, the accuracy of synthesized binaural audio is far from satisfactory on binaural differences. In this work, we propose a novel DNN-based BAS method, namely DIFFBAS, to improve the accuracy of synthesized binaural audio from the perspective of the interaural phase difference. Specifically, DIFFBAS is trained using the average signals of the left and right channels. To make the model learn the binaural differences, we propose a new loss named Interaural Phase Difference (IPD) loss to supervise the model training. Extensive experiments have been performed and the results demonstrate the effectiveness of the DIFFBAS model and the IPD loss.

12

Roebel, Axel, and Frederik Bous. "Neural Vocoding for Singing and Speaking Voices with the Multi-Band Excited WaveNet." Information 13, no. 3 (February 23, 2022): 103. http://dx.doi.org/10.3390/info13030103.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

The use of the mel spectrogram as a signal parameterization for voice generation is quite recent and linked to the development of neural vocoders. These are deep neural networks that allow reconstructing high-quality speech from a given mel spectrogram. While initially developed for speech synthesis, now neural vocoders have also been studied in the context of voice attribute manipulation, opening new means for voice processing in audio production. However, to be able to apply neural vocoders in real-world applications, two problems need to be addressed: (1) To support use in professional audio workstations, the computational complexity should be small, (2) the vocoder needs to support a large variety of speakers, differences in voice qualities, and a wide range of intensities potentially encountered during audio production. In this context, the present study will provide a detailed description of the Multi-band Excited WaveNet, a fully convolutional neural vocoder built around signal processing blocks. It will evaluate the performance of the vocoder when trained on a variety of multi-speaker and multi-singer databases, including an experimental evaluation of the neural vocoder trained on speech and singing voices. Addressing the problem of intensity variation, the study will introduce a new adaptive signal normalization scheme that allows for robust compensation for dynamic and static gain variations. Evaluations are performed using objective measures and a number of perceptual tests including different neural vocoder algorithms known from the literature. The results confirm that the proposed vocoder compares favorably to the state-of-the-art in its capacity to generalize to unseen voices and voice qualities. The remaining challenges will be discussed.

13

García, Víctor, Inma Hernáez, and Eva Navas. "Evaluation of Tacotron Based Synthesizers for Spanish and Basque." Applied Sciences 12, no. 3 (February 7, 2022): 1686. http://dx.doi.org/10.3390/app12031686.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

In this paper, we describe the implementation and evaluation of Text to Speech synthesizers based on neural networks for Spanish and Basque. Several voices were built, all of them using a limited number of data. The system applies Tacotron 2 to compute mel-spectrograms from the input sequence, followed by WaveGlow as neural vocoder to obtain the audio signals from the spectrograms. The limited number of data used for training the models leads to synthesis errors in some sentences. To automatically detect those errors, we developed a new method that is able to find the sentences that have lost the alignment during the inference process. To mitigate the problem, we implemented a guided attention providing the system with the explicit duration of the phonemes. The resulting system was evaluated to assess its robustness, quality and naturalness both with objective and subjective measures. The results reveal the capacity of the system to produce good quality and natural audios.

14

Prihasto, Bima, and Nur Fajri Azhar. "Evaluation of Recurrent Neural Network Based on Indonesian Speech Synthesis for Small Datasets." Advances in Science and Technology 104 (February 2021): 17–25. http://dx.doi.org/10.4028/www.scientific.net/ast.104.17.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

The recurrent neural network (RNN) application when it comes to audio and speech processing in the case of Indonesian-language voice data is rarely done now. This is important because Indonesian languages have different characteristics from foreign languages. So in this case we try to evaluate a number of methods in RNN to make speech synthesis in Indonesian. In this research we use objective measurements, the results we get that LSTM generally produces better sound quality than GRU. While the derivative of GRU, MGU2 gets the best results in the model training time.

15

Venkatesh, Satvik, David Moffat, and Eduardo Reck Miranda. "Investigating the Effects of Training Set Synthesis for Audio Segmentation of Radio Broadcast." Electronics 10, no. 7 (March 31, 2021): 827. http://dx.doi.org/10.3390/electronics10070827.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Music and speech detection provides us valuable information regarding the nature of content in broadcast audio. It helps detect acoustic regions that contain speech, voice over music, only music, or silence. In recent years, there have been developments in machine learning algorithms to accomplish this task. However, broadcast audio is generally well-mixed and copyrighted, which makes it challenging to share across research groups. In this study, we address the challenges encountered in automatically synthesising data that resembles a radio broadcast. Firstly, we compare state-of-the-art neural network architectures such as CNN, GRU, LSTM, TCN, and CRNN. Later, we investigate how audio ducking of background music impacts the precision and recall of the machine learning algorithm. Thirdly, we examine how the quantity of synthetic training data impacts the results. Finally, we evaluate the effectiveness of synthesised, real-world, and combined approaches for training models, to understand if the synthetic data presents any additional value. Amongst the network architectures, CRNN was the best performing network. Results also show that the minimum level of audio ducking preferred by the machine learning algorithm was similar to that of human listeners. After testing our model on in-house and public datasets, we observe that our proposed synthesis technique outperforms real-world data in some cases and serves as a promising alternative.

16

Tao Chen. "Music Tone Synthesis based Anti-Interference Dynamic Integral Neural Network optimized with Artificial Hummingbird Optimization Algorithm." Journal of Electrical Systems 20, no. 3s (April 4, 2024): 2665–76. http://dx.doi.org/10.52783/jes.3162.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Music Tone Synthesis is an applied science or method that implies to identify and study of specific orchestral tone currently part of a music. It is especially effective in the area of oral music training system classes, somewhere be able to support in the practice and growth of musicians. Music Tone Synthesis provides singer and writer to produce large number of noises as well as imitate several tools and impact it cannot be flexible otherwise realistic for create over standard classical instrument. In this manuscript, Music Tone Synthesis based Anti-Interference Dynamic Integral Neural Network enhanced with artificial hummingbird Optimization algorithm (MTS-AIDINN-AHOA) is proposed. The input data are obtained from the audio signal. Then the data are pre-processing using Stein Particle Filtering (SPF) to remove the noise. The pre-processed data is given into the Two-sided Offset Quaternion Linear Canonical transform (TSOQLCT) for extracting the musical features such as melody, harmony, tempo, and dynamics. After this the extracted feature is provided to the Anti-Interference Dynamic Integral Neural Network (AIDINN) is used for the music tone synthesis and it is classified as pitch, chronaxie, volume, tone color. In general, the Anti-Interference Dynamic Integral Neural Network (AIDINN) does no express adapting optimization strategies to determine ideal parameters to assure precise prediction. Thus, it is proposed to utilize the Artificial Hummingbird Optimization Algorithm enhancement AIDINN for Music Tone Synthesis. The proposed MTS-AIDINN-AHOA method is implemented on MATLAB. Then performance of proposed technique is evaluated to other existing techniques. The proposed technique attains 26.36%, 20.69% and 35.29% higher accuracy, 19.23%, 23.56%, and 33.96% higher precision, 26.28%, 31.26%, and 19.66%higher recall, 28.96%, 33.21% and 23.89%higher specificity comparing with the existing methods such as a research on Musical Tone Recognition Method Based on Improved RNN for Vocal Music Teaching Network Courses (MTS-RNN), Music Timbre Extracted from Audio Signal Features (MTS-BPNN)and Feature Extraction and Categorization of Music Content Based on Deep Learning(MTS-SMNN) respectively.

17

Serebryanaya, L. V., and I. E. Lasy. "Automatic recognition and representation of text in the form of audio stream." Doklady BGUIR 19, no. 6 (October 1, 2021): 51–58. http://dx.doi.org/10.35596/1729-7648-2021-19-6-51-58.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

The problem of automatic speech generation from a text file is considered. An analytical review of the software has been completed. They are designed to recognize texts and convert them to an audio stream. The advantages and disadvantages of software products are estimated. Based on this, a conclusion was drawn about the relevance of developing a software for automatic generation of an audio stream from a text in Russian. Models based on artificial neural networks, which are used for speech synthesis, are analyzed. After that, a mathematical model of the created software is built. It consists of three components: a convolutional encoder, a convolutional decoder, and a transformer. The architecture of the software is designed. It includes a graphical interface, an application server, and a speech synthesis system. A number of algorithms have been developed: preprocessing text before loading it into a software, converting audio files of a training sample and training a network, generating speech based on arbitrary text files. A software has been created, which is a single-page application and has a web interface for interacting with the user. To assess the quality of the software, a metric was used that represents the average score of different opinions. As a result of the aggregation of different opinions, the metric received a sufficiently high value, on the basis of which it can be assumed that all the tasks have been solved.

18

Patnaik, W. Shivani. "Background Noise Suppression in Audio File using LSTM Network." International Journal for Research in Applied Science and Engineering Technology 10, no. 6 (June 30, 2022): 1310–16. http://dx.doi.org/10.22214/ijraset.2022.44109.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Abstract— In the realm of speech enhancement, noise suppression is a crucial problem. It is especially important in workfrom-home situations where noise reduction may improve communication quality and reduce the cognitive effort of video conferencing. As a result of the advent of deep neural networks, several novel ways for audio processing methods based on deep models have been presented. The goal of the project is to use a stacked Dual signal Transformation LSTM Network (DTLN) to combine both analysis and synthesis into one model. The proposed model consists of two separation cores, the first of which employs an Short Term Fourier Transformation (STFT) signal transformation and the second of which employs a learnt signal representation, This arrangement was designed to enable the second core to further improve the signal with phase information while the first core creates a strong magnitude estimation. Due to the complementarity of traditional and learnt features modifications, this combination might give good impacts while preserving a minimal computing footprint, in terms of computational complexity, the stacked network is far less than most previously suggested LSTM networks and assures real-time capabilities.

19

Mu, Jin. "Pose Estimation-Assisted Dance Tracking System Based on Convolutional Neural Network." Computational Intelligence and Neuroscience 2022 (June 3, 2022): 1–10. http://dx.doi.org/10.1155/2022/2301395.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

In the field of music-driven, computer-assisted dance movement generation, traditional music movement adaptations and statistical mapping models have the following problems: Firstly, the dance sequences generated by the model are not powerful enough to fit the music itself. Secondly, the integrity of the dance movements produced is not sufficient. Thirdly, it is necessary to improve the suppleness and rationality of long-term dance sequences. Fourthly, traditional models cannot produce new dance movements. How to create smooth and complete dance gesture sequences after music is a problem that needs to be investigated in this paper. To address these problems, we design a deep learning dance generation algorithm to extract the association between sound and movement characteristics. During the feature extraction phase, rhythmic features extracted from music and audio beat features are used as musical features, and coordinates of the main points of human bones extracted from dance videos are used for training as movement characteristics. During the model building phase, the model’s generator module is used to achieve a basic mapping of music and dance movements and to generate gentle dance gestures. The identification module is used to achieve consistency between dance and music. The self-encoder module is used to make the audio function more representative. Experimental results on the DeepFashion dataset show that the generated model can synthesize the new view of the target person in any human posture of a given posture, complete the transformation of different postures of the same person, and retain the external features and clothing textures of the target person. Using a whole-to-detail generation strategy can improve the final video composition. For the problem of incoherent character movements in video synthesis, we propose to optimize the character movements by using a generative adversarial network, specifically by inserting generated motion compensation frames into the incoherent movement sequences to improve the smoothness of the synthesized video.

20

Shejole, Prof Sakshi, Piyush Jaiswal, Neha Karmal, Vivek Patil, and Samnan Shaikh. "Autotuned Voice Cloning Enabling Multilingualism." International Journal for Research in Applied Science and Engineering Technology 11, no. 5 (May 31, 2023): 5945–49. http://dx.doi.org/10.22214/ijraset.2023.52906.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Abstract: This article describes a neural network-based text-to-speech (TTS) synthesis system that can generate spoken audio in a variety of speaker voices. We show that the proposed model can convert natural-language text-to-speech into a target language, and synthesize and translate natural text-to-speech. We quantify the importance of trained voice modules to obtain the best generalization performance. Finally, using randomly selected speaker embeddings, we show that speech can be synthesized with new speaker voices used in training and that the model learned high-quality speaker representations. We have also introduced a multilingual system and auto-tuner that allows you to translate regular text into another language, which makes multilingualization possible for various applications.

21

Rodríguez Fernández-Peña, Alfonso Carlos. "AI is great, isn’t it? Tone direction and illocutionary force delivery of tag ques-tions in Amazon’s AI NTTS Polly." Journal of Experimental Phonetics 32 (November 28, 2023): 227–42. http://dx.doi.org/10.1344/efe-2023-32-227-242.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

This work provides a descriptive analysis of the tone direction and its inherent illocutionary force in question tags delivered by Amazon’s neural text-to-speech system Polly. We included three types of tag questions (reverse-polarity tags — both positive and negative —, copy tags and command tags) for which 10 sentences were used as input in each case. The data included 600 utterances produced by British and American English voices currently available on Amazon’s NTTS. The audio files were examined with the speech analysis software Praat to identify the tone pattern for each utterance and confirm the intended illocutionary force. The results show that Amazon’s AI speech synthesis technology is not yet fully reliable and produces a high rate of utterances whose pragmatic load is undesired when using natural spontaneous speech traits as question tags.

22

Modi, Rohan. "Transcript Anatomization with Multi-Linguistic and Speech Synthesis Features." International Journal for Research in Applied Science and Engineering Technology 9, no. VI (June 20, 2021): 1755–58. http://dx.doi.org/10.22214/ijraset.2021.35371.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Handwriting Detection is a process or potential of a computer program to collect and analyze comprehensible input that is written by hand from various types of media such as photographs, newspapers, paper reports etc. Handwritten Text Recognition is a sub-discipline of Pattern Recognition. Pattern Recognition is refers to the classification of datasets or objects into various categories or classes. Handwriting Recognition is the process of transforming a handwritten text in a specific language into its digitally expressible script represented by a set of icons known as letters or characters. Speech synthesis is the artificial production of human speech using Machine Learning based software and audio output based computer hardware. While there are many systems which convert normal language text in to speech, the aim of this paper is to study Optical Character Recognition with speech synthesis technology and to develop a cost effective user friendly image based offline text to speech conversion system using CRNN neural networks model and Hidden Markov Model. The automated interpretation of text that has been written by hand can be very useful in various instances where processing of great amounts of handwritten data is required, such as signature verification, analysis of various types of documents and recognition of amounts written on bank cheques by hand.

23

Kazakova, M. A., and A. P. Sultanova. "Analysis of natural language processing technology: modern problems and approaches." Advanced Engineering Research 22, no. 2 (July 11, 2022): 169–76. http://dx.doi.org/10.23947/2687-1653-2022-22-2-169-176.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Introduction. The article presents an overview of modern neural network models for natural language processing. Research into natural language processing is of interest as the need to process large amounts of audio and text information accumulated in recent decades has increased. The most discussed in foreign literature are the features of the processing of spoken language. The aim of the work is to present modern models of neural networks in the field of oral speech processing.Materials and Methods. Applied research on understanding spoken language is an important and far-reaching topic in the natural language processing. Listening comprehension is central to practice and presents a challenge. This study meets a method of hearing detection based on deep learning. The article briefly outlines the substantive aspects of various neural networks for speech recognition, using the main terms associated with this theory. A brief description of the main points of the transformation of neural networks into a natural language is given.Results. A retrospective analysis of foreign and domestic literary sources was carried out alongside with a description of new methods for oral speech processing, in which neural networks were used. Information about neural networks, methods of speech recognition and synthesis is provided. The work includes the results of diverse experimental works of recent years. The article elucidates the main approaches to natural language processing and their changes over time, as well as the emergence of new technologies. The major problems currently existing in this area are considered.Discussion and Conclusions. The analysis of the main aspects of speech recognition systems has shown that there is currently no universal system that would be self-learning, noise-resistant, recognizing continuous speech, capable of working with large dictionaries and at the same time having a low error rate.

24

Mandeel, Ali Raheem, Mohammed Salah Al-Radhi, and Tamás Gábor Csapó. "Speaker Adaptation Experiments with Limited Data for End-to-End Text-To-Speech Synthesis using Tacotron2." Infocommunications journal 14, no. 3 (2022): 55–62. http://dx.doi.org/10.36244/icj.2022.3.7.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Speech synthesis has the aim of generating humanlike speech from text. Nowadays, with end-to-end systems, highly natural synthesized speech can be achieved if a large enough dataset is available from the target speaker. However, often it would be necessary to adapt to a target speaker for whom only a few training samples are available. Limited data speaker adaptation might be a difficult problem due to the overly few training samples. Issues might appear with a limited speaker dataset, such as the irregular allocation of linguistic tokens (i.e., some speech sounds are left out from the synthesized speech). To build lightweight systems, measuring the number of minimum data samples and training epochs is crucial to acquire a reasonable quality. We conducted detailed experiments with four target speakers for adaptive speaker text-to-speech (TTS) synthesis to show the performance of the end-to-end Tacotron2 model and the WaveGlow neural vocoder with an English dataset at several training data samples and training lengths. According to our investigation of objective and subjective evaluations, the Tacotron2 model exhibits good performance in terms of speech quality and similarity for unseen target speakers at 100 sentences of data (pair of text and audio) with a relatively low training time.

25

Thoidis, Iordanis, Lazaros Vrysis, Dimitrios Markou, and George Papanikolaou. "Temporal Auditory Coding Features for Causal Speech Enhancement." Electronics 9, no. 10 (October 16, 2020): 1698. http://dx.doi.org/10.3390/electronics9101698.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Perceptually motivated audio signal processing and feature extraction have played a key role in the determination of high-level semantic processes and the development of emerging systems and applications, such as mobile phone telecommunication and hearing aids. In the era of deep learning, speech enhancement methods based on neural networks have seen great success, mainly operating on the log-power spectra. Although these approaches surpass the need for exhaustive feature extraction and selection, it is still unclear whether they target the important sound characteristics related to speech perception. In this study, we propose a novel set of auditory-motivated features for single-channel speech enhancement by fusing temporal envelope and temporal fine structure information in the context of vocoder-like processing. A causal gated recurrent unit (GRU) neural network is employed to recover the low-frequency amplitude modulations of speech. Experimental results indicate that the exploited system achieves considerable gains for normal-hearing and hearing-impaired listeners, in terms of objective intelligibility and quality metrics. The proposed auditory-motivated feature set achieved better objective intelligibility results compared to the conventional log-magnitude spectrogram features, while mixed results were observed for simulated listeners with hearing loss. Finally, we demonstrate that the proposed analysis/synthesis framework provides satisfactory reconstruction accuracy of speech signals.

26

Vishwakama, Ramesh, Ramashish Yadav, Harsheet Sharma, and Dr Saurabh Suman. "Automated Leaf Disease Detection System with Machine Learning." International Journal for Research in Applied Science and Engineering Technology 12, no. 2 (February 29, 2024): 814–19. http://dx.doi.org/10.22214/ijraset.2024.58449.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Abstract: This study provides a detailed review of the application of deep learning techniques in plant protection, with a particular emphasis on the detection of crop leaf diseases. Deep learning has received a lot of attention for its success in feature extraction and machine learning, and it has emerged as a major technique in a variety of disciplines such as image and video processing, audio processing, and natural language processing. When applied to the field of plant disease detection, deep learning allows for more objective and efficient extraction of disease traits, boosting research efficiency and technical improvements.Our study seeks to present a synthesis of recent advances in deep learning applied to agricultural leaf disease detection, highlighting current trends and addressing issues in the field. The paper is an invaluable resource for scholars working on plant pest identification. Specifically, our approach uses the Convolutional Neural Network (CNN) algorithm, attaining an outstanding accuracy rate of 97% in disease identification.

27

Kane, Joseph, Michael N. Johnstone, and Patryk Szewczyk. "Voice Synthesis Improvement by Machine Learning of Natural Prosody." Sensors 24, no. 5 (March 1, 2024): 1624. http://dx.doi.org/10.3390/s24051624.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Since the advent of modern computing, researchers have striven to make the human–computer interface (HCI) as seamless as possible. Progress has been made on various fronts, e.g., the desktop metaphor (interface design) and natural language processing (input). One area receiving attention recently is voice activation and its corollary, computer-generated speech. Despite decades of research and development, most computer-generated voices remain easily identifiable as non-human. Prosody in speech has two primary components—intonation and rhythm—both often lacking in computer-generated voices. This research aims to enhance computer-generated text-to-speech algorithms by incorporating melodic and prosodic elements of human speech. This study explores a novel approach to add prosody by using machine learning, specifically an LSTM neural network, to add paralinguistic elements to a recorded or generated voice. The aim is to increase the realism of computer-generated text-to-speech algorithms, to enhance electronic reading applications, and improved artificial voices for those in need of artificial assistance to speak. A computer that is able to also convey meaning with a spoken audible announcement will also improve human-to-computer interactions. Applications for the use of such an algorithm may include improving high-definition audio codecs for telephony, renewing old recordings, and lowering barriers to the utilization of computing. This research deployed a prototype modular platform for digital speech improvement by analyzing and generalizing algorithms into a modular system through laboratory experiments to optimize combinations and performance in edge cases. The results were encouraging, with the LSTM-based encoder able to produce realistic speech. Further work will involve optimizing the algorithm and comparing its performance against other approaches.

28

Ravikiran K, Neerav Nishant, M Sreedhar, N.Kavitha, Mathur N. Kathiravan, and Geetha A. "Deep learning methods and integrated digital image processing techniques for detecting and evaluating wheat stripe rust disease." Scientific Temper 14, no. 03 (September 30, 2023): 864–69. http://dx.doi.org/10.58414/scientifictemper.2023.14.3.47.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

In recent years, signal processing and deep learning convergence has sparked transformative synergies across various domains, including image and speech recognition, natural language processing, autonomous systems, and healthcare diagnostics. This fusion capitalizes on the strength of signal processing in extracting meaningful features from raw data and the prowess of deep learning in unraveling intricate patterns, driving innovation and research into uncharted territories. This paper explores literature spanning the past three years to illuminate the dynamic landscape of scholarly endeavors that leverage the integration of signal processing techniques within deep learning architectures. The resulting paradigm shift magnifies the precision and efficiency of applications in computer vision, speech and audio processing, natural language comprehension, and interdisciplinary domains like healthcare. Notable advances include synergizing wavelet transformations with convolutional neural networks (CNNs) for enhanced image classification accuracy, integrating spectrogram-based features with deep learning architectures for improved speech-to-text accuracy, and pioneering the fusion of wavelet packet decomposition into recurrent architectures for sentiment analysis. Moreover, the paper delves into developing and evaluating a U-Net neural network model for image segmentation, investigating its performance under varying training conditions using metrics such as confusion matrices, heat maps, and precision-recall curves. The comprehensive survey identifies research gaps, notably within the context of wheat rust detection, and emphasizes the need for tailored innovations to enhance accuracy and efficiency. Overall, the synthesis of signal processing techniques with deep learning architectures propels innovation, poised to address complex challenges across diverse domains

29

Gromov, N. V., and T. A. Levanova. "WaveNet vocoder for prediction of time series with extreme events." Genes & Cells 18, no. 4 (December 15, 2023): 847–49. http://dx.doi.org/10.17816/gc623433.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Extreme events are typically defined as rare or unpredictable events that deviate significantly from typical behavior. Despite this, objective criteria for extreme events have yet to be established. Rareness may be characterized by certain scales or spatial and temporal boundaries, while intensity is an indication of an event’s potential to cause a significant change. One of the most prominent occurrences of extreme events in both neuroscience and medicine is in the case of epileptic seizures [1]. In speech synthesis, vocoder networks like WaveNet [2] generate audio. The model is a multi-layer convolutional neural network that functions as a causal filter and doesn’t predict the future. Due to this quality, the vocoder may have potential in time series prediction. Audio time series can be regarded as a dynamic system characterized by unpredictable switching regimes. For instance, transitioning from one letter to another can result in significant deviations in amplitude, similar to extreme events. This network receives r previous input counts known as a receptive field, and uses them to predict the next sample. The network is tree-like in structure, with exponentially increasing distances between subsequent layers of inputs. This is a necessary feature since the receptive field r is usually quite large, on the order of one or two thousand. Without this exponential increase in distance, the number of layers would depend linearly on r. Recurrent neural networks pose a challenge in optimizing the loss function when predicting time series sequences, as they tend to predict samples very similar to the previous one, causing the network to converge towards the mode. However, in a convolutional network, the output to the model will be longer due to the large receptive field. In the case of sound analysis, for instance, multiple oscillations occur within a given timeframe and the network does not elevate any specific sample. The study used artificial data generated from two coupled Hidmarsh–Rose neurons with chemical synaptic couplings. The observed variable was determined by the biological significance of the system, specifically the total membrane potential. The results exhibited extreme events across various coupling parameter values. Based on prior research [3], a numerical standard was selected for the events. The WaveNet vocoder model exhibits a 91% accuracy rate and 82% recall rate when forecasting extreme events of the same width as the prediction. It is noteworthy that recall is crucial in the forecast of extreme events since it identifies instances where the model predicted falsely that an extreme event would not occur.

30

Hakim, Heba, and Ali Marhoon. "Indoor Low Cost Assistive Device using 2D SLAM Based on LiDAR for Visually Impaired People." Iraqi Journal for Electrical and Electronic Engineering 15, no. 2 (December 1, 2019): 115–21. http://dx.doi.org/10.37917/ijeee.15.2.12.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Many assistive devices have been developed for visually impaired (VI) person in recent years which solve the problems that face VI person in his/her daily moving. Most of researches try to solve the obstacle avoidance or navigation problem, and others focus on assisting VI person to recognize the objects in his/her surrounding environment. However, a few of them integrate both navigation and recognition capabilities in their system. According to above needs, an assistive device is presented in this paper that achieves both capabilities to aid the VI person to (1) navigate safely from his/her current location (pose) to a desired destination in unknown environment, and (2) recognize his/her surrounding objects. The proposed system consists of the low cost sensors Neato XV-11 LiDAR, ultrasonic sensor, Raspberry pi camera (CameraPi), which are hold on a white cane. Hector SLAM based on 2D LiDAR is used to construct a 2D-map of unfamiliar environment. While A* path planning algorithm generates an optimal path on the given 2D hector map. Moreover, the temporary obstacles in front of VI person are detected by an ultrasonic sensor. The recognition system based on Convolution Neural Networks (CNN) technique is implemented in this work to predict object class besides enhance the navigation system. The interaction between the VI person and an assistive system is done by audio module (speech recognition and speech synthesis). The proposed system performance has been evaluated on various real-time experiments conducted in indoor scenarios, showing the efficiency of the proposed system.

31

Bai, Jinqiang, Zhaoxiang Liu, Yimin Lin, Ye Li, Shiguo Lian, and Dijun Liu. "Wearable Travel Aid for Environment Perception and Navigation of Visually Impaired People." Electronics 8, no. 6 (June 20, 2019): 697. http://dx.doi.org/10.3390/electronics8060697.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Assistive devices for visually impaired people (VIP) which support daily traveling and improve social inclusion are developing fast. Most of them try to solve the problem of navigation or obstacle avoidance, and other works focus on helping VIP to recognize their surrounding objects. However, very few of them couple both capabilities (i.e., navigation and recognition). Aiming at the above needs, this paper presents a wearable assistive device that allows VIP to (i) navigate safely and quickly in unfamiliar environment, and (ii) to recognize the objects in both indoor and outdoor environments. The device consists of a consumer Red, Green, Blue and Depth (RGB-D) camera and an Inertial Measurement Unit (IMU), which are mounted on a pair of eyeglasses, and a smartphone. The device leverages the ground height continuity among adjacent image frames to segment the ground accurately and rapidly, and then search the moving direction according to the ground. A lightweight Convolutional Neural Network (CNN)-based object recognition system is developed and deployed on the smartphone to increase the perception ability of VIP and promote the navigation system. It can provide the semantic information of surroundings, such as the categories, locations, and orientations of objects. Human–machine interaction is performed through audio module (a beeping sound for obstacle alert, speech recognition for understanding the user commands, and speech synthesis for expressing semantic information of surroundings). We evaluated the performance of the proposed system through many experiments conducted in both indoor and outdoor scenarios, demonstrating the efficiency and safety of the proposed assistive system.

32

Nicol, Rozenn, and Jean-Yves Monfort. "Acoustic research for telecoms: bridging the heritage to the future." Acta Acustica 7 (2023): 64. http://dx.doi.org/10.1051/aacus/2023056.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

In its early age, telecommunication was focused on voice communications, and acoustics was at the heart of the work related to speech coding and transmission, automatic speech recognition or speech synthesis, aiming at offering better quality (Quality of Experience or QoE) and enhanced services to users. As technology has evolved, the research themes have diversified, but acoustics remains essential. This paper gives an overview of the evolution of acoustic research for telecommunication. Communication was initially (and for a long time) only audio with a monophonic narrow-band sound (i.e. [300–3400 Hz]). After the bandwidth extension (from the wide-band [100–7000 Hz] to the full-band [20 Hz–20 kHz] range), a new break was the introduction of 3D sound, either to provide telepresence in audioconferencing or videoconferencing, or to enhance the QoE of contents such as radio, television, VOD, or video games. Loudspeaker or microphone arrays have been deployed to implement “Holophonic” or “Ambisonic” systems. The interaction between spatialized sounds and 3D images was also investigated. At the end of the 2000s, smartphones invaded our lives. Binaural sound was immediately acknowledged as the most suitable technology for reproducing 3D audio on smartphones. However, to achieve a satisfactory QoE, binaural filters need to be customized in relation with the listener’s morphology. This question is the main obstacle to a mass-market distribution of binaural sound, and its solving has prompted a large amount of work. In parallel with the development of technologies, their perceptual evaluation was an equally important area of research. In addition to conventional methods, innovative approaches have been explored for the assessment of sound spatialization, such as physiological measurement, neuroscience tools or Virtual Reality (VR). The latest development is the use of acoustics as a universal sensor for the Internet of Things (IoT) and connected environments. Microphones can be deployed, preferably with parcimony, in order to monitor surrounding sounds, with the goal of detecting information or events thanks to models of automatic sound recognition based on neural networks. Applications range from security and personal assistance to acoustic measurement of biodiversity. As for the control of environments or objects, voice commands have become widespread in recent years thanks to the tremendous progress made in speech recognition, but an even more intuitive mode based on direct control by the mind is proposed by Brain Computer Interfaces (BCIs), which rely on sensory stimulation using different modalities, among which the auditory one offers some advantages.

33

Yu, Junxiao, Zhengyuan Xu, Xu He, Jian Wang, Bin Liu, Rui Feng, Songsheng Zhu, Wei Wang, and Jianqing Li. "DIA-TTS: Deep-Inherited Attention-Based Text-to-Speech Synthesizer." Entropy 25, no. 1 (December 26, 2022): 41. http://dx.doi.org/10.3390/e25010041.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Text-to-speech (TTS) synthesizers have been widely used as a vital assistive tool in various fields. Traditional sequence-to-sequence (seq2seq) TTS such as Tacotron2 uses a single soft attention mechanism for encoder and decoder alignment tasks, which is the biggest shortcoming that incorrectly or repeatedly generates words when dealing with long sentences. It may also generate sentences with run-on and wrong breaks regardless of punctuation marks, which causes the synthesized waveform to lack emotion and sound unnatural. In this paper, we propose an end-to-end neural generative TTS model that is based on the deep-inherited attention (DIA) mechanism along with an adjustable local-sensitive factor (LSF). The inheritance mechanism allows multiple iterations of the DIA by sharing the same training parameter, which tightens the token–frame correlation, as well as fastens the alignment process. In addition, LSF is adopted to enhance the context connection by expanding the DIA concentration region. In addition, a multi-RNN block is used in the decoder for better acoustic feature extraction and generation. Hidden-state information driven from the multi-RNN layers is utilized for attention alignment. The collaborative work of the DIA and multi-RNN layers contributes to outperformance in the high-quality prediction of the phrase breaks of the synthesized speech. We used WaveGlow as a vocoder for real-time, human-like audio synthesis. Human subjective experiments show that the DIA-TTS achieved a mean opinion score (MOS) of 4.48 in terms of naturalness. Ablation studies further prove the superiority of the DIA mechanism for the enhancement of phrase breaks and attention robustness.

34

Wang, Tianmeng. "Research and Application Analysis of Correlative Optimization Algorithms for GAN." Highlights in Science, Engineering and Technology 57 (July 11, 2023): 141–47. http://dx.doi.org/10.54097/hset.v57i.9992.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Generative Adversarial Networks (GANs) have been one of the most successful deep learning architectures in recent years, providing a powerful way to model high-dimensional data such as images, audio, and text data. GANs use two neural networks, generator and discriminator, to generate samples that resemble real data. The generator tries to create realistic looking samples while the discriminator tries to differentiate the generated samples from real ones. Through this adversarial training process, the generator learns to produce high-quality samples indistinguishable from the real ones.Different optimization algorithms have been utilized in GAN research, including different types of loss functions and regularization techniques, to improve the performance of GANs. Some of the most significant recent developments in GANs include M-DCGAN, which stands for multi-scale deep convolutional generative adversarial network, designed for image dataset augmentation; StackGAN, which is a text-to-image generation technique designed to produce high-resolution images with fine details and BigGAN, a scaled-up version of GAN that has shown improved performance in generating high-fidelity images.Moreover, the potential applications of GANs are vast and cross-disciplinary. They have been applied in various fields such as image and video synthesis, data augmentation, image translation, and style transfer. GANs also show promise in extending their use to healthcare, finance, and creative art fields. Despite their significant advancements and promising applications, GANs face several challenges such as mode collapse, vanishing gradients, and instability, which need to be addressed to achieve better performance and broader applicability.In conclusion, this review gives insights into the current state-of-the-art in GAN research, discussing its core ideas, structure, optimization techniques, applications, and challenges faced. This knowledge aims to help researchers and practitioners alike to understand the current GAN models' strengths and weaknesses and guide future GAN developments. As GANs continue to evolve, they have the potential to transform the way we understand and generate complex datasets across various fields.

35

Dorofeeva, S. V. "Neuroplasicity and the developmental dyslexia intervention." Genes & Cells 18, no. 4 (December 15, 2023): 706–9. http://dx.doi.org/10.17816/gc623418.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

A growing body of literature suggests that timing plays a critical role in neuroplasticity processes and the molecular mechanisms necessary for learning and memory [1, 2]. Of particular significance to remedial education is identifying the time parameters for primary stimulation that are necessary and sufficient, the time frame relevant for transitioning to long-term memory, and the appropriate periods for restimulation. The translation of short-term stimulation into long-term memory is regulated by diverse processes that are mechanistically distinct and activated by synaptic activity [1], while also relying on protein and glycoprotein synthesis [3] and myelination processes. The current research explores the potential benefits of incorporating neuroscience research on the timing of neuroplasticity mechanisms in designing intervention programs for individuals with developmental dyslexia, specifically focusing on enhancing cognitive functions and skills crucial for reading. Based on available evidence, we have determined optimal training and break time periods for a 10-year-old child with developmental dyslexia resulting from multiple deficits. During the 21-day intervention program, 12 training sessions were conducted each day, commencing at 8 or 9 am and held hourly thereafter. Each session comprised a brief, targeted training exercise ranging from 3 to 7 minutes, depending on the child’s aptitude, followed by a playing session or computer game lasting up to 15 minutes. A 40-minute break followed each session. Brief training sessions were required due to the swift exhaustion of the subject child. The sessions were selected based on the evidence that brief stimulation can still result in a high level of CERB phosphorylation, even if it lasts only a few minutes [4]. Playing sessions were necessary for supporting the child’s motivation throughout the lengthy and intensive intervention program, and the activities performed during these sessions were pertinent to developing specific skills. The duration of the breaks was determined by evidence indicating the time required for primary memory consolidation processes and protein synthesis necessary for long-lasting synaptic plasticity. We have made an effort to eliminate any potential sources of emotional engagement or significant new information during breaks, allowing the initial stage of memory consolidation to occur without any unnecessary disruption. During the training sessions, various tasks were used to target specific types of processing such as phonological, visual, speech, and multimodal processing (e.g., visual-motor, audio-visual, or reading). Each session exclusively focused on one type of exercise. In our prior study [5], we discussed the linguistic aspects of the program and the exercises employed. Significant progress was achieved as a result of the 21-day intervention, surpassing the progress achieved in three years of schooling and during traditional remediation programs with speech therapists that lasted 1–3 sessions per week for 40–120 minutes. Following the intensive intervention, supportive training was continued for one year while considering the crucial timing for neuroplasticity. Afterward, the child reached a normative level of reading, and the effect was maintained throughout their entire period of school education. Based on the timing of neuroplasticity processes, this is the first intensive intervention program experience for dyslexia that we are aware of. Intensive remediation programs, based on relevant findings regarding the mechanisms of memory consolidation, may enhance neural memory trace reinforcement. However, further research is necessary to optimize the timing and length of sessions and identify the most effective combination of linguistic and neurophysiological aspects for intervention.

36

Hood, Graeme, Kieran Hand, Emma Cramp, Philip Howard, Susan Hopkins, and Diane Ashiru-Oredope. "Measuring Appropriate Antibiotic Prescribing in Acute Hospitals: Development of a National Audit Tool Through a Delphi Consensus." Antibiotics 8, no. 2 (April 29, 2019): 49. http://dx.doi.org/10.3390/antibiotics8020049.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

This study developed a patient-level audit tool to assess the appropriateness of antibiotic prescribing in acute National Health Service (NHS) hospitals in the UK. A modified Delphi process was used to evaluate variables identified from published literature that could be used to support an assessment of appropriateness of antibiotic use. At a national workshop, 22 infection experts reached a consensus to define appropriate prescribing and agree upon an initial draft audit tool. Following this, a national multidisciplinary panel of 19 infection experts, of whom only one was part of the workshop, was convened to evaluate and validate variables using questionnaires to confirm the relevance of each variable in assessing appropriate prescribing. The initial evidence synthesis of published literature identified 25 variables that could be used to support an assessment of appropriateness of antibiotic use. All the panel members reviewed the variables for the first round of the Delphi; the panel accepted 23 out of 25 variables. Following review by the project team, one of the two rejected variables was rephrased, and the second neutral variable was re-scored. The panel accepted both these variables in round two with a 68% response rate. Accepted variables were used to develop an audit tool to determine the extent of appropriateness of antibiotic prescribing at the individual patient level in acute NHS hospitals through infection expert consensus based on the results of a Delphi process.

37

Li, Wanting, Yiting Chen, and Buzhou Tang. "Improving Generative Adversarial Network based Vocoding Through Multi-Scale Convolution." ACM Transactions on Asian and Low-Resource Language Information Processing, August 16, 2023. http://dx.doi.org/10.1145/3610532.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Vocoding is a sub-process of text-to-speech task, which aims at generating audios from intermediate representations between text and audio. Several recent works have shown that generative adversarial network (GAN) based vocoders can generate audios with high quality. While GAN-based neural vocoders have shown higher efficiency in generating speed than autoregressive vocoders, the audio fidelity still cannot compete with ground truth samples. One major cause of the degradation in audio quality and spectrogram vague comes from the average pooling layers in discriminator. As the multi-scale discriminator (MSD) commonly used by recent GAN-based vocoders applies several average pooling layers to capture different frequency bands, we believe it is crucial to prevent the high frequency information from leakage in the average pooling process. This paper proposes MSCGAN, which solves the above-mentioned problem and achieves higher-fidelity speech synthesis. We demonstrate that substituting the average pooling process with a multi-scale convolution architecture effectively retains high frequency features and thus forces the generator to recover audio details in time and frequency domain. Compared with other state-of-the-art GAN based vocoders, MSCGAN can produce competitive audio with a higher spectrogram clarity and MOS score in subjective human evaluation.

38

Lluís, Francesc, Vasileios Chatziioannou, and Alex Hofmann. "Points2Sound: from mono to binaural audio using 3D point cloud scenes." EURASIP Journal on Audio, Speech, and Music Processing 2022, no. 1 (December 29, 2022). http://dx.doi.org/10.1186/s13636-022-00265-4.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

AbstractFor immersive applications, the generation of binaural sound that matches its visual counterpart is crucial to bring meaningful experiences to people in a virtual environment. Recent studies have shown the possibility of using neural networks for synthesizing binaural audio from mono audio by using 2D visual information as guidance. Extending this approach by guiding the audio with 3D visual information and operating in the waveform domain may allow for a more accurate auralization of a virtual audio scene. We propose Points2Sound, a multi-modal deep learning model which generates a binaural version from mono audio using 3D point cloud scenes. Specifically, Points2Sound consists of a vision network and an audio network. The vision network uses 3D sparse convolutions to extract a visual feature from the point cloud scene. Then, the visual feature conditions the audio network, which operates in the waveform domain, to synthesize the binaural version. Results show that 3D visual information can successfully guide multi-modal deep learning models for the task of binaural synthesis. We also investigate how 3D point cloud attributes, learning objectives, different reverberant conditions, and several types of mono mixture signals affect the binaural audio synthesis performance of Points2Sound for the different numbers of sound sources present in the scene.

39

Khanjani, Zahra, Gabrielle Watson, and Vandana P. Janeja. "Audio deepfakes: A survey." Frontiers in Big Data 5 (January 9, 2023). http://dx.doi.org/10.3389/fdata.2022.1001063.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

A deepfake is content or material that is synthetically generated or manipulated using artificial intelligence (AI) methods, to be passed off as real and can include audio, video, image, and text synthesis. The key difference between manual editing and deepfakes is that deepfakes are AI generated or AI manipulated and closely resemble authentic artifacts. In some cases, deepfakes can be fabricated using AI-generated content in its entirety. Deepfakes have started to have a major impact on society with more generation mechanisms emerging everyday. This article makes a contribution in understanding the landscape of deepfakes, and their detection and generation methods. We evaluate various categories of deepfakes especially in audio. The purpose of this survey is to provide readers with a deeper understanding of (1) different deepfake categories; (2) how they could be created and detected; (3) more specifically, how audio deepfakes are created and detected in more detail, which is the main focus of this paper. We found that generative adversarial networks (GANs), convolutional neural networks (CNNs), and deep neural networks (DNNs) are common ways of creating and detecting deepfakes. In our evaluation of over 150 methods, we found that the majority of the focus is on video deepfakes, and, in particular, the generation of video deepfakes. We found that for text deepfakes, there are more generation methods but very few robust methods for detection, including fake news detection, which has become a controversial area of research because of the potential heavy overlaps with human generation of fake content. Our study reveals a clear need to research audio deepfakes and particularly detection of audio deepfakes. This survey has been conducted with a different perspective, compared to existing survey papers that mostly focus on just video and image deepfakes. This survey mainly focuses on audio deepfakes that are overlooked in most of the existing surveys. This article's most important contribution is to critically analyze and provide a unique source of audio deepfake research, mostly ranging from 2016 to 2021. To the best of our knowledge, this is the first survey focusing on audio deepfakes generation and detection in English.

40

Dyer, Mark. "Neural Synthesis as a Methodology for Art-Anthropology in Contemporary Music." Organised Sound, September 16, 2022, 1–8. http://dx.doi.org/10.1017/s1355771822000371.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

This article investigates the use of machine learning within contemporary experimental music as a methodology for anthropology, as a transformational engagement that might shape knowing and feeling. In Midlands (2019), Sam Salem presents an (auto)ethnographical account of his relationship to the city of Derby, UK. By deriving musical materials from audio generated by the deep neural network WaveNet, Salem creates an uncanny, not-quite-right representation of his childhood hometown. Similarly, in her album A Late Anthology of Early Music Vol. 1: Ancient to Renaissance (2020), Jennifer Walshe uses the neural network SampleRNN to create a simulated narrative of Western art music. By mapping her own voice onto selected canonical works, Walshe presents both an autoethnographic and anthropological reimagining of a musical past and questions practices of historiography. These works are contextualised within the practice and theory of filmmaker-ethnographer Trinh T. Minh-ha and her notion of ‘speaking nearby’. In extension of Tim Ingold’s conception of anthropology, it is shown that both works make collaborative human and non-human inquiries into the possibilities of human (and non-human) life.

41

Comanducci, Luca, Fabio Antonacci, and Augusto Sarti. "Synthesis of soundfields through irregular loudspeaker arrays based on convolutional neural networks." EURASIP Journal on Audio, Speech, and Music Processing 2024, no. 1 (March 28, 2024). http://dx.doi.org/10.1186/s13636-024-00337-7.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

AbstractMost soundfield synthesis approaches deal with extensive and regular loudspeaker arrays, which are often not suitable for home audio systems, due to physical space constraints. In this article, we propose a technique for soundfield synthesis through more easily deployable irregular loudspeaker arrays, i.e., where the spacing between loudspeakers is not constant, based on deep learning. The input are the driving signals obtained through a plane wave decomposition-based technique. While the considered driving signals are able to correctly reproduce the soundfield with a regular array, they show degraded performances when using irregular setups. Through a complex-valued convolutional neural network (CNN), we modify the driving signals in order to compensate the errors in the reproduction of the desired soundfield. Since no ground truth driving signals are available for the compensated ones, we train the model by calculating the loss between the desired soundfield at a number of control points and the one obtained through the driving signals estimated by the network. The proposed model must be retrained for each irregular loudspeaker array configuration. Numerical results show better reproduction accuracy with respect to the plane wave decomposition-based technique, pressure-matching approach, and linear optimizers for driving signal compensation.

42

Patole, Prof Mrunalinee, Akhilesh Pandey, Kaustubh Bhagwat, Mukesh Vaishnav, and Salikram Chadar. "A Survey on “Text-to-Speech Systems for Real-Time Audio Synthesis”." International Journal of Advanced Research in Science, Communication and Technology, June 10, 2021, 375–79. http://dx.doi.org/10.48175/ijarsct-1400.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Text to Speech (TTS) is a form of speech synthesis wherein the text is converted right into a spoken human-like voice output. The state of the art strategies for TTS employs a neural network based totally method. This paintings pursuits to take a look at a number of the problems and barriers gift inside the contemporary works, especially Tacotron-2, and attempts to in addition enhance its performance by means of editing its structure. till now many papers were published on these topics that display various exceptional TTS structures by means of developing new TTS products. The aim is to have a look at different textual content-to-Speech structures. in comparison to different text-to-Speech systems, Tacotron2 has multiple blessings. In opportunity algorithms like CNN, speedy-CNN the algorithmic program may not investigate the photo fully however in YOLO the algorithmic application check out the picture absolutely by predicting the bounding boxes through using convolutional network and possibilities for those packing containers and detects the image faster in comparison to alternative algorithms.

43

Angrick, Miguel, Maarten C. Ottenhoff, Lorenz Diener, Darius Ivucic, Gabriel Ivucic, Sophocles Goulis, Jeremy Saal, et al. "Real-time synthesis of imagined speech processes from minimally invasive recordings of neural activity." Communications Biology 4, no. 1 (September 23, 2021). http://dx.doi.org/10.1038/s42003-021-02578-0.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

AbstractSpeech neuroprosthetics aim to provide a natural communication channel to individuals who are unable to speak due to physical or neurological impairments. Real-time synthesis of acoustic speech directly from measured neural activity could enable natural conversations and notably improve quality of life, particularly for individuals who have severely limited means of communication. Recent advances in decoding approaches have led to high quality reconstructions of acoustic speech from invasively measured neural activity. However, most prior research utilizes data collected during open-loop experiments of articulated speech, which might not directly translate to imagined speech processes. Here, we present an approach that synthesizes audible speech in real-time for both imagined and whispered speech conditions. Using a participant implanted with stereotactic depth electrodes, we were able to reliably generate audible speech in real-time. The decoding models rely predominately on frontal activity suggesting that speech processes have similar representations when vocalized, whispered, or imagined. While reconstructed audio is not yet intelligible, our real-time synthesis approach represents an essential step towards investigating how patients will learn to operate a closed-loop speech neuroprosthesis based on imagined speech.

44

Zhang, Ni. "Informatization Integration Strategy of Modern Vocal Music Teaching and Traditional Music Culture in Colleges and Universities in the Era of Artificial Intelligence." Applied Mathematics and Nonlinear Sciences, December 2, 2023. http://dx.doi.org/10.2478/amns.2023.2.01333.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Abstract This paper utilizes deep learning algorithms to informally integrate modern vocal music teaching with traditional music culture and extracts audio time-domain features and frequency-domain features through neural network self-learning. Secondly, a large number of music tracks are decomposed into music patterns, which constitute a music pattern library, and a music training model is generated through the automatic music audio synthesis algorithm based on a recurrent neural network, and the GRU model is used for music training and model prediction. The strategy of integrating artificial intelligence and modern vocal music teaching mode through traditional music culture in modern vocal music teaching is informatized, and a controlled experiment is carried out with H Music Academy as an example. The results show that the average degree of completion of the learning objectives of the students in the two experimental classes is 89.32 and 87.16, respectively, which is 14.15 and 11.99 higher than the average degree of completion of the control class. This study demonstrates that the teaching mode of traditional music culture integration in modern vocal music teaching can enhance the student’s ability of vocal music skills and practically improve the students’ artistic literacy, which can improve the degree of completion of the student’s learning objectives and in turn, improve the overall level of vocal music teaching.

45

Hayes, Ben, Jordie Shier, György Fazekas, Andrew McPherson, and Charalampos Saitis. "A review of differentiable digital signal processing for music and speech synthesis." Frontiers in Signal Processing 3 (January 11, 2024). http://dx.doi.org/10.3389/frsip.2023.1284100.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

The term “differentiable digital signal processing” describes a family of techniques in which loss function gradients are backpropagated through digital signal processors, facilitating their integration into neural networks. This article surveys the literature on differentiable audio signal processing, focusing on its use in music and speech synthesis. We catalogue applications to tasks including music performance rendering, sound matching, and voice transformation, discussing the motivations for and implications of the use of this methodology. This is accompanied by an overview of digital signal processing operations that have been implemented differentiably, which is further supported by a web book containing practical advice on differentiable synthesiser programming (https://intro2ddsp.github.io/). Finally, we highlight open challenges, including optimisation pathologies, robustness to real-world conditions, and design trade-offs, and discuss directions for future research.

46

Kohler, Jonas, Maarten C. Ottenhoff, Sophocles Goulis, Miguel Angrick, Albert J. Colon, Louis Wagner, Simon Tousseyn, Pieter L. Kubben, and Christian Herff. "Synthesizing Speech from Intracranial Depth Electrodes using an Encoder-Decoder Framework." Neurons, Behavior, Data analysis, and Theory, December 9, 2022. http://dx.doi.org/10.51628/001c.57524.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Speech Neuroprostheses have the potential to enable communication for people with dysarthria or anarthria. Recent advances have demonstrated high-quality text decoding and speech synthesis from electrocorticographic grids placed on the cortical surface. Here, we investigate a less invasive measurement modality in three participants, namely stereotactic EEG (sEEG) that provides sparse sampling from multiple brain regions, including subcortical regions. To evaluate whether sEEG can also be used to synthesize high-quality audio from neural recordings, we employ a recurrent encoder-decoder model based on modern deep learning methods. We find that speech can indeed be reconstructed with correlations up to 0.8 from these minimally invasive recordings, despite limited amounts of training data.

47

Simionato, Riccardo, Stefano Fasciani, and Sverre Holm. "Physics-informed differentiable method for piano modeling." Frontiers in Signal Processing 3 (February 13, 2024). http://dx.doi.org/10.3389/frsip.2023.1276748.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Numerical emulations of the piano have been a subject of study since the early days of sound synthesis. High-accuracy sound synthesis of acoustic instruments employs physical modeling techniques which aim to describe the system’s internal mechanism using mathematical formulations. Such physical approaches are system-specific and present significant challenges for tuning the system’s parameters. In addition, acoustic instruments such as the piano present nonlinear mechanisms that present significant computational challenges for solving associated partial differential equations required to generate synthetic sound. In a nonlinear context, the stability and efficiency of the numerical schemes when performing numerical simulations are not trivial, and models generally adopt simplifying assumptions and linearizations. Artificial neural networks can learn a complex system’s behaviors from data, and their application can be beneficial for modeling acoustic instruments. Artificial neural networks typically offer less flexibility regarding the variation of internal parameters for interactive applications, such as real-time sound synthesis. However, their integration with traditional signal processing frameworks can overcome this limitation. This article presents a method for piano sound synthesis informed by the physics of the instrument, combining deep learning with traditional digital signal processing techniques. The proposed model learns to synthesize the quasi-harmonic content of individual piano notes using physics-based formulas whose parameters are automatically estimated from real audio recordings. The model thus emulates the inharmonicity of the piano and the amplitude envelopes of the partials. It is capable of generalizing with good accuracy across different keys and velocities. Challenges persist in the high-frequency part of the spectrum, where the generation of partials is less accurate, especially at high-velocity values. The architecture of the proposed model permits low-latency implementation and has low computational complexity, paving the way for a novel approach to sound synthesis in interactive digital pianos that emulates specific acoustic instruments.

48

Кожирбаев, Ж. М. "ҚАЗАҚ ТІЛІ ҮШІН ИНТЕГРАЛДЫҚ (END-TO-END) СӨЙЛЕУ СИНТЕЗІ." BULLETIN Series Physical and Mathematical Sciences 79, no. 3(2022) (September 25, 2023). http://dx.doi.org/10.51889/9340.2022.21.68.023.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Синтез речи, также называемый преобразованием текста в речь (TTS), считается одной из важных задач обработки речи наряду с распознаванием речи. Это способ преобразования данного текста в речь. Существует несколько подходов синтеза речи. В 20 веке была разработана первая компьютерная система синтеза голоса. Некоторыми из ранних методов компьютерного синтеза речи являются артикуляционный синтез, формантный синтез и конкатенативный синтез. Статистический параметрический синтез речи позже был предложен по мере развития машинного обучения. С 2010-х годов синтез речи на основе нейронных сетей постепенно становится все более популярным и улучшает качество голоса. Целью данной работы является обзор статистических параметрических и сквозных методов, которые можно рассматривать как линию эволюционного развития TTS. Кроме того, мы проведем эксперимент со сквозным методом на базе Tacotron2 и ParalleWavegan. Для экспериментов были собраны текстовые материалы произведений Ахмета Байтурсынулы. Всего из собранных материалов было записано 50 часов аудиозаписи. Из произведений Байтурсынулы было отобрано шесть книг, из которых были отобраны наиболее распространенные произведения и собраны в аудиотекстовые материалы. Один профессиональный диктор-мужчина озвучивал собранные текстовые данные.Ключевые слова: синтез речи, формантный синтез речи, конкатенативный синтез речи, статистический параметрический синтез речи, интегральный синтез речи. Сөйлеу синтезі, оны мәтіннен сөйлеуге (TTS) деп те атайды, сөйлеуді танумен қатар сөйлеуді өңдеудің маңызды міндеттерінің бірі болып саналады. Бұл берілген мәтінді сөйлеуге түрлендіру тәсілі. Сөйлеу синтезінің бірнеше тәсілдері бар. 20 ғасырда бірінші компьютерлік cөйлеу синтезі жүйесі жасалды. Компьютерлік сөйлеу синтезінің алғашқы әдістерінің кейбірі артикуляциялық синтез, формант синтезі және конкатенативті синтез болып табылады. Машиналық оқыту дамыған сайын статистикалық параметрлік сөйлеу синтезі ұсынылды. 2010 жылдардан бастап нейрондық желіге негізделген сөйлеу синтезі біртіндеп танымал бола бастады және сөйлеу сапасын жақсартады. Бұл жұмыстың мақсаты статистикалық параметрлік және түпкілікті әдістерді қарастыру болып табылады, оларды TTS эволюциялық даму желісі ретінде қарастыруға болады. Сонымен қатар, біз Tacotron2 және ParalleWavegan негізіндегі әдіспен тәжірибе жасаймыз. Эксперимент үшін Ахмет Байтұрсынұлының шығармаларынан мәтіндік материалдар жинақталды. Жиналған материалдардан барлығы 50 сағат аудиожазба жазылды. Байтұрсынұлының шығармаларынан алты кітап таңдалып, олардың ішінен ең көп таралған шығармалар таңдалып, аудиомәтіндік материалдарға жинақталды. Бір кәсіби ер диктор жиналған мәтіндік деректерді оқыды. Түйiн сөздер: сөйлеу синтезі, формантты сөйлеу синтезі, конкатенативті сөйлеу синтезі, статистикалық параметрлік сөйлеу синтезі, интегралды сөйлеу синтезі. Speech synthesis, also called text-to-speech (TTS), is considered one of the important tasks of speech processing along with speech recognition. It is a way of converting given text to speech. There are several approaches to speech synthesis. In the 20th century, the first computer voice synthesis system was developed. Some of the early computer speech synthesis methods are articulatory synthesis, formant synthesis, and concatenative synthesis. Statistical parametric speech synthesis was later proposed as machine learning developed. Since the 2010s, neural network-based speech synthesis has gradually become more popular and improves voice quality. The purpose of this work is to review statistical parametric and end- to-end methods, which can be considered as a line of evolutionary development of TTS. In addition, we will experiment with an end-to-end method based on Tacotron2 and ParalleWavegan. For the experiments, textual materials from the works of Akhmet Baitursynuly were collected. In total, 50 hours of audio recording were recorded from the collected materials. From Baitursynuly's works, six books were selected, from which the most common works were selected and collected in audio text materials. One professional male announcer voiced the collected text data. Keywords: speech synthesis, formant speech synthesis, concatenative speech synthesis, statistical parametric speech synthesis, integral speech synthesis.

49

Alsaadawı, Hussein Farooq Tayeb, and Resul Daş. "Multimodal Emotion Recognition Using Bi-LG-GCN for MELD Dataset." Balkan Journal of Electrical and Computer Engineering, October 16, 2023. http://dx.doi.org/10.17694/bajece.1372107.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Emotion recognition using multimodal data is a widely adopted approach due to its potential to enhance human interactions and various applications. By leveraging multimodal data for emotion recognition, the quality of human interactions can be significantly improved. We present the Multimodal Emotion Lines Dataset (MELD) and a novel method for multimodal emotion recognition using a bi-lateral gradient graph neural network (Bi-LG-GNN) and feature extraction and pre-processing. The multimodal dataset uses fine-grained emotion labeling for textual, audio, and visual modalities. This work aims to identify affective computing states successfully concealed in the textual and audio data for emotion recognition and sentiment analysis. We use pre-processing techniques to improve the quality and consistency of the data to increase the dataset’s usefulness. The process also includes noise removal, normalization, and linguistic processing to deal with linguistic variances and background noise in the discourse. The Kernel Principal Component Analysis (K-PCA) is employed for feature extraction, aiming to derive valuable attributes from each modality and encode labels for array values. We propose a Bi-LG-GCN-based architecture explicitly tailored for multimodal emotion recognition, effectively fusing data from various modalities. The Bi-LG-GCN system takes each modality's feature-extracted and pre-processed representation as input to the generator network, generating realistic synthetic data samples that capture multimodal relationships. These generated synthetic data samples, reflecting multimodal relationships, serve as inputs to the discriminator network, which has been trained to distinguish genuine from synthetic data. With this approach, the model can learn discriminative features for emotion recognition and make accurate predictions regarding subsequent emotional states. Our method was evaluated on the MELD dataset, yielding notable results in terms of accuracy (80%), F1-score (81%), precision (81%), and recall (81%) when using the MELD dataset. The pre-processing and feature extraction steps enhance input representation quality and discrimination. Our Bi-LG-GCN-based approach, featuring multimodal data synthesis, outperforms contemporary techniques, thus demonstrating its practical utility.

50

Mithoowani, Siraj, Andrew Mulloy, Augustin Toma, and Ameen Patel. "To err is human: A case-based review of cognitive bias and its role in clinical decision making." Canadian Journal of General Internal Medicine 12, no. 2 (August 30, 2017). http://dx.doi.org/10.22374/cjgim.v12i2.166.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Cognitive biases, or systematic errors in cognition, are important contributors to diagnostic error in medicine. In our review, we explore the psychological underpinnings of cognitive bias and highlight several common biases using clinical cases. We conclude by reviewing strategies to improve diagnostic accuracy and by discussing controversies and future research directions.RésuméLes préjugés cognitifs, ou les erreurs systématiques dans la cognition, sont des contributeurs importants à l'erreur diagnostique dans la médecine. Dans notre examen, nous explorons les fondements psychologiques du biais cognitif et soulignons plusieurs préjugés communs en utilisant des cas cliniques. Nous concluons en examinant les stratégies visant à améliorer la précision diagnostique et en discutant des controverses et des futures orientations de recherche.Research in the field of behavioural psychology and its application to medicine has been ongoing for several decades in an effort to better understand clinical decision making.1 Cognitive biases (systematic errors in cognition) are increasingly recognized in behavioural economics2 and more recently have been shown to affect medical decision making.3 Over 100 such cognitive biases have been identified and several dozen are postulated to play a major role in diagnostic error.4 Cognitive errors can take many forms and in one study contributed to as many as 74% of diagnostic errors by internists. 5 Most of these errors were due to “faulty synthesis” of information, including premature diagnostic closure and failed use of heuristics.5 Inadequate medical knowledge, on the other hand, was rare and mostly identified in cases concerning rare conditions. 5 Professional organizations such as the Royal College of Physicians and Surgeons of Canada and the Canadian Medical Protective Association have since been working to raise awareness of cognitive bias in clinical practice.6In our review, we explore the role of cognitive bias in diagnostic error through the use of clinical cases. We also review the literature on de-biasing strategies and comment on limitations and future directions of research.The Dual Process TheoryA prevailing theory to explain the existence of cognitive bias is the dual process theory, which asserts that two cognitive systems are used in decision making, herein called System 1 and System 2 (Table 1). 2,7System 1 can be thought of as our intuitive mode of thinking. It generates hypotheses rapidly, operates beneath our perceptible consciousness and makes judgments that are highly dependent on contextual clues. System 1 is characterized by heuristics (short cuts, or “rules of thumb”) and is an important component of clinical judgment or expertise. In contrast, System 2 is slow, deliberate, analytical and more demanding on cognition. It applies rules that are acquired through learning and it can play a “monitoring role” over System 1, and thus overrides heuristics when their use is inappropriate. The dual process theory implies that errors result when inappropriate judgments generated by System 1 fail to be recognized and corrected by System 2. Maintaining constant vigilance over System 1 would be both impractical and time consuming for routine decisions and would diminish the value of intuition. It follows that a more practical way of improving reasoning is to identify the most common biases of System 1 and to recognize situations when mistakes are most likely to occur.2Alternative Theories of CognitionVariations of dual process theory have further refined our understanding of medical decision making. Fuzzy trace theory, for example, proposes that individuals process information through parallel gist and verbatim representations.8 The “gist” is analogous to System 1 and represents the bottom-line “meaning” of information. This representation is subject to an individual’s worldview, emotions and experiences. In contrast, verbatim representations are precise, literal and analogous to System 2. Fuzzy trace theory is particularly useful in explaining how patients might interpret health information. Proponents of this theory contend that in order for information to lead to meaningful behavioural change, physicians must appeal to both gist and verbatim representations when communicating with patients.8 Other models, such as dynamic graded continuum theory, do away with the dichotomy of System 1 and System 2 and instead represent implicit, automatic and explicit cognitive processes on a continuous scale.9 These single system models are useful to compare against dual process theory but have not replaced it as a well-established framework for understanding and mitigating cognitive bias in clinical decision making.7Case 1: A 55-Year-Old Male with Retrosternal Chest PainA 55year-old non-smoking male was assessed in a busy Emergency Department (ED) for retrosternal chest pain. Past medical history is significant for osteoarthritis for which he takes naproxen. On review of his history, the patient has had multiple visits for retrosternal chest pain in the previous two months. At each encounter, he was discharged home after a negative cardiac workup.Vital signs in the ED were within normal limits except for sinus tachycardia at 112 beats per minute. On exam, the patient was visibly distressed. Cardiac and respiratory exams were normal. There was mild tenderness in the epigastrium. Basic blood-work revealed leukocytosis (16.0 × 109/L), a mildly elevated high sensitivity cardiac troponin, and no other abnormalities. An ECG revealed T wave flattening in leads V3-V4.The patient was referred to the internal medicine service with a diagnosis of non-ST-elevation myocardial infarction and treated with aspirin, clopidogrel, and fondaparinux. Several hours later, the patient became more agitated and complained of worsening retrosternal and epigastric pain. On re-examination, heart rate had increased to 139 beats per minute, blood pressure dropped to 77/60 and he had a rigid abdomen. Abdominal radiography revealed free air under the right hemi-diaphragm and the patient was rushed to the operating room where a perforated gastric ulcer was detected and repaired. The case above illustrates numerous cognitive biases, including:1. Premature diagnostic closure: the tendency to accept a diagnosis before it is fully verified.42. Anchoring: the tendency to over-emphasize features in the patient’s initial presentation and failing to adjust the clinical impression after learning new information.43. Confirmation bias: the tendency to look for confirming evidence to support a diagnosis, rather than to look for (or explain) evidence which puts the diagnosis in question.4In this case, the physician based the diagnosis of myocardial infarction primarily on symptoms of chest pain and an elevated cardiac troponin. However, several other objective findings were present and when taken together, suggested a diagnosis other than myocardial infarction. These included a tender epigastrium, leukocytosis, and resting sinus tachycardia. These symptoms/signs were not explicitly explained or investigated before a treatment decision was made. Premature diagnostic closure is one of the most common cognitive biases underlying medical errors5 and it affects clinicians at all levels of training.10 It is multifactorial in origin5 and is especially common in the face of other cognitive biases such as anchoring and confirmation bias.The physician in this case “anchored” to a diagnosis of cardiac chest pain given the patient’s previous ED visit history and his/her best intentions of ruling out a “worst case scenario.” Anchoring can be especially powerful in the face of abnormal screening investigations that have been reviewed even before the physician has acquired a history or performed a physical examination. If the physician had reviewed the screening investigations before seeing the patient, he/she might have narrowed the differential diagnosis prematurely, failed to gather all the relevant information and failed to adjust the clinical impression based on new information.The physician demonstrated confirmation bias by failing to explain the abnormalities that put the diagnosis of myocardial infarction in question (e.g. tender epigastrium, leukocytosis). Confirmation bias arises from an attempt to avoid cognitive dissonance, a distressing psychological conflict which occurs when inconsistent beliefs or theories are held simultaneously.11 In one study evaluating clinical decision making amongst 75 psychiatrists and 75 medical students,12 13% of psychiatrists and 25% of medical students demonstrated confirmation bias when searching for information after having made a preliminary diagnosis. In this study, confirmation bias resulted in more frequent diagnostic errors and predictably impacted subsequent treatment decisions.An appropriate consideration of all diagnostic possibilities is the first step in avoiding diagnostic error. While acquiring information, physicians should step back and consolidate new data with the working diagnosis, as failure to do so can result in confirmation bias.13 All abnormal findings and tests, especially if considered clinically relevant should be explained by the most probable diagnosis. An alternate diagnosis or the possibility of more than one diagnosis should be considered when an abnormal finding or test cannot reasonably be explained by the working diagnosis.Tschen et al observed a team of physicians working through a simulated scenario which had diagnostic ambiguity.14 Two approaches were found to be effective in reducing the effect of confirmation bias: explicit reasoning and talking to the room. Explicit reasoning involves making causal inferences when interpreting and communicating information. Talking to the room is a process whereby diagnostic reasoning is explained in an unstructured way to a team member or colleague in the room. This allows the clinician the opportunity to elaborate on their thoughts and observers to point out errors or suggest alternate diagnoses in a shared mental model.Case 2: A 30-Yearold Male with Confusion and SeizuresA 30-year-old homeless male is found confused on the street by paramedics and brought to the ED for assessment. Empty bottles of alcohol were noted at the scene. The CIWA (Clinical Institute Withdrawal Assessment for Alcohol) protocol is initiated and he is given several doses of lorazepam to minimal effect. Several hours after the patient is admitted, a resident on-call is paged for elevated CIWA scores on the basis of diaphoresis and agitation. Several additional doses of lorazepam are ordered which fail to completely resolve the symptoms. Gradually, the patient becomes more obtunded. The on-call resident orders a capillary blood glucose and it measures 1.1 mmol/L. Intravenous D50W is promptly administered, the blood glucose normalizes and the patient’s level of consciousness improves.The case above illustrates the following biases:1. Availability bias: the tendency to weigh a diagnosis as being more likely if it comes to mind more readily.42. Diagnostic momentum: the tendency for labels to “stick” to patients and become more definite with time.4Although the symptoms of diaphoresis and agitation are not specific to alcohol withdrawal, this diagnosis was deemed most likely based on how readily it came to mind, the empty alcohol bottles at the scene, and potentially on the patient’s demographics. The unproven diagnosis of alcohol withdrawal “stuck” with the patient despite minimal improvement after a therapeutic trial of benzodiazepines.Availability bias has been shown to affect internal medicine residents. In one single-centre study,15 18 first-year and 18 second-year residents were exposed to case descriptions with associated diagnoses as part of an exercise. They were then asked to diagnose a series of new cases, some of which appeared similar to those they had previously encountered but with pertinent differences that made an alternate diagnosis more likely. Second year residents had lower diagnostic accuracy on these similar-appearing cases; a result consistent with availability bias. First year residents were less prone to this bias because of their limited clinical experience. Most importantly, subsequent reflective diagnostic reasoning countered the bias and improved accuracy.General Strategies to Avoid Cognitive BiasInterventions aimed at mitigating diagnostic error due to cognitive bias take several approaches.1. Improving clinical reasoning2. Reducing cognitive burden3. Improving knowledge and experienceDespite a large number of proposed interventions, there is a lack of empirical evidence supporting the efficacy of many de-biasing strategies. 16 What follows is a brief review of the current evidence.Improving Clinical ReasoningSeveral “de-biasing” strategies have been proposed to improve clinical reasoning. De-biasing strategies assume that System 1 processes are more prone to bias due to their heavy reliance on heuristics and therefore the solution is to activate System 2 at critical points in decision making. De-biasing occurs in several stages: at first an individual is educated about the presence of a cognitive bias, they then employ strategies to eliminate that bias and finally they maintain those strategies in the long term.17Metacognition, or “thinking about thinking,” involves reflecting on one’s own diagnostic reasoning. Internal reflection along with awareness of potential biases should allow the clinician to identify faulty reasoning. However, the evidence underlying reflective practice is mixed.16 Several studies have tried to encourage reflective practice and System 2 processes by instructing participants to proceed slowly through their reasoning18 or by giving participants the opportunity to review their diagnoses.19 These studies have found minimal or no impact on reducing the rate of diagnostic error. On the other hand, some studies have shown improved diagnostic accuracy when physicians are asked to explicitly state their differential diagnoses along with features that are consistent or inconsistent with each diagnosis.20 These results suggest that if reflective practice is to be effective, it must involve a thorough review of the differential diagnosis as opposed to simply taking additional time.Reducing Cognitive BurdenTools that reduce the cognitive burden placed on physicians may reduce the frequency of diagnostic errors. One suggestion has been to incorporate the use of checklists in the diagnostic process. These checklists would be matched to common presenting symptoms and include a list of possible diagnoses. One randomized controlled trial failed to show a statistically significant reduction in the diagnostic error rate with the use of checklists, except in a small subgroup of patients treated in the ED. 21 These findings challenge the results of two other studies that found checklists to be effective in improving scrutiny22 and diagnostic accuracy23 when interpreting electrocardiograms. More advanced forms of clinician decision support systems have also been studied.24 Software programs such as DXplain generate a list of potential diagnoses based on a patient’s chief complaint. In one study, when the software provided physicians a list of possible diagnoses before evaluating patients, diagnoses were 1.31 times more likely to be correct. 25 The use of diagnostic support tools may grow in the future as they are integrated into electronic medical record systems.Improving Knowledge and ExperienceA combination of experience, knowledge and feedback are integral in developing a clinician’s intuition to produce the best hypotheses. Experience without feedback can lead to overconfidence, which itself is a cognitive bias. The evidence supporting feedback is strong. Fridriksson et al showed a significant reduction in diagnostic error when referring doctors were provided feedback on the identification of subarachnoid hemorrhage.26 A systematic review of 118 randomized trials concluded that feedback was effective in improving professional practice.27 The specific characteristics of the best feedback were elusive. In general, however, feedback was thought to be most effective when it was explicit and delivered close to the time of decision making. ConclusionsIn our review, we explore clinical decision making through the lens of dual-process theory. However, multiple different dual-processing models are still being explored and fundamental questions are still under debate. For example, some experts believe that instead of focusing on de-biasing strategies, the key to improving intuitive (System 1) processes is simply to acquire more formal and experiential knowledge.19 Other unanswered questions include: the impact and magnitude of cognitive bias in actual clinical practice, which biases are most prevalent in each medical specialty and which strategies are the most effective in mitigating bias. Further study is also needed to assess the impact of novel educational methods, such as case-based and simulation-based learning, which are promising venues where trainees may identify and correct cognitive biases in a directly observed setting. References 1. Norman G. Research in clinical reasoning: past history and current trends. Med Educ 2005;39:418–27.2. Kahneman D. Thinking, fast and slow. Farrar, Straus and Giroux; 2011.3. Croskerry P. From mindless to mindful practice — cognitive bias and clinical decision making. N Engl J Med 2013;368:2445–8.4. Croskerry P. The importance of cognitive errors in diagnosis and strategies to minimize them. Academic Medicine 2003;78:775–80.5. Graber ML, Franklin N, Gordon R. Diagnostic error in internal medicine. Arch Intern Med 2005;165:1493–9.6. Parush A, Campbell C, Hunter A, et al. Situational awareness and patient safety - a short primer. Ottawa ON: The Royal College of Physicians and Surgeons of Canada; 2011.7. Pelaccia T, Tardif J, Triby E, Charlin B. An analysis of clinical reasoning through a recent and comprehensive approach: the dual-process theory. Med Educ Online 2011;16.8. Reyna VF. A theory of medical decision making and health: fuzzy trace theory. Med Decis 2008;28:850–65.9. Osman M. An evaluation of dual-process theories of reasoning. Psychonom Bull review 2004;11:988–1010.10. Dubeau CE, Voytovich AE, Rippey RM. Premature conclusions in the diagnosis of iron-deficiency anemia: cause and effect. Med Dec Mak 1986;6:169–73.11. Nickerson RS. Confirmation bias: A ubiquitous phenomenon in many guises. Rev Gen Psychol 1998;2:175.12. Mendel R, Traut-Mattausch E, Jonas E, et al. Confirmation bias: why psychiatrists stick to wrong preliminary diagnoses. Psychol Med 2011;41:2651–9.13. Pines JM. Profiles in patient safety: confirmation bias in emergency medicine. Acad Emerg Med 2006;13:90–4.14. Tschan F, Semmer NK, Gurtner A, et al. Explicit reasoning, confirmation bias, and illusory transactive memory: a simulation study of group medical decision making. Small Group Res 2009;40:271–300.15. Mamede S, van Gog T, van den Berge K, et al. Effect of availability bias and reflective reasoning on diagnostic accuracy among internal medicine residents. JAMA 2010;304:1198–203.16. Graber ML, Kissam S, Payne VL, et al. Cognitive interventions to reduce diagnostic error: a narrative review. BMJ Qual Safe 2012;21:535–57.17. Croskerry P, Singhal G, Mamede S. Cognitive debiasing 2: impediments to and strategies for change. BMJ Qual Saf 2013.18. Norman G, Sherbino J, Dore K, et al. The etiology of diagnostic errors: a controlled trial of system 1 versus system 2 reasoning. Acad Med 2014;89:277–84.19. Monteiro SD, Sherbino J, Patel A, Mazzetti I, Norman GR, Howey E. Reflecting on diagnostic errors: taking a second look is not enough. J Gen Intern Med 2015;30:1270–4.20. Bass A, Geddes C, Wright B, Coderre S, Rikers R, McLaughlin K. Experienced physicians benefit from analyzing initial diagnostic hypotheses. Can Med Educ J 2013;4:e7–e15.21. Ely JW, Graber MA. Checklists to prevent diagnostic errors: a pilot randomized controlled trial. Diagnosis 2015;2.22. Sibbald M, de Bruin ABH, Yu E, van Merrienboer JJG. Why verifying diagnostic decisions with a checklist can help: insights from eye tracking. Adv Health Sci Educ Theory Pract 2015;20:1053–60.23. Sibbald M, de Bruin ABH, van Merrienboer JJG. Checklists improve experts' diagnostic decisions. Med Educ 2013;47:301–8.24. Garg AX, Adhikari NKJ, McDonald H, et al. Effects of computerized clinical decision support systems on practitioner performance and patient outcomes: a systematic review. JAMA 2005;293:1223–38.25. Kostopoulou O, Rosen A, Round T, et al. Early diagnostic suggestions improve accuracy of GPs: a randomised controlled trial using computer-simulated patients. Br J Gen Pract 2015;65:e49–54.26. Fridriksson S, Hillman J, Landtblom AM, Boive J. Education of referring doctors about sudden onset headache in subarachnoid hemorrhage. A prospective study. Acta Neurol Scand 2001;103:238–42.27. Jamtvedt G, Young JM, Kristoffersen DT, O'Brien MA, Oxman AD. Does telling people what they have been doing change what they do? A systematic review of the effects of audit and feedback. Qual Saf Health Care 2006;15:433–6.