Academic literature on the topic 'Mel spectrogram analysis'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Mel spectrogram analysis.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Mel spectrogram analysis"

1

Lambamo, Wondimu, Ramasamy Srinivasagan, and Worku Jifara. "Analyzing Noise Robustness of Cochleogram and Mel Spectrogram Features in Deep Learning Based Speaker Recognition." Applied Sciences 13, no. 1 (December 31, 2022): 569. http://dx.doi.org/10.3390/app13010569.

Full text
Abstract:
The performance of speaker recognition systems is very well on the datasets without noise and mismatch. However, the performance gets degraded with the environmental noises, channel variation, physical and behavioral changes in speaker. The types of Speaker related feature play crucial role in improving the performance of speaker recognition systems. Gammatone Frequency Cepstral Coefficient (GFCC) features has been widely used to develop robust speaker recognition systems with the conventional machine learning, it achieved better performance compared to Mel Frequency Cepstral Coefficient (MFCC) features in the noisy condition. Recently, deep learning models showed better performance in the speaker recognition compared to conventional machine learning. Most of the previous deep learning-based speaker recognition models has used Mel Spectrogram and similar inputs rather than a handcrafted features like MFCC and GFCC features. However, the performance of the Mel Spectrogram features gets degraded in the high noise ratio and mismatch in the utterances. Similar to Mel Spectrogram, Cochleogram is another important feature for deep learning speaker recognition models. Like GFCC features, Cochleogram represents utterances in Equal Rectangular Band (ERB) scale which is important in noisy condition. However, none of the studies have conducted analysis for noise robustness of Cochleogram and Mel Spectrogram in speaker recognition. In addition, only limited studies have used Cochleogram to develop speech-based models in noisy and mismatch condition using deep learning. In this study, analysis of noise robustness of Cochleogram and Mel Spectrogram features in speaker recognition using deep learning model is conducted at the Signal to Noise Ratio (SNR) level from −5 dB to 20 dB. Experiments are conducted on the VoxCeleb1 and Noise added VoxCeleb1 dataset by using basic 2DCNN, ResNet-50, VGG-16, ECAPA-TDNN and TitaNet Models architectures. The Speaker identification and verification performance of both Cochleogram and Mel Spectrogram is evaluated. The results show that Cochleogram have better performance than Mel Spectrogram in both speaker identification and verification at the noisy and mismatch condition.
APA, Harvard, Vancouver, ISO, and other styles
2

Liao, Ying. "Analysis of Rehabilitation Occupational Therapy Techniques Based on Instrumental Music Chinese Tonal Language Spectrogram Analysis." Occupational Therapy International 2022 (October 3, 2022): 1–12. http://dx.doi.org/10.1155/2022/1064441.

Full text
Abstract:
This paper provides an in-depth analysis of timbre-speech spectrograms in instrumental music, designs a model analysis of rehabilitation occupational therapy techniques based on the analysis of timbre-speech spectrograms in instrumental music, and tests the models for comparison. Starting from the mechanism of human articulation, this paper models the process of human expression as a time-varying linear system consisting of excitation, vocal tract, and radiation models. The system’s overall architecture is designed according to the characteristics of Chinese speech and everyday speech rehabilitation theory (HSL theory). The dual judgment of temporal threshold and short-time average energy realized the phonetic length training. Tone and clear tone training were achieved by linear predictive coding technique (LPC) and autocorrelation function. Using the DTW technique, isolated word speech recognition was achieved by extracting Mel-scale Frequency Cepstral Coefficients (MFCC) parameters of speech signals. The system designs corresponding training scenes for each training module according to the extracted speech parameters, combines the multimedia speech spectrogram motion situation with the speech parameters, and finally presents the training content as a speech spectrogram, and evaluates the training results through human-machine interaction to stimulate the interest of rehabilitation therapy and realize the speech rehabilitation training of patients. After analyzing the pre- and post-test data, it was found that the p -values of all three groups were <0.05, which was judged to be significantly different. Also, all subjects changed their behavioral data during the treatment. Therefore, it was concluded that the music therapy technique could improve the patients’ active gaze communication ability, verbal command ability, and active question-answering ability after summarizing the data, i.e., the hypothesis of this experiment is valid. Therefore, it is believed that the technique of timbre-speech spectrogram analysis in instrumental music can achieve the effect of rehabilitation therapy to a certain extent.
APA, Harvard, Vancouver, ISO, and other styles
3

Byeon, Yeong-Hyeon, and Keun-Chang Kwak. "Pre-Configured Deep Convolutional Neural Networks with Various Time-Frequency Representations for Biometrics from ECG Signals." Applied Sciences 9, no. 22 (November 10, 2019): 4810. http://dx.doi.org/10.3390/app9224810.

Full text
Abstract:
We evaluated electrocardiogram (ECG) biometrics using pre-configured models of convolutional neural networks (CNNs) with various time-frequency representations. Biometrics technology records a person’s physical or behavioral characteristics in a digital signal via a sensor and analyzes it to identify the person. An ECG signal is obtained by detecting and amplifying a minute electrical signal flowing on the skin using a noninvasive electrode when the heart muscle depolarizes at each heartbeat. In biometrics, the ECG is especially advantageous in security applications because the heart is located within the body and moves while the subject is alive. However, a few body states generate noisy biometrics. The analysis of signals in the frequency domain has a robust effect on the noise. As the ECG is noise-sensitive, various studies have applied time-frequency transformations that are robust to noise, with CNNs achieving a good performance in image classification. Studies have applied time-frequency representations of the 1D ECG signals to 2D CNNs using transforms like MFCC (mel frequency cepstrum coefficient), spectrogram, log spectrogram, mel spectrogram, and scalogram. CNNs have various pre-configured models such as VGGNet, GoogLeNet, ResNet, and DenseNet. Combinations of the time-frequency representations and pre-configured CNN models have not been investigated. In this study, we employed the PTB (Physikalisch-Technische Bundesanstalt)-ECG and CU (Chosun University)-ECG databases. The MFCC accuracies were 0.45%, 2.60%, 3.90%, and 0.25% higher than the spectrogram, log spectrogram, mel spectrogram, and scalogram accuracies, respectively. The Xception accuracies were 3.91%, 0.84%, and 1.14% higher than the VGGNet-19, ResNet-101, and DenseNet-201 accuracies, respectively.
APA, Harvard, Vancouver, ISO, and other styles
4

Reddy, A. Pramod, and Vijayarajan V. "Fusion Based AER System Using Deep Learning Approach for Amplitude and Frequency Analysis." ACM Transactions on Asian and Low-Resource Language Information Processing 21, no. 3 (May 31, 2022): 1–19. http://dx.doi.org/10.1145/3488369.

Full text
Abstract:
Automatic emotion recognition from Speech (AERS) systems based on acoustical analysis reveal that some emotional classes persist with ambiguity. This study employed an alternative method aimed at providing deep understanding into the amplitude–frequency, impacts of various emotions in order to aid in the advancement of near term, more effectively in classifying AER approaches. The study was undertaken by converting narrow 20 ms frames of speech into RGB or grey-scale spectrogram images. The features have been used to fine-tune a feature selection system that had previously been trained to recognise emotions. Two different Linear and Mel spectral scales are used to demonstrate a spectrogram. An inductive approach for in sighting the amplitude and frequency features of various emotional classes. We propose a two-channel profound combination of deep fusion network model for the efficient categorization of images. Linear and Mel- spectrogram is acquired from Speech-signal, which is prepared in the recurrence area to input Deep Neural Network. The proposed model Alex-Net with five convolutional layers and two fully connected layers acquire most vital features form spectrogram images plotted on the amplitude-frequency scale. The state-of-the-art is compared with benchmark dataset (EMO-DB). RGB and saliency images are fed to pre-trained Alex-Net tested both EMO-DB and Telugu dataset with an accuracy of 72.18% and fused image features less computations reaching to an accuracy 75.12%. The proposed model show that Transfer learning predict efficiently than Fine-tune network. When tested on Emo-DB dataset, the propȯsed system adequately learns discriminant features from speech spectrȯgrams and outperforms many stȧte-of-the-art techniques.
APA, Harvard, Vancouver, ISO, and other styles
5

Yu, Yeonguk, and Yoon-Joong Kim. "Attention-LSTM-Attention Model for Speech Emotion Recognition and Analysis of IEMOCAP Database." Electronics 9, no. 5 (April 26, 2020): 713. http://dx.doi.org/10.3390/electronics9050713.

Full text
Abstract:
We propose a speech-emotion recognition (SER) model with an “attention-long Long Short-Term Memory (LSTM)-attention” component to combine IS09, a commonly used feature for SER, and mel spectrogram, and we analyze the reliability problem of the interactive emotional dyadic motion capture (IEMOCAP) database. The attention mechanism of the model focuses on emotion-related elements of the IS09 and mel spectrogram feature and the emotion-related duration from the time of the feature. Thus, the model extracts emotion information from a given speech signal. The proposed model for the baseline study achieved a weighted accuracy (WA) of 68% for the improvised dataset of IEMOCAP. However, the WA of the proposed model of the main study and modified models could not achieve more than 68% in the improvised dataset. This is because of the reliability limit of the IEMOCAP dataset. A more reliable dataset is required for a more accurate evaluation of the model’s performance. Therefore, in this study, we reconstructed a more reliable dataset based on the labeling results provided by IEMOCAP. The experimental results of the model for the more reliable dataset confirmed a WA of 73%.
APA, Harvard, Vancouver, ISO, and other styles
6

Bous, Frederik, and Axel Roebel. "A Bottleneck Auto-Encoder for F0 Transformations on Speech and Singing Voice." Information 13, no. 3 (February 23, 2022): 102. http://dx.doi.org/10.3390/info13030102.

Full text
Abstract:
In this publication, we present a deep learning-based method to transform the f0 in speech and singing voice recordings. f0 transformation is performed by training an auto-encoder on the voice signal’s mel-spectrogram and conditioning the auto-encoder on the f0. Inspired by AutoVC/F0, we apply an information bottleneck to it to disentangle the f0 from its latent code. The resulting model successfully applies the desired f0 to the input mel-spectrograms and adapts the speaker identity when necessary, e.g., if the requested f0 falls out of the range of the source speaker/singer. Using the mean f0 error in the transformed mel-spectrograms, we define a disentanglement measure and perform a study over the required bottleneck size. The study reveals that to remove the f0 from the auto-encoder’s latent code, the bottleneck size should be smaller than four for singing and smaller than nine for speech. Through a perceptive test, we compare the audio quality of the proposed auto-encoder to f0 transformations obtained with a classical vocoder. The perceptive test confirms that the audio quality is better for the auto-encoder than for the classical vocoder. Finally, a visual analysis of the latent code for the two-dimensional case is carried out. We observe that the auto-encoder encodes phonemes as repeated discontinuous temporal gestures within the latent code.
APA, Harvard, Vancouver, ISO, and other styles
7

Rajan, Rajeev, and Sreejith Sivan. "Raga Recognition in Indian Carnatic Music Using Convolutional Neural Networks." WSEAS TRANSACTIONS ON ACOUSTICS AND MUSIC 9 (May 7, 2022): 5–10. http://dx.doi.org/10.37394/232019.2022.9.2.

Full text
Abstract:
A vital aspect of Indian Classical music (ICM) is raga, which serves as a melodic framework for compositions and improvisations for both traditions of classical music. In this work, we propose a CNN-based sliding window analysis on mel-spectrogram and modgdgram for raga recognition in Carnatic music. The impor- tant contribution of the work is that the pro- posed method neither requires pitch extraction nor metadata for the estimation of raga. CNN learns the representation of raga from the pat- terns in the melspectrogram/ modgdgram dur- ing training through a sliding-window analysis. We train and test the network on sliced-mel- spectrogram/modgdgram of the original audio while the nal inference is performed on the au- dio as a whole. The performance is evaluated on 15 ragas from the CompMusic dataset. Multi- stream fusion has also been implemented to identify the potential of two feature representations. Multi-stream architecture shows promise in the proposed scheme for raga recognition.
APA, Harvard, Vancouver, ISO, and other styles
8

Papadimitriou, Ioannis, Anastasios Vafeiadis, Antonios Lalas, Konstantinos Votis, and Dimitrios Tzovaras. "Audio-Based Event Detection at Different SNR Settings Using Two-Dimensional Spectrogram Magnitude Representations." Electronics 9, no. 10 (September 29, 2020): 1593. http://dx.doi.org/10.3390/electronics9101593.

Full text
Abstract:
Audio-based event detection poses a number of different challenges that are not encountered in other fields, such as image detection. Challenges such as ambient noise, low Signal-to-Noise Ratio (SNR) and microphone distance are not yet fully understood. If the multimodal approaches are to become better in a range of fields of interest, audio analysis will have to play an integral part. Event recognition in autonomous vehicles (AVs) is such a field at a nascent stage that can especially leverage solely on audio or can be part of the multimodal approach. In this manuscript, an extensive analysis focused on the comparison of different magnitude representations of the raw audio is presented. The data on which the analysis is carried out is part of the publicly available MIVIA Audio Events dataset. Single channel Short-Time Fourier Transform (STFT), mel-scale and Mel-Frequency Cepstral Coefficients (MFCCs) spectrogram representations are used. Furthermore, aggregation methods of the aforementioned spectrogram representations are examined; the feature concatenation compared to the stacking of features as separate channels. The effect of the SNR on recognition accuracy and the generalization of the proposed methods on datasets that were both seen and not seen during training are studied and reported.
APA, Harvard, Vancouver, ISO, and other styles
9

Yazgaç, Bilgi Görkem, and Mürvet Kırcı. "Fractional-Order Calculus-Based Data Augmentation Methods for Environmental Sound Classification with Deep Learning." Fractal and Fractional 6, no. 10 (September 29, 2022): 555. http://dx.doi.org/10.3390/fractalfract6100555.

Full text
Abstract:
In this paper, we propose two fractional-order calculus-based data augmentation methods for audio signals. The first approach is based on fractional differentiation of the Mel scale. By using a randomly selected fractional derivation order, we are warping the Mel scale, therefore, we aim to augment Mel-scale-based time-frequency representations of audio data. The second approach is based on previous fractional-order image edge enhancement methods. Since multiple deep learning approaches treat Mel spectrogram representations like images, a fractional-order differential-based mask is employed. The mask parameters are produced with respect to randomly selected fractional-order derivative parameters. The proposed data augmentation methods are applied to the UrbanSound8k environmental sound dataset. For the classification of the dataset and testing the methods, an arbitrary convolutional neural network is implemented. Our results show that fractional-order calculus-based methods can be employed as data augmentation methods. Increasing the dataset size to six times the original size, the classification accuracy result increased by around 8.5%. Additional tests on more complex networks also produced better accuracy results compared to a non-augmented dataset. To our knowledge, this paper is the first example of employing fractional-order calculus as an audio data augmentation tool.
APA, Harvard, Vancouver, ISO, and other styles
10

Barile, C., C. Casavola, G. Pappalettera, and P. K. Vimalathithan. "Sound of a Composite Failure: An Acoustic Emission Investigation." IOP Conference Series: Materials Science and Engineering 1214, no. 1 (January 1, 2022): 012006. http://dx.doi.org/10.1088/1757-899x/1214/1/012006.

Full text
Abstract:
Abstract The failure progression characteristics of adhesively bonded Carbon Fiber Reinforced Polymer (CFRP) composites are investigated using Acoustic Emission (AE) technique. Different failure progression modes such as matrix cracking, fiber breakage, delamination and through-thickness crack growth releases AE waveforms in different frequency domains. The characteristic features of these different AE waveforms are studied in Mel Scale, which is a perpetual frequency scale of average human hearing frequency. The recurring noise in the recorded waveforms has been identified more efficiently when the waveforms are analysed in Mel Scale. The recorded AE signals from the adhesively bonded CFRP under static tensile loading are stretched to match the Mel filter banks. The sampling rate of the recorded signal is adjusted from 1 MHz to 20 kHz. Following that, the Mel spectrogram and its cepstral coefficients are used for identifying the different failure modes from which the AE signals are generated. A comprehensive comparison of the AE analysis in Mel scale with conventional waveform processing techniques such as Fast Fourier Transform (FFT), Continuous Wavelet Transform (CWT), Wavelet Packet Transform (WPT) and Hilbert-Huang Transform (HHT) has been made. The advantages and further applications of Mel Scale over traditional waveform processing techniques in defining the failure modes in the composites are also discussed.
APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic "Mel spectrogram analysis"

1

Semela, René. "Automatické tagování hudebních děl pomocí metod strojového učení." Master's thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2020. http://www.nusl.cz/ntk/nusl-413253.

Full text
Abstract:
One of the many challenges of machine learning are systems for automatic tagging of music, the complexity of this issue in particular. These systems can be practically used in the content analysis of music or the sorting of music libraries. This thesis deals with the design, training, testing, and evaluation of artificial neural network architectures for automatic tagging of music. In the beginning, attention is paid to the setting of the theoretical foundation of this field. In the practical part of this thesis, 8 architectures of neural networks are designed (4 fully convolutional and 4 convolutional recurrent). These architectures are then trained using the MagnaTagATune Dataset and mel spectrogram. After training, these architectures are tested and evaluated. The best results are achieved by the four-layer convolutional recurrent neural network (CRNN4) with the ROC-AUC = 0.9046 ± 0.0016. As the next step of the practical part of this thesis, a completely new Last.fm Dataset 2020 is created. This dataset uses Last.fm and Spotify API for data acquisition and contains 100 tags and 122877 tracks. The most successful architectures are then trained, tested, and evaluated on this new dataset. The best results on this dataset are achieved by the six-layer fully convolutional neural network (FCNN6) with the ROC-AUC = 0.8590 ± 0.0011. Finally, a simple application is introduced as a concluding point of this thesis. This application is designed for testing individual neural network architectures on a user-inserted audio file. Overall results of this thesis are similar to other papers on the same topic, but this thesis brings several new findings and innovations. In terms of innovations, a significant reduction in the complexity of individual neural network architectures is achieved while maintaining similar results.
APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Mel spectrogram analysis"

1

Mittel, Dominik, Sebastian Proll, Florian Kerber, and Thorsten Scholer. "Mel Spectrogram Analysis for Punching Machine Operating State Classification with CNNs." In 2021 IEEE 26th International Conference on Emerging Technologies and Factory Automation (ETFA). IEEE, 2021. http://dx.doi.org/10.1109/etfa45728.2021.9613330.

Full text
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography