Journal articles on the topic 'Mel spectrogram analysis'

To see the other types of publications on this topic, follow the link: Mel spectrogram analysis.

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Mel spectrogram analysis.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Lambamo, Wondimu, Ramasamy Srinivasagan, and Worku Jifara. "Analyzing Noise Robustness of Cochleogram and Mel Spectrogram Features in Deep Learning Based Speaker Recognition." Applied Sciences 13, no. 1 (December 31, 2022): 569. http://dx.doi.org/10.3390/app13010569.

Full text
Abstract:
The performance of speaker recognition systems is very well on the datasets without noise and mismatch. However, the performance gets degraded with the environmental noises, channel variation, physical and behavioral changes in speaker. The types of Speaker related feature play crucial role in improving the performance of speaker recognition systems. Gammatone Frequency Cepstral Coefficient (GFCC) features has been widely used to develop robust speaker recognition systems with the conventional machine learning, it achieved better performance compared to Mel Frequency Cepstral Coefficient (MFCC) features in the noisy condition. Recently, deep learning models showed better performance in the speaker recognition compared to conventional machine learning. Most of the previous deep learning-based speaker recognition models has used Mel Spectrogram and similar inputs rather than a handcrafted features like MFCC and GFCC features. However, the performance of the Mel Spectrogram features gets degraded in the high noise ratio and mismatch in the utterances. Similar to Mel Spectrogram, Cochleogram is another important feature for deep learning speaker recognition models. Like GFCC features, Cochleogram represents utterances in Equal Rectangular Band (ERB) scale which is important in noisy condition. However, none of the studies have conducted analysis for noise robustness of Cochleogram and Mel Spectrogram in speaker recognition. In addition, only limited studies have used Cochleogram to develop speech-based models in noisy and mismatch condition using deep learning. In this study, analysis of noise robustness of Cochleogram and Mel Spectrogram features in speaker recognition using deep learning model is conducted at the Signal to Noise Ratio (SNR) level from −5 dB to 20 dB. Experiments are conducted on the VoxCeleb1 and Noise added VoxCeleb1 dataset by using basic 2DCNN, ResNet-50, VGG-16, ECAPA-TDNN and TitaNet Models architectures. The Speaker identification and verification performance of both Cochleogram and Mel Spectrogram is evaluated. The results show that Cochleogram have better performance than Mel Spectrogram in both speaker identification and verification at the noisy and mismatch condition.
APA, Harvard, Vancouver, ISO, and other styles
2

Liao, Ying. "Analysis of Rehabilitation Occupational Therapy Techniques Based on Instrumental Music Chinese Tonal Language Spectrogram Analysis." Occupational Therapy International 2022 (October 3, 2022): 1–12. http://dx.doi.org/10.1155/2022/1064441.

Full text
Abstract:
This paper provides an in-depth analysis of timbre-speech spectrograms in instrumental music, designs a model analysis of rehabilitation occupational therapy techniques based on the analysis of timbre-speech spectrograms in instrumental music, and tests the models for comparison. Starting from the mechanism of human articulation, this paper models the process of human expression as a time-varying linear system consisting of excitation, vocal tract, and radiation models. The system’s overall architecture is designed according to the characteristics of Chinese speech and everyday speech rehabilitation theory (HSL theory). The dual judgment of temporal threshold and short-time average energy realized the phonetic length training. Tone and clear tone training were achieved by linear predictive coding technique (LPC) and autocorrelation function. Using the DTW technique, isolated word speech recognition was achieved by extracting Mel-scale Frequency Cepstral Coefficients (MFCC) parameters of speech signals. The system designs corresponding training scenes for each training module according to the extracted speech parameters, combines the multimedia speech spectrogram motion situation with the speech parameters, and finally presents the training content as a speech spectrogram, and evaluates the training results through human-machine interaction to stimulate the interest of rehabilitation therapy and realize the speech rehabilitation training of patients. After analyzing the pre- and post-test data, it was found that the p -values of all three groups were <0.05, which was judged to be significantly different. Also, all subjects changed their behavioral data during the treatment. Therefore, it was concluded that the music therapy technique could improve the patients’ active gaze communication ability, verbal command ability, and active question-answering ability after summarizing the data, i.e., the hypothesis of this experiment is valid. Therefore, it is believed that the technique of timbre-speech spectrogram analysis in instrumental music can achieve the effect of rehabilitation therapy to a certain extent.
APA, Harvard, Vancouver, ISO, and other styles
3

Byeon, Yeong-Hyeon, and Keun-Chang Kwak. "Pre-Configured Deep Convolutional Neural Networks with Various Time-Frequency Representations for Biometrics from ECG Signals." Applied Sciences 9, no. 22 (November 10, 2019): 4810. http://dx.doi.org/10.3390/app9224810.

Full text
Abstract:
We evaluated electrocardiogram (ECG) biometrics using pre-configured models of convolutional neural networks (CNNs) with various time-frequency representations. Biometrics technology records a person’s physical or behavioral characteristics in a digital signal via a sensor and analyzes it to identify the person. An ECG signal is obtained by detecting and amplifying a minute electrical signal flowing on the skin using a noninvasive electrode when the heart muscle depolarizes at each heartbeat. In biometrics, the ECG is especially advantageous in security applications because the heart is located within the body and moves while the subject is alive. However, a few body states generate noisy biometrics. The analysis of signals in the frequency domain has a robust effect on the noise. As the ECG is noise-sensitive, various studies have applied time-frequency transformations that are robust to noise, with CNNs achieving a good performance in image classification. Studies have applied time-frequency representations of the 1D ECG signals to 2D CNNs using transforms like MFCC (mel frequency cepstrum coefficient), spectrogram, log spectrogram, mel spectrogram, and scalogram. CNNs have various pre-configured models such as VGGNet, GoogLeNet, ResNet, and DenseNet. Combinations of the time-frequency representations and pre-configured CNN models have not been investigated. In this study, we employed the PTB (Physikalisch-Technische Bundesanstalt)-ECG and CU (Chosun University)-ECG databases. The MFCC accuracies were 0.45%, 2.60%, 3.90%, and 0.25% higher than the spectrogram, log spectrogram, mel spectrogram, and scalogram accuracies, respectively. The Xception accuracies were 3.91%, 0.84%, and 1.14% higher than the VGGNet-19, ResNet-101, and DenseNet-201 accuracies, respectively.
APA, Harvard, Vancouver, ISO, and other styles
4

Reddy, A. Pramod, and Vijayarajan V. "Fusion Based AER System Using Deep Learning Approach for Amplitude and Frequency Analysis." ACM Transactions on Asian and Low-Resource Language Information Processing 21, no. 3 (May 31, 2022): 1–19. http://dx.doi.org/10.1145/3488369.

Full text
Abstract:
Automatic emotion recognition from Speech (AERS) systems based on acoustical analysis reveal that some emotional classes persist with ambiguity. This study employed an alternative method aimed at providing deep understanding into the amplitude–frequency, impacts of various emotions in order to aid in the advancement of near term, more effectively in classifying AER approaches. The study was undertaken by converting narrow 20 ms frames of speech into RGB or grey-scale spectrogram images. The features have been used to fine-tune a feature selection system that had previously been trained to recognise emotions. Two different Linear and Mel spectral scales are used to demonstrate a spectrogram. An inductive approach for in sighting the amplitude and frequency features of various emotional classes. We propose a two-channel profound combination of deep fusion network model for the efficient categorization of images. Linear and Mel- spectrogram is acquired from Speech-signal, which is prepared in the recurrence area to input Deep Neural Network. The proposed model Alex-Net with five convolutional layers and two fully connected layers acquire most vital features form spectrogram images plotted on the amplitude-frequency scale. The state-of-the-art is compared with benchmark dataset (EMO-DB). RGB and saliency images are fed to pre-trained Alex-Net tested both EMO-DB and Telugu dataset with an accuracy of 72.18% and fused image features less computations reaching to an accuracy 75.12%. The proposed model show that Transfer learning predict efficiently than Fine-tune network. When tested on Emo-DB dataset, the propȯsed system adequately learns discriminant features from speech spectrȯgrams and outperforms many stȧte-of-the-art techniques.
APA, Harvard, Vancouver, ISO, and other styles
5

Yu, Yeonguk, and Yoon-Joong Kim. "Attention-LSTM-Attention Model for Speech Emotion Recognition and Analysis of IEMOCAP Database." Electronics 9, no. 5 (April 26, 2020): 713. http://dx.doi.org/10.3390/electronics9050713.

Full text
Abstract:
We propose a speech-emotion recognition (SER) model with an “attention-long Long Short-Term Memory (LSTM)-attention” component to combine IS09, a commonly used feature for SER, and mel spectrogram, and we analyze the reliability problem of the interactive emotional dyadic motion capture (IEMOCAP) database. The attention mechanism of the model focuses on emotion-related elements of the IS09 and mel spectrogram feature and the emotion-related duration from the time of the feature. Thus, the model extracts emotion information from a given speech signal. The proposed model for the baseline study achieved a weighted accuracy (WA) of 68% for the improvised dataset of IEMOCAP. However, the WA of the proposed model of the main study and modified models could not achieve more than 68% in the improvised dataset. This is because of the reliability limit of the IEMOCAP dataset. A more reliable dataset is required for a more accurate evaluation of the model’s performance. Therefore, in this study, we reconstructed a more reliable dataset based on the labeling results provided by IEMOCAP. The experimental results of the model for the more reliable dataset confirmed a WA of 73%.
APA, Harvard, Vancouver, ISO, and other styles
6

Bous, Frederik, and Axel Roebel. "A Bottleneck Auto-Encoder for F0 Transformations on Speech and Singing Voice." Information 13, no. 3 (February 23, 2022): 102. http://dx.doi.org/10.3390/info13030102.

Full text
Abstract:
In this publication, we present a deep learning-based method to transform the f0 in speech and singing voice recordings. f0 transformation is performed by training an auto-encoder on the voice signal’s mel-spectrogram and conditioning the auto-encoder on the f0. Inspired by AutoVC/F0, we apply an information bottleneck to it to disentangle the f0 from its latent code. The resulting model successfully applies the desired f0 to the input mel-spectrograms and adapts the speaker identity when necessary, e.g., if the requested f0 falls out of the range of the source speaker/singer. Using the mean f0 error in the transformed mel-spectrograms, we define a disentanglement measure and perform a study over the required bottleneck size. The study reveals that to remove the f0 from the auto-encoder’s latent code, the bottleneck size should be smaller than four for singing and smaller than nine for speech. Through a perceptive test, we compare the audio quality of the proposed auto-encoder to f0 transformations obtained with a classical vocoder. The perceptive test confirms that the audio quality is better for the auto-encoder than for the classical vocoder. Finally, a visual analysis of the latent code for the two-dimensional case is carried out. We observe that the auto-encoder encodes phonemes as repeated discontinuous temporal gestures within the latent code.
APA, Harvard, Vancouver, ISO, and other styles
7

Rajan, Rajeev, and Sreejith Sivan. "Raga Recognition in Indian Carnatic Music Using Convolutional Neural Networks." WSEAS TRANSACTIONS ON ACOUSTICS AND MUSIC 9 (May 7, 2022): 5–10. http://dx.doi.org/10.37394/232019.2022.9.2.

Full text
Abstract:
A vital aspect of Indian Classical music (ICM) is raga, which serves as a melodic framework for compositions and improvisations for both traditions of classical music. In this work, we propose a CNN-based sliding window analysis on mel-spectrogram and modgdgram for raga recognition in Carnatic music. The impor- tant contribution of the work is that the pro- posed method neither requires pitch extraction nor metadata for the estimation of raga. CNN learns the representation of raga from the pat- terns in the melspectrogram/ modgdgram dur- ing training through a sliding-window analysis. We train and test the network on sliced-mel- spectrogram/modgdgram of the original audio while the nal inference is performed on the au- dio as a whole. The performance is evaluated on 15 ragas from the CompMusic dataset. Multi- stream fusion has also been implemented to identify the potential of two feature representations. Multi-stream architecture shows promise in the proposed scheme for raga recognition.
APA, Harvard, Vancouver, ISO, and other styles
8

Papadimitriou, Ioannis, Anastasios Vafeiadis, Antonios Lalas, Konstantinos Votis, and Dimitrios Tzovaras. "Audio-Based Event Detection at Different SNR Settings Using Two-Dimensional Spectrogram Magnitude Representations." Electronics 9, no. 10 (September 29, 2020): 1593. http://dx.doi.org/10.3390/electronics9101593.

Full text
Abstract:
Audio-based event detection poses a number of different challenges that are not encountered in other fields, such as image detection. Challenges such as ambient noise, low Signal-to-Noise Ratio (SNR) and microphone distance are not yet fully understood. If the multimodal approaches are to become better in a range of fields of interest, audio analysis will have to play an integral part. Event recognition in autonomous vehicles (AVs) is such a field at a nascent stage that can especially leverage solely on audio or can be part of the multimodal approach. In this manuscript, an extensive analysis focused on the comparison of different magnitude representations of the raw audio is presented. The data on which the analysis is carried out is part of the publicly available MIVIA Audio Events dataset. Single channel Short-Time Fourier Transform (STFT), mel-scale and Mel-Frequency Cepstral Coefficients (MFCCs) spectrogram representations are used. Furthermore, aggregation methods of the aforementioned spectrogram representations are examined; the feature concatenation compared to the stacking of features as separate channels. The effect of the SNR on recognition accuracy and the generalization of the proposed methods on datasets that were both seen and not seen during training are studied and reported.
APA, Harvard, Vancouver, ISO, and other styles
9

Yazgaç, Bilgi Görkem, and Mürvet Kırcı. "Fractional-Order Calculus-Based Data Augmentation Methods for Environmental Sound Classification with Deep Learning." Fractal and Fractional 6, no. 10 (September 29, 2022): 555. http://dx.doi.org/10.3390/fractalfract6100555.

Full text
Abstract:
In this paper, we propose two fractional-order calculus-based data augmentation methods for audio signals. The first approach is based on fractional differentiation of the Mel scale. By using a randomly selected fractional derivation order, we are warping the Mel scale, therefore, we aim to augment Mel-scale-based time-frequency representations of audio data. The second approach is based on previous fractional-order image edge enhancement methods. Since multiple deep learning approaches treat Mel spectrogram representations like images, a fractional-order differential-based mask is employed. The mask parameters are produced with respect to randomly selected fractional-order derivative parameters. The proposed data augmentation methods are applied to the UrbanSound8k environmental sound dataset. For the classification of the dataset and testing the methods, an arbitrary convolutional neural network is implemented. Our results show that fractional-order calculus-based methods can be employed as data augmentation methods. Increasing the dataset size to six times the original size, the classification accuracy result increased by around 8.5%. Additional tests on more complex networks also produced better accuracy results compared to a non-augmented dataset. To our knowledge, this paper is the first example of employing fractional-order calculus as an audio data augmentation tool.
APA, Harvard, Vancouver, ISO, and other styles
10

Barile, C., C. Casavola, G. Pappalettera, and P. K. Vimalathithan. "Sound of a Composite Failure: An Acoustic Emission Investigation." IOP Conference Series: Materials Science and Engineering 1214, no. 1 (January 1, 2022): 012006. http://dx.doi.org/10.1088/1757-899x/1214/1/012006.

Full text
Abstract:
Abstract The failure progression characteristics of adhesively bonded Carbon Fiber Reinforced Polymer (CFRP) composites are investigated using Acoustic Emission (AE) technique. Different failure progression modes such as matrix cracking, fiber breakage, delamination and through-thickness crack growth releases AE waveforms in different frequency domains. The characteristic features of these different AE waveforms are studied in Mel Scale, which is a perpetual frequency scale of average human hearing frequency. The recurring noise in the recorded waveforms has been identified more efficiently when the waveforms are analysed in Mel Scale. The recorded AE signals from the adhesively bonded CFRP under static tensile loading are stretched to match the Mel filter banks. The sampling rate of the recorded signal is adjusted from 1 MHz to 20 kHz. Following that, the Mel spectrogram and its cepstral coefficients are used for identifying the different failure modes from which the AE signals are generated. A comprehensive comparison of the AE analysis in Mel scale with conventional waveform processing techniques such as Fast Fourier Transform (FFT), Continuous Wavelet Transform (CWT), Wavelet Packet Transform (WPT) and Hilbert-Huang Transform (HHT) has been made. The advantages and further applications of Mel Scale over traditional waveform processing techniques in defining the failure modes in the composites are also discussed.
APA, Harvard, Vancouver, ISO, and other styles
11

Chen, Wei, and Guobin Wu. "A Multimodal Convolutional Neural Network Model for the Analysis of Music Genre on Children’s Emotions Influence Intelligence." Computational Intelligence and Neuroscience 2022 (August 29, 2022): 1–11. http://dx.doi.org/10.1155/2022/5611456.

Full text
Abstract:
This paper designs a multimodal convolutional neural network model for the intelligent analysis of the influence of music genres on children’s emotions by constructing a multimodal convolutional neural network model and profoundly analyzing the impact of music genres on children’s feelings. Considering the diversity of music genre features in the audio power spectrogram, the Mel filtering method is used in the feature extraction stage to ensure the effective retention of the genre feature attributes of the audio signal by dimensional reduction of the Mel filtered signal, deepening the differences of the extracted features between different genres, and to reduce the input size and expand the model training scale in the model input stage, the audio power spectrogram obtained by feature extraction is cut the MSCN-LSTM consists of two modules: multiscale convolutional kernel convolutional neural network and long and short term memory network. The MSCNN network is used to extract the EEG signal features, the LSTM network is used to remove the temporal characteristics of the eye-movement signal, and the feature fusion is done by feature-level fusion. The multimodal signal has a higher emotion classification accuracy than the unimodal signal, and the average accuracy of emotion quadruple classification based on a 6-channel EEG signal, and children’s multimodal signal reaches 97.94%. After pretraining with the MSD (Million Song Dataset) dataset in this paper, the model effect was further improved significantly. The accuracy of the Dense Inception network improved to 91.0% and 89.91% on the GTZAN dataset and ISMIR2004 dataset, respectively, proving that the Dense Inception network’s effectiveness and advancedness of the Dense Inception network were demonstrated.
APA, Harvard, Vancouver, ISO, and other styles
12

Hong, Joonki, Hai Tran, Jinhwan Jeong, Hyeryung Jang, In-Young Yoon, Jung Kyung Hong, and Jeong-Whun Kim. "0348 Sleep Staging Using End-to-End Deep Learning Model Based on Nocturnal Sound for Smartphones." Sleep 45, Supplement_1 (May 25, 2022): A156—A157. http://dx.doi.org/10.1093/sleep/zsac079.345.

Full text
Abstract:
Abstract Introduction Convenient sleep tracking with mobile devices such as smartphones is desirable for people who want to easily objectify their sleep. The objective of this study was to introduce a deep learning model for sound-based sleep staging using audio data recorded with smartphones during sleep. Methods Two different audio datasets were used. One (N = 1,154) was extracted from polysomnography (PSG) data and the other (N = 327) was recorded using a smartphone during PSG from independent subjects. The performance of sound-based sleep staging would always depend on the quality of the audio. In practical conditions (non-contact and smartphone microphones), breathing and body movement sounds during night are so weak that the energy of such signals is sometimes smaller than that of ambient noise. The audio was converted into Mel spectrogram to detect latent temporal frequency patterns of breathing and body movement sound from ambient noise. The proposed neural network model consisted of two sub-models. The first sub-model extracted features from each 30-second epoch Mel spectrogram and the second one classified sleep stages through inter-epoch analysis of extracted features. Results Our model achieved 70 % epoch-by-epoch agreement for 4-class (wake, light, deep, rapid eye movement) stage classification and robust performance across various signal-to-noise conditions. More precisely, the model was correct in 77% of wake, 73% of light, 46% of deep, and 66% of REM. The model performance was not considerably affected by existence of sleep apnea but degradation observed with severe periodic limb movement. External validation with smartphone dataset also showed 68 % epoch-by-epoch agreement. Compared with some commercially available sleep trackers such as Fitbit Alta HR (0.6325 in mean per-class sensitivity) and SleepScore Max (0.565 in mean per-class sensitivity), our model showed superior performance in both PSG audio (0.655 in mean per-class sensitivity) and smartphone audio (0.6525 in mean per-class sensitivity). Conclusion To the best of our knowledge, this is the first end (Mel spectrogram-based feature extraction)-to-end (sleep staging) deep learning model that can work with audio data in practical conditions. Our proposed deep learning model of sound-based sleep staging has potential to be integrated in smartphone application for reliable at-home sleep tracking. Support (If Any)
APA, Harvard, Vancouver, ISO, and other styles
13

Hajarolasvadi, Noushin, and Hasan Demirel. "3D CNN-Based Speech Emotion Recognition Using K-Means Clustering and Spectrograms." Entropy 21, no. 5 (May 8, 2019): 479. http://dx.doi.org/10.3390/e21050479.

Full text
Abstract:
Detecting human intentions and emotions helps improve human–robot interactions. Emotion recognition has been a challenging research direction in the past decade. This paper proposes an emotion recognition system based on analysis of speech signals. Firstly, we split each speech signal into overlapping frames of the same length. Next, we extract an 88-dimensional vector of audio features including Mel Frequency Cepstral Coefficients (MFCC), pitch, and intensity for each of the respective frames. In parallel, the spectrogram of each frame is generated. In the final preprocessing step, by applying k-means clustering on the extracted features of all frames of each audio signal, we select k most discriminant frames, namely keyframes, to summarize the speech signal. Then, the sequence of the corresponding spectrograms of keyframes is encapsulated in a 3D tensor. These tensors are used to train and test a 3D Convolutional Neural network using a 10-fold cross-validation approach. The proposed 3D CNN has two convolutional layers and one fully connected layer. Experiments are conducted on the Surrey Audio-Visual Expressed Emotion (SAVEE), Ryerson Multimedia Laboratory (RML), and eNTERFACE’05 databases. The results are superior to the state-of-the-art methods reported in the literature.
APA, Harvard, Vancouver, ISO, and other styles
14

Kim, Daeyeol, Tegg Taekyong Sung, Soo Young Cho, Gyunghak Lee, and Chae Bong Sohn. "A Single Predominant Instrument Recognition of Polyphonic Music Using CNN-based Timbre Analysis." International Journal of Engineering & Technology 7, no. 3.34 (September 1, 2018): 590. http://dx.doi.org/10.14419/ijet.v7i3.34.19388.

Full text
Abstract:
Classifying musical instrument from polyphonic music is a challenging but important task in music information retrieval. This work enables to automatically tag music information, such as genre classification. In previous, almost every work of spectrogram analysis has been used Short Time Fourier Transform (STFT) and Mel Frequency Cepstral Coefficient (MFCC). Recently, sparkgram is researched and used in audio source analysis. Moreover, for deep learning approach, modified convolutional neural networks (CNN) widely have been researched, but many results have not been improved drastically. Instead of improving backbone networks, we have researched on preprocessing process.In this paper, we use CNN and Hilbert Spectral Analysis (HSA) to solve the polyphonic music problem. The HSA is performed at the fixed length of polyphonic music, and a predominant instrument is labeled at its result. As result, we have achieved the state-of-the-art result in IRMAS dataset and 3% performance improvement in individual instruments
APA, Harvard, Vancouver, ISO, and other styles
15

Kim, Heejung, Youngshin Cho, Sunghee Lee, and Chaehyeon Kang. "MULTIMODAL AFFECTIVE ANALYSIS OF FACIAL AND VOCAL EXPRESSIVITY USING SMARTPHONE AND DEEP LEARNING ANALYSIS." Innovation in Aging 6, Supplement_1 (November 1, 2022): 593–94. http://dx.doi.org/10.1093/geroni/igac059.2221.

Full text
Abstract:
Abstract Limited expressivity of emotion is one of the most common symptoms of major depression, particularly in older adults. Although assessing facial and vocal expressivity is very important for accurate clinical evaluation of geriatric depression, research has rarely examined older adults via telehealth technology. This study aims to quantify facial and vocal expressivity via a multimodal affective system with deep learning. A total of 19 Korean adults aged over 65 years with severe depressive symptoms participated in this research. Using smartphone video recording, 1,429 facial and vocal data were collected between July and December 2020. Recorded videos were transmitted automatically to the cloud system. Basic facial movements were extracted using combined video frames and mel spectrogram images. Compared to the AI hub of Korean images from big data, mood status was classified into seven categories (anger, disgust, fear, happiness, neutrality, sadness, and surprise). Frequencies of each mood were coded into continuous variables for each participant in each recording. When comparing video and text prediction to determine “true labels,” the overall accuracy was 0.69, with F1 scores ranging from 0.57 to 0.79. In addition, the most common emotions were angry, happy, neutral, sad, and surprised. This study suggests that smartphone-recorded video could function as a useful tool for quantifying mood expressivity. This study established a preliminary method of affective assessment for older adults for telecare use based on socially assistive technology at a distance from the clinic.
APA, Harvard, Vancouver, ISO, and other styles
16

Maskeliūnas, Rytis, Audrius Kulikajevas, Robertas Damaševičius, Kipras Pribuišis, Nora Ulozaitė-Stanienė, and Virgilijus Uloza. "Lightweight Deep Learning Model for Assessment of Substitution Voicing and Speech after Laryngeal Carcinoma Surgery." Cancers 14, no. 10 (May 11, 2022): 2366. http://dx.doi.org/10.3390/cancers14102366.

Full text
Abstract:
Laryngeal carcinoma is the most common malignant tumor of the upper respiratory tract. Total laryngectomy provides complete and permanent detachment of the upper and lower airways that causes the loss of voice, leading to a patient’s inability to verbally communicate in the postoperative period. This paper aims to exploit modern areas of deep learning research to objectively classify, extract and measure the substitution voicing after laryngeal oncosurgery from the audio signal. We propose using well-known convolutional neural networks (CNNs) applied for image classification for the analysis of voice audio signal. Our approach takes an input of Mel-frequency spectrogram (MFCC) as an input of deep neural network architecture. A database of digital speech recordings of 367 male subjects (279 normal speech samples and 88 pathological speech samples) was used. Our approach has shown the best true-positive rate of any of the compared state-of-the-art approaches, achieving an overall accuracy of 89.47%.
APA, Harvard, Vancouver, ISO, and other styles
17

Dzulfikar, Helmy, Sisdarmanto Adinandra, and Erika Ramadhani. "The Comparison of Audio Analysis Using Audio Forensic Technique and Mel Frequency Cepstral Coefficient Method (MFCC) as the Requirement of Digital Evidence." Jurnal Online Informatika 6, no. 2 (December 26, 2021): 145. http://dx.doi.org/10.15575/join.v6i2.702.

Full text
Abstract:
Audio forensics is the application of science and scientific methods in handling digital evidence in the form of audio. In this regard, the audio supports the disclosure of various criminal cases and reveals the necessary information needed in the trial process. So far, research related to audio forensics is more on human voices that are recorded directly, either by using a voice recorder or voice recordings on smartphones, which are available on Google Play services or iOS Store. This study compares the analysis of live voices (human voices) with artificial voices on Google Voice and other artificial voices. This study implements the audio forensic analysis, which involves pitch, formant, and spectrogram as the parameters. Besides, it also analyses the data by using feature extraction using the Mel Frequency Cepstral Coefficient (MFCC) method, the Dynamic Time Warping (DTW) method, and applying the K-Nearest Neighbor (KNN) algorithm. The previously made live voice recording and artificial voice are then cut into words. Then, it tests the chunk from the voice recording. The testing of audio forensic techniques with the Praat application obtained similar words between live and artificial voices and provided 40,74% accuracy of information. While the testing by using the MFCC, DTW, KNN methods with the built systems by using Matlab, obtained similar word information between live voice and artificial voice with an accuracy of 33.33%.
APA, Harvard, Vancouver, ISO, and other styles
18

Kumari, Neha. "Music Genre Classification for Indian Music Genres." International Journal for Research in Applied Science and Engineering Technology 9, no. 8 (August 31, 2021): 1756–62. http://dx.doi.org/10.22214/ijraset.2021.37669.

Full text
Abstract:
Abstract: Due to the enormous expansion in the accessibility of music data, music genre classification has taken on new significance in recent years. In order to have better access to them, we need to correctly index them. Automatic music genre classification is essential when working with a large collection of music. For the majority of contemporary music genre classification methodologies, researchers have favoured machine learning techniques. In this study, we employed two datasets with different genres. A Deep Learning approach is utilised to train and classify the system. A convolution neural network is used for training and classification. In speech analysis, the most crucial task is to perform speech analysis is feature extraction. The Mel Frequency Cepstral Coefficient (MFCC) is utilised as the main audio feature extraction technique. By extracting the feature vector, the suggested method classifies music into several genres. Our findings suggest that our system has an 80% accuracy level, which will substantially improve on further training and facilitate music genre classification. Keywords: Music Genre Classification, CNN, KNN, Music information retrieval, feature extraction, spectrogram, GTZAN dataset, Indian music genre dataset.
APA, Harvard, Vancouver, ISO, and other styles
19

Kim, Jeonghyeon, Jonghoek Kim, and Hyuntai Kim. "A Study on Gear Defect Detection via Frequency Analysis Based on DNN." Machines 10, no. 8 (August 5, 2022): 659. http://dx.doi.org/10.3390/machines10080659.

Full text
Abstract:
In this paper, we introduce a gear defect detection system using frequency analysis based on deep learning. The existing defect diagnosis systems using acoustic analysis use spectrogram, scalogram, and MFCC (Mel-Frequency Cepstral Coefficient) images as inputs to the convolutional neural network (CNN) model to diagnose defects. However, using visualized acoustic data as input to the CNN models requires a lot of computation time. Although computing power has improved, there is a situation in which a processor with low performance is used for reasons such as cost-effectiveness. In this paper, only the sums of frequency bands are used as input to the deep neural network (DNN) model to diagnose the gear fault. This system diagnoses the defects using only a few specific frequency bands, so it ignores unnecessary data and does not require high performance when diagnosing defects because it uses a relatively simple deep learning model for classification. We evaluate the performance of the proposed system through experiments and verify that real-time diagnosis of gears is possible compared to the CNN model. The result showed 95.5% accuracy for 1000 test data, and it took 18.48 ms, so that verified the capability of real-time diagnosis in a low-spec environment. The proposed system is expected to be effectively used to diagnose defects in various sound-based facilities at a low cost.
APA, Harvard, Vancouver, ISO, and other styles
20

He, Jinzheng, Zhou Zhao, Yi Ren, Jinglin Liu, Baoxing Huai, and Nicholas Yuan. "Flow-Based Unconstrained Lip to Speech Generation." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 1 (June 28, 2022): 843–51. http://dx.doi.org/10.1609/aaai.v36i1.19966.

Full text
Abstract:
Unconstrained lip-to-speech aims to generate corresponding speeches based on silent facial videos with no restriction to head pose or vocabulary. It is desirable to generate intelligible and natural speech with a fast speed in unconstrained settings. Currently, to handle the more complicated scenarios, most existing methods adopt the autoregressive architecture, which is optimized with the MSE loss. Although these methods have achieved promising performance, they are prone to bring issues including high inference latency and mel-spectrogram over-smoothness. To tackle these problems, we propose a novel flow-based non-autoregressive lip-to-speech model (GlowLTS) to break autoregressive constraints and achieve faster inference. Concretely, we adopt a flow-based decoder which is optimized by maximizing the likelihood of the training data and is capable of more natural and fast speech generation. Moreover, we devise a condition module to improve the intelligibility of generated speech. We demonstrate the superiority of our proposed method through objective and subjective evaluation on Lip2Wav-Chemistry-Lectures and Lip2Wav-Chess-Analysis datasets. Our demo video can be found at https://glowlts.github.io/.
APA, Harvard, Vancouver, ISO, and other styles
21

Utebayeva, Dana, Lyazzat Ilipbayeva, and Eric T. Matson. "Practical Study of Recurrent Neural Networks for Efficient Real-Time Drone Sound Detection: A Review." Drones 7, no. 1 (December 30, 2022): 26. http://dx.doi.org/10.3390/drones7010026.

Full text
Abstract:
The detection and classification of engine-based moving objects in restricted scenes from acoustic signals allow better Unmanned Aerial System (UAS)-specific intelligent systems and audio-based surveillance systems. Recurrent Neural Networks (RNNs) provide wide coverage in the field of acoustic analysis due to their effectiveness in widespread practical applications. In this work, we propose to study SimpleRNN, LSTM, BiLSTM, and GRU recurrent network models for real-time UAV sound recognition systems based on Mel-spectrogram using Kapre layers. The main goal of the work is to study the types of RNN networks in a practical sense for a reliable drone sound recognition system. According to the results of an experimental study, the GRU (Gated Recurrent Units) network model demonstrated a higher prediction ability than other RNN architectures for detecting differences and the state of objects from acoustic signals. That is, RNNs gave higher recognition than CNNs for loaded and unloaded audio states of various UAV models, while the GRU model showed about 98% accuracy for determining the UAV load states and 99% accuracy for background noise, which consisted of more other data.
APA, Harvard, Vancouver, ISO, and other styles
22

de Benito-Gorrón, Diego, Daniel Ramos, and Doroteo T. Toledano. "An Analysis of Sound Event Detection under Acoustic Degradation Using Multi-Resolution Systems." Applied Sciences 11, no. 23 (December 6, 2021): 11561. http://dx.doi.org/10.3390/app112311561.

Full text
Abstract:
The Sound Event Detection task aims to determine the temporal locations of acoustic events in audio clips. In recent years, the relevance of this field is rising due to the introduction of datasets such as Google AudioSet or DESED (Domestic Environment Sound Event Detection) and competitive evaluations like the DCASE Challenge (Detection and Classification of Acoustic Scenes and Events). In this paper, we analyze the performance of Sound Event Detection systems under diverse artificial acoustic conditions such as high- or low-pass filtering and clipping or dynamic range compression, as well as under an scenario of high overlap between events. For this purpose, the audio was obtained from the Evaluation subset of the DESED dataset, whereas the systems were trained in the context of the DCASE Challenge 2020 Task 4. Our systems are based upon the challenge baseline, which consists of a Convolutional-Recurrent Neural Network trained using the Mean Teacher method, and they employ a multiresolution approach which is able to improve the Sound Event Detection performance through the use of several resolutions during the extraction of Mel-spectrogram features. We provide insights on the benefits of this multiresolution approach in different acoustic settings, and compare the performance of the single-resolution systems in the aforementioned scenarios when using different resolutions. Furthermore, we complement the analysis of the performance in the high-overlap scenario by assessing the degree of overlap of each event category in sound event detection datasets.
APA, Harvard, Vancouver, ISO, and other styles
23

Kim, Jaehoon, Jeongkyu Oh, and Tae-Young Heo. "Acoustic Scene Classification and Visualization of Beehive Sounds Using Machine Learning Algorithms and Grad-CAM." Mathematical Problems in Engineering 2021 (May 24, 2021): 1–13. http://dx.doi.org/10.1155/2021/5594498.

Full text
Abstract:
Honeybees play a crucial role in the agriculture industry because they pollinate approximately 75% of all flowering crops. However, every year, the number of honeybees continues to decrease. Consequently, numerous researchers in various fields have persistently attempted to solve this problem. Acoustic scene classification, using sounds recorded from beehives, is an approach that can be applied to detect changes inside beehives. This method can be used to determine intervals that threaten a beehive. Currently, studies on sound analysis, using deep learning algorithms integrated with various data preprocessing methods that extract features from sound signals, continue to be conducted. However, there is little insight into how deep learning algorithms recognize audio scenes, as demonstrated by studies on image recognition. Therefore, in this study, we used a mel spectrogram, mel-frequency cepstral coefficients (MFCCs), and a constant-Q transform to compare the performance of conventional machine learning models to that of convolutional neural network (CNN) models. We used the support vector machine, random forest, extreme gradient boosting, shallow CNN, and VGG-13 models. Using gradient-weighted class activation mapping (Grad-CAM), we conducted an analysis to determine how the best-performing CNN model recognized audio scenes. The results showed that the VGG-13 model, using MFCCs as input data, demonstrated the best accuracy (91.93%). Additionally, based on the precision, recall, and F1-score for each class, we established that sounds other than those from bees were effectively recognized. Further, we conducted an analysis to determine the MFCCs that are important for classification through the visualizations obtained by applying Grad-CAM to the VGG-13 model. We believe that our findings can be used to develop a monitoring system that can consistently detect abnormal conditions in beehives early by classifying the sounds inside beehives.
APA, Harvard, Vancouver, ISO, and other styles
24

Zakariah, Mohammed, Reshma B, Yousef Ajmi Alothaibi, Yanhui Guo, Kiet Tran-Trung, and Mohammad Mamun Elahi. "An Analytical Study of Speech Pathology Detection Based on MFCC and Deep Neural Networks." Computational and Mathematical Methods in Medicine 2022 (April 4, 2022): 1–15. http://dx.doi.org/10.1155/2022/7814952.

Full text
Abstract:
Diseases of internal organs other than the vocal folds can also affect a person’s voice. As a result, voice problems are on the rise, even though they are frequently overlooked. According to a recent study, voice pathology detection systems can successfully help the assessment of voice abnormalities and enable the early diagnosis of voice pathology. For instance, in the early identification and diagnosis of voice problems, the automatic system for distinguishing healthy and diseased voices has gotten much attention. As a result, artificial intelligence-assisted voice analysis brings up new possibilities in healthcare. The work was aimed at assessing the utility of several automatic speech signal analysis methods for diagnosing voice disorders and suggesting a strategy for classifying healthy and diseased voices. The proposed framework integrates the efficacy of three voice characteristics: chroma, mel spectrogram, and mel frequency cepstral coefficient (MFCC). We also designed a deep neural network (DNN) capable of learning from the retrieved data and producing a highly accurate voice-based disease prediction model. The study describes a series of studies using the Saarbruecken Voice Database (SVD) to detect abnormal voices. The model was developed and tested using the vowels /a/, /i/, and /u/ pronounced in high, low, and average pitches. We also maintained the “continuous sentence” audio files collected from SVD to select how well the developed model generalizes to completely new data. The highest accuracy achieved was 77.49%, superior to prior attempts in the same domain. Additionally, the model attains an accuracy of 88.01% by integrating speaker gender information. The designed model trained on selected diseases can also obtain a maximum accuracy of 96.77% ( cordectomy × healthy ). As a result, the suggested framework is the best fit for the healthcare industry.
APA, Harvard, Vancouver, ISO, and other styles
25

Srivastava, Arpan, Sonakshi Jain, Ryan Miranda, Shruti Patil, Sharnil Pandya, and Ketan Kotecha. "Deep learning based respiratory sound analysis for detection of chronic obstructive pulmonary disease." PeerJ Computer Science 7 (February 11, 2021): e369. http://dx.doi.org/10.7717/peerj-cs.369.

Full text
Abstract:
In recent times, technologies such as machine learning and deep learning have played a vital role in providing assistive solutions to a medical domain’s challenges. They also improve predictive accuracy for early and timely disease detection using medical imaging and audio analysis. Due to the scarcity of trained human resources, medical practitioners are welcoming such technology assistance as it provides a helping hand to them in coping with more patients. Apart from critical health diseases such as cancer and diabetes, the impact of respiratory diseases is also gradually on the rise and is becoming life-threatening for society. The early diagnosis and immediate treatment are crucial in respiratory diseases, and hence the audio of the respiratory sounds is proving very beneficial along with chest X-rays. The presented research work aims to apply Convolutional Neural Network based deep learning methodologies to assist medical experts by providing a detailed and rigorous analysis of the medical respiratory audio data for Chronic Obstructive Pulmonary detection. In the conducted experiments, we have used a Librosa machine learning library features such as MFCC, Mel-Spectrogram, Chroma, Chroma (Constant-Q) and Chroma CENS. The presented system could also interpret the severity of the disease identified, such as mild, moderate, or acute. The investigation results validate the success of the proposed deep learning approach. The system classification accuracy has been enhanced to an ICBHI score of 93%. Furthermore, in the conducted experiments, we have applied K-fold Cross-Validation with ten splits to optimize the performance of the presented deep learning approach.
APA, Harvard, Vancouver, ISO, and other styles
26

Aggarwal, Apeksha, Akshat Srivastava, Ajay Agarwal, Nidhi Chahal, Dilbag Singh, Abeer Ali Alnuaim, Aseel Alhadlaq, and Heung-No Lee. "Two-Way Feature Extraction for Speech Emotion Recognition Using Deep Learning." Sensors 22, no. 6 (March 19, 2022): 2378. http://dx.doi.org/10.3390/s22062378.

Full text
Abstract:
Recognizing human emotions by machines is a complex task. Deep learning models attempt to automate this process by rendering machines to exhibit learning capabilities. However, identifying human emotions from speech with good performance is still challenging. With the advent of deep learning algorithms, this problem has been addressed recently. However, most research work in the past focused on feature extraction as only one method for training. In this research, we have explored two different methods of extracting features to address effective speech emotion recognition. Initially, two-way feature extraction is proposed by utilizing super convergence to extract two sets of potential features from the speech data. For the first set of features, principal component analysis (PCA) is applied to obtain the first feature set. Thereafter, a deep neural network (DNN) with dense and dropout layers is implemented. In the second approach, mel-spectrogram images are extracted from audio files, and the 2D images are given as input to the pre-trained VGG-16 model. Extensive experiments and an in-depth comparative analysis over both the feature extraction methods with multiple algorithms and over two datasets are performed in this work. The RAVDESS dataset provided significantly better accuracy than using numeric features on a DNN.
APA, Harvard, Vancouver, ISO, and other styles
27

Uloza, Virgilijus, Rytis Maskeliunas, Kipras Pribuisis, Saulius Vaitkus, Audrius Kulikajevas, and Robertas Damasevicius. "An Artificial Intelligence-Based Algorithm for the Assessment of Substitution Voicing." Applied Sciences 12, no. 19 (September 28, 2022): 9748. http://dx.doi.org/10.3390/app12199748.

Full text
Abstract:
The purpose of this research was to develop an artificial intelligence-based method for evaluating substitution voicing (SV) and speech following laryngeal oncosurgery. Convolutional neural networks were used to analyze spoken audio sources. A Mel-frequency spectrogram was employed as input to the deep neural network architecture. The program was trained using a collection of 309 digitized speech recordings. The acoustic substitution voicing index (ASVI) model was elaborated using regression analysis. This model was then tested with speech samples that were unknown to the algorithm, and the results were compared to the auditory-perceptual SV evaluation provided by the medical professionals. A statistically significant, strong correlation with rs = 0.863 (p = 0.001) was observed between the ASVI and the SV evaluation performed by the trained laryngologists. The one-way ANOVA showed statistically significant ASVI differences in control, cordectomy, partial laryngectomy, and total laryngectomy patient groups (p < 0.001). The elaborated lightweight ASVI algorithm reached rapid response rates of 3.56 ms. The ASVI provides a fast and efficient option for SV and speech in patients after laryngeal oncosurgery. The ASVI results are comparable to the auditory-perceptual SV evaluation performed by medical professionals.
APA, Harvard, Vancouver, ISO, and other styles
28

Rao, Sunil, Vivek Narayanaswamy, Michael Esposito, Jayaraman J. Thiagarajan, and Andreas Spanias. "COVID-19 detection using cough sound analysis and deep learning algorithms." Intelligent Decision Technologies 15, no. 4 (January 10, 2022): 655–65. http://dx.doi.org/10.3233/idt-210206.

Full text
Abstract:
Reliable and rapid non-invasive testing has become essential for COVID-19 diagnosis and tracking statistics. Recent studies motivate the use of modern machine learning (ML) and deep learning (DL) tools that utilize features of coughing sounds for COVID-19 diagnosis. In this paper, we describe system designs that we developed for COVID-19 cough detection with the long-term objective of embedding them in a testing device. More specifically, we use log-mel spectrogram features extracted from the coughing audio signal and design a series of customized deep learning algorithms to develop fast and automated diagnosis tools for COVID-19 detection. We first explore the use of a deep neural network with fully connected layers. Additionally, we investigate prospects of efficient implementation by examining the impact on the detection performance by pruning the fully connected neural network based on the Lottery Ticket Hypothesis (LTH) optimization process. In general, pruned neural networks have been shown to provide similar performance gains to that of unpruned networks with reduced computational complexity in a variety of signal processing applications. Finally, we investigate the use of convolutional neural network architectures and in particular the VGG-13 architecture which we tune specifically for this application. Our results show that a unique ensembling of the VGG-13 architecture trained using a combination of binary cross entropy and focal losses with data augmentation significantly outperforms the fully connected networks and other recently proposed baselines on the DiCOVA 2021 COVID-19 cough audio dataset. Our customized VGG-13 model achieves an average validation AUROC of 82.23% and a test AUROC of 78.3% at a sensitivity of 80.49%.
APA, Harvard, Vancouver, ISO, and other styles
29

An, Ji-Hee, Na-Kyoung Koo, Ju-Hye Son, Hye-Min Joo, and Seungdo Jeong. "Development on Deaf Support Application Based on Daily Sound Classification Using Image-based Deep Learning." JOIV : International Journal on Informatics Visualization 6, no. 1-2 (May 31, 2022): 250. http://dx.doi.org/10.30630/joiv.6.1-2.936.

Full text
Abstract:
According to statistics, the number of hearing-impaired persons among the disabled in Korea accounts for 27% of all persons with disabilities. However, there is insufficient support for the deaf and hard of hearing's protective devices and life aids compared to the large number. In particular, the hearing impaired misses much information obtained through sound and causes inconvenience in daily life. Therefore, in this paper, we propose a method to relieve the discomfort in the daily life of the hearing impaired. It analyzes sounds that can occur frequently and must be recognized in daily life and guide them to the hearing impaired through applications and vibration bracelets. Sound analysis was learned by using deep learning by converting sounds that often occur in daily life into the Mel-Spectrogram. The sound that actually occurs is recorded through the application, and then it is identified based on the learning result. According to the identification result, predefined alarms and vibrations are provided differently so that the hearing impaired can easily recognize it. As a result of the recognition of the four major sounds occurring in real life in the experiment, the performance showed an average of 85% and an average of 80% of the classification rate for mixed sounds. It was confirmed that the proposed method can be applied to real-life through experiments. Through the proposed method, the quality of life can be improved by allowing the hearing impaired to recognize and respond to sounds that are essential in daily life.
APA, Harvard, Vancouver, ISO, and other styles
30

Ilarionov, Oleg, Anton Astakhov, Anna Krasovska, and Iryna Domanetska. "Intelligent module for recognizing emotions by voice." Advanced Information Technology, no. 1 (1) (2021): 46–52. http://dx.doi.org/10.17721/ait.2021.1.06.

Full text
Abstract:
Speech is the main way of communication for people, and people can receive not only semantic but also emotional information from speech. Recognition of emotions by voice is relevant to areas such as psychological care, security systems development, lie detection, customer relationship analysis, video game development. Because the recognition of emotions by a person is subjective, and therefore inexact and time consuming, there is a need to create software that could solve this problem. The article considers the state of the problem of recognizing human emotions by voice. Modern publications, the approaches used in them, namely models of emotions, data sets, methods of extraction of signs, classifiers are analyzed. It is determined that existing developments have an average accuracy of about 0.75. The general structure of the system of recognition of human emotions by voice is analyzed, the corresponding intellectual module is designed and developed. A Unified Modeling Language (UML) is used to create a component diagram and a class diagram. RAVDESS and TESS datasets were selected as datasets to diversify the training sample. A discrete model of emotions (joy, sadness, anger, disgust, fear, surprise, calm, neutral emotion), MFCC (Mel Frequency Cepstral Coefficients) method for extracting signs, convolutional neural network for classification were used. . The neural network was developed using the TensorFlow and Keras machine learning libraries. The spectrogram and graphs of the audio signal, as well as graphs of accuracy and recognition errors are constructed. As a result of the software implementation of the intelligent module for recognizing emotions by voice, the accuracy of validation has been increased to 0.8.
APA, Harvard, Vancouver, ISO, and other styles
31

Akinpelu, Samson, and Serestina Viriri. "Robust Feature Selection-Based Speech Emotion Classification Using Deep Transfer Learning." Applied Sciences 12, no. 16 (August 18, 2022): 8265. http://dx.doi.org/10.3390/app12168265.

Full text
Abstract:
Speech Emotion Classification (SEC) relies heavily on the quality of feature extraction and selection from the speech signal. Improvement on this to enhance the classification of emotion had attracted significant attention from researchers. Many primitives and algorithmic solutions for efficient SEC with minimum cost have been proposed; however, the accuracy and performance of these methods have not yet attained a satisfactory point. In this work, we proposed a novel deep transfer learning approach with distinctive emotional rich feature selection techniques for speech emotion classification. We adopt mel-spectrogram extracted from speech signal as the input to our deep convolutional neural network for efficient feature extraction. We froze 19 layers of our pretrained convolutional neural network from re-training to increase efficiency and minimize computational cost. One flattened layer and two dense layers were used. A ReLu activation function was used at the last layer of our feature extraction segment. To prevent misclassification and reduce feature dimensionality, we employed the Neighborhood Component Analysis (NCA) feature selection algorithm for picking out the most relevant features before the actual classification of emotion. Support Vector Machine (SVM) and Multi-Layer Perceptron (MLP) classifiers were utilized at the topmost layer of our model. Two popular datasets for speech emotion classification tasks were used, which are: Berling Emotional Speech Database (EMO-DB), and Toronto English Speech Set (TESS), and a combination of EMO-DB with TESS was used in our experiment. We obtained a state-of-the-art result with an accuracy rate of 94.3%, 100% specificity on EMO-DB, and 97.2%, 99.80% on TESS datasets, respectively. The performance of our proposed method outperformed some recent work in SEC after assessment on the three datasets.
APA, Harvard, Vancouver, ISO, and other styles
32

Lee, Seungwoo, Iksu Seo, Jongwon Seok, Yunsu Kim, and Dong Seog Han. "Active Sonar Target Classification with Power-Normalized Cepstral Coefficients and Convolutional Neural Network." Applied Sciences 10, no. 23 (November 26, 2020): 8450. http://dx.doi.org/10.3390/app10238450.

Full text
Abstract:
Detection and classification of unidentified underwater targets maneuvering in complex underwater environments are critical for active sonar systems. In previous studies, many detection methods were applied to separate targets from the clutter using signals that exceed a preset threshold determined by the sonar console operator. This is because the high signal-to-noise ratio target has enough feature vector components to separate. However, in a real environment, the signal-to-noise ratio of the received target does not always exceed the threshold. Therefore, a target detection algorithm for various target signal-to-noise ratio environments is required; strong clutter energy can lead to false detection, while weak target signals reduce the probability of detection. It also uses long pulse repetition intervals for long-range detection and high ambient noise, requiring classification processing for each ping without accumulating pings. In this study, a target classification algorithm is proposed that can be applied to signals in real underwater environments above the noise level without a threshold set by the sonar console operator, and the classification performance of the algorithm is verified. The active sonar for long-range target detection has low-resolution data; thus, feature vector extraction algorithms are required. Feature vectors are extracted from the experimental data using Power-Normalized Cepstral Coefficients for target classification. Feature vectors are also extracted with Mel-Frequency Cepstral Coefficients and compared with the proposed algorithm. A convolutional neural network was employed as the classifier. In addition, the proposed algorithm is to be compared with the result of target classification using a spectrogram and convolutional neural network. Experimental data were obtained using a hull-mounted active sonar system operating on a Korean naval ship in the East Sea of South Korea and a real maneuvering underwater target. From the experimental data with 29 pings, we extracted 361 target and 3351 clutter data. It is difficult to collect real underwater target data from the real sea environment. Therefore, the number of target data was increased using the data augmentation technique. Eighty percent of the data was used for training and the rest was used for testing. Accuracy value curves and classification rate tables are presented for performance analysis and discussion. Results showed that the proposed algorithm has a higher classification rate than Mel-Frequency Cepstral Coefficients without affecting the target classification by the signal level. Additionally, the obtained results showed that target classification is possible within one ping data without any ping accumulation.
APA, Harvard, Vancouver, ISO, and other styles
33

Ćirić, Dejan G., Zoran H. Perić, Nikola J. Vučić, and Miljan P. Miletić. "Analysis of Industrial Product Sound by Applying Image Similarity Measures." Mathematics 11, no. 3 (January 17, 2023): 498. http://dx.doi.org/10.3390/math11030498.

Full text
Abstract:
The sounds of certain industrial products (machines) carry important information about these products. Product classification or malfunction detection can be performed utilizing a product’s sound. In this regard, sound can be used as it is or it can be mapped to either features or images. The latter enables the implementation of recently achieved performance improvements with respect to image processing. In this paper, the sounds of seven industrial products are mapped into mel-spectrograms. The similarities of these images within the same class (machine type) and between classes, representing the intraclass and interclass similarities, respectively, are investigated. Three often-used image similarity measures are applied: Euclidean distance (ED), the Pearson correlation coefficient (PCC), and the structural similarity index (SSIM). These measures are mutually compared to analyze their behaviors in a particular use-case. According to the obtained results, the mel-spectrograms of five classes are similar, while two classes have unique properties manifested in considerably larger intraclass as opposed to interclass similarity. The applied image similarity measures lead to similar general results showing the same main trends, but there are differences among them as mutual relationship of similarity among classes. The differences between the images are more blurred when the SSIM is applied than using ED and the PCC.
APA, Harvard, Vancouver, ISO, and other styles
34

SHIRAISHI, Toshihiko, and Tomoki DOURA. "Blind source separation by multilayer neural network classifiers for spectrogram analysis." Mechanical Engineering Journal 6, no. 6 (2019): 18–00527. http://dx.doi.org/10.1299/mej.18-00527.

Full text
APA, Harvard, Vancouver, ISO, and other styles
35

Dumitrescu, Cătălin, Marius Minea, Ilona Mădălina Costea, Ionut Cosmin Chiva, and Augustin Semenescu. "Development of an Acoustic System for UAV Detection." Sensors 20, no. 17 (August 28, 2020): 4870. http://dx.doi.org/10.3390/s20174870.

Full text
Abstract:
The purpose of this paper is to investigate the possibility of developing and using an intelligent, flexible, and reliable acoustic system, designed to discover, locate, and transmit the position of unmanned aerial vehicles (UAVs). Such an application is very useful for monitoring sensitive areas and land territories subject to privacy. The software functional components of the proposed detection and location algorithm were developed employing acoustic signal analysis and concurrent neural networks (CoNNs). An analysis of the detection and tracking performance for remotely piloted aircraft systems (RPASs), measured with a dedicated spiral microphone array with MEMS microphones, was also performed. The detection and tracking algorithms were implemented based on spectrograms decomposition and adaptive filters. In this research, spectrograms with Cohen class decomposition, log-Mel spectrograms, harmonic-percussive source separation and raw audio waveforms of the audio sample, collected from the spiral microphone array—as an input to the Concurrent Neural Networks were used, in order to determine and classify the number of detected drones in the perimeter of interest.
APA, Harvard, Vancouver, ISO, and other styles
36

Bayram, Barış, and Gökhan İnce. "An Incremental Class-Learning Approach with Acoustic Novelty Detection for Acoustic Event Recognition." Sensors 21, no. 19 (October 5, 2021): 6622. http://dx.doi.org/10.3390/s21196622.

Full text
Abstract:
Acoustic scene analysis (ASA) relies on the dynamic sensing and understanding of stationary and non-stationary sounds from various events, background noises and human actions with objects. However, the spatio-temporal nature of the sound signals may not be stationary, and novel events may exist that eventually deteriorate the performance of the analysis. In this study, a self-learning-based ASA for acoustic event recognition (AER) is presented to detect and incrementally learn novel acoustic events by tackling catastrophic forgetting. The proposed ASA framework comprises six elements: (1) raw acoustic signal pre-processing, (2) low-level and deep audio feature extraction, (3) acoustic novelty detection (AND), (4) acoustic signal augmentations, (5) incremental class-learning (ICL) (of the audio features of the novel events) and (6) AER. The self-learning on different types of audio features extracted from the acoustic signals of various events occurs without human supervision. For the extraction of deep audio representations, in addition to visual geometry group (VGG) and residual neural network (ResNet), time-delay neural network (TDNN) and TDNN based long short-term memory (TDNN–LSTM) networks are pre-trained using a large-scale audio dataset, Google AudioSet. The performances of ICL with AND using Mel-spectrograms, and deep features with TDNNs, VGG, and ResNet from the Mel-spectrograms are validated on benchmark audio datasets such as ESC-10, ESC-50, UrbanSound8K (US8K), and an audio dataset collected by the authors in a real domestic environment.
APA, Harvard, Vancouver, ISO, and other styles
37

Dalal, Sarang S., Johanna M. Zumer, Adrian G. Guggisberg, Michael Trumpis, Daniel D. E. Wong, Kensuke Sekihara, and Srikantan S. Nagarajan. "MEG/EEG Source Reconstruction, Statistical Evaluation, and Visualization with NUTMEG." Computational Intelligence and Neuroscience 2011 (2011): 1–17. http://dx.doi.org/10.1155/2011/758973.

Full text
Abstract:
NUTMEG is a source analysis toolbox geared towards cognitive neuroscience researchers using MEG and EEG, including intracranial recordings. Evoked and unaveraged data can be imported to the toolbox for source analysis in either the time or time-frequency domains. NUTMEG offers several variants of adaptive beamformers, probabilistic reconstruction algorithms, as well as minimum-norm techniques to generate functional maps of spatiotemporal neural source activity. Lead fields can be calculated from single and overlapping sphere head models or imported from other software. Group averages and statistics can be calculated as well. In addition to data analysis tools, NUTMEG provides a unique and intuitive graphical interface for visualization of results. Source analyses can be superimposed onto a structural MRI or headshape to provide a convenient visual correspondence to anatomy. These results can also be navigated interactively, with the spatial maps and source time series or spectrogram linked accordingly. Animations can be generated to view the evolution of neural activity over time. NUTMEG can also display brain renderings and perform spatial normalization of functional maps using SPM's engine. As a MATLAB package, the end user may easily link with other toolboxes or add customized functions.
APA, Harvard, Vancouver, ISO, and other styles
38

Ciborowski, Tomasz, Szymon Reginis, Dawid Weber, Adam Kurowski, and Bozena Kostek. "Classifying Emotions in Film Music—A Deep Learning Approach." Electronics 10, no. 23 (November 27, 2021): 2955. http://dx.doi.org/10.3390/electronics10232955.

Full text
Abstract:
The paper presents an application for automatically classifying emotions in film music. A model of emotions is proposed, which is also associated with colors. The model created has nine emotional states, to which colors are assigned according to the color theory in film. Subjective tests are carried out to check the correctness of the assumptions behind the adopted emotion model. For that purpose, a statistical analysis of the subjective test results is performed. The application employs a deep convolutional neural network (CNN), which classifies emotions based on 30 s excerpts of music works presented to the CNN input using mel-spectrograms. Examples of classification results of the selected neural networks used to create the system are shown.
APA, Harvard, Vancouver, ISO, and other styles
39

Salian, Beenaa, Omkar Narvade, Rujuta Tambewagh, and Smita Bharne. "Speech Emotion Recognition using Time Distributed CNN and LSTM." ITM Web of Conferences 40 (2021): 03006. http://dx.doi.org/10.1051/itmconf/20214003006.

Full text
Abstract:
Speech has several distinguishing characteristic features which has remained a state-of-the-art tool for extracting valuable information from audio samples. Our aim is to develop a emotion recognition system using these speech features, which would be able to accurately and efficiently recognize emotions through audio analysis. In this article, we have employed a hybrid neural network comprising four blocks of time distributed convolutional layers followed by a layer of Long Short Term Memory to achieve the same.The audio samples for the speech dataset are collectively assembled from RAVDESS, TESS and SAVEE audio datasets and are further augmented by injecting noise. Mel Spectrograms are computed from audio samples and are used to train the neural network. We have been able to achieve a testing accuracy of about 89.26%.
APA, Harvard, Vancouver, ISO, and other styles
40

Kostek, Bozena. "Analysis-by-synthesis paradigm evolved into a new concept." Journal of the Acoustical Society of America 152, no. 4 (October 2022): A178. http://dx.doi.org/10.1121/10.0015955.

Full text
Abstract:
This work aims at showing how the well-known analysis-by-synthesis paradigm has recently been evolved into a new concept. However, in contrast to the original idea stating that the created sound should not fail to pass the foolproof synthesis test, the recent development is a consequence of the need to create new data. Deep learning models are greedy algorithms requiring a vast amount of data that, in addition, should be correctly annotated. Annotation is a bottleneck to getting quality-reliable data as the process relies on annotating a person’s experience and, in many cases, personality-related issues. So, the new approach is to create synthesized data based on a thorough analytical examination of a musical/speech signal resulting in cues for a deep model of how to populate data to overcome this problem. Typically, a 2D feature space is employed, e.g., mel spectrograms, cepstrograms, chromagrams, etc., or a wave-based representation with the counterpart on the algorithmic side called wavenet. In this paper, examples of 2D musical/speech signal representation are presented, along with deep models applied. Creating new data in the context of applications is also shown. In conclusion, further possible directions of this paradigm development which is now beyond the conceptual phase, are presented.
APA, Harvard, Vancouver, ISO, and other styles
41

Xu, Xiaona, Li Yang, Yue Zhao, and Hui Wang. "End-to-End Speech Synthesis for Tibetan Multidialect." Complexity 2021 (January 25, 2021): 1–8. http://dx.doi.org/10.1155/2021/6682871.

Full text
Abstract:
The research on Tibetan speech synthesis technology has been mainly focusing on single dialect, and thus there is a lack of research on Tibetan multidialect speech synthesis technology. This paper presents an end-to-end Tibetan multidialect speech synthesis model to realize a speech synthesis system which can be used to synthesize different Tibetan dialects. Firstly, Wylie transliteration scheme is used to convert the Tibetan text into the corresponding Latin letters, which effectively reduces the size of training corpus and the workload of front-end text processing. Secondly, a shared feature prediction network with a cyclic sequence-to-sequence structure is built, which maps the Latin transliteration vector of Tibetan character to Mel spectrograms and learns the relevant features of multidialect speech data. Thirdly, two dialect-specific WaveNet vocoders are combined with the feature prediction network, which synthesizes the Mel spectrum of Lhasa-Ü-Tsang and Amdo pastoral dialect into time-domain waveform, respectively. The model avoids using a large number of Tibetan dialect expertise for processing some time-consuming tasks, such as phonetic analysis and phonological annotation. Additionally, it can directly synthesize Lhasa-Ü-Tsang and Amdo pastoral speech on the existing text annotation. The experimental results show that the synthesized speech of Lhasa-Ü-Tsang and Amdo pastoral dialect based on our proposed method has better clarity and naturalness than the Tibetan monolingual model.
APA, Harvard, Vancouver, ISO, and other styles
42

Zhang, Lilun, Dezhi Wang, Changchun Bao, Yongxian Wang, and Kele Xu. "Large-Scale Whale-Call Classification by Transfer Learning on Multi-Scale Waveforms and Time-Frequency Features." Applied Sciences 9, no. 5 (March 12, 2019): 1020. http://dx.doi.org/10.3390/app9051020.

Full text
Abstract:
Whale vocal calls contain valuable information and abundant characteristics that are important for classification of whale sub-populations and related biological research. In this study, an effective data-driven approach based on pre-trained Convolutional Neural Networks (CNN) using multi-scale waveforms and time-frequency feature representations is developed in order to perform the classification of whale calls from a large open-source dataset recorded by sensors carried by whales. Specifically, the classification is carried out through a transfer learning approach by using pre-trained state-of-the-art CNN models in the field of computer vision. 1D raw waveforms and 2D log-mel features of the whale-call data are respectively used as the input of CNN models. For raw waveform input, windows are applied to capture multiple sketches of a whale-call clip at different time scales and stack the features from different sketches for classification. When using the log-mel features, the delta and delta-delta features are also calculated to produce a 3-channel feature representation for analysis. In the training, a 4-fold cross-validation technique is employed to reduce the overfitting effect, while the Mix-up technique is also applied to implement data augmentation in order to further improve the system performance. The results show that the proposed method can improve the accuracies by more than 20% in percentage for the classification into 16 whale pods compared with the baseline method using groups of 2D shape descriptors of spectrograms and the Fisher discriminant scores on the same dataset. Moreover, it is shown that classifications based on log-mel features have higher accuracies than those based directly on raw waveforms. The phylogeny graph is also produced to significantly illustrate the relationships among the whale sub-populations.
APA, Harvard, Vancouver, ISO, and other styles
43

Gourishetti, Saichand, David Johnson, Sara Werner, András Kátai, and Peter Holstein. "Partial discharge monitoring using deep neural networks with acoustic emission." INTER-NOISE and NOISE-CON Congress and Conference Proceedings 263, no. 3 (August 1, 2021): 3312–23. http://dx.doi.org/10.3397/in-2021-2373.

Full text
Abstract:
The occurrence of partial discharge (PD) indicates failures in electrical equipment. Depending on the equipment and operating conditions, each type of PD has its own acoustic characteristics and a wide frequency spectrum. To detect PD, electrical equipment is often monitored using various sensors, such as microphones, ultrasonic, and transient-earth voltage, whose signals are then analyzed manually by experts using signal processing techniques. This process requires significant expertise and time, both of which are costly. Advancements in machine learning, aim to address this issue by automatically learning a representation of the signal, minimizing the need for expert analysis. To this end, we propose a deep learning-based solution for the automatic detection of PD using airborne sound emission in the audible to the ultrasonic range. As input to our proposed model, we evaluate common time-frequency representations of the acoustic signal, such as short-time Fourier, continuous wavelet transform and Mel spectrograms. The extracted spectrum from the PD signal pulses is used to train and evaluate the proposed deep neural network models for the detection of different types of PD. Compared to the manual process, the automatic solution is seen as beneficial for maintenance processes and measurement technology.
APA, Harvard, Vancouver, ISO, and other styles
44

Witte, H., and M. Wacker. "Time-frequency Techniques in Biomedical Signal Analysis." Methods of Information in Medicine 52, no. 04 (2013): 279–96. http://dx.doi.org/10.3414/me12-01-0083.

Full text
Abstract:
SummaryObjectives: This review outlines the method -ological fundamentals of the most frequently used non-parametric time-frequency analysis techniques in biomedicine and their main properties, as well as providing decision aids concerning their applications.Methods: The short-term Fourier transform (STFT), the Gabor transform (GT), the S-transform (ST), the continuous Morlet wavelet transform (CMWT), and the Hilbert transform (HT) are introduced as linear transforms by using a unified concept of the time-frequency representation which is based on a standardized analytic signal. The Wigner-Ville dis -tribution (WVD) serves as an example of the ‘quadratic transforms’ class. The combination of WVD and GT with the matching pursuit (MP) decomposition and that of the HT with the empirical mode decomposition (EMD) are explained; these belong to the class of signal-adaptive approaches.Results: Similarities between linear transforms are demonstrated and differences with regard to the time-frequency resolution and interference (cross) terms are presented in detail. By means of simulated signals the effects of different time-frequency resolutions of the GT, CMWT, and WVD as well as the resolution-related properties of the inter -ference (cross) terms are shown. The method-inherent drawbacks and their consequences for the application of the time-frequency techniques are demonstrated by instantaneous amplitude, frequency and phase measures and related time-frequency representations (spectrogram, scalogram, time-frequency distribution, phase-locking maps) of measured magnetoencephalographic (MEG) signals.Conclusions: The appropriate selection of a method and its parameter settings will ensure readability of the time-frequency representations and reliability of results. When the time-frequency characteristics of a signal strongly correspond with the time-frequency resolution of the analysis then a method may be considered ‘optimal’. The MP-based signal-adaptive approaches are preferred as these provide an appropriate time-frequency resolution for all frequencies while simultaneously reducing interference (cross) terms.
APA, Harvard, Vancouver, ISO, and other styles
45

Ostler, Daniel, Matthias Seibold, Jonas Fuchtmann, Nicole Samm, Hubertus Feussner, Dirk Wilhelm, and Nassir Navab. "Acoustic signal analysis of instrument–tissue interaction for minimally invasive interventions." International Journal of Computer Assisted Radiology and Surgery 15, no. 5 (April 22, 2020): 771–79. http://dx.doi.org/10.1007/s11548-020-02146-7.

Full text
Abstract:
Abstract Purpose Minimally invasive surgery (MIS) has become the standard for many surgical procedures as it minimizes trauma, reduces infection rates and shortens hospitalization. However, the manipulation of objects in the surgical workspace can be difficult due to the unintuitive handling of instruments and limited range of motion. Apart from the advantages of robot-assisted systems such as augmented view or improved dexterity, both robotic and MIS techniques introduce drawbacks such as limited haptic perception and their major reliance on visual perception. Methods In order to address the above-mentioned limitations, a perception study was conducted to investigate whether the transmission of intra-abdominal acoustic signals can potentially improve the perception during MIS. To investigate whether these acoustic signals can be used as a basis for further automated analysis, a large audio data set capturing the application of electrosurgery on different types of porcine tissue was acquired. A sliding window technique was applied to compute log-mel-spectrograms, which were fed to a pre-trained convolutional neural network for feature extraction. A fully connected layer was trained on the intermediate feature representation to classify instrument–tissue interaction. Results The perception study revealed that acoustic feedback has potential to improve the perception during MIS and to serve as a basis for further automated analysis. The proposed classification pipeline yielded excellent performance for four types of instrument–tissue interaction (muscle, fascia, liver and fatty tissue) and achieved top-1 accuracies of up to 89.9%. Moreover, our model is able to distinguish electrosurgical operation modes with an overall classification accuracy of 86.40%. Conclusion Our proof-of-principle indicates great application potential for guidance systems in MIS, such as controlled tissue resection. Supported by a pilot perception study with surgeons, we believe that utilizing audio signals as an additional information channel has great potential to improve the surgical performance and to partly compensate the loss of haptic feedback.
APA, Harvard, Vancouver, ISO, and other styles
46

"Spoken Language Identification using CNN with Log Mel Spectrogram Features in Indian Context." International Journal of Advanced Trends in Computer Science and Engineering 11, no. 6 (December 9, 2022): 273–79. http://dx.doi.org/10.30534/ijatcse/2022/071162022.

Full text
Abstract:
This study demonstrates a novel application of Log Mel Spectrogram coefficients to image classification via Convolutional Neural Networks (CNN). The acoustic features obtained as log mel spectrogram images are used in this article. Log mel spectrogram pictures, a novel technique, ensure that the system is noise-resistant and free of channel mismatch. The majority of Indian languages from our own dataset were used.With the use of auditory features integrated in CNN, we hope to quickly and accurately detect a language. InceptionV3 and Resnet50 models are also used in this study for performance analysis. When compared to the existing system, these approaches achieved significant improvements in language identification accuracy.
APA, Harvard, Vancouver, ISO, and other styles
47

"Music Genre Classification using Spectral Analysis Techniques With Hybrid Convolution-Recurrent Neural Network." International Journal of Innovative Technology and Exploring Engineering 9, no. 1 (November 10, 2019): 149–54. http://dx.doi.org/10.35940/ijitee.a3956.119119.

Full text
Abstract:
In this work, the objective is to classify the audio data into specific genres from GTZAN dataset which contain about 10 genres. First, it perform the audio splitting to make it signal into clips which contains homogeneous content. Short-term Fourier Transform (STFT), Mel-spectrogram and Mel-frequency cepstrum coefficient (MFCC) are the most common feature extraction technique and each feature extraction technique has been successful in their own various audio applications. Then, these feature extractions of the audio fed to the Convolution Neural Network (CNN) model and VGG16 Neural Network model, which consist of 16 convolution layers network. We perform different feature extraction with different CNN and VGG16 model with or without different Recurrent Neural Network (RNN) and evaluated performance measure. In this model, it has achieved overall accuracy 95.5\% for this task.
APA, Harvard, Vancouver, ISO, and other styles
48

Saishu, Yuki, Amir Hossein Poorjam, and Mads Græsbøll Christensen. "A CNN-based approach to identification of degradations in speech signals." EURASIP Journal on Audio, Speech, and Music Processing 2021, no. 1 (February 5, 2021). http://dx.doi.org/10.1186/s13636-021-00198-4.

Full text
Abstract:
AbstractThe presence of degradations in speech signals, which causes acoustic mismatch between training and operating conditions, deteriorates the performance of many speech-based systems. A variety of enhancement techniques have been developed to compensate the acoustic mismatch in speech-based applications. To apply these signal enhancement techniques, however, it is necessary to know prior information about the presence and the type of degradations in speech signals. In this paper, we propose a new convolutional neural network (CNN)-based approach to automatically identify the major types of degradations commonly encountered in speech-based applications, namely additive noise, nonlinear distortion, and reverberation. In this approach, a set of parallel CNNs, each detecting a certain degradation type, is applied to the log-mel spectrogram of audio signals. Experimental results using two different speech types, namely pathological voice and normal running speech, show the effectiveness of the proposed method in detecting the presence and the type of degradations in speech signals which outperforms the state-of-the-art method. Using the score weighted class activation mapping, we provide a visual analysis of how the network makes decision for identifying different types of degradation in speech signals by highlighting the regions of the log-mel spectrogram which are more influential to the target degradation.
APA, Harvard, Vancouver, ISO, and other styles
49

Sukumaran, Poornima, and Kousalya Govardhanan. "Towards voice based prediction and analysis of emotions in ASD children." Journal of Intelligent & Fuzzy Systems, March 22, 2021, 1–10. http://dx.doi.org/10.3233/jifs-189854.

Full text
Abstract:
Voice processing has proven to be an eminent way of recognizing the various emotions of the people. The objective of this research is to identify the presence of Autism Spectrum Disorder (ASD) and to analyze the emotions of autistic children through their voices. The presented automated voice-based system can detect and classify seven basic emotions (anger, disgust, neutral, happiness, calmness, fear and sadness) expressed by children through source parameters associated with their voices. Various prime voice features such as Mel-frequency Cepstral Coefficients (MFCC) and Spectrogram are extracted and utilized to train a Multi-layer Perceptron (MLP) Classifier to identify possible emotions exhibited by the children thereby assessing their behavioral state. This proposed work therefore helps in the examination of emotions in autistic children that can be used to assess the kind of training and care required to enhance their lifestyle.
APA, Harvard, Vancouver, ISO, and other styles
50

Reghunath, Lekshmi Chandrika, and Rajeev Rajan. "Transformer-based ensemble method for multiple predominant instruments recognition in polyphonic music." EURASIP Journal on Audio, Speech, and Music Processing 2022, no. 1 (May 16, 2022). http://dx.doi.org/10.1186/s13636-022-00245-8.

Full text
Abstract:
AbstractMultiple predominant instrument recognition in polyphonic music is addressed using decision level fusion of three transformer-based architectures on an ensemble of visual representations. The ensemble consists of Mel-spectrogram, modgdgram, and tempogram. Predominant instrument recognition refers to the problem where the prominent instrument is identified from a mixture of instruments being played together. We experimented with two transformer architectures like Vision transformer (Vi-T) and Shifted window transformer (Swin-T) for the proposed task. The performance of the proposed system is compared with that of the state-of-the-art Han’s model, convolutional neural networks (CNN), and deep neural networks (DNN). Transformer networks learn the distinctive local characteristics from the visual representations and classify the instrument to the group where it belongs. The proposed system is systematically evaluated using the IRMAS dataset with eleven classes. A wave generative adversarial network (WaveGAN) architecture is also employed to generate audio files for data augmentation. We train our networks from fixed-length music excerpts with a single-labeled predominant instrument and estimate an arbitrary number of predominant instruments from the variable-length test audio file without any sliding window analysis and aggregation strategy as in existing algorithms. The ensemble voting scheme using Swin-T reports a micro and macro F1 score of 0.66 and 0.62, respectively. These metrics are 3.12% and 12.72% relatively higher than those obtained by the state-of-the-art Han’s model. The architectural choice of transformers with ensemble voting on Mel-spectro-/modgd-/tempogram has merit in recognizing the predominant instruments in polyphonic music.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography