To see the other types of publications on this topic, follow the link: Speech and audio signals.

Journal articles on the topic 'Speech and audio signals'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Speech and audio signals.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Rao*, G. Manmadha, Raidu Babu D.N, Krishna Kanth P.S.L, Vinay B., and Nikhil V. "Reduction of Impulsive Noise from Speech and Audio Signals by using Sd-Rom Algorithm." International Journal of Recent Technology and Engineering 10, no. 1 (May 30, 2021): 265–68. http://dx.doi.org/10.35940/ijrte.a5943.0510121.

Full text
Abstract:
Removal of noise is the heart for speech and audio signal processing. Impulse noise is one of the most important noise which corrupts different parts in speech and audio signals. To remove this type of noise from speech and audio signals the technique proposed in this work is signal dependent rank order mean (SD-ROM) method in recursive version. This technique is used to replace the impulse noise samples based on the neighbouring samples. It detects the impulse noise samples based on the rank ordered differences with threshold values. This technique doesn’t change the features and tonal quality of signal. Rank ordered differences is used for detecting the impulse noise samples in speech and audio signals. Once the sample is detected as corrupted sample, that sample is replaced with rank ordered mean value and this rank ordered mean value depends on the sliding window size and neighbouring samples. This technique shows good results in terms of signal to noise ratio (SNR) and peak signal to noise ratio (PSNR) when compared with other techniques. It mainly used for removal of impulse noises from speech and audio signals.
APA, Harvard, Vancouver, ISO, and other styles
2

S. Ashwin, J., and N. Manoharan. "Audio Denoising Based on Short Time Fourier Transform." Indonesian Journal of Electrical Engineering and Computer Science 9, no. 1 (January 1, 2018): 89. http://dx.doi.org/10.11591/ijeecs.v9.i1.pp89-92.

Full text
Abstract:
<p>This paper presents a novel audio de-noising scheme in a given speech signal. The recovery of original from the communication channel without any noise is a difficult task. Many de-noising techniques have been proposed for the removal of noises from a digital signal. In this paper, an audio de-noising technique based on Short Time Fourier Transform (STFT) is implemented. The proposed architecture uses a novel approach to estimate environmental noise from speech adaptively. Here original speech signals are given as input signal. Using AWGN, noises are added to the signal. Then noised signals are de-noised using STFT techniques. Finally Signal to Noise Ratio (SNR), Peak Signal to Noise Ratio (PSNR) values for noised and de-noised signals are obtained.</p>
APA, Harvard, Vancouver, ISO, and other styles
3

Kacur, Juraj, Boris Puterka, Jarmila Pavlovicova, and Milos Oravec. "Frequency, Time, Representation and Modeling Aspects for Major Speech and Audio Processing Applications." Sensors 22, no. 16 (August 22, 2022): 6304. http://dx.doi.org/10.3390/s22166304.

Full text
Abstract:
There are many speech and audio processing applications and their number is growing. They may cover a wide range of tasks, each having different requirements on the processed speech or audio signals and, therefore, indirectly, on the audio sensors as well. This article reports on tests and evaluation of the effect of basic physical properties of speech and audio signals on the recognition accuracy of major speech/audio processing applications, i.e., speech recognition, speaker recognition, speech emotion recognition, and audio event recognition. A particular focus is on frequency ranges, time intervals, a precision of representation (quantization), and complexities of models suitable for each class of applications. Using domain-specific datasets, eligible feature extraction methods and complex neural network models, it was possible to test and evaluate the effect of basic speech and audio signal properties on the achieved accuracies for each group of applications. The tests confirmed that the basic parameters do affect the overall performance and, moreover, this effect is domain-dependent. Therefore, accurate knowledge of the extent of these effects can be valuable for system designers when selecting appropriate hardware, sensors, architecture, and software for a particular application, especially in the case of limited resources.
APA, Harvard, Vancouver, ISO, and other styles
4

Nittrouer, Susan, and Joanna H. Lowenstein. "Beyond Recognition: Visual Contributions to Verbal Working Memory." Journal of Speech, Language, and Hearing Research 65, no. 1 (January 12, 2022): 253–73. http://dx.doi.org/10.1044/2021_jslhr-21-00177.

Full text
Abstract:
Purpose: It is well recognized that adding the visual to the acoustic speech signal improves recognition when the acoustic signal is degraded, but how that visual signal affects postrecognition processes is not so well understood. This study was designed to further elucidate the relationships among auditory and visual codes in working memory, a postrecognition process. Design: In a main experiment, 80 young adults with normal hearing were tested using an immediate serial recall paradigm. Three types of signals were presented (unprocessed speech, vocoded speech, and environmental sounds) in three conditions (audio-only, audio–video with dynamic visual signals, and audio–picture with static visual signals). Three dependent measures were analyzed: (a) magnitude of the recency effect, (b) overall recall accuracy, and (c) response times, to assess cognitive effort. In a follow-up experiment, 30 young adults with normal hearing were tested largely using the same procedures, but with a slight change in order of stimulus presentation. Results: The main experiment produced three major findings: (a) unprocessed speech evoked a recency effect of consistent magnitude across conditions; vocoded speech evoked a recency effect of similar magnitude to unprocessed speech only with dynamic visual (lipread) signals; environmental sounds never showed a recency effect. (b) Dynamic and static visual signals enhanced overall recall accuracy to a similar extent, and this enhancement was greater for vocoded speech and environmental sounds than for unprocessed speech. (c) All visual signals reduced cognitive load, except for dynamic visual signals with environmental sounds. The follow-up experiment revealed that dynamic visual (lipread) signals exerted their effect on the vocoded stimuli by enhancing phonological quality. Conclusions: Acoustic and visual signals can combine to enhance working memory operations, but the source of these effects differs for phonological and nonphonological signals. Nonetheless, visual information can support better postrecognition processes for patients with hearing loss.
APA, Harvard, Vancouver, ISO, and other styles
5

B, Nagesh, and Dr M. Uttara Kumari. "A Review on Machine Learning for Audio Applications." Journal of University of Shanghai for Science and Technology 23, no. 07 (June 30, 2021): 62–70. http://dx.doi.org/10.51201/jusst/21/06508.

Full text
Abstract:
Audio processing is an important branch under the signal processing domain. It deals with the manipulation of the audio signals to achieve a task like filtering, data compression, speech processing, noise suppression, etc. which improves the quality of the audio signal. For applications such as natural language processing, speech generation, automatic speech recognition, the conventional algorithms aren’t sufficient. There is a need for machine learning or deep learning algorithms which can be implemented so that the audio signal processing can be achieved with good results and accuracy. In this paper, a review of the various algorithms used by researchers in the past has been described and gives the appropriate algorithm that can be used for the respective applications.
APA, Harvard, Vancouver, ISO, and other styles
6

Kubanek, M., J. Bobulski, and L. Adrjanowicz. "Characteristics of the use of coupled hidden Markov models for audio-visual polish speech recognition." Bulletin of the Polish Academy of Sciences: Technical Sciences 60, no. 2 (October 1, 2012): 307–16. http://dx.doi.org/10.2478/v10175-012-0041-6.

Full text
Abstract:
Abstract. This paper focuses on combining audio-visual signals for Polish speech recognition in conditions of the highly disturbed audio speech signal. Recognition of audio-visual speech was based on combined hidden Markov models (CHMM). The described methods were developed for a single isolated command, nevertheless their effectiveness indicated that they would also work similarly in continuous audiovisual speech recognition. The problem of a visual speech analysis is very difficult and computationally demanding, mostly because of an extreme amount of data that needs to be processed. Therefore, the method of audio-video speech recognition is used only while the audiospeech signal is exposed to a considerable level of distortion. There are proposed the authors’ own methods of the lip edges detection and a visual characteristic extraction in this paper. Moreover, the method of fusing speech characteristics for an audio-video signal was proposed and tested. A significant increase of recognition effectiveness and processing speed were noted during tests - for properly selected CHMM parameters and an adequate codebook size, besides the use of the appropriate fusion of audio-visual characteristics. The experimental results were very promising and close to those achieved by leading scientists in the field of audio-visual speech recognition.
APA, Harvard, Vancouver, ISO, and other styles
7

Timmermann, Johannes, Florian Ernst, and Delf Sachau. "Speech enhancement for helicopter headsets with an integrated ANC-system for FPGA-platforms." INTER-NOISE and NOISE-CON Congress and Conference Proceedings 265, no. 5 (February 1, 2023): 2720–30. http://dx.doi.org/10.3397/in_2022_0382.

Full text
Abstract:
During flights, helicopter pilots are exposed to high noise levels caused by rotor, engine and wind. To protect the health of passengers and crew, noise-dampening headsets are used. Modern active noise control (ANC) headset can further reduce the noise exposure for humans in helicopters. Internal or external voice transmission in the helicopter must be adapted to the noisy environment and speech signals are therefore heavily amplified. To improve the quality of communication in helicopters speech and background noise in the transmitted audio signals should be separated. Subsequently the noise components of the signal are eliminated. One established method for this type of speech enhancement is spectral subtraction. In this study, audio files recorded with an artificial head during a helicopter flight are used to evaluate a speech enhancement system with additional ANC capabilities on a rapid prototyping platform. Since both spectral subtraction and the ANC algorithm are computationally intensive, an FPGA is used. The results show a significant enhancement in the quality of the speech signals, which thus lead to improved communication. Furthermore, the enhanced audio signals can be used for voice recognition algorithms.
APA, Harvard, Vancouver, ISO, and other styles
8

Abdallah, Hanaa A., and Souham Meshoul. "A Multilayered Audio Signal Encryption Approach for Secure Voice Communication." Electronics 12, no. 1 (December 20, 2022): 2. http://dx.doi.org/10.3390/electronics12010002.

Full text
Abstract:
In this paper, multilayer cryptosystems for encrypting audio communications are proposed. These cryptosystems combine audio signals with other active concealing signals, such as speech signals, by continuously fusing the audio signal with a speech signal without silent periods. The goal of these cryptosystems is to prevent unauthorized parties from listening to encrypted audio communications. Preprocessing is performed on both the speech signal and the audio signal before they are combined, as this is necessary to get the signals ready for fusion. Instead of encoding and decoding methods, the cryptosystems rely on the values of audio samples, which allows for saving time while increasing their resistance to hackers and environments with a noisy background. The main feature of the proposed approach is to consider three levels of encryption namely fusion, substitution, and permutation where various combinations are considered. The resulting cryptosystems are compared to the one-dimensional logistic map-based encryption techniques and other state-of-the-art methods. The performance of the suggested cryptosystems is evaluated by the use of the histogram, structural similarity index, signal-to-noise ratio (SNR), log-likelihood ratio, spectrum distortion, and correlation coefficient in simulated testing. A comparative analysis in relation to the encryption of logistic maps is given. This research demonstrates that increasing the level of encryption results in increased security. It is obvious that the proposed salting-based encryption method and the multilayer DCT/DST cryptosystem offer better levels of security as they attain the lowest SNR values, −25 dB and −2.5 dB, respectively. In terms of the used evaluation metrics, the proposed multilayer cryptosystem achieved the best results in discrete cosine transform and discrete sine transform, demonstrating a very promising performance.
APA, Harvard, Vancouver, ISO, and other styles
9

Yin, Shu Hua. "Design of the Auxiliary Speech Recognition System of Super-Short-Range Reconnaissance Radar." Applied Mechanics and Materials 556-562 (May 2014): 4830–34. http://dx.doi.org/10.4028/www.scientific.net/amm.556-562.4830.

Full text
Abstract:
To improve the usability and operability of the hybrid-identification reconnaissance radar for individual use, a voice identification System was designed. By using SPCE061A audio signal microprocessor as the core, a digital signal processing technology was used to obtain Doppler radar signals of audio segments by audio cable. Afterwards, the A/D acquisition was conducted to acquire digital signals, and then the data obtained were preprocessed and adaptively filtered to eliminate background noises. Moreover, segmented FFT transforming was used to identify the types of the signals. The overall design of radar voice recognition for an individual soldier was thereby fulfilled. The actual measurements showed that the design of the circuit improved radar resolution and the accuracy of the radar identification.
APA, Harvard, Vancouver, ISO, and other styles
10

Moore, Brian C. J. "Binaural sharing of audio signals." Hearing Journal 60, no. 11 (November 2007): 46–48. http://dx.doi.org/10.1097/01.hj.0000299172.13153.6f.

Full text
APA, Harvard, Vancouver, ISO, and other styles
11

Gnanamanickam, Jenifa, Yuvaraj Natarajan, and Sri Preethaa K. R. "A Hybrid Speech Enhancement Algorithm for Voice Assistance Application." Sensors 21, no. 21 (October 23, 2021): 7025. http://dx.doi.org/10.3390/s21217025.

Full text
Abstract:
In recent years, speech recognition technology has become a more common notion. Speech quality and intelligibility are critical for the convenience and accuracy of information transmission in speech recognition. The speech processing systems used to converse or store speech are usually designed for an environment without any background noise. However, in a real-world atmosphere, background intervention in the form of background noise and channel noise drastically reduces the performance of speech recognition systems, resulting in imprecise information transfer and exhausting the listener. When communication systems’ input or output signals are affected by noise, speech enhancement techniques try to improve their performance. To ensure the correctness of the text produced from speech, it is necessary to reduce the external noises involved in the speech audio. Reducing the external noise in audio is difficult as the speech can be of single, continuous or spontaneous words. In automatic speech recognition, there are various typical speech enhancement algorithms available that have gained considerable attention. However, these enhancement algorithms work well in simple and continuous audio signals only. Thus, in this study, a hybridized speech recognition algorithm to enhance the speech recognition accuracy is proposed. Non-linear spectral subtraction, a well-known speech enhancement algorithm, is optimized with the Hidden Markov Model and tested with 6660 medical speech transcription audio files and 1440 Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) audio files. The performance of the proposed model is compared with those of various typical speech enhancement algorithms, such as iterative signal enhancement algorithm, subspace-based speech enhancement, and non-linear spectral subtraction. The proposed cascaded hybrid algorithm was found to achieve a minimum word error rate of 9.5% and 7.6% for medical speech and RAVDESS speech, respectively. The cascading of the speech enhancement and speech-to-text conversion architectures results in higher accuracy for enhanced speech recognition. The evaluation results confirm the incorporation of the proposed method with real-time automatic speech recognition medical applications where the complexity of terms involved is high.
APA, Harvard, Vancouver, ISO, and other styles
12

Haas, Ellen C. "Can 3-D Auditory Warnings Enhance Helicopter Cockpit Safety?" Proceedings of the Human Factors and Ergonomics Society Annual Meeting 42, no. 15 (October 1998): 1117–21. http://dx.doi.org/10.1177/154193129804201513.

Full text
Abstract:
The design and use of 3-D auditory warning signals can potentially enhance helicopter cockpit safety. A study was conducted to determine how quickly helicopter pilots could respond to helicopter malfunction warning signals in a simulated cockpit environment when four different signal functions (fire in left engine, fire in right engine, chips in transmission, shaft-driven compressor failure) were presented in three different presentation modes (visual only, visual plus 3-D auditory speech signals, visual plus 3-D auditory icons). The dependent variable was pilot response time to the warning signal, from the time of signal onset to the time that the pilot manipulated the collective control in the correct manner. Subjects were 12 U.S. Army pilots between the ages of 18 and 35 who possessed hearing and visual acuity within thresholds acceptable to the U.S. Army. Results indicated that signal presentation was the only significant effect. Signal function and the signal presentation x signal function interaction were not significant. Post hoc test results indicated that pilot response time to the visual signals supplemented with 3-D audio speech or auditory icon signals was significantly shorter than that to visual signals only. The data imply that 3-D audio speech and auditory icon signals provide a safe and effective mode of warning presentation in the helicopter cockpit.
APA, Harvard, Vancouver, ISO, and other styles
13

Rashid, Rakan Saadallah, and Jafar Ramadhan Mohammed. "Securing speech signals by watermarking binary images in the wavelet domain." Indonesian Journal of Electrical Engineering and Computer Science 18, no. 2 (May 1, 2020): 1096. http://dx.doi.org/10.11591/ijeecs.v18.i2.pp1096-1103.

Full text
Abstract:
<span>Digital watermarking is the process of embedding particular information into other signal data in such a way that the quality of the original data is maintained and secured. Watermarking can be performed on images, videos, texts, or audio to protect them from copyright violation. Among all of these types of watermarking, audio watermarking techniques are gaining more interest and becoming more challenging because the quality of such signals is highly affected by the watermarked code. This paper introduces some efficient approaches that have capability to maintain the signals’ quality and preserves the important features of the audio signals. Moreover, the proposed digital audio watermarking approaches are performed in the transform domain. These approaches are gaining more attention due to their robustness or resistance to the attackers. These transform domains include discrete cosine transform (DCT), short-term Fourier transform (STFT), and digital wavelet transform (DWT). Furthermore, the most digital wavelet transforms were found to be applicable for speech watermarking are the Haar and the Daubechies-4. </span>
APA, Harvard, Vancouver, ISO, and other styles
14

Sinha, Ria. "Digital Assistant for Sound Classification Using Spectral Fingerprinting." International Journal for Research in Applied Science and Engineering Technology 9, no. 8 (August 31, 2021): 2045–52. http://dx.doi.org/10.22214/ijraset.2021.37714.

Full text
Abstract:
Abstract: This paper describes a digital assistant designed to help hearing-impaired people sense ambient sounds. The assistant relies on obtaining audio signals from the ambient environment of a hearing-impaired person. The audio signals are analysed by a machine learning model that uses spectral signatures as features to classify audio signals into audio categories (e.g., emergency, animal sounds, etc.) and specific audio types within the categories (e.g., ambulance siren, dog barking, etc.) and notify the user leveraging a mobile or wearable device. The user can configure active notification preferences and view historical logs. The machine learning classifier is periodically trained externally based on labeled audio sound samples. Additional system features include an audio amplification option and a speech to text option for transcribing human speech to text output. Keywords: assistive technology, sound classification, machine learning, audio processing, spectral fingerprinting
APA, Harvard, Vancouver, ISO, and other styles
15

Thanki, Rohit, and Komal Borisagar. "Watermarking Scheme with CS Encryption for Security and Piracy of Digital Audio Signals." International Journal of Information System Modeling and Design 8, no. 4 (October 2017): 38–60. http://dx.doi.org/10.4018/ijismd.2017100103.

Full text
Abstract:
In this article, a watermarking scheme using Curvelet Transform with a combination of compressive sensing (CS) theory is proposed for the protection of a digital audio signal. The curvelet coefficients of the host audio signal are modified according to compressive sensing (CS) measurements of the watermarked data. The CS measurements of watermark data is generated using CS theory processes and sparse coefficients (wavelet coefficients of DCT coefficients). The proposed scheme can be employed for both audio and speech watermarking. The gray scale watermark image is inserted into the host digital audio signal when the proposed scheme is used for audio watermarking. The speech signal is inserted into the host digital audio signal when the proposed scheme is employed for speech watermarking. The experimental results show that proposed scheme performs better than the existing watermarking schemes in terms of perceptual transparency.
APA, Harvard, Vancouver, ISO, and other styles
16

Maryn, Youri, and Andrzej Zarowski. "Calibration of Clinical Audio Recording and Analysis Systems for Sound Intensity Measurement." American Journal of Speech-Language Pathology 24, no. 4 (November 2015): 608–18. http://dx.doi.org/10.1044/2015_ajslp-14-0082.

Full text
Abstract:
Purpose Sound intensity is an important acoustic feature of voice/speech signals. Yet recordings are performed with different microphone, amplifier, and computer configurations, and it is therefore crucial to calibrate sound intensity measures of clinical audio recording and analysis systems on the basis of output of a sound-level meter. This study was designed to evaluate feasibility, validity, and accuracy of calibration methods, including audiometric speech noise signals and human voice signals under typical speech conditions. Method Calibration consisted of 3 comparisons between data from 29 measurement microphone-and-computer systems and data from the sound-level meter: signal-specific comparison with audiometric speech noise at 5 levels, signal-specific comparison with natural voice at 3 levels, and cross-signal comparison with natural voice at 3 levels. Intensity measures from recording systems were then linearly converted into calibrated data on the basis of these comparisons, and validity and accuracy of calibrated sound intensity were investigated. Results Very strong correlations and quasisimilarity were found between calibrated data and sound-level meter data across calibration methods and recording systems. Conclusions Calibration of clinical sound intensity measures according to this method is feasible, valid, accurate, and representative for a heterogeneous set of microphones and data acquisition systems in real-life circumstances with distinct noise contexts.
APA, Harvard, Vancouver, ISO, and other styles
17

Menezes, João Vítor Possamai de, Maria Mendes Cantoni, Denis Burnham, and Adriano Vilela Barbosa. "method for lexical tone classification in audio-visual speech." Journal of Speech Sciences 9 (September 9, 2020): 93–104. http://dx.doi.org/10.20396/joss.v9i00.14960.

Full text
Abstract:
This work presents a method for lexical tone classification in audio-visual speech. The method is applied to a speech data set consisting of syllables and words produced by a female native speaker of Cantonese. The data were recorded in an audio-visual speech production experiment. The visual component of speech was measured by tracking the positions of active markers placed on the speaker's face, whereas the acoustic component was measured with an ordinary microphone. A pitch tracking algorithm is used to estimate F0 from the acoustic signal. A procedure for head motion compensation is applied to the tracked marker positions in order to separate the head and face motion components. The data are then organized into four signal groups: F0, Face, Head, Face+Head. The signals in each of these groups are parameterized by means of a polynomial approximation and then used to train an LDA (Linear Discriminant Analysis) classifier that maps the input signals into one of the output classes (the lexical tones of the language). One classifier is trained for each signal group. The ability of each signal group to predict the correct lexical tones was assessed by the accuracy of the corresponding LDA classifier. The accuracy of the classifiers was obtained by means of a k-fold cross validation method. The classifiers for all signal groups performed above chance, with F0 achieving the highest accuracy, followed by Face+Head, Face, and Head, respectively. The differences in performance between all signal groups were statistically significant.
APA, Harvard, Vancouver, ISO, and other styles
18

Gunawan, T. S., O. O. Khalifa, and E. Ambikairajah. "FORWARD MASKING THRESHOLD ESTIMATION USING NEURAL NETWORKS AND ITS APPLICATION TO PARALLEL SPEECH ENHANCEMENT." IIUM Engineering Journal 11, no. 1 (May 26, 2010): 15–26. http://dx.doi.org/10.31436/iiumej.v11i1.41.

Full text
Abstract:
Forward masking models have been used successfully in speech enhancement and audio coding. Presently, forward masking thresholds are estimated using simplified masking models which have been used for audio coding and speech enhancement applications. In this paper, an accurate approximation of forward masking threshold estimation using neural networks is proposed. A performance comparison to the other existing masking models in speech enhancement application is presented. Objective measures using PESQ demonstrates that our proposed forward masking model, provides significant improvements (5-15 %) over four existing models, when tested with speech signals corrupted by various noises at very low signal to noise ratios. Moreover, a parallel implementation of the speech enhancement algorithm was developed using Matlab parallel computing toolbox.
APA, Harvard, Vancouver, ISO, and other styles
19

Mehrotra, Tushar, Neha Shukla, Tarunika Chaudhary, Gaurav Kumar Rajput, Majid Altuwairiqi, and Mohd Asif Shah. "Improved Frame-Wise Segmentation of Audio Signals for Smart Hearing Aid Using Particle Swarm Optimization-Based Clustering." Mathematical Problems in Engineering 2022 (May 5, 2022): 1–9. http://dx.doi.org/10.1155/2022/1182608.

Full text
Abstract:
Labeling speech signals is a critical activity that cannot be overlooked in any of the early phases of designing a system based on speech technology. For this, an efficient particle swarm optimization (PSO)-based clustering algorithm is proposed to classify the speech classes, i.e., voiced, unvoiced, and silence. A sample of 10 signal waves is selected, and their audio features are extracted. The audio signals are then partitioned into frames, and each frame is classified by using the proposed PSO-based clustering algorithm. The performance of the proposed algorithm is evaluated using various performance metrics such as accuracy, sensitivity, and specificity that are examined. Extensive experiments reveal that the proposed algorithm outperforms the competitive algorithms. The average accuracy of the proposed algorithm is 97%, sensitivity is 98%, and specificity is 96%, which depicts that the proposed approach is efficient in detecting and classifying the speech classes.
APA, Harvard, Vancouver, ISO, and other styles
20

Usina, E. E., A. R. Shabanova, and I. V. Lebedev. "Models and a Tecnique for Determining the Speech Activity of a User of a Socio-Cyberphysical System." Proceedings of the Southwest State University 23, no. 6 (February 23, 2020): 225–40. http://dx.doi.org/10.21869/2223-1560-2019-23-6-225-240.

Full text
Abstract:
Purpose of reseach. The article presents the development of the model-algorithmic support for the process of determining the speech activity of a user of a socio-cyberphysical system. A topological model of a distributed subsystem of audio recordings implemented in limited physical spaces (rooms) is proposed; the model makes it possible to assess the quality of perceived audio signals for the case of distribution of microphones in such a room. Based on this model, a technique for determining the speech activity of a user of a socio-cyberphysical system, which maximizes the quality of perceived audio signals when a user moves in a room by means of determining the installation coordinates of microphones has been developed. Methods. The mathematical tools of graph theory and set theory was used for the most complete analysis and formal description of the distributed subsystem of the audiorecording. In order to determine the coordinates of the placement of microphones in one room, a relevant technique was developed; it involves performing such operations as emitting a speech signal in a room using acoustic equipment and measuring signal levels using a noise meter in the places intended for installing microphones. Results. The dependences of the correlation coefficient of the combined signal and the initial test signal on the distance to the signal source were calculated for a different number of microphones. The obtained dependences allow us to determine the minimum required number of spaced microphones to ensure high-quality recording of the user’s speech. The results of testing the developed technique for determining speech activity in a particular room indicate the possibility and high efficiency of determining the speech activity of a user of a socio-cyberphysical system. Conclusion. Application of the proposed technique for determining the speech activity of a user of a sociocyberphysical system will improve the recording quality of the audio signal and, as a consequence, its subsequent processing, taking into account the possible movement of a user.
APA, Harvard, Vancouver, ISO, and other styles
21

Mowlaee, Pejman, Abolghasem Sayadiuan, and Hamid Sheikhzadeh. "FDMSM robust signal representation for speech mixtures and noise corrupted audio signals." IEICE Electronics Express 6, no. 15 (2009): 1077–83. http://dx.doi.org/10.1587/elex.6.1077.

Full text
APA, Harvard, Vancouver, ISO, and other styles
22

Naithani, Deeksha. "Development of a Real-Time Audio Signal Processing System for Speech Enhancement." Mathematical Statistician and Engineering Applications 70, no. 2 (February 26, 2021): 1041–52. http://dx.doi.org/10.17762/msea.v70i2.2157.

Full text
Abstract:
The requirements of real-time signal processing dictate that the audio signal must be completely processed before the subsequent audio segment may be received. This is done in order to achieve those requirements. This highlights how important it is to create methods of signal processing that are not just quick but also accurate. I describe many ways for processing audio signals in real time within the scope of this thesis. The publications that are being presented cover a wide range of issues, including noise dosimetry, speech analysis, and network echo cancellation, to name a few. In this article, the process of constructing a system that uses audio signal processing to improve speech in real time is broken down and examined. Speech enhancement makes speech more audible and comprehensible in noisy surroundings, which is beneficial to individuals who have hearing loss as well as speech recognition and communication systems. Signal-to-noise ratio, power spectral efficiency ratio, and spectrum transfer entropy index are the metrics that are utilized to evaluate speech quality enhancement system development strategies, abbreviated as STOI. Because studies have shown that being exposed to loud noises over extended periods of time can have severe impacts on health, it is essential to have precise methods for detecting the levels of noise. The findings of a study that measured exposure to noise while also taking into account the impact of the speaker's own voice are presented in this article.
APA, Harvard, Vancouver, ISO, and other styles
23

Sreelekha, Pallepati, Aedabaina Devi, Kurva Pooja, and S. T. Ramya. "Audio to Sign Language Translator." International Journal for Research in Applied Science and Engineering Technology 11, no. 4 (April 30, 2023): 3382–84. http://dx.doi.org/10.22214/ijraset.2023.50873.

Full text
Abstract:
Abstract: This project is based on converting the audio signals receiver to text using speech to text API. Speech to text conversion comprises of small, medium and large vocabulary conversions. Such systems process or accept the voice which then gets converted to their respective text. This paper gives a comparative analysis of the technologies used in small, medium, and large vocabulary Speech Recognition System. The comparative study determines the benefits and liabilities of all the approaches so far. The experiment shows the role of language models in improving the accuracy of speech to text conversion system. We experiment the speech data with noisy sentences and incomplete words. The results show a prominent result for randomly chosen sentences compared to sequential set of sentences
APA, Harvard, Vancouver, ISO, and other styles
24

MOWLAEE, PEJMAN, and ABOLGHASEM SAYADIYAN. "AUDIO CLASSIFICATION OF MUSIC/SPEECH MIXED SIGNALS USING SINUSOIDAL MODELING WITH SVM AND NEURAL NETWORK APPROACH." Journal of Circuits, Systems and Computers 22, no. 02 (February 2013): 1250083. http://dx.doi.org/10.1142/s0218126612500831.

Full text
Abstract:
A preprocessing stage in every speech/music applications including audio/speech separation, speech/speaker recognition and audio/genre transcription task is inevitable. The importance of such pre-processing stage is originated from the requisite of determining each frame of the given signal is belonged to which classes, namely: speech only, music only or speech/music mixture. Such classification can significantly decrease the computational burden due to exhaustive search commonly introduced as a problem in model-based speech recognition or separation as well as music transcription scenarios. In this paper, we present a new method to separate mixed type audio frames based on support vector machine (SVM) and neural network. We present a feature type selection algorithm which seeks for the most appropriate features to discriminate possible classes (hypotheses) on the mixed signal. We also propose features based on eigen-decomposition on the mixed frame. Experimental results demonstrate that the proposed features together with the selected audio classifiers achieve acceptable classification results. From the experimental results, it is observed that the proposed system outperforms other classification systems including k-nearest neighbor (k-NN) and multi-layer perceptron (MLP).
APA, Harvard, Vancouver, ISO, and other styles
25

Hanani, Abualsoud, Yanal Abusara, Bisan Maher, and Inas Musleh. "English speaking proficiency assessment using speech and electroencephalography signals." International Journal of Electrical and Computer Engineering (IJECE) 12, no. 3 (June 1, 2022): 2501. http://dx.doi.org/10.11591/ijece.v12i3.pp2501-2508.

Full text
Abstract:
<p>In this paper, the English speaking proficiency level of non-native English speakers<br />was automatically estimated as high, medium, or low performance. For this purpose, the speech of 142 non-native English speakers was recorded and electroencephalography (EEG) signals of 58 of them were recorded while speaking in English. Two systems were proposed for estimating the English proficiency level of the speaker; one used 72 audio features, extracted from speech signals, and the other used 112 features extracted from EEG signals. Multi-class support vector machines (SVM) was used for training and testing both systems using a cross-validation strategy. The speech-based system outperformed the EEG system with 68% accuracy on 60 testing audio recordings, compared with 56% accuracy on 30 testing EEG recordings.</p>
APA, Harvard, Vancouver, ISO, and other styles
26

Lu, Yuanxun, Jinxiang Chai, and Xun Cao. "Live speech portraits." ACM Transactions on Graphics 40, no. 6 (December 2021): 1–17. http://dx.doi.org/10.1145/3478513.3480484.

Full text
Abstract:
To the best of our knowledge, we first present a live system that generates personalized photorealistic talking-head animation only driven by audio signals at over 30 fps. Our system contains three stages. The first stage is a deep neural network that extracts deep audio features along with a manifold projection to project the features to the target person's speech space. In the second stage, we learn facial dynamics and motions from the projected audio features. The predicted motions include head poses and upper body motions, where the former is generated by an autoregressive probabilistic model which models the head pose distribution of the target person. Upper body motions are deduced from head poses. In the final stage, we generate conditional feature maps from previous predictions and send them with a candidate image set to an image-to-image translation network to synthesize photorealistic renderings. Our method generalizes well to wild audio and successfully synthesizes high-fidelity personalized facial details, e.g., wrinkles, teeth. Our method also allows explicit control of head poses. Extensive qualitative and quantitative evaluations, along with user studies, demonstrate the superiority of our method over state-of-the-art techniques.
APA, Harvard, Vancouver, ISO, and other styles
27

Rani, Shalu. "Review: Audio Noise Reduction Using Filters and Discrete Wavelet Transformation." Journal of Advance Research in Electrical & Electronics Engineering (ISSN: 2208-2395) 2, no. 6 (June 30, 2015): 17–21. http://dx.doi.org/10.53555/nneee.v2i6.192.

Full text
Abstract:
Audio noise reduction using filters and Discrete Wavelet Transformation” our applications include noise propagation problem in industrial air handling systems, noise in aircrafts and tonal noise from electric power, as well as isolation of vibration from which noise is one kind of sound that is unexpected or undesired . The noise related problem can be divided into non-additive noise and additive noise. The non-additive noise includes multiplier noise and convolution noise, which can be transformed into additive noise throughhomomorphism transform. The additive noise includes periodical noise, pulse noise, and broadband noise related problems. There are many kinds of broadband noise, which may include heat noise, wind noise, quantization noise, and all kinds of random noise such as white noise and pink noise. In acoustics applications, noise from the surrounding environment severely reduces the quality ofspeech and audio signals. Therefore, basic linear are used to denoise the audio signals and enhance speech and audio signal quality. Our main objective is to reduce noise from system which is heavily dependent on the specific context and application. As, we want to increase the intelligibility or improve the overall speech perception quality. Such as SNR, PSNR, MSE and the Time to reduce the noise for noisy signals for removing noise.
APA, Harvard, Vancouver, ISO, and other styles
28

Kropotov, Y. A., A. A. Belov, and A. Y. Prockuryakov. "Increasing signal/acoustic interference ratio in telecommunications audio exchange by adaptive filtering methods." Information Technology and Nanotechnology, no. 2416 (2019): 271–76. http://dx.doi.org/10.18287/1613-0073-2019-2416-271-276.

Full text
Abstract:
The paper deals with the issues of increasing signal/noise ratio in telecommunication audio exchange systems. The study of characteristics of speech signals and acoustic noises, such as mathematical expectation, dispersion, relative intensity of acoustic speech signals and various types of acoustic noises and interference is carried out. It is shown that in the design of telecommunications systems, in particular loudspeaker systems operating under the influence of external acoustic noise of high intensity, it is necessary to solve the problem of developing algorithms to effectively suppress the above mentioned interference to ensure the necessary signal/noise ratio in communication systems. A mathematical model of the autocorrelation function of the speech signal by using the Lagrange interpolation polynomial of order 10, considered the creation of adaptive algorithms to suppress acoustic noise by linear filtering methods. Thus suppression of acoustic noises and hindrances is possible at the expense of operated change of area of a cutting in the interval from 0 Hz to 300-1000 Hz, depending on a hindrance conditions.
APA, Harvard, Vancouver, ISO, and other styles
29

Kholkina, Natalya. "INFORMATION TRANSFER EFFECTIVENESS OF WARNING AND TELECOMMUNICATION SYSTEMS OF AUDIO-EXCHANGE UNDER NOISE CONDITIONS." Bulletin of Bryansk state technical university 2020, no. 5 (May 13, 2020): 49–55. http://dx.doi.org/10.30987/1999-8775-2020-5-49-55.

Full text
Abstract:
In the paper shown there is presented an approach to the solution of the problem of the effectiveness parameter assessment in telecommunication systems of operational and command communication, systems of warning speakerphone and audio-exchange. There are considered the matters of the dependence investigation of the acoustic speech signal/noise ratio to the assurance of the required syllabic legibility for the possibility to increase the function effectiveness of telecommunication systems and information exchange operated under complex noise situation. There is shown the dependence of formant legibility upon the meaning of average geometric frequencies in each i-th band of a frequency spectrum of acoustic speech signals. The degree of the impact upon syllabic legibility of the acoustic speech signal/noise ratio is shown. In the paper it is shown that for obtaining speech information with syllabic legibility higher than 93% required for complete perception by a subscriber it is necessary to ensure the acoustic signal/noise ratio at the level no less than 20 dB. The problems in the probability density approximation of acoustic signals with the use of generalized polynomials on function basis systems are presented.
APA, Harvard, Vancouver, ISO, and other styles
30

Putta, Venkata Subbaiah, A. Selwin Mich Priyadharson, and Venkatesa Prabhu Sundramurthy. "Regional Language Speech Recognition from Bone-Conducted Speech Signals through Different Deep Learning Architectures." Computational Intelligence and Neuroscience 2022 (August 25, 2022): 1–10. http://dx.doi.org/10.1155/2022/4473952.

Full text
Abstract:
Bone-conducted microphone (BCM) senses vibrations from bones in the skull during speech to electrical audio signal. When transmitting speech signals, bone-conduction microphones (BCMs) capture speech signals based on the vibrations of the speaker’s skull and have better noise-resistance capabilities than standard air-conduction microphones (ACMs). BCMs have a different frequency response than ACMs because they only capture the low-frequency portion of speech signals. When we replace an ACM with a BCM, we may get satisfactory noise suppression results, but the speech quality and intelligibility may suffer due to the nature of the solid vibration. Mismatched BCM and ACM characteristics can also have an impact on ASR performance, and it is impossible to recreate a new ASR system using voice data from BCMs. The speech intelligibility of a BCM-conducted speech signal is determined by the location of the bone used to acquire the signal and accurately model phonemes of words. Deep learning techniques such as neural network have traditionally been used for speech recognition. However, neural networks have a high computational cost and are unable to model phonemes in signals. In this paper, the intelligibility of BCM signal speech was evaluated for different bone locations, namely the right ramus, larynx, and right mastoid. Listener and deep learning architectures such as CapsuleNet, UNet, and S-Net were used to acquire the BCM signal for Tamil words and evaluate speech intelligibility. As validated by the listener and deep learning architectures, the Larynx bone location improves speech intelligibility.
APA, Harvard, Vancouver, ISO, and other styles
31

Rashkevych, Yu, D. Peleshko, I. Pelekh, and I. Izonіn. "Speech signal marking on the base of local magnitude and invariant segmentation." Mathematical Modeling and Computing 1, no. 2 (2014): 234–44. http://dx.doi.org/10.23939/mmc2014.02.234.

Full text
Abstract:
The paper suggests a new watermarking scheme based on invariant method of segmentation and the use of local magnitude for marking speech signals. The watermark is embedded in the chosen form at peaks with the spectrum magnitude of each nonoverlapping frame of audio signal.
APA, Harvard, Vancouver, ISO, and other styles
32

Cox, Trevor, Michael Akeroyd, Jon Barker, John Culling, Jennifer Firth, Simone Graetzer, Holly Griffiths, et al. "Predicting Speech Intelligibility for People with a Hearing Loss: The Clarity Challenges." INTER-NOISE and NOISE-CON Congress and Conference Proceedings 265, no. 3 (February 1, 2023): 4599–606. http://dx.doi.org/10.3397/in_2022_0662.

Full text
Abstract:
Objective speech intelligibility metrics are used to reduce the need for time consuming listening tests. They are used in the design of audio systems; room acoustics and signal processing algorithms. Most published speech intelligibility metrics have been developed using young adults with so-called 'normal hearing', and therefore do not work well for those with different hearing characteristics. One of the most common causes of aural diversity is sensorineural hearing loss. While partially restoring perception through hearing aids is possible, results are mixed. This has led to the Clarity Project, which is running an open series of Enhancement Challenges to improve the processing of speech-in-noise for hearing aids. To enable this, objective metrics of speech intelligibility are needed, which work from signals produced by hearing aids for diverse listeners. For this reason, Clarity is also running Prediction Challenges to improve speech intelligibility models. Competitors are given a set of audio signals produced by hearing aid algorithms, and challenged to predict how many words a listener with a particular hearing characteristic will achieve. Drawing on the learning from the challenge, we will outline what has been learnt about improving intelligibility metrics for those with a hearing impairment.
APA, Harvard, Vancouver, ISO, and other styles
33

Yashwanth, A. "Audio Enhancement and Denoising using Online Non-Negative Matrix Factorization and Deep Learning." International Journal for Research in Applied Science and Engineering Technology 10, no. 6 (June 30, 2022): 1703–9. http://dx.doi.org/10.22214/ijraset.2022.44061.

Full text
Abstract:
Abstract: For many years, reducing noise in a noisy speech recording has been a difficult task with numerous applications. This gives scope to use better techniques to enhance the audio and speech and to reduce the noise in the audio. One such technique is Online Non-Negative Matrix Factorization (ONMF). ONMF noise reduction approach primarily generates a noiseless audio signal from an audio sample that has been contaminated by additive noise. Previously many approaches were based on nonnegative matrix factorization to spectrogram measurements. Non-negative Matrix Factorization (NMF) is a standard tool for audio source separation. One major disadvantage of applying NMF on datasets that are large is the time complexity. In this work, we proposed using Online Non-Negative Matrix Factorization. The data can be taken as any speech or music. This method uses less memory than regular non-negative matrix factorization, and it could be used for real-time denoising. This ONMF algorithm is more efficient in memory and time complexity for updates in the dictionary. We have shown that the ONMF method is faster and more efficient for small audio signals on audio simulations. We also implemented this using the Deep Learning approach for comparative study with the Online Non-Negative Matrix Factorization.
APA, Harvard, Vancouver, ISO, and other styles
34

Saitoh, Takeshi. "Research on multi-modal silent speech recognition technology." Impact 2018, no. 3 (June 15, 2018): 47–49. http://dx.doi.org/10.21820/23987073.2018.3.47.

Full text
Abstract:
We are all familiar with audio speech recognition technology for interfacing with smartphones and in-car computers. However, technology that can interpret our speech signals without audio is a far greater challenge for scientists. Audio speech recognition (ASR) can only work in situations where there is little or no background noise and where speech is clearly enunciated. Other technologies that use visual signals to lip-read, or that use lip-reading in conjunction with degraded audio input are under development. However, in the situations where a person cannot speak or where the person's face may not be fully visible, silent speech recognition, which uses muscle movements or brain signals to decode speech, is also under development. Associate Professor Takeshi Saitoh's laboratory at the Kyushu Institute of Technology is at the forefront of visual speech recognition (VSR) and is collaborating with researchers worldwide to develop a range of silent speech recognition technologies. Saitoh, whose small team of researchers and students are being supported by the Japan Society for the Promotion of Science (JSPS), says: 'The aim of our work is to achieve smooth and free communication in real time, without the need for audible speech.' The laboratory's VSR prototype is already performing at a high level. There are many reasons why scientists are working on speech technology that does not rely on audio. Saitoh points out that: 'With an ageing population, more people will suffer from speech or hearing disabilities and would benefit from a means to communicate freely. This would vastly improve their quality of life and create employment opportunities.' Also, intelligent machines, controlled by human-machine interfaces, are expected to become increasingly common in our lives. Non-audio speech recognition technology will be useful for interacting with smartphones, driverless cars, surveillance systems and smart appliances. VSR uses a modified camera, combined with image processing and pattern recognition to convert moving shapes made by the mouth, into meaningful language. Earlier VSR technologies matched the shape of a still mouth with vowel sounds, and others have correlated mouth shapes with a key input. However, these do not provide audio output in real-time, so cannot facilitate a smooth conversation. Also, it is vital that VSR is both easy to use and applicable to a range of situations, such as people bedridden in a supine position, where there is a degree of camera movement or where a face is being viewed in profile rather than full-frontal. Any reliable system should also be user-dependent, such that it will work on any skin colour and any shape of face and in spite of head movement.
APA, Harvard, Vancouver, ISO, and other styles
35

Hao, Cailing. "Application of Neural Network Algorithm Based on Principal Component Image Analysis in Band Expansion of College English Listening." Computational Intelligence and Neuroscience 2021 (November 12, 2021): 1–12. http://dx.doi.org/10.1155/2021/9732156.

Full text
Abstract:
With the development of information technology, band expansion technology is gradually applied to college English listening teaching. This technology aims to recover broadband speech signals from narrowband speech signals with a limited frequency band. However, due to the limitations of current voice equipment and channel conditions, the existing voice band expansion technology often ignores the high-frequency and low-frequency correlation of the audio, resulting in excessive smoothing of the recovered high-frequency spectrum, too dull subjective hearing, and insufficient expression ability. In order to solve this problem, a neural network model PCA-NN (principal components analysis-neural network) based on principal component image analysis is proposed. Based on the nonlinear characteristics of the audio image signal, the model reduces the dimension of high-dimensional data and realizes the effective recovery of the high-frequency detailed spectrum of audio signal in phase space. The results show that the PCA-NN, i.e., neural network based on principal component analysis, is superior to other audio expansion algorithms in subjective and objective evaluation; in log spectrum distortion evaluation, PCA-NN algorithm obtains smaller LSD. Compared with EHBE, Le, and La, the average LSD decreased by 2.286 dB, 0.51 dB, and 0.15 dB, respectively. The above results show that in the image frequency band expansion of college English listening, the neural network algorithm based on principal component analysis (PCA-NN) can obtain better high-frequency reconstruction accuracy and effectively improve the audio quality.
APA, Harvard, Vancouver, ISO, and other styles
36

LIN, RUEI-SHIANG, and LING-HWEI CHEN. "A NEW APPROACH FOR CLASSIFICATION OF GENERIC AUDIO DATA." International Journal of Pattern Recognition and Artificial Intelligence 19, no. 01 (February 2005): 63–78. http://dx.doi.org/10.1142/s0218001405003958.

Full text
Abstract:
The existing audio retrieval systems fall into one of two categories: single-domain systems that can accept data of only a single type (e.g. speech) or multiple-domain systems that offer content-based retrieval for multiple types of audio data. Since a single-domain system has limited applications, a multiple-domain system will be more useful. However, different types of audio data will have different properties, this will make a multiple-domain system harder to be developed. If we can classify audio information in advance, the above problems can be solved. In this paper, we will propose a real-time classification method to classify audio signals into several basic audio types such as pure speech, music, song, speech with music background, and speech with environmental noise background. In order to make the proposed method robust for a variety of audio sources, we use Bayesian decision function for multivariable Gaussian distribution instead of manually adjusting a threshold for each discriminator. The proposed approach can be applied to content-based audio/video retrieval. In the experiment, the efficiency and effectiveness of this method are shown by an accuracy rate of more than 96% for general audio data classification.
APA, Harvard, Vancouver, ISO, and other styles
37

Alexandrou, Anna Maria, Timo Saarinen, Jan Kujala, and Riitta Salmelin. "Cortical Tracking of Global and Local Variations of Speech Rhythm during Connected Natural Speech Perception." Journal of Cognitive Neuroscience 30, no. 11 (November 2018): 1704–19. http://dx.doi.org/10.1162/jocn_a_01295.

Full text
Abstract:
During natural speech perception, listeners must track the global speaking rate, that is, the overall rate of incoming linguistic information, as well as transient, local speaking rate variations occurring within the global speaking rate. Here, we address the hypothesis that this tracking mechanism is achieved through coupling of cortical signals to the amplitude envelope of the perceived acoustic speech signals. Cortical signals were recorded with magnetoencephalography (MEG) while participants perceived spontaneously produced speech stimuli at three global speaking rates (slow, normal/habitual, and fast). Inherently to spontaneously produced speech, these stimuli also featured local variations in speaking rate. The coupling between cortical and acoustic speech signals was evaluated using audio–MEG coherence. Modulations in audio–MEG coherence spatially differentiated between tracking of global speaking rate, highlighting the temporal cortex bilaterally and the right parietal cortex, and sensitivity to local speaking rate variations, emphasizing the left parietal cortex. Cortical tuning to the temporal structure of natural connected speech thus seems to require the joint contribution of both auditory and parietal regions. These findings suggest that cortical tuning to speech rhythm operates on two functionally distinct levels: one encoding the global rhythmic structure of speech and the other associated with online, rapidly evolving temporal predictions. Thus, it may be proposed that speech perception is shaped by evolutionary tuning, a preference for certain speaking rates, and predictive tuning, associated with cortical tracking of the constantly changing-rate of linguistic information in a speech stream.
APA, Harvard, Vancouver, ISO, and other styles
38

Faghani, Maral, Hamidreza Rezaee-Dehsorkh, Nassim Ravanshad, and Hamed Aminzadeh. "Ultra-Low-Power Voice Activity Detection System Using Level-Crossing Sampling." Electronics 12, no. 4 (February 5, 2023): 795. http://dx.doi.org/10.3390/electronics12040795.

Full text
Abstract:
This paper presents an ultra-low-power voice activity detection (VAD) system to discriminate speech from non-speech parts of audio signals. The proposed VAD system uses level-crossing sampling for voice activity detection. The useless samples in the non-speech parts of the signal are eliminated due to the activity-dependent nature of this sampling scheme. A 40 ms moving window with a 30 ms overlap is exploited as a feature extraction block, within which the output samples of the level-crossing analog-to-digital converter (LC-ADC) are counted as the feature. The only variable used to distinguish speech and non-speech segments in the audio input signal is the number of LC-ADC output samples within a time window. The proposed system achieves an average of 91.02% speech hit rate and 82.64% non-speech hit rate over 12 noise types at −5, 0, 5, and 10 dB signal-to-noise ratios (SNR) over the TIMIT database. The proposed system including LC-ADC, feature extraction, and classification circuits was designed in 0.18 µm CMOS technology. Post-layout simulation results show a power consumption of 394.6 nW with a silicon area of 0.044 mm2, which makes it suitable as an always-on device in an automatic speech recognition system.
APA, Harvard, Vancouver, ISO, and other styles
39

Hajarolasvadi, Noushin, and Hasan Demirel. "3D CNN-Based Speech Emotion Recognition Using K-Means Clustering and Spectrograms." Entropy 21, no. 5 (May 8, 2019): 479. http://dx.doi.org/10.3390/e21050479.

Full text
Abstract:
Detecting human intentions and emotions helps improve human–robot interactions. Emotion recognition has been a challenging research direction in the past decade. This paper proposes an emotion recognition system based on analysis of speech signals. Firstly, we split each speech signal into overlapping frames of the same length. Next, we extract an 88-dimensional vector of audio features including Mel Frequency Cepstral Coefficients (MFCC), pitch, and intensity for each of the respective frames. In parallel, the spectrogram of each frame is generated. In the final preprocessing step, by applying k-means clustering on the extracted features of all frames of each audio signal, we select k most discriminant frames, namely keyframes, to summarize the speech signal. Then, the sequence of the corresponding spectrograms of keyframes is encapsulated in a 3D tensor. These tensors are used to train and test a 3D Convolutional Neural network using a 10-fold cross-validation approach. The proposed 3D CNN has two convolutional layers and one fully connected layer. Experiments are conducted on the Surrey Audio-Visual Expressed Emotion (SAVEE), Ryerson Multimedia Laboratory (RML), and eNTERFACE’05 databases. The results are superior to the state-of-the-art methods reported in the literature.
APA, Harvard, Vancouver, ISO, and other styles
40

Kane, Joji, and Akira Nohara. "Speech processing apparatus for separating voice and non‐voice audio signals contained in a same mixed audio signal." Journal of the Acoustical Society of America 95, no. 3 (March 1994): 1704. http://dx.doi.org/10.1121/1.408490.

Full text
APA, Harvard, Vancouver, ISO, and other styles
41

Mohd Hanifa, Rafizah, Khalid Isa, Shamsul Mohamad, Shaharil Mohd Shah, Shelena Soosay Nathan, Rosni Ramle, and Mazniha Berahim. "Voiced and unvoiced separation in malay speech using zero crossing rate and energy." Indonesian Journal of Electrical Engineering and Computer Science 16, no. 2 (November 1, 2019): 775. http://dx.doi.org/10.11591/ijeecs.v16.i2.pp775-780.

Full text
Abstract:
<p>This paper contributes to the literature on voice-recognition in the context of non-English language. Specifically, it aims to validate the techniques used to present the basic characteristics of speech, viz. voiced and unvoiced, that need to be evaluated when analysing speech signals. Zero Crossing Rate (ZCR) and Short Time Energy (STE) are used in this paper to perform signal pre-processing of continuous Malay speech to separate the voiced and unvoiced parts. The study is based on non-real time data which was developed from a collection of audio speeches. The signal is assessed using ZCR and STE for comparison purposes. The results revealed that ZCR are low for voiced part and high for unvoiced part whereas the STE is high for voiced part and low for unvoiced part. Thus, these two techniques can be used effectively for separating voiced and unvoiced for continuous Malay speech.</p>
APA, Harvard, Vancouver, ISO, and other styles
42

Wolfe, Jace, and Erin C. Schafer. "Optimizing The Benefit of Sound Processors Coupled to Personal FM Systems." Journal of the American Academy of Audiology 19, no. 08 (September 2008): 585–94. http://dx.doi.org/10.3766/jaaa.19.8.2.

Full text
Abstract:
Background: Use of personal frequency modulated (FM) systems significantly improves speech recognition in noise for users of cochlear implants (CI). There are, however, a number of adjustable parameters of the cochlear implant and FM receiver that may affect performance and benefit, and there is limited evidence to guide audiologists in optimizing these parameters. Purpose: This study examined the effect of two sound processor audio-mixing ratios (30/70 and 50/50) on speech recognition and functional benefit for adults with CIs using the Advanced Bionics Auria® sound processors. Research Design: Fully-repeated repeated measures experimental design. Each subject participated in every speech-recognition condition in the study, and qualitative data was collected with subject questionnaires. Study Sample: Twelve adults using Advanced Bionics Auria sound processors. Participants had greater than 20% correct speech recognition on consonant-nucleus-consonant (CNC) monosyllabic words in quiet and had used their CIs for at least six months. Intervention: Performance was assessed at two audio-mixing ratios (30/70 and 50/50). For the 50/50 mixing ratio, equal emphasis is placed on the signals from the sound processor and the FM system. For the 30/70 mixing ratio, the signal from the microphone of the sound processor is attenuated by 10 dB. Data Collection and Analysis: Speech recognition was assessed at two audio-mixing ratios (30/70 and 50/50) in quiet (35 and 50 dB HL) and in noise (+5 signal-to-noise ratio) with and without the personal FM system. After two weeks of using each audio-mixing ratio, the participants completed subjective questionnaires. Results: Study results suggested that use of a personal FM system resulted in significant improvements in speech recognition in quiet at low-presentation levels, speech recognition in noise, and perceived benefit in noise. Use of the 30/70 mixing ratio resulted in significantly poorer speech recognition for low-level speech that was not directed to the FM transmitter. There was no significant difference in speech recognition in noise or functional benefit between the two audio-mixing ratios. Conclusions: Use of a 50/50 audio-mixing ratio is recommended for optimal performance with an FM system in quiet and noisy listening situations.
APA, Harvard, Vancouver, ISO, and other styles
43

Zhao, Huan, Shaofang He, Zuo Chen, and Xixiang Zhang. "Dual Key Speech Encryption Algorithm Based Underdetermined BSS." Scientific World Journal 2014 (2014): 1–7. http://dx.doi.org/10.1155/2014/974735.

Full text
Abstract:
When the number of the mixed signals is less than that of the source signals, the underdetermined blind source separation (BSS) is a significant difficult problem. Due to the fact that the great amount data of speech communications and real-time communication has been required, we utilize the intractability of the underdetermined BSS problem to present a dual key speech encryption method. The original speech is mixed with dual key signals which consist of random key signals (one-time pad) generated by secret seed and chaotic signals generated from chaotic system. In the decryption process, approximate calculation is used to recover the original speech signals. The proposed algorithm for speech signals encryption can resist traditional attacks against the encryption system, and owing to approximate calculation, decryption becomes faster and more accurate. It is demonstrated that the proposed method has high level of security and can recover the original signals quickly and efficiently yet maintaining excellent audio quality.
APA, Harvard, Vancouver, ISO, and other styles
44

Jahan, Ayesha, Sanobar Shadan, Yasmeen Fatima, and Naheed Sultana. "Image Orator - Image to Speech Using CNN, LSTM and GTTS." International Journal for Research in Applied Science and Engineering Technology 11, no. 6 (June 30, 2023): 4473–81. http://dx.doi.org/10.22214/ijraset.2023.54470.

Full text
Abstract:
Abstract: This report presents an image to audio system that utilizes a combination of Long Short-Term Memory (LSTM) and Convolutional Neural Networks (CNN) for image captioning and Google Text-to-Speech (GTTS) for generating audio output. The aim of the project is to create an accessible system that converts images into descriptive audio signals for visually impaired individuals. The proposed system has the potential to provide meaningful context and information about the image through descriptive audio output, making it easier for visually impaired individuals to engage with visual content. In conclusion, the proposed image to audio system, which combines LSTM and CNN for image captioning and GTTS for audio generation, is a promising approach to making visual content more accessible to individuals with visual impairments. Future work may involve exploring different neural network architectures, optimising the system for real-time performance, and incorporating additional audio features to enhance the overall user experience.
APA, Harvard, Vancouver, ISO, and other styles
45

Wang, Li, Weiguang Zheng, Xiaojun Ma, and Shiming Lin. "Denoising Speech Based on Deep Learning and Wavelet Decomposition." Scientific Programming 2021 (July 16, 2021): 1–10. http://dx.doi.org/10.1155/2021/8677043.

Full text
Abstract:
The work proposed a denoising speech method using deep learning. The predictor and target network signals were the amplitude spectra of the wavelet-decomposition vectors of the noisy audio signal and clean audio signal, respectively. The output of the network was the amplitude spectrum of the denoised signal. Besides, the regression network used the input of the predictor to minimize the mean square error between its output and input targets. The denoised wavelet-decomposition vector was transformed back to the time domain by the output amplitude spectrum and the phase of the wavelet-decomposition vector. Then, the denoised speech was obtained by the inverse wavelet transform. This method overcame the problem that the frequency and time resolution of the short-time Fourier transform could not be adjusted. The noise reduction effect in each frequency band was improved due to the gradual reduction of the noise energy in the wavelet-decomposition process. The experimental results showed that the method has a good denoising effect in the whole frequency band.
APA, Harvard, Vancouver, ISO, and other styles
46

V, Sethuram, Ande Prasad, and R. Rajeswara Rao. "Metaheuristic adapted convolutional neural network for Telugu speaker diarization." Intelligent Decision Technologies 15, no. 4 (January 10, 2022): 561–77. http://dx.doi.org/10.3233/idt-211005.

Full text
Abstract:
In speech technology, a pivotal role is being played by the Speaker diarization mechanism. In general, speaker diarization is the mechanism of partitioning the input audio stream into homogeneous segments based on the identity of the speakers. The automatic transcription readability can be improved with the speaker diarization as it is good in recognizing the audio stream into the speaker turn and often provides the true speaker identity. In this research work, a novel speaker diarization approach is introduced under three major phases: Feature Extraction, Speech Activity Detection (SAD), and Speaker Segmentation and Clustering process. Initially, from the input audio stream (Telugu language) collected, the Mel Frequency Cepstral coefficient (MFCC) based features are extracted. Subsequently, in Speech Activity Detection (SAD), the music and silence signals are removed. Then, the acquired speech signals are segmented for each individual speaker. Finally, the segmented signals are subjected to the speaker clustering process, where the Optimized Convolutional Neural Network (CNN) is used. To make the clustering more appropriate, the weight and activation function of CNN are fine-tuned by a new Self Adaptive Sea Lion Algorithm (SA-SLnO). Finally, a comparative analysis is made to exhibit the superiority of the proposed speaker diarization work. Accordingly, the accuracy of the proposed method is 0.8073, which is 5.255, 2.45%, and 0.075, superior to the existing works.
APA, Harvard, Vancouver, ISO, and other styles
47

Lee, Dongheon, and Jung-Woo Choi. "Inter-channel Conv-TasNet for source-agnostic multichannel audio enhancement." INTER-NOISE and NOISE-CON Congress and Conference Proceedings 265, no. 5 (February 1, 2023): 2068–75. http://dx.doi.org/10.3397/in_2022_0297.

Full text
Abstract:
Deep neural network (DNN) models for the audio enhancement task have been developed in various ways. Most of them rely on the source-dependent characteristics, such as temporal or spectral characteristics of speeches, to suppress noises embedded in measured signals. Only a few studies have attempted to exploit the spatial information embedded in multichannel data. In this work, we propose a DNN architecture that fully exploits inter-channel relations to realize source-agnostic audio enhancement. The proposed model is based on the fully convolutional time-domain audio separation network (Conv-TasNet) but extended to extract and learn spatial features from multichannel input signals. The use of spatial information is facilitated by separating each convolutional layer into dedicated inter-channel 1x1 Conv blocks and 2D spectro-temporal Conv blocks. The performance of the proposed model is verified through the training and test with heterogeneous datasets including speech and other audio datasets, which demonstrates that the enriched spatial information from the proposed architecture enables the versatile audio enhancement in a source-agnostic way.
APA, Harvard, Vancouver, ISO, and other styles
48

Ong, Kah Liang, Chin Poo Lee, Heng Siong Lim, and Kian Ming Lim. "Speech emotion recognition with light gradient boosting decision trees machine." International Journal of Electrical and Computer Engineering (IJECE) 13, no. 4 (August 1, 2023): 4020. http://dx.doi.org/10.11591/ijece.v13i4.pp4020-4028.

Full text
Abstract:
<p>Speech emotion recognition aims to identify the emotion expressed in the speech by analyzing the audio signals. In this work, data augmentation is first performed on the audio samples to increase the number of samples for better model learning. The audio samples are comprehensively encoded as the frequency and temporal domain features. In the classification, a light gradient boosting machine is leveraged. The hyperparameter tuning of the light gradient boosting machine is performed to determine the optimal hyperparameter settings. As the speech emotion recognition datasets are imbalanced, the class weights are regulated to be inversely proportional to the sample distribution where minority classes are assigned higher class weights. The experimental results demonstrate that the proposed method outshines the state-of-the-art methods with 84.91% accuracy on the emo-DB dataset, 67.72% on the Ryerson audio-visual database of emotional speech and song (RAVDESS) dataset, and 62.94% on the interactive emotional dyadic motion capture (IEMOCAP) dataset.</p>
APA, Harvard, Vancouver, ISO, and other styles
49

Alluhaidan, Ala Saleh, Oumaima Saidani, Rashid Jahangir, Muhammad Asif Nauman, and Omnia Saidani Neffati. "Speech Emotion Recognition through Hybrid Features and Convolutional Neural Network." Applied Sciences 13, no. 8 (April 10, 2023): 4750. http://dx.doi.org/10.3390/app13084750.

Full text
Abstract:
Speech emotion recognition (SER) is the process of predicting human emotions from audio signals using artificial intelligence (AI) techniques. SER technologies have a wide range of applications in areas such as psychology, medicine, education, and entertainment. Extracting relevant features from audio signals is a crucial task in the SER process to correctly identify emotions. Several studies on SER have employed short-time features such as Mel frequency cepstral coefficients (MFCCs), due to their efficiency in capturing the periodic nature of audio signals. However, these features are limited in their ability to correctly identify emotion representations. To solve this issue, this research combined MFCCs and time-domain features (MFCCT) to enhance the performance of SER systems. The proposed hybrid features were given to a convolutional neural network (CNN) to build the SER model. The hybrid MFCCT features together with CNN outperformed both MFCCs and time-domain (t-domain) features on the Emo-DB, SAVEE, and RAVDESS datasets by achieving an accuracy of 97%, 93%, and 92% respectively. Additionally, CNN achieved better performance compared to the machine learning (ML) classifiers that were recently used in SER. The proposed features have the potential to be widely utilized to several types of SER datasets for identifying emotions.
APA, Harvard, Vancouver, ISO, and other styles
50

CAO, JIANGTAO, NAOYUKI KUBOTA, PING LI, and HONGHAI LIU. "THE VISUAL-AUDIO INTEGRATED RECOGNITION METHOD FOR USER AUTHENTICATION SYSTEM OF PARTNER ROBOTS." International Journal of Humanoid Robotics 08, no. 04 (December 2011): 691–705. http://dx.doi.org/10.1142/s0219843611002678.

Full text
Abstract:
Some of noncontact biometric ways have been used for user authentication system of partner robots, such as visual-based recognition methods and speech recognition. However, the methods of visual-based recognition are sensitive to the light noise and speech recognition systems are perturbed to the acoustic environment and sound noise. Inspiring from the human's capability of compensating visual information (looking) with audio information (hearing), a visual-audio integrating method is proposed to deal with the disturbance of light noise and to improve the recognition accuracy. Combining with the PCA-based and 2DPCA-based face recognition, a two-stage speaker recognition algorithm is used to extract useful personal identity information from speech signals. With the statistic properties of visual background noise, the visual-audio integrating method is performed to draw the final decision. The proposed method is evaluated on a public visual-audio dataset VidTIMIT and a partner robot authentication system. The results verified the visual-audio integrating method can obtain satisfied recognition results with strong robustness.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography