Relevant bibliographies by topics / Synthesized speech detection

Academic literature on the topic 'Synthesized speech detection'

Author: Grafiati

Published: 10 December 2022

Last updated: 28 January 2023

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Journal articles
Book chapters
Conference papers

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Synthesized speech detection.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Synthesized speech detection"

Diqun Yan, Li Xiang, Zhifeng Wang, and Rangding Wang. "Detection of HMM Synthesized Speech by Wavelet Logarithmic Spectrum." Automatic Control and Computer Sciences 53, no. 1 (January 2019): 72–79. http://dx.doi.org/10.3103/s014641161901005x.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Nautsch, Andreas, Xin Wang, Nicholas Evans, Tomi H. Kinnunen, Ville Vestman, Massimiliano Todisco, Hector Delgado, Md Sahidullah, Junichi Yamagishi, and Kong Aik Lee. "ASVspoof 2019: Spoofing Countermeasures for the Detection of Synthesized, Converted and Replayed Speech." IEEE Transactions on Biometrics, Behavior, and Identity Science 3, no. 2 (April 2021): 252–65. http://dx.doi.org/10.1109/tbiom.2021.3059479.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Přibil, Jiří, Anna Přibilová, and Jindřich Matoušek. "GMM-Based Evaluation of Synthetic Speech Quality Using 2D Classification in Pleasure-Arousal Scale." Applied Sciences 11, no. 1 (December 22, 2020): 2. http://dx.doi.org/10.3390/app11010002.

Full text

Abstract:

The paper focuses on the description of a system for the automatic evaluation of synthetic speech quality based on the Gaussian mixture model (GMM) classifier. The speech material originating from a real speaker is compared with synthesized material to determine similarities or differences between them. The final evaluation order is determined by distances in the Pleasure-Arousal (P-A) space between the original and synthetic speech using different synthesis and/or prosody manipulation methods implemented in the Czech text-to-speech system. The GMM models for continual 2D detection of P-A classes are trained using the sound/speech material from the databases without any relation to the original speech or the synthesized sentences. Preliminary and auxiliary analyses show a substantial influence of the number of mixtures, the number and type of the speech features used the size of the processed speech material, as well as the type of the database used for the creation of the GMMs on the P-A classification process and on the final evaluation result. The main evaluation experiments confirm the functionality of the system developed. The objective evaluation results obtained are principally correlated with the subjective ratings of human evaluators; however, partial differences were indicated, so a subsequent detailed investigation must be performed.

APA, Harvard, Vancouver, ISO, and other styles

Tian, Hui, Jun Sun, Yongfeng Huang, Tian Wang, Yonghong Chen, and Yiqiao Cai. "Detecting Steganography of Adaptive Multirate Speech with Unknown Embedding Rate." Mobile Information Systems 2017 (2017): 1–18. http://dx.doi.org/10.1155/2017/5418978.

Full text

Abstract:

Steganalysis of adaptive multirate (AMR) speech is a significant research topic for preventing cybercrimes based on steganography in mobile speech services. Differing from the state-of-the-art works, this paper focuses on steganalysis of AMR speech with unknown embedding rate, where we present three schemes based on support-vector-machine to address the concern. The first two schemes evolve from the existing image steganalysis schemes, which adopt different global classifiers. One is trained on a comprehensive speech sample set including original samples and steganographic samples with various embedding rates, while the other is trained on a particular speech sample set containing original samples and steganographic samples with uniform distributions of embedded information. Further, we present a hybrid steganalysis scheme, which employs Dempster–Shafer theory (DST) to fuse all the evidence from multiple specific classifiers and provide a synthesized detection result. All the steganalysis schemes are evaluated using the well-selected feature set based on statistical characteristics of pulse pairs and compared with the optimal steganalysis that adopts specialized classifiers for corresponding embedding rates. The experimental results demonstrate that all the three steganalysis schemes are feasible and effective for detecting the existing steganographic methods with unknown embedding rates in AMR speech streams, while the DST-based scheme outperforms the others overall.

APA, Harvard, Vancouver, ISO, and other styles

Xie, Hong En, Qiang Li, and Qin Jun Shu. "A Discontinuous Transmission Method for LPC Speech Codec." Applied Mechanics and Materials 644-650 (September 2014): 4346–50. http://dx.doi.org/10.4028/www.scientific.net/amm.644-650.4346.

Full text

Abstract:

In order to improve the utilization of transmission bandwidth in voice communication, this paper proposes a discontinuous transmission method for LPC speech codec. Firstly, by using the algorithm of voice activity detection (VAD), the received signal is divided into voice frame and mute frame. Transitional frame is introduced when the voice frame is converted to mute frame. Then voice frames and transitional frames are encoded at a normal rate, but mute frames are encoded into silence description (SID) frame at a lower rate, which is sent by a method of discontinuous transmission mode. The transmission frequency of SID frame is adjusted automatically according to the fluctuation of characteristic parameters of background noise in mute frames. Finally, the method is applied to the simulation in the MELP vocoder, and the results show that this method has better adaptability in the transmission of mute signal and the synthesized background noise presents good comfort and continuity in the auditory perception.

APA, Harvard, Vancouver, ISO, and other styles

Makhmudov, Fazliddin, Mukhriddin Mukhiddinov, Akmalbek Abdusalomov, Kuldoshbay Avazov, Utkir Khamdamov, and Young Im Cho. "Improvement of the end-to-end scene text recognition method for “text-to-speech” conversion." International Journal of Wavelets, Multiresolution and Information Processing 18, no. 06 (September 15, 2020): 2050052. http://dx.doi.org/10.1142/s0219691320500526.

Full text

Abstract:

Methods for text detection and recognition in images of natural scenes have become an active research topic in computer vision and have obtained encouraging achievements over several benchmarks. In this paper, we introduce a robust yet simple pipeline that produces accurate and fast text detection and recognition for the Uzbek language in natural scene images using a fully convolutional network and the Tesseract OCR engine. First, the text detection step quickly predicts text in random orientations in full-color images with a single fully convolutional neural network, discarding redundant intermediate stages. Then, the text recognition step recognizes the Uzbek language, including both the Latin and Cyrillic alphabets, using a trained Tesseract OCR engine. Finally, the recognized text can be pronounced using the Uzbek language text-to-speech synthesizer. The proposed method was tested on the ICDAR 2013, ICDAR 2015 and MSRA-TD500 datasets, and it showed an advantage in efficiently detecting and recognizing text from natural scene images for assisting the visually impaired.

APA, Harvard, Vancouver, ISO, and other styles

Hossain, Prommy Sultana, Amitabha Chakrabarty, Kyuheon Kim, and Md Jalil Piran. "Multi-Label Extreme Learning Machine (MLELMs) for Bangla Regional Speech Recognition." Applied Sciences 12, no. 11 (May 27, 2022): 5463. http://dx.doi.org/10.3390/app12115463.

Full text

Abstract:

Extensive research has been conducted in the past to determine age, gender, and words spoken in Bangla speech, but no work has been conducted to identify the regional language spoken by the speaker in Bangla speech. Hence, in this study, we create a dataset containing 30 h of Bangla speech of seven regional Bangla dialects with the goal of detecting synthesized Bangla speech and categorizing it. To categorize the regional language spoken by the speaker in the Bangla speech and determine its authenticity, the proposed model was created; a Stacked Convolutional Autoencoder (SCAE) and a Sequence of Multi-Label Extreme Learning machines (MLELM). SCAE creates a detailed feature map by identifying the spatial and temporal salient qualities from MFEC input data. The feature map is then sent to MLELM networks to generate soft labels and then hard labels. As aging generates physiological changes in the brain that alter the processing of aural information, the model took age class into account while generating dialect class labels, increasing classification accuracy from 85% to 95% without and with age class consideration, respectively. The classification accuracy for synthesized Bangla speech labels is 95%. The proposed methodology works well with English speaking audio sets as well.

APA, Harvard, Vancouver, ISO, and other styles

Sarmah, Elina, and Philip Kennedy. "Detecting Silent Vocalizations in a Locked-In Subject." Neuroscience Journal 2013 (November 7, 2013): 1–12. http://dx.doi.org/10.1155/2013/594624.

Full text

Abstract:

Problem Addressed. Decoding of silent vocalization would be enhanced by detecting vocalization onset. This is necessary in order to improve decoding of neural firings and thus synthesize near conversational speech in locked-in subjects implanted with brain computer interfacing devices. Methodology. Cortical recordings were obtained during attempts at inner speech in a mute and paralyzed subject (ER) implanted with a recording electrode to detect and analyze lower beta band peaks meeting the criterion of a minimum 0.2% increase in the power spectrum density (PSD). To provide supporting data, three speaking subjects were used in a similar testing paradigm using EEG signals recorded over the speech area. Results. Conspicuous lower beta band peaks were identified around the time of assumed speech onset. The correlations between single unit firings, recorded at the same time as the continuous neural signals, were found to increase after the lower beta band peaks as compared to before the peaks. Studies in the nonparalyzed control individuals suggested that the lower beta band peaks were related to the movement of the articulators of speech (tongue, jaw, and lips), not to higher order speech processes. Significance and Potential Impact. The results indicate that the onset of silent and overt speech is associated with a sharp peak in lower beta band activity—an important step in the development of a speech prosthesis. This raises the possibility of using these peaks in online applications to assist decoding paradigms being developed to decode speech from neural signal recordings in mute humans.

APA, Harvard, Vancouver, ISO, and other styles

Wan, Yuzhi, and Nadine Sarter. "The Effects of Masking on the Detection of Alarms in Close Temporal Proximity." Proceedings of the Human Factors and Ergonomics Society Annual Meeting 62, no. 1 (September 2018): 1545–46. http://dx.doi.org/10.1177/1541931218621349.

Full text

Abstract:

In many complex data-rich domains, safety is highly dependent on the timely and reliable detection and identification of alarms. However, due to the coupling and complexity of systems in these environments, large numbers of alarms can occur within a short period of time – a problem called an alarm flood (Perrow, 2011). Alarm floods have been defined as more than 10 alarms in a 10-minute period (EEMUA, 1999); however, this rate is often exceeded which can lead to operators missing or misinterpreting critical alarms and, as a result, system failures and accidents. Various types of masking effects may account for observed failures to detect and identify alarms during an alarm flood. Masking occurs when one stimulus is obscured by the presence of another stimulus that appears either simultaneously or in close temporal proximity (Enns & Di Lollo V, 2000). One example of masking is an attentional blink, where the second of two stimuli is missed when presented in close temporal proximity to a preceding stimulus (Raymond, Shapiro, & Arnell, 1992). To date, attentional blinks have been studied almost exclusively in the context of two target stimuli of very short duration (less than 100ms) and in simple single-task conditions. These experiments suggest that the phenomenon occurs when two stimuli are separated by 200-600ms. However, there is limited empirical evidence (e.g., Ferris et al., 2006) that, in more complex and demanding task environments, detection performance suffers even with a longer stimulus onset asynchrony (SOA). To better predict and prevent the occurrence of attentional blinks in alarm floods, the current study aimed to establish the SOA range that results in missed signals in the context of multiple visual and auditory alarms in a multi-task environment. The participants in this study were 26 students from the University of Michigan (age: 20-30 years old). The experiment was conducted using a simulation of an automated package delivery system. Participants were required to monitor the performance of eight delivery drones and perform two tasks: (1) search and confirm that a delivery pad was present before agreeing to package delivery; (2) detect and respond to visual alarms and auditory alarms associated with the various drones. Visual alarms took the form of a number presented in the center of the screen that identified the affected drone; auditory alarms used synthesized speech to present the drone number. Participants had to acknowledge the alarm as quickly as possible by pressing a button adjacent to the drone window. Both visual and auditory alarms lasted 200ms. Crossmodal matching was performed to ensure that the perceived intensity of signals in the two modalities was the same for each individual (see Pitts, Riggs, & Sarter, 2016). Alarms appeared either by themselves (single alarms) or in close temporal proximity of another alarm (alarm pairs). Each experiment scenario was 30 minutes long and included 40 single alarms and 40 alarm pairs. In addition, a 3-minute alarm flood was included in each scenario, consisting of 30 single alarms and 30 alarm pairs. The experiment employed a 5×4 full factorial design. The two independent variables, both varied within subjects, were SOA (200, 600, 800, 1000, 1200ms) and modality pairs (all four combinations of visual and auditory alarms). The dependent measures in this study were detection rate, accuracy of identification, and response time. The detection rate for visual alarms was lower when the alarm was the second in an alarm pair, compared to single visual alarms (89.9% vs. 93.9%; X2 (2, N = 22) = 6.874, p < .01). This effect was independent of the modality of the first alarm and strongest with an SOA of 1000ms. No difference was observed for the detection of single versus paired auditory alarms. Identification accuracy for visual alarms was also significantly lower when the alarm appeared second in a pair, compared to single visual alarms (86.0% vs. 94.0%; X2 (2, N = 22) = 6.007, p = .05). This effect was also independent of the modality of the first alarm, but found only with SOAs of 600, 1000, or 1200ms. Also, no significant difference in accuracy was found for single versus paired auditory alarms. Finally, response times were significantly faster during alarm floods, compared to single alarms or alarm pairs (2160ms vs. 2318ms; F (1, 21) = 6.284, p = .001). Response times to visual and auditory alarms did not differ significantly during alarm floods. In summary, in this experiment, alarm detection and identification suffered when a visual (but not an auditory) alarm was preceded by another visual or auditory alarm. This performance decrement was observed at longer SOAs than reported in earlier single-task studies. This finding may be explained, in part, by the competing visual (but not auditory) demands imposed by the required response to the alarms. Performance during alarm floods was comparable, and even improved in terms of response times, compared to single alarms and alarm pairs. This finding may be explained by the Yerkes-Dodson Law (1908) which describes that performance improves with physiological or mental arousal, up to a point, and then decreases again when arousal increases further. Another possible explanation is that participants invested more effort during alarm floods. The findings from this study add to the knowledge base in attention and alarm design. They highlight the importance of examining attentional phenomena in applied settings to be able to predict and counter performance breakdowns that may be experienced by operators engaged in multitasking in complex data-rich environments.

APA, Harvard, Vancouver, ISO, and other styles

Jain, Mahek, Bhavya Bhagerathi, and Dr Sowmyarani C N. "Real-Time Driver Drowsiness Detection using Computer Vision." International Journal of Engineering and Advanced Technology 11, no. 1 (October 30, 2021): 109–13. http://dx.doi.org/10.35940/ijeat.a3159.1011121.

Full text

Abstract:

The proposed system aims to lessen the number of accidents that occur due to drivers’ drowsiness and fatigue, which will in turn increase transportation safety. This is becoming a common reason for accidents in recent times. Several faces and body gestures are considered such as signs of drowsiness and fatigue in drivers, including tiredness in eyes and yawning. These features are an indication that the driver’s condition is improper. EAR (Eye Aspect Ratio) computes the ratio of distances between the horizontal and vertical eye landmarks which is required for detection of drowsiness. For the purpose of yawn detection, a YAWN value is calculated using the distance between the lower lip and the upper lip, and the distance will be compared against a threshold value. We have deployed an eSpeak module (text to speech synthesizer) which is used for giving appropriate voice alerts when the driver is feeling drowsy or is yawning. The proposed system is designed to decrease the rate of accidents and to contribute to the technology with the goal to prevent fatalities caused due to road accidents.

APA, Harvard, Vancouver, ISO, and other styles

More sources

Book chapters on the topic "Synthesized speech detection"

Wu, QingE, and Weidong Yang. "A Local Approach and Comparison with Other Data Mining Approaches in Software Application." In Examining Information Retrieval and Image Processing Paradigms in Multidisciplinary Contexts, 1–26. IGI Global, 2017. http://dx.doi.org/10.4018/978-1-5225-1884-6.ch001.

Full text

Abstract:

In order to complete an online, real-time and effective aging detection to software, this paper studies a local approach that is also called a fuzzy incomplete and a statistical data mining approaches, and gives their algorithm implementation in the software system fault diagnosis. The application comparison of the two data mining approaches with four classical data mining approaches in software system fault diagnosis is discussed. The performance of each approach is evaluated from the sensitivity, specificity, accuracy rate, error classified rate, missed classified rate, and run-time. An optimum approach is chosen from several approaches to do comparative study. On the data of 1020 samples, the operating results show that the fuzzy incomplete approach has the highest sensitivity, the forecast accuracy that are 96.13% and 94.71%, respectively, which is higher than those of other approaches. It has also the relatively less error classified rate is or so 4.12%, the least missed classified rate is or so 1.18%, and the least runtime is 0.35s, which all are less than those of the other approaches. After the performance, indices are all evaluated and synthesized, the results indicate the performance of the fuzzy incomplete approach is best. Moreover, from the test analysis known, the fuzzy incomplete approach has also some advantages, such as it has the faster detection speed, the lower storage capacity, and does not need any prior information in addition to data processing. These results indicate that the mining approach is more effective and feasible than the old data mining approaches in software aging detection.

APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Synthesized speech detection"

Bartusiak, Emily R., and Edward J. Delp. "Synthesized Speech Detection Using Convolutional Transformer-Based Spectrogram Analysis." In 2021 55th Asilomar Conference on Signals, Systems, and Computers. IEEE, 2021. http://dx.doi.org/10.1109/ieeeconf53345.2021.9723142.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Krishnathasan, Mathangi, and C. R. J. Amalraj. "Speaker Change Detection for Conversational Speech using Synthesized Voice Embedding." In 2019 4th International Conference on Information Technology Research (ICITR). IEEE, 2019. http://dx.doi.org/10.1109/icitr49409.2019.9407791.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Singh, Arun Kumar, and Priyanka Singh. "Detection of AI-Synthesized Speech Using Cepstral & Bispectral Statistics." In 2021 IEEE 4th International Conference on Multimedia Information Processing and Retrieval (MIPR). IEEE, 2021. http://dx.doi.org/10.1109/mipr51284.2021.00076.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Nosek, Tijana, Sinisa Suzic, Boris Papic, and Niksa Jakovljevic. "Synthesized Speech Detection Based on Spectrogram and Convolutional Neural Networks." In 2019 27th Telecommunications Forum (TELFOR). IEEE, 2019. http://dx.doi.org/10.1109/telfor48224.2019.8971215.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Choi, Yeunju, Youngmoon Jung, and Hoirin Kim. "Neural MOS Prediction for Synthesized Speech Using Multi-Task Learning with Spoofing Detection and Spoofing Type Classification." In 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021. http://dx.doi.org/10.1109/slt48900.2021.9383533.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Liu, Xiaohui, Meng Liu, Lin Zhang, Linjuan Zhang, Chang Zeng, Kai Li, Nan Li, Kong Aik Lee, Longbiao Wang, and Jianwu Dang. "Deep Spectro-temporal Artifacts for Detecting Synthesized Speech." In MM '22: The 30th ACM International Conference on Multimedia. New York, NY, USA: ACM, 2022. http://dx.doi.org/10.1145/3552466.3556527.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Banerjee, Sandipan, Ajjen Joshi, Ahmed Ghoneim, Survi Kyal, and Taniya Mishra. "Synthesize & Learn: Jointly Optimizing Generative and Classifier Networks for Improved Drowsiness Detection." In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021. http://dx.doi.org/10.1109/icassp39728.2021.9413822.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Chang, Su-Yu, Kai-Cheng Wu, and Chia-Ping Chen. "Transfer-Representation Learning for Detecting Spoofing Attacks with Converted and Synthesized Speech in Automatic Speaker Verification System." In Interspeech 2019. ISCA: ISCA, 2019. http://dx.doi.org/10.21437/interspeech.2019-2014.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Bu, Huanxian, Wenjun Yu, and Xun Huang. "Compressive Sensing Approach for Aeroengine Fan Noise Mode Detection." In ASME Turbo Expo 2018: Turbomachinery Technical Conference and Exposition. American Society of Mechanical Engineers, 2018. http://dx.doi.org/10.1115/gt2018-75141.

Full text

Abstract:

To further simplify the sensor array set-ups and improve the mode detection capability for the aeroengine fan noise test, a new compressive sensing based methodology has been proposed. This paper reports the details of the validated aeroengine fan noise test method and the wind tunnel test results for the validation. The experimental set-up consists of a transition duct to the open jet, a mode synthesizer to generate different modes of characteristic fan noise, and a sensor array to conduct mode detection in the presence of background flow speeds and background noise interference. The main attention is primarily focused on the examination of the associated reconstruction accuracy and probability of success for spinning mode detection. The testing results clearly show the potential capability of the proposed new testing method for aeroengine tests in a practical testing facility.

APA, Harvard, Vancouver, ISO, and other styles

Prasad, Varsha, and S. Sandya. "Single event transient tolerant high speed phase frequency detector for PLL based frequency synthesizer." In 2014 International Conference on Circuits, Communication, Control and Computing (I4C). IEEE, 2014. http://dx.doi.org/10.1109/cimca.2014.7057761.

Full text

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

Academic literature on the topic 'Synthesized speech detection'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Contents

Journal articles on the topic "Synthesized speech detection"

Book chapters on the topic "Synthesized speech detection"

Conference papers on the topic "Synthesized speech detection"