To see the other types of publications on this topic, follow the link: Audio speaker.

Dissertations / Theses on the topic 'Audio speaker'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Audio speaker.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Khan, Faheem. "Audio-visual speaker separation." Thesis, University of East Anglia, 2016. https://ueaeprints.uea.ac.uk/59679/.

Full text
Abstract:
Communication using speech is often an audio-visual experience. Listeners hear what is being uttered by speakers and also see the corresponding facial movements and other gestures. This thesis is an attempt to exploit this bimodal (audio-visual) nature of speech for speaker separation. In addition to the audio speech features, visual speech features are used to achieve the task of speaker separation. An analysis of the correlation between audio and visual speech features is carried out first. This correlation between audio and visual features is then used in the estimation of clean audio features from visual features using Gaussian MixtureModels (GMMs) andMaximum a Posteriori (MAP) estimation. For speaker separation three methods are proposed that use the estimated clean audio features. Firstly, the estimated clean audio features are used to construct aWiener filter to separate the mixed speech at various signal-to-noise ratios (SNRs) into target and competing speakers. TheWiener filter gains are modified in several ways in search for improvements in quality and intelligibility of the extracted speech. Secondly, the estimated clean audio features are used in developing visually-derived binary masking method for speaker separation. The estimated audio features are used to compute time-frequency binary masks that identify the regions where the target speaker dominates. These regions are retained and formthe estimate of the target speaker’s speech. Experimental results compare the visually-derived binary masks with ideal binary masks which shows a useful level of accuracy. The effectiveness of the visually-derived binary mask for speaker separation is then evaluated through estimates of speech quality and speech intelligibility and shows substantial gains over the original mixture. Thirdly, the estimated clean audio features and the visually-derivedWiener filtering are used to modify the operation of an effective audio-only method of speaker separation, namely the soft mask method, to allow visual speech information to improve the separation task. Experimental results are presented that compare the proposed audio-visual speaker separation with the audio-only method using both speech quality and intelligibility metrics. Finally, a detailed comparison is made of the proposed and existing methods of speaker separation using objective and subjective measures.
APA, Harvard, Vancouver, ISO, and other styles
2

Kwon, Patrick (Patrick Ryan) 1975. "Speaker spotting : automatic annotation of audio data with speaker identity." Thesis, Massachusetts Institute of Technology, 1998. http://hdl.handle.net/1721.1/47608.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Seymour, R. "Audio-visual speech and speaker recognition." Thesis, Queen's University Belfast, 2008. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.492489.

Full text
Abstract:
In this thesis, a number of important issues relating to the use of both audio and video information for speech and speaker recognition are investigated. A comprehensive comparison of different visual feature types is given, including both geometric and image transformation based features. A new geometric based method for feature extraction is described, as well as the novel use of curvelet based features. Different methods for constructing the feature vectors are compared, as well as feature vector sizes and the use of dynamic features. Each feature type is tested against three types of visual noise: compression, blurring and jitter. A novel method of integrating the audio and video information streams called the maximum stream posterior (MSP) is described. This method is tested in both speaker dependent and speaker independent audio-visual speech recognition (AVSR) systems, and is shown to be robust to noise in either the audio or video streams, given no prior knowledge of the noise. This method is then extended to form the maximum weighted stream posterior (MWSP) method. Finally, both the MSP and MWSP are tested in an audio-visual speaker recognition system (AVSpR). / Experiments using the XM2VTS database will show that both of these methods can outperform ,_.','/ standard methods in terms of recognition accuracy in situations where either stream is corrupted.
APA, Harvard, Vancouver, ISO, and other styles
4

Malegaonkar, Amit. "Speaker-based indexation of conversational audio." Thesis, University of Hertfordshire, 2006. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.440175.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

D'Arca, Eleonora. "Speaker tracking in a joint audio-video network." Thesis, Heriot-Watt University, 2015. http://hdl.handle.net/10399/2972.

Full text
Abstract:
Situational awareness is achieved naturally by the human senses of sight and hearing in combination. System-level automatic scene understanding aims at replicating this human ability using cooperative microphones and cameras. In this thesis, we integrate and fuse audio and video signals at different levels of abstractions to detect and track a speaker in a scenario where people are free to move indoors. Despite the low complexity of the system, which consists of just 4 microphones pairs and 1 camera, results show that the overall multimodal tracker is more reliable than single modality systems, tolerating large occlusions and cross-talking. The system evaluation is performed on both single modality and multimodality tracking. The performance improvement given by the audio-video integration and fusion is quantified in terms of tracking precision and accuracy as well as speaker diarisation error rate and precision-recall recognition metrics. We evaluate our results vs. the closest works: a 56% improvement on audio only sound source localisation computational cost and an 18% increment on the speaker diarisation error rate over a speaker-only unit is achieved.
APA, Harvard, Vancouver, ISO, and other styles
6

Lathe, Andrew. "Speaker Prototyping Design." Digital Commons @ East Tennessee State University, 2020. https://dc.etsu.edu/honors/584.

Full text
Abstract:
Audio design is a pertinent industry in today’s world, with an extremely large market including leaders such as Bose, Harman International, and Sennheiser. This project is designed to explore the processes that are necessary to create a new type of product in this market. The end goal is to have a functioning, high–quality set of speakers to prove various concepts of design and prototyping. The steps involved in this project go through the entire design process from initial choice of product to a finished prototype. Processes include the selection of outsourced components such as drivers and necessary connectors. The design stage will include any design processes necessary to create the enclosure or any electronics. Production will be controlled by shipping dates and any potential issues that lie within the methods chosen for production. The final product will be tested for response. The prototyping process is usually fulfilled by various departments with extreme expertise in the respective field.
APA, Harvard, Vancouver, ISO, and other styles
7

Martí, Guerola Amparo. "Multichannel audio processing for speaker localization, separation and enhancement." Doctoral thesis, Universitat Politècnica de València, 2013. http://hdl.handle.net/10251/33101.

Full text
Abstract:
This thesis is related to the field of acoustic signal processing and its applications to emerging communication environments. Acoustic signal processing is a very wide research area covering the design of signal processing algorithms involving one or several acoustic signals to perform a given task, such as locating the sound source that originated the acquired signals, improving their signal to noise ratio, separating signals of interest from a set of interfering sources or recognizing the type of source and the content of the message. Among the above tasks, Sound Source localization (SSL) and Automatic Speech Recognition (ASR) have been specially addressed in this thesis. In fact, the localization of sound sources in a room has received a lot of attention in the last decades. Most real-word microphone array applications require the localization of one or more active sound sources in adverse environments (low signal-to-noise ratio and high reverberation). Some of these applications are teleconferencing systems, video-gaming, autonomous robots, remote surveillance, hands-free speech acquisition, etc. Indeed, performing robust sound source localization under high noise and reverberation is a very challenging task. One of the most well-known algorithms for source localization in noisy and reverberant environments is the Steered Response Power - Phase Transform (SRP-PHAT) algorithm, which constitutes the baseline framework for the contributions proposed in this thesis. Another challenge in the design of SSL algorithms is to achieve real-time performance and high localization accuracy with a reasonable number of microphones and limited computational resources. Although the SRP-PHAT algorithm has been shown to be an effective localization algorithm for real-world environments, its practical implementation is usually based on a costly fine grid-search procedure, making the computational cost of the method a real issue. In this context, several modifications and optimizations have been proposed to improve its performance and applicability. An effective strategy that extends the conventional SRP-PHAT functional is presented in this thesis. This approach performs a full exploration of the sampled space rather than computing the SRP at discrete spatial positions, increasing its robustness and allowing for a coarser spatial grid that reduces the computational cost required in a practical implementation with a small hardware cost (reduced number of microphones). This strategy allows to implement real-time applications based on location information, such as automatic camera steering or the detection of speech/non-speech fragments in advanced videoconferencing systems. As stated before, besides the contributions related to SSL, this thesis is also related to the field of ASR. This technology allows a computer or electronic device to identify the words spoken by a person so that the message can be stored or processed in a useful way. ASR is used on a day-to-day basis in a number of applications and services such as natural human-machine interfaces, dictation systems, electronic translators and automatic information desks. However, there are still some challenges to be solved. A major problem in ASR is to recognize people speaking in a room by using distant microphones. In distant-speech recognition, the microphone does not only receive the direct path signal, but also delayed replicas as a result of multi-path propagation. Moreover, there are multiple situations in teleconferencing meetings when multiple speakers talk simultaneously. In this context, when multiple speaker signals are present, Sound Source Separation (SSS) methods can be successfully employed to improve ASR performance in multi-source scenarios. This is the motivation behind the training method for multiple talk situations proposed in this thesis. This training, which is based on a robust transformed model constructed from separated speech in diverse acoustic environments, makes use of a SSS method as a speech enhancement stage that suppresses the unwanted interferences. The combination of source separation and this specific training has been explored and evaluated under different acoustical conditions, leading to improvements of up to a 35% in ASR performance.
Martí Guerola, A. (2013). Multichannel audio processing for speaker localization, separation and enhancement [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/33101
TESIS
APA, Harvard, Vancouver, ISO, and other styles
8

Lucey, Simon. "Audio-visual speech processing." Thesis, Queensland University of Technology, 2002. https://eprints.qut.edu.au/36172/7/SimonLuceyPhDThesis.pdf.

Full text
Abstract:
Speech is inherently bimodal, relying on cues from the acoustic and visual speech modalities for perception. The McGurk effect demonstrates that when humans are presented with conflicting acoustic and visual stimuli, the perceived sound may not exist in either modality. This effect has formed the basis for modelling the complementary nature of acoustic and visual speech by encapsulating them into the relatively new research field of audio-visual speech processing (AVSP). Traditional acoustic based speech processing systems have attained a high level of performance in recent years, but the performance of these systems is heavily dependent on a match between training and testing conditions. In the presence of mismatched conditions (eg. acoustic noise) the performance of acoustic speech processing applications can degrade markedly. AVSP aims to increase the robustness and performance of conventional speech processing applications through the integration of the acoustic and visual modalities of speech, in particular the tasks of isolated word speech and text-dependent speaker recognition. Two major problems in AVSP are addressed in this thesis, the first of which concerns the extraction of pertinent visual features for effective speech reading and visual speaker recognition. Appropriate representations of the mouth are explored for improved classification performance for speech and speaker recognition. Secondly, there is the question of how to effectively integrate the acoustic and visual speech modalities for robust and improved performance. This question is explored in-depth using hidden Markov model(HMM)classifiers. The development and investigation of integration strategies for AVSP required research into a new branch of pattern recognition known as classifier combination theory. A novel framework is presented for optimally combining classifiers so their combined performance is greater than any of those classifiers individually. The benefits of this framework are not restricted to AVSP, as they can be applied to any task where there is a need for combining independent classifiers.
APA, Harvard, Vancouver, ISO, and other styles
9

Abdelraheem, Mahmoud Fakhry Mahmoud. "Exploiting spatial and spectral information for audio source separation and speaker diarization." Doctoral thesis, University of Trento, 2016. http://eprints-phd.biblio.unitn.it/1876/1/PhD_Thesis.pdf.

Full text
Abstract:
The goal of multichannel audio source separation is to produce high quality separated audio signals, observing mixtures of these signals. The difficulty of tackling the problem comes from not only the source propagation through noisy and echoing environments, but also overlapped source signals. Among the different research directions pursued around this problem, the adoption of probabilistic and advanced modeling aims at exploiting the diversity of multichannel propagation, and the redundancy of source signals. Moreover, prior information about the environments or the signals is helpful to improve the quality and to accelerate the separation. In this thesis, we propose methods to increase the effectiveness of model-based audio source separation methods by exploiting prior information applying spectral and sparse modeling theories. The work is divided into two main parts. In the first part, spectral modeling based on Nonnegative Matrix Factorization is adopted to represent the source signals. The parameters of Gaussian model-based source separation are estimated in sense of Maximum-Likelihood using a Generalized Expectation-Maximization algorithm by applying supervised Nonnegative Matrix and Tensor Factorization, given spectral descriptions of the source signals. Three modalities of making the descriptions available are addressed, i.e. the descriptions are on-line trained during the separation, pre-trained and made directly available, or pre-trained and made indirectly available. In the latter, a detection method is proposed in order to identify the descriptions best representing the signals in the mixtures. In the second part, sparse modeling is adopted to represent the propagation environments. Spatial descriptions of the environments, either deterministic or probabilistic, are pre-trained and made indirectly available. A detection method is proposed in order to identify the deterministic descriptions best representing the environments. The detected descriptions are then used to perform source separation by minimizing a non-convex $l_0$-norm function. For speaker diarization where the task is to determine ``who spoke when" in real meetings, a Watson mixture model is optimized using an Expectation-Maximization algorithm in order to detect the probabilistic descriptions, best representing the environments, and to estimate the temporal activity of each source. The performance of the proposed methods is experimentally evaluated using different datasets, between simulated and live-recorded. The elaborated results show the superiority of the proposed methods over recently developed methods used as baselines.
APA, Harvard, Vancouver, ISO, and other styles
10

Dean, David Brendan. "Synchronous HMMs for audio-visual speech processing." Thesis, Queensland University of Technology, 2008. https://eprints.qut.edu.au/17689/3/David_Dean_Thesis.pdf.

Full text
Abstract:
Both human perceptual studies and automaticmachine-based experiments have shown that visual information from a speaker's mouth region can improve the robustness of automatic speech processing tasks, especially in the presence of acoustic noise. By taking advantage of the complementary nature of the acoustic and visual speech information, audio-visual speech processing (AVSP) applications can work reliably in more real-world situations than would be possible with traditional acoustic speech processing applications. The two most prominent applications of AVSP for viable human-computer-interfaces involve the recognition of the speech events themselves, and the recognition of speaker's identities based upon their speech. However, while these two fields of speech and speaker recognition are closely related, there has been little systematic comparison of the two tasks under similar conditions in the existing literature. Accordingly, the primary focus of this thesis is to compare the suitability of general AVSP techniques for speech or speaker recognition, with a particular focus on synchronous hidden Markov models (SHMMs). The cascading appearance-based approach to visual speech feature extraction has been shown to work well in removing irrelevant static information from the lip region to greatly improve visual speech recognition performance. This thesis demonstrates that these dynamic visual speech features also provide for an improvement in speaker recognition, showing that speakers can be visually recognised by how they speak, in addition to their appearance alone. This thesis investigates a number of novel techniques for training and decoding of SHMMs that improve the audio-visual speech modelling ability of the SHMM approach over the existing state-of-the-art joint-training technique. Novel experiments are conducted within to demonstrate that the reliability of the two streams during training is of little importance to the final performance of the SHMM. Additionally, two novel techniques of normalising the acoustic and visual state classifiers within the SHMM structure are demonstrated for AVSP. Fused hidden Markov model (FHMM) adaptation is introduced as a novel method of adapting SHMMs from existing wellperforming acoustic hidden Markovmodels (HMMs). This technique is demonstrated to provide improved audio-visualmodelling over the jointly-trained SHMMapproach at all levels of acoustic noise for the recognition of audio-visual speech events. However, the close coupling of the SHMM approach will be shown to be less useful for speaker recognition, where a late integration approach is demonstrated to be superior.
APA, Harvard, Vancouver, ISO, and other styles
11

Dean, David Brendan. "Synchronous HMMs for audio-visual speech processing." Queensland University of Technology, 2008. http://eprints.qut.edu.au/17689/.

Full text
Abstract:
Both human perceptual studies and automaticmachine-based experiments have shown that visual information from a speaker's mouth region can improve the robustness of automatic speech processing tasks, especially in the presence of acoustic noise. By taking advantage of the complementary nature of the acoustic and visual speech information, audio-visual speech processing (AVSP) applications can work reliably in more real-world situations than would be possible with traditional acoustic speech processing applications. The two most prominent applications of AVSP for viable human-computer-interfaces involve the recognition of the speech events themselves, and the recognition of speaker's identities based upon their speech. However, while these two fields of speech and speaker recognition are closely related, there has been little systematic comparison of the two tasks under similar conditions in the existing literature. Accordingly, the primary focus of this thesis is to compare the suitability of general AVSP techniques for speech or speaker recognition, with a particular focus on synchronous hidden Markov models (SHMMs). The cascading appearance-based approach to visual speech feature extraction has been shown to work well in removing irrelevant static information from the lip region to greatly improve visual speech recognition performance. This thesis demonstrates that these dynamic visual speech features also provide for an improvement in speaker recognition, showing that speakers can be visually recognised by how they speak, in addition to their appearance alone. This thesis investigates a number of novel techniques for training and decoding of SHMMs that improve the audio-visual speech modelling ability of the SHMM approach over the existing state-of-the-art joint-training technique. Novel experiments are conducted within to demonstrate that the reliability of the two streams during training is of little importance to the final performance of the SHMM. Additionally, two novel techniques of normalising the acoustic and visual state classifiers within the SHMM structure are demonstrated for AVSP. Fused hidden Markov model (FHMM) adaptation is introduced as a novel method of adapting SHMMs from existing wellperforming acoustic hidden Markovmodels (HMMs). This technique is demonstrated to provide improved audio-visualmodelling over the jointly-trained SHMMapproach at all levels of acoustic noise for the recognition of audio-visual speech events. However, the close coupling of the SHMM approach will be shown to be less useful for speaker recognition, where a late integration approach is demonstrated to be superior.
APA, Harvard, Vancouver, ISO, and other styles
12

Almaadeed, Noor. "Evaluation and analysis of hybrid intelligent pattern recognition techniques for speaker identification." Thesis, Brunel University, 2014. http://bura.brunel.ac.uk/handle/2438/8760.

Full text
Abstract:
The rapid momentum of the technology progress in the recent years has led to a tremendous rise in the use of biometric authentication systems. The objective of this research is to investigate the problem of identifying a speaker from its voice regardless of the content (i.e. text-independent), and to design efficient methods of combining face and voice in producing a robust authentication system. A novel approach towards speaker identification is developed using wavelet analysis, and multiple neural networks including Probabilistic Neural Network (PNN), General Regressive Neural Network (GRNN)and Radial Basis Function-Neural Network (RBF NN) with the AND voting scheme. This approach is tested on GRID and VidTIMIT cor-pora and comprehensive test results have been validated with state- of-the-art approaches. The system was found to be competitive and it improved the recognition rate by 15% as compared to the classical Mel-frequency Cepstral Coe±cients (MFCC), and reduced the recognition time by 40% compared to Back Propagation Neural Network (BPNN), Gaussian Mixture Models (GMM) and Principal Component Analysis (PCA). Another novel approach using vowel formant analysis is implemented using Linear Discriminant Analysis (LDA). Vowel formant based speaker identification is best suitable for real-time implementation and requires only a few bytes of information to be stored for each speaker, making it both storage and time efficient. Tested on GRID and Vid-TIMIT, the proposed scheme was found to be 85.05% accurate when Linear Predictive Coding (LPC) is used to extract the vowel formants, which is much higher than the accuracy of BPNN and GMM. Since the proposed scheme does not require any training time other than creating a small database of vowel formants, it is faster as well. Furthermore, an increasing number of speakers makes it di±cult for BPNN and GMM to sustain their accuracy, but the proposed score-based methodology stays almost linear. Finally, a novel audio-visual fusion based identification system is implemented using GMM and MFCC for speaker identi¯cation and PCA for face recognition. The results of speaker identification and face recognition are fused at different levels, namely the feature, score and decision levels. Both the score-level and decision-level (with OR voting) fusions were shown to outperform the feature-level fusion in terms of accuracy and error resilience. The result is in line with the distinct nature of the two modalities which lose themselves when combined at the feature-level. The GRID and VidTIMIT test results validate that the proposed scheme is one of the best candidates for the fusion of face and voice due to its low computational time and high recognition accuracy.
APA, Harvard, Vancouver, ISO, and other styles
13

Raghunathan, Anusha. "EVALUATION OF INTELLIGIBILITY AND SPEAKER SIMILARITY OF VOICE TRANSFORMATION." UKnowledge, 2011. http://uknowledge.uky.edu/gradschool_theses/101.

Full text
Abstract:
Voice transformation refers to a class of techniques that modify the voice characteristics either to conceal the identity or to mimic the voice characteristics of another speaker. Its applications include automatic dialogue replacement and voice generation for people with voice disorders. The diversity in applications makes evaluation of voice transformation a challenging task. The objective of this research is to propose a framework to evaluate intentional voice transformation techniques. Our proposed framework is based on two fundamental qualities: intelligibility and speaker similarity. Intelligibility refers to the clarity of the speech content after voice transformation and speaker similarity measures how well the modified output disguises the source speaker. We measure intelligibility with word error rates and speaker similarity with likelihood of identifying the correct speaker. The novelty of our approach is, we consider whether similarly transformed training data are available to the recognizer. We have demonstrated that this factor plays a significant role in intelligibility and speaker similarity for both human testers and automated recognizers. We thoroughly test two classes of voice transformation techniques: pitch distortion and voice conversion, using our proposed framework. We apply our results for patients with voice hypertension using video self-modeling and preliminary results are presented.
APA, Harvard, Vancouver, ISO, and other styles
14

Krishnan, Ravikiran. "Detecting Group Turns of Speaker Groups in Meeting Room Conversations Using Audio-Video Change Scale-Space." Scholar Commons, 2010. http://scholarcommons.usf.edu/etd/3644.

Full text
Abstract:
Automatic analysis of conversations is important for extracting high-level descriptions of meetings. In this work, as an alternative to linguistic approaches, we develop a novel, purely bottom-up representation, constructed from both audio and video signals that help us char- acterize and build a rich description of the content at multiple temporal scales. Nonverbal communication plays an important role in describing information about the communication and the nature of the conversation. We consider simple audio and video features to extract these changes in conversation. In order to detect these changes, we consider the evolution of the detected change, using the Bayesian Information Criterion (BIC) at multiple temporal scales to build an audio-visual change scale-space. Peaks detected in this representation yields group turn based conversational changes at dierent temporal scales. We use the NIST Meeting Room corpus to test our approach. Four clips of eight minutes are extracted from this corpus at random, and the other ten are extracted after 90 seconds of the start of the entire video in the corpus. A single microphone and a single camera are used from the dataset. The group turns detected in this test gave an overall detection result, when compared with dierent thresholds with xed group turn scale range, of 82%, and a best result of 91% for a single video. Conversation overlaps, changes and their inferred models oer an intermediate-level de- scription of meeting videos that are useful in summarization and indexing of meetings. Since the proposed solutions are computationally e cient, require no training and use little domain knowledge, they can be easily added as a feature to other multimedia analysis techniques.
APA, Harvard, Vancouver, ISO, and other styles
15

Soldi, Giovanni. "Diarisation du locuteur en temps réel pour les objets intelligents." Electronic Thesis or Diss., Paris, ENST, 2016. http://www.theses.fr/2016ENST0061.

Full text
Abstract:
La diarisation du locuteur en temps réel vise à détecter "qui parle maintenant" dans un flux audio donné. La majorité des systèmes de diarisation en ligne proposés a mis l'accent sur des domaines moins difficiles, tels que l’émission des nouvelles et discours en plénière, caractérisé par une faible spontanéité. La première contribution de cette thèse est le développement d'un système de diarisation du locuteur complètement un-supervisé et adaptatif en ligne pour les données de réunions qui sont plus difficiles et spontanées. En raison des hauts taux d’erreur de diarisation, une approche semi-supervisé pour la diarisation en ligne, ou les modèles des interlocuteurs sont initialisés avec une quantité modeste de données étiquetées manuellement et adaptées par une incrémentale maximum a-posteriori adaptation (MAP) procédure, est proposée. Les erreurs obtenues peuvent être suffisamment bas pour supporter des applications pratiques. La deuxième partie de la thèse aborde le problème de la normalisation phonétique pendant la modélisation des interlocuteurs avec petites quantités des données. Tout d'abord, Phone Adaptive Training (PAT), une technique récemment proposé, est évalué et optimisé au niveau de la modélisation des interlocuteurs et dans le cadre de la vérification automatique du locuteur (ASV) et est ensuite développée vers un système entièrement un-supervise en utilisant des transcriptions de classe acoustiques générées automatiquement, dont le nombre est contrôlé par analyse de l'arbre de régression. PAT offre des améliorations significatives dans la performance d'un système ASV iVector, même lorsque des transcriptions phonétiques précises ne sont pas disponibles
On-line speaker diarization aims to detect “who is speaking now" in a given audio stream. The majority of proposed on-line speaker diarization systems has focused on less challenging domains, such as broadcast news and plenary speeches, characterised by long speaker turns and low spontaneity. The first contribution of this thesis is the development of a completely unsupervised adaptive on-line diarization system for challenging and highly spontaneous meeting data. Due to the obtained high diarization error rates, a semi-supervised approach to on-line diarization, whereby speaker models are seeded with a modest amount of manually labelled data and adapted by an efficient incremental maximum a-posteriori adaptation (MAP) procedure, is proposed. Obtained error rates may be low enough to support practical applications. The second part of the thesis addresses instead the problem of phone normalisation when dealing with short-duration speaker modelling. First, Phone Adaptive Training (PAT), a recently proposed technique, is assessed and optimised at the speaker modelling level and in the context of automatic speaker verification (ASV) and then is further developed towards a completely unsupervised system using automatically generated acoustic class transcriptions, whose number is controlled by regression tree analysis. PAT delivers significant improvements in the performance of a state-of-the-art iVector ASV system even when accurate phonetic transcriptions are not available
APA, Harvard, Vancouver, ISO, and other styles
16

Unnikrishnan, Harikrishnan. "AUDIO SCENE SEGEMENTATION USING A MICROPHONE ARRAY AND AUDITORY FEATURES." UKnowledge, 2010. http://uknowledge.uky.edu/gradschool_theses/622.

Full text
Abstract:
Auditory stream denotes the abstract effect a source creates in the mind of the listener. An auditory scene consists of many streams, which the listener uses to analyze and understand the environment. Computer analyses that attempt to mimic human analysis of a scene must first perform Audio Scene Segmentation (ASS). ASS find applications in surveillance, automatic speech recognition and human computer interfaces. Microphone arrays can be employed for extracting streams corresponding to spatially separated sources. However, when a source moves to a new location during a period of silence, such a system loses track of the source. This results in multiple spatially localized streams for the same source. This thesis proposes to identify local streams associated with the same source using auditory features extracted from the beamformed signal. ASS using the spatial cues is first performed. Then auditory features are extracted and segments are linked together based on similarity of the feature vector. An experiment was carried out with two simultaneous speakers. A classifier is used to classify the localized streams as belonging to one speaker or the other. The best performance was achieved when pitch appended with Gammatone Frequency Cepstral Coefficeints (GFCC) was used as the feature vector. An accuracy of 96.2% was achieved.
APA, Harvard, Vancouver, ISO, and other styles
17

Miller, William H. "Analog Implementation of DVM and Farrow Filter Based Beamforming Algorithms for Audio Frequencies." University of Akron / OhioLINK, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=akron1531951902410037.

Full text
APA, Harvard, Vancouver, ISO, and other styles
18

Leis, John W. "Spectral coding methods for speech compression and speaker identification." Thesis, Queensland University of Technology, 1998. https://eprints.qut.edu.au/36062/7/36062_Digitised_Thesis.pdf.

Full text
Abstract:
This thesis investigates aspects of encoding the speech spectrum at low bit rates, with extensions to the effect of such coding on automatic speaker identification. Vector quantization (VQ) is a technique for jointly quantizing a block of samples at once, in order to reduce the bit rate of a coding system. The major drawback in using VQ is the complexity of the encoder. Recent research has indicated the potential applicability of the VQ method to speech when product code vector quantization (PCVQ) techniques are utilized. The focus of this research is the efficient representation, calculation and utilization of the speech model as stored in the PCVQ codebook. In this thesis, several VQ approaches are evaluated, and the efficacy of two training algorithms is compared experimentally. It is then shown that these productcode vector quantization algorithms may be augmented with lossless compression algorithms, thus yielding an improved overall compression rate. An approach using a statistical model for the vector codebook indices for subsequent lossless compression is introduced. This coupling of lossy compression and lossless compression enables further compression gain. It is demonstrated that this approach is able to reduce the bit rate requirement from the current 24 bits per 20 millisecond frame to below 20, using a standard spectral distortion metric for comparison. Several fast-search VQ methods for use in speech spectrum coding have been evaluated. The usefulness of fast-search algorithms is highly dependent upon the source characteristics and, although previous research has been undertaken for coding of images using VQ codebooks trained with the source samples directly, the product-code structured codebooks for speech spectrum quantization place new constraints on the search methodology. The second major focus of the research is an investigation of the effect of lowrate spectral compression methods on the task of automatic speaker identification. The motivation for this aspect of the research arose from a need to simultaneously preserve the speech quality and intelligibility and to provide for machine-based automatic speaker recognition using the compressed speech. This is important because there are several emerging applications of speaker identification where compressed speech is involved. Examples include mobile communications where the speech has been highly compressed, or where a database of speech material has been assembled and stored in compressed form. Although these two application areas have the same objective - that of maximizing the identification rate - the starting points are quite different. On the one hand, the speech material used for training the identification algorithm may or may not be available in compressed form. On the other hand, the new test material on which identification is to be based may only be available in compressed form. Using the spectral parameters which have been stored in compressed form, two main classes of speaker identification algorithm are examined. Some studies have been conducted in the past on bandwidth-limited speaker identification, but the use of short-term spectral compression deserves separate investigation. Combining the major aspects of the research, some important design guidelines for the construction of an identification model when based on the use of compressed speech are put forward.
APA, Harvard, Vancouver, ISO, and other styles
19

Vajaria, Himanshu. "Diarization, localization and indexing of meeting archives." [Tampa, Fla] : University of South Florida, 2008. http://purl.fcla.edu/usf/dc/et/SFE0002581.

Full text
APA, Harvard, Vancouver, ISO, and other styles
20

Zhang, Xianxian. "Robust speech processing based on microphone array, audio-visual, and frame selection for in-vehicle speech recognition and in-set speaker recognition." Diss., Connect to online resource, 2005. http://wwwlib.umi.com/cr/colorado/fullcit?p3190350.

Full text
APA, Harvard, Vancouver, ISO, and other styles
21

Brangers, Kirstin M. "Perceptual Ruler for Quantifying Speech Intelligibility in Cocktail Party Scenarios." UKnowledge, 2013. http://uknowledge.uky.edu/ece_etds/31.

Full text
Abstract:
Systems designed to enhance intelligibility of speech in noise are difficult to evaluate quantitatively because intelligibility is subjective and often requires feedback from large populations for consistent evaluations. Attempts to quantify the evaluation have included related measures such as the Speech Intelligibility Index. These require separating speech and noise signals, which precludes its use on experimental recordings. This thesis develops a procedure using an Intelligibility Ruler (IR) for efficiently quantifying intelligibility. A calibrated Mean Opinion Score (MOS) method is also implemented in order to compare repeatability over a population of 24 subjective listeners. Results showed that subjects using the IR consistently estimated SII values of the test samples with an average standard deviation of 0.0867 between subjects on a scale from zero to one and R2=0.9421. After a calibration procedure from a subset of subjects, the MOS method yielded similar results with an average standard deviation of 0.07620 and R2=0.9181.While results suggest good repeatability of the IR method over a broad range of subjects, the calibrated MOS method is capable of producing results more closely related to actual SII values and is a simpler procedure for human subjects.
APA, Harvard, Vancouver, ISO, and other styles
22

Barkmeier, Julie Marie. "Intelligibility of dysarthric speakers: audio-only and audio-visual presentations." Thesis, University of Iowa, 1988. https://ir.uiowa.edu/etd/5698.

Full text
APA, Harvard, Vancouver, ISO, and other styles
23

Larcher, Anthony. "Modèles acoustiques à structure temporelle renforcée pour la vérification du locuteur embarquée." Phd thesis, Université d'Avignon, 2009. http://tel.archives-ouvertes.fr/tel-00453645.

Full text
Abstract:
La vérification automatique du locuteur est une tâche de classification qui vise à confirmer ou infirmer l'identité d'un individu d'après une étude des caractéristiques spécifiques de sa voix. L'intégration de systèmes de vérification du locuteur sur des appareils embarqués impose de respecter deux types de contraintes, liées à cet environnement : - les contraintes matérielles, qui limitent fortement les ressources disponibles en termes de mémoire de stockage et de puissance de calcul disponibles ; - les contraintes ergonomiques, qui limitent la durée et le nombre des sessions d'entraînement ainsi que la durée des sessions de test. En reconnaissance du locuteur, la structure temporelle du signal de parole n'est pas exploitée par les approches état-de-l'art. Nous proposons d'utiliser cette information, à travers l'utilisation de mots de passe personnels, afin de compenser le manque de données d'apprentissage et de test. Une première étude nous a permis d'évaluer l'influence de la dépendance au texte sur l'approche état-de-l'art GMM/UBM (Gaussian Mixture Model/ Universal Background Model). Nous avons montré qu'une contrainte lexicale imposée à cette approche, généralement utilisée pour la reconnaissance du locuteur indépendante du texte, permet de réduire de près de 30% (en relatif) le taux d'erreurs obtenu dans le cas où les imposteurs ne connaissent pas le mot de passe des clients. Dans ce document, nous présentons une architecture acoustique spécifique qui permet d'exploiter à moindre coût la structure temporelle des mots de passe choisis par les clients. Cette architecture hiérarchique à trois niveaux permet une spécialisation progressive des modèles acoustiques. Un modèle générique représente l'ensemble de l'espace acoustique. Chaque locuteur est représenté par une mixture de Gaussiennes qui dérive du modèle du monde générique du premier niveau. Le troisième niveau de notre architecture est formé de modèles de Markov semi-continus (SCHMM), qui permettent de modéliser la structure temporelle des mots de passe tout en intégrant l'information spécifique au locuteur, modélisée par le modèle GMM du deuxième niveau. Chaque état du modèle SCHMM d'un mot de passe est estimé, relativement au modèle indépendant du texte de ce locuteur, par adaptation des paramètres de poids des distributions Gaussiennes de ce GMM. Cette prise en compte de la structure temporelle des mots de passe permet de réduire de 60% le taux d'égales erreurs obtenu lorsque les imposteurs prononcent un énoncé différent du mot de passe des clients. Pour renforcer la modélisation de la structure temporelle des mots de passe, nous proposons d'intégrer une information issue d'un processus externe au sein de notre architecture acoustique hiérarchique. Des points de synchronisation forts, extraits du signal de parole, sont utilisés pour contraindre l'apprentissage des modèles de mots de passe durant la phase d'enrôlement. Les points de synchronisation obtenus lors de la phase de test, selon le même procédé, permettent de contraindre le décodage Viterbi utilisé, afin de faire correspondre la structure de la séquence avec celle du modèle testé. Cette approche a été évaluée sur la base de données audio-vidéo MyIdea grâce à une information issue d'un alignement phonétique. Nous avons montré que l'ajout d'une contrainte de synchronisation au sein de notre approche acoustique permet de dégrader les scores imposteurs et ainsi de diminuer le taux d'égales erreurs de 20% (en relatif) dans le cas où les imposteurs ignorent le mot de passe des clients tout en assurant des performances équivalentes à celles des approches état-de-l'art dans le cas où les imposteurs connaissent les mots de passe. L'usage de la modalité vidéo nous apparaît difficilement conciliable avec la limitation des ressources imposée par le contexte embarqué. Nous avons proposé un traitement simple du flux vidéo, respectant ces contraintes, qui n'a cependant pas permis d'extraire une information pertinente. L'usage d'une modalité supplémentaire permettrait néanmoins d'utiliser les différentes informations structurelles pour déjouer d'éventuelles impostures par play-back. Ce travail ouvre ainsi de nombreuses perspectives, relatives à l'utilisation d'information structurelle dans le cadre de la vérification du locuteur et aux approches de reconnaissance du locuteur assistée par la modalité vidéo
APA, Harvard, Vancouver, ISO, and other styles
24

Kilic, V. "Audio-visual tracking of multiple moving speakers." Thesis, University of Surrey, 2016. http://epubs.surrey.ac.uk/809761/.

Full text
Abstract:
In this thesis, a novel approach is proposed for multi-speaker tracking by integrating audio and visual data in a particle filtering (PF) framework. This approach is further improved for adaptive estimation of two critical parameters of the PF, namely, the number of particles and noise variance, based on tracking error and the area occupied by the particles in the image. Here, it is assumed that the number of speakers is known and constant during the tracking. To relax this assumption, the random finite set (RFS) theory is used due to its ability in dealing with the problem of tracking a variable number of speakers. However, the computational complexity increases exponentially with the number of speakers, so probability hypothesis density (PHD) filter, which is first order approximation of the RFS, is applied with sequential Monte Carlo (SMC), namely particle filter, implementation since the computational complexity increases linearly with the number of speakers. The SMC-PHD filter in visual tracking uses three types of particles (i.e. surviving, spawned and born particles) to model the state of the speakers and to estimate the number of speakers. We propose to use audio data in the distribution of these particles to improve the visual SMC-PHD filter in terms of estimation accuracy and computational efficiency. The tracking accuracy of the proposed algorithm is further improved by using a modified mean-shift algorithm, and the extra computational complexity introduced by mean-shift is controlled with a sparse sampling technique. For quantitative evaluation, both audio and video sequences are required together with the calibration information of the cameras and microphone arrays (circular arrays). To this end, the AV16.3 dataset is used to demonstrate the performance of the proposed methods in a variety of scenarios such as occlusion and rapid movements of the speakers.
APA, Harvard, Vancouver, ISO, and other styles
25

Sturtzer, Eric. "Modélisation en vue de l'intégration d'un système audio de micro puissance comprenant un haut-parleur MEMS et son amplificateur." Phd thesis, INSA de Lyon, 2013. http://tel.archives-ouvertes.fr/tel-00940463.

Full text
Abstract:
Ce manuscrit de thèse propose l'optimisation de l'ensemble de la chaîne de reproduction sonore dans un système embarqué. Le premier axe de recherche introduit les notions générales concernant les systèmes audio embarqués nécessaires à la bonne compréhension du contexte de la recherche. Le principe de conversion de l'ensemble de la chaine est présenté afin de comprendre les différentes étapes qui composent un système audio. Un état de l'art présente les différents types de haut-parleurs ainsi que l'électronique associé les plus couramment utilisées dans les systèmes embarqués. Le second axe de recherche propose une approche globale : une modélisation électrique du haut-parleur (tenant compte d'un nombre optimal de paramètres) permet à un électronicien de mieux appréhender les phénomènes non-linéaires du haut-parleur qui dégradent majoritairement la qualité audio. Il en résulte un modèle viable qui permet d'évaluer la non-linéarité intrinsèque du haut-parleur et d'en connaitre sa cause. Les résultats des simulations montrent que le taux de distorsion harmonique intrinsèque au haut-parleur est supérieur à celui généré par un amplificateur. Le troisième axe de recherche met en avant l'impact du contrôle du transducteur. L'objectif étant de savoir s'il existe une différence, du point de vue de la qualité audio, entre la commande asservie par une tension ou par un courant, d'un micro-haut-parleur électrodynamique. Pour ce type de transducteur et à ce niveau de la modélisation, le contrôle en tension est équivalent à contrôler directement le haut-parleur en courant. Néanmoins, une solution alternative (ne dégradant pas davantage la qualité audio du signal) pourrait être de contrôler le micro-haut-parleur en courant. Le quatrième axe de recherche propose d'adapter les spécifications des amplificateurs audio aux performances des micro-haut-parleurs. Une étude globale (énergétique) démontre qu'un des facteurs clés pour améliorer l'efficacité énergétique du côté de l'amplificateur audio est la minimalisation de la consommation statique en courant, en maximalisant le rendement à puissance nominale. Pour les autres spécifications, l'approche globale se base sur l'étude de l'impact de la spécification d'un amplificateur sur la partie acoustique. Cela nous a par exemple permis de réduire la contrainte en bruit de 300%. Le dernier axe de recherche s'articule autour d'un nouveau type de transducteur : un micro-haut-parleur en technologie MEMS. La caractérisation électroacoustique présente l'amélioration en terme de qualité audio (moins de 0,016% de taux de distorsion harmonique) et de plage de fréquence utile allant de 200 Hz à 20 kHz le tout pour un niveau sonore moyen de 80dB (10cm). La combinaison de tous les efforts présente un réel saut technologique. Enfin, la démarche globale d'optimisation de la partie électrique a été appliquée aux performances du MEMS dans la dernière section, ce qui a notamment permis de réduire la contrainte en bruit de 500%.
APA, Harvard, Vancouver, ISO, and other styles
26

Collins, Christopher Michael. "Development of a Virtual Acoustic Showroom for Simulating Listening Environments and Audio Speakers." Thesis, Virginia Tech, 2004. http://hdl.handle.net/10919/9965.

Full text
Abstract:
Virtual acoustic techniques can be used to create virtual listening environments for multiple purposes. Using multi-speaker reproduction, a physical environment can take on the acoustical appearance of another environment. Implementation of this environment auralization could change the way customers evaluate speakers in a retail store. The objective of this research is to develop a virtual acoustic showroom using a multi- speaker system. The two main components to the virtual acoustic showroom are simulating living environments using the image source method, and simulating speaker responses using inverse filtering. The image source method is used to simulate realistic living environments by filtering the environment impulse response by frequency-dependant absorption coefficients of typical building materials. Psychoacoustic tests show that listeners can match virtual acoustic cues with appropriate virtual visual cues. Inverse filtering is used to "replace" the frequency response function of one speaker with another, allowing a single set of speakers to represent any number of other speakers. Psychoacoustic tests show that listeners could not distinguish the difference between the original speaker and the reference speaker that was mimicking the original. The two components of this system are shown to be accurate both empirically and psychologically.
Master of Science
APA, Harvard, Vancouver, ISO, and other styles
27

Syncox, David. "The effects of audio-taped feedback on ESL graduate student writing." Thesis, McGill University, 2003. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=19391.

Full text
Abstract:
This thesis examined the effects of audio-taped feedback on ESL graduate student writing. Thirty-two students participated in the study over one semester. A triangulated approach to data collection was used by gathering and analyzing information from three principal sources: (a) students' written texts, (b) audio-taped feedback from the instructor, and (c) interviews with the participants. The research revealed that single and multiple feedback moves, in the form of models and prompts, were used by the instructor with similar frequency. Results also indicated that students benefited in all cases from audio-taped feedback. Overall, findings suggest that audio-taped feedback is very effective at helping students to produce an improved draft. The study includes discussion of the pedagogical implications of audio-taped feedback. Limitations to the study are discussed and conclusions are drawn based on the findings.
APA, Harvard, Vancouver, ISO, and other styles
28

Eiderbo, Ian. "How does binaural audio mixed for headphones translate to loudspeaker setups in terms of listener preferences?" Thesis, Luleå tekniska universitet, Institutionen för ekonomi, teknik, konst och samhälle, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:ltu:diva-85732.

Full text
Abstract:
While most of today’s music listening is being done through headphones, mixing techniques using binaural audio are still not widely implemented in modern music production. This study aims to help inform mixing engineers on the applicability of binaural processing for music production, with the specific focus on how binaurally processed audio translates to loudspeakers in terms of listener preference. In this study a listening test was performed where binaurally processed mixes where given preference ratings in relation to a reference mix. Each listener completed the test twice, once using headphones and once using loudspeakers. The test results for the two playback systems were then compared. Only one of 12 mixes showed a significant difference in preference ratings with playback system as the factor, but the reported ratings showed a large disagreement among the 13 test subjects. The results from the study are inconclusive, however they do not suggest that the binaural processing used for the stimuli suffers in terms of listener preference when played back over loudspeakers.
APA, Harvard, Vancouver, ISO, and other styles
29

Scaini, Davide. "Wavelet-based spatial audio framework : from ambisonics to wavelets: a novel approach to spatial audio." Doctoral thesis, Universitat Pompeu Fabra, 2019. http://hdl.handle.net/10803/668214.

Full text
Abstract:
Ambisonics is a complete theory for spatial audio whose building blocks are the spherical harmonics. Some of the drawbacks of low order Ambisonics, like poor source directivity and small sweet-spot, are directly related to the properties of spherical harmonics. In this thesis we illustrate a novel spatial audio framework similar in spirit to Ambisonics that replaces the spherical harmonics by an alternative set of functions with compact support: the spherical wavelets. We develop a complete audio chain from encoding to decoding, using discrete spherical wavelets built on a multiresolution mesh. We show how the wavelet family and the decoding matrices to loudspeakers can be generated via numerical optimization. In particular, we present a decoding algorithm optimizing acoustic and psychoacoustic parameters that can generate decoding matrices to irregular layouts for both Ambisonics and the new wavelet format. This audio workflow is directly compared with Ambisonics.
Ambisonics és una teoria completa d’àudio espacial construïda a partir dels harmònics esfèrics. Alguns dels inconvenients d'Ambisonics de baix ordre, com ara una localització pobra i una àrea petita d’escolta òptima, estan directament relacionats amb les propietats dels harmònics esfèrics. En aquesta tesi presentem un nou formalisme d’àudio espacial basat en Ambisonics substituint però els harmònics esfèrics per les ondetes esfèriques. Desenvolupem una cadena d’àudio completa, des de la codificació fins a la descodificació, a través de l'ús de ondetes discretes construïdes en una malla de multirresolució. Mostrem com es pot generar la família de ondetes i les matrius de descodificació a altaveus mitjançant una optimització numèrica. Presentem un algorisme de descodificació que pot generar matrius de descodificació a conjunts irregulars d'altaveus tant per a Ambisonics com per al nou format basat en ondetes. Finalment, comparem aquest nou formalisme d’àudio amb Ambisonics.
APA, Harvard, Vancouver, ISO, and other styles
30

Li, Ying. "Audio-visual training effect on L2 perception and production of English /0/-/s/ and /d/-/z/ by Mandarin speakers." Thesis, University of Newcastle upon Tyne, 2015. http://hdl.handle.net/10443/3052.

Full text
Abstract:
Research on L2 speech perception and production indicate that adult language learners are able to acquire L2 speech sounds that they initially have difficulty with (Best, 1994). Moreover, use of the audiovisual modality, which provides language learners with articulatory information for speech sounds, has been illustrated to be effective in L2 speech perception training (Hazan et al., 2005). Since auditory and visual skills are integrated with each other in speech perception, audiovisual perception training may enhance language learners’ auditory perception of L2 speech sounds (Bernstein, Auer Jr, Ebehardt, and Jiang, 2013). However, little research has been conducted on L1 Mandarin learners of English. Based on these hypotheses, this study investigated whether audiovisual perception training can improve learners’ auditory perception and production of L2 speech sounds. A pilot study was performed on 42 L1-Mandarin learners of English (L1-dialect: Chongqing Mandarin (CQd)) in which their perception and production of English consonants was tested. According to the results, 29 of the subjects had difficulty in the perception and production of /θ/-/s/ and /ð/-/z/. Therefore, these 29 subjects were selected as the experimental group to attend a 9-session audiovisual perception training programme, in which identification tasks for the minimal pairs /θ/-/s/ and /ð/-/z/ were conducted. The subjects’ perception and production performance was tested before, during and at the end of the training with an AXB task and “read aloud” task. In view of the threat to interval validity arising from a repeated testing effect, a control group was tested with the same AXB task and intervals as that of the experimental group. The results show that the experimental group’s perception and production accuracy improved substantially during and by the end of the training programme. Indeed, whilst the control group also showed perception improvement across the pre-test and post-test, their degree of improvement was significantly lower than that of the experimental group. These results therefore confirm the value of the audiovisual modality in L2 speech perception training.
APA, Harvard, Vancouver, ISO, and other styles
31

Bern, Charlotte, and Linda Liljeström. "“Request to speak, button” : Accessibility for visually impaired VoiceOver users on social live audio chat platforms." Thesis, Linnéuniversitetet, Institutionen för informatik (IK), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-105457.

Full text
Abstract:
Social media has become an inevitable part of everyday life. With the focus to a large extent being on image and video sharing, accessibility for the visually impaired is not always granted. However, new types of social platforms with live audio chats at their core have shown potential to stand out as particularly inclusive of visually impaired users. Taking a standpoint in the Technology-to-Performance Chain model, the study aimed to create a better understanding of the subjective user experience of visually impaired users of social live audio chat platforms by identifying what influences accessibility, especially when it comes to taking part of and creating audio content. The topic was approached in the form of a case study. Qualitative data collection was conducted with a combination of observations and product assessment, expert interviews a well as user interviews. The aim was to create a better understanding of the subjective user experience of visually impaired users. The results suggest audio-based platforms have the potential to fit the visually impaired users well. A limited scope of the platform, having voice-based communication at the core and a limited number of visual elements all influence accessibility. A sufficient amount of adjustable VoiceOver adds to the accessibility of the platform. Furthermore, the results indicate being aware of user behaviour and the inaccessibility it might lead to is of importance. Concludingly, applying a UX perspective is deemed as of importance as the results indicate it often is the intangible, subjective user perspective which can highlight what influences accessibility.
APA, Harvard, Vancouver, ISO, and other styles
32

Zhang, Xiangmei. "Authentic materials in English as a Second Language conversation instruction." CSUSB ScholarWorks, 2004. https://scholarworks.lib.csusb.edu/etd-project/2526.

Full text
APA, Harvard, Vancouver, ISO, and other styles
33

Hussin, Nora Anniesha Binte. "Interaction from an activity theoretical perspective: comparing learner discourse of language face-to-face, inchat and in audio conferencing in second language learning." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2009. http://hub.hku.hk/bib/B41758146.

Full text
APA, Harvard, Vancouver, ISO, and other styles
34

Aoyama, Kazumasa. "Using A Diglot Reader to Teach Kanji: The Effects of Audio and Romaji on the Acquisition of Kanji Vocabulary." Diss., CLICK HERE for online access, 2005. http://contentdm.lib.byu.edu/ETD/image/etd888.pdf.

Full text
APA, Harvard, Vancouver, ISO, and other styles
35

Bodenstein, Eckhard W. "Lernervoraussetzungen von Deutschstudenten an der Universitat Zululand : eine Untersuchung auf der Grundlage von Bildtexten." Thesis, Stellenbosch : Stellenbosch University, 1998. http://hdl.handle.net/10019.1/50985.

Full text
Abstract:
Thesis (MA) -- Stellenbosch University, 1998.
ENGLISH ABSTRACT: During my work as a lecturer in "German as a foreign language" at the University of Zululand I have experienced that African students often understand German texts in a different way than I, coming from a European background, would have expected. According to the research on text reception, differences in understanding texts are the result of different reader characteristics of which the socio-cultural background forms an important component. This thesis examines the socio-cultural background of Zulu students and aims to show how it influences their understanding of German texts. The necessary data is obtained by way of a comparative empirical investigation which is enhanced by personal observations made while teaching German to African learners. The investigation is based on a German advertisement. The control groups consist of South African students at the Universities of Natal/Durban and Stellenbosch as well as students in Germany at the University of Kassel. The investigation is concluded by a discussion of the implications that the socio-cultural background of Zulu students can have on the teaching of "German as a foreign language" and on intercultural communication.
AFRIKAANSE OPSOMMING: Gedurende my werks,aamheidas dosent in die vak "Duits as vreemde taal" aan die Universiteit van Zululand het ek ondervind dat Swart studente dikwels Duitse tekste heeltemal anders verstaan as wat ek, as iemand met Europese agtergrond, sou verwag het. Navorsing oar teks-resepsie skryf resepsieverskille toe aan verskillende lesereienskappe waarvan die sosio-kulturele agtergrond 'n belangrike komponent vorm. Hierdie tesis ondersoek die sosio-kulturele agtergrond van Zoeloe-studente en probeer aantoon hoe dit die resepsie van Duitse tekste be'invloed. Die nodige inligting hiervoor word verkry deur middel van 'n vergelykende empiriese ondersoek. Dit word aangevul deur persoonlike waarnemings wat ek gedurende die onderrig van Duits aan Swart studente gemaak het. Die ondersoek is gebaseer op 'n Duitse advertensie. Die kontrolegroepe bestaan uit studente aan die universiteite in Natal/Durban en Stellenbosch in Suid- Afrika en in Duitsland aan die Universiteit van Kassel. In die slotgedeelte word die implikasies uitgewys wat die sosio-kulturele agtergrond van Zoeloe-studente op die onderrig van "Duits as vreemde taal" as oak op interkulturele kommunikasie kan he.
APA, Harvard, Vancouver, ISO, and other styles
36

Murray, Garold Linwood. "Bodies in cyberspace : language learning in a simulated environment." Thesis, National Library of Canada = Bibliothèque nationale du Canada, 1998. http://www.collectionscanada.ca/obj/s4/f2/dsk2/ftp02/NQ27209.pdf.

Full text
APA, Harvard, Vancouver, ISO, and other styles
37

Thompson, Scott Alan. "A Comparison of the Effects of Different Video Imagery Upon Adult ESL Students' Comprehension of a Video Narrative." PDXScholar, 1994. https://pdxscholar.library.pdx.edu/open_access_etds/4845.

Full text
Abstract:
This study was meant to provide empirical evidence to support or challenge the assumption that a nonfiction video narrative will be better comprehended by students of ESL if it includes a variety of relevant visual information compared to only seeing a single speaker or "talking head" reciting a narration. The overarching goal of this study was to give teachers of ESL greater knowledge and confidence in using video materials to develop the listening skills of their students. It compared two video tapes which contained the identical soundtrack but different visual information. The first tape (also called the "lecture tape") showed a single speaker, standing behind a lectern, giving a speech about Costa Rica. The second video (also called the "documentary tape") contained the identical soundtrack of tape one, but included documentary video footage actually filmed in Costa Rica which complemented the narration. A questionnaire of 45 true/false questions was created based on facts given in the narration. Thirty-nine advanced and fifty-five intermediate university ESL students took part in the study. Approximate! y half of each group viewed the lecture tape while the other half watched the documentary tape. All students answered the 45 - item questionnaire while viewing their respective video tapes. A thorough item-analysis was then conducted with the initial raw scores of all 94 students, resulting in fifteen questions being omitted from the final analysis. Based on a revised 30 - item questionnaire, the scores of the video and documentary groups were compared within each proficiency level. The hypothesis of the study was that the documentary tape would significantly improve listening comprehension at the intermediate level but that no significant difference would be found between the advanced lecture and documentary groups. In other words, it was predicted that the documentary video would have an interaction effect depending upon proficiency level. However, the results of a 2-way ANOV A did not support the hypothesis. In addition to the ANOV A, a series oft-tests also found no significant difference between the mean scores of the documentary and lecture groups at either the intermediate or the advanced levels This study was intended to be a beginning to research which may eventually reveal a "taxonomy" of video images from those which enhance listening comprehension the most to those that aid it the least. It contained limitations in the testing procedures which caused the results to be inconclusive. A variety of testing methods was suggested in order to continue research which may reveal such a "video" taxonomy. Given the plethora of video materials that ESL teachers can purchase, record, or create themselves, empirical research is needed to help guide the choices that educators make in choosing video material for their students which will provide meaningful linguistic input.
APA, Harvard, Vancouver, ISO, and other styles
38

Zappen-Thomson, Marianne 1956. "Liedertexte im fremdkulturellen Literaturunterricht : eine textwissenschaftliche und -didaktische Untersuchung." Thesis, Stellenbosch : Stellenbosch University, 1985. http://hdl.handle.net/10019.1/64968.

Full text
APA, Harvard, Vancouver, ISO, and other styles
39

Tyson, Marian. "The effect of media on the listening comprehension scores of intermediate ESL students." PDXScholar, 1989. https://pdxscholar.library.pdx.edu/open_access_etds/3961.

Full text
Abstract:
The use of videotapes has become widespread in ESL classes in recent years. The decline in cost of tapes and VCR equipment has assisted in the spread of this technology. These tapes are often used in listening comprehension classes and may replace or supplement the use of audiotapes. However, research has not established that the addition of the visual element, especially in the movie or TV type context of many videos, is an advantage to the language learner. A total of seventy-six students participated in a listening comprehension recall exercise. Thirty-nine students viewed a videotape segment, and the remaining thirty-seven students listened to the audio portion of the same segment. Each group viewed or listened to the tape two times. Then the groups were given twenty minutes to write a recall. Each paper was scored for total idea units recalled, macropropositions, elaborations, and distortions .
APA, Harvard, Vancouver, ISO, and other styles
40

Sundberg, Daniel. "HANDSFREE-ENHET FÖR MOBIL TRYGGHETSTELEFON." Thesis, Örebro University, Örebro University, Department of Technology, 2009. http://urn.kb.se/resolve?urn=urn:nbn:se:oru:diva-7411.

Full text
Abstract:

Cnior Mobile AB i Lindesberg utvecklar en mobil trygghetstelefon för äldre. Detta examensarbete går ut på att utforma en handsfree-enhet för denna. Handsfree-enheten ska integreras i larmknappen, som bärs av användaren runt handleden, och har kontakt med telefonen via blåtandsradio. I examensarbetet ingår att välja ut lämplig högtalare och mikrofon, hitta lösningar för smuts- och vattentålighet samt att lösa problem med ekon och bakgrundsstörningar.

En högtalare hittades som uppfyllde kraven för smuts- och vattentålighet samtidigt som den hade utmärkt frekvensgång för återgivning av tydligt tal. Vattenavrinning från högtalaren löstes genom att ett sinussvep sänds ut från högtalaren varje gång ett samtal ska kopplas upp. På så sätt pressar ljudtrycket ut vattnet från handledsknappens kavitet. Olika utformningar av ljudhålen i handledsknappens skal provades. Den bästa lösningen för vattenavrinningen var att använda sju stycken runda hål med 1,3 mm i diameter. En ljudtrycksmätning säkerställde att ljudtrycket inte blev lidande av denna utformning av ljudhålen.

Ekosläckning och bakgrundsstörningsundertryckning sköts av GSM-modulen i trygghetsmobilen. I ekosläckningens manual finns beskrivet hur ekosläckningens 24 parametrar kan justeras för att passa olika applikationer. Endast en mindre ändring av de rekommenderade parametervärdena behövdes för att ekosläckning och bakgrundsstörningsundertryckning skulle fungera tillfredställande.

Eftersom mikrofonernas datablad visade på så snarlika egenskaper överlämnades mikrofonvalet till företaget, då det kan vara klokt att låta priset avgöra.

APA, Harvard, Vancouver, ISO, and other styles
41

Kůst, Martin. "Konstrukční návrh moderního těla reproduktoru s využitím nových technologií." Master's thesis, Vysoké učení technické v Brně. Fakulta strojního inženýrství, 2020. http://www.nusl.cz/ntk/nusl-432589.

Full text
Abstract:
This diploma thesis deals with the analysis and design optimization of a speaker cabinet (enclosure) produced by the additive technology of 3D sand printing. The introductory part is devoted to the theory, both in the field of noise, vibration and the method of their calculations, as well as the theory of speakers and their systems. Numerical and experimental modal analysis is performed, which were compared to determine the mechanical properties of the new material, including material damping. This is followed by experimental and numerical harmonic analysis, with output in the form of a numerical model describing the behaviour of the structure during its excitation. The data are compared with a modal analysis of the internal acoustic space and the critical shapes and their frequencies are determined. At the end of the work, construction modifications are proposed to increase the rigidity of enclosures, the influence of which is evaluated on the created numerical models.
APA, Harvard, Vancouver, ISO, and other styles
42

Anguiano, Arcelia. "Visual literacy in kindergarten: How can visual literacy be used as a tool to promote student learning in the kindergarten classroom?" CSUSB ScholarWorks, 2004. https://scholarworks.lib.csusb.edu/etd-project/2559.

Full text
Abstract:
The purpose of this project is to create a guide for planning effective use of visuals. Recent studies demonstrate the effectiveness of using visuals in classroom instruction, including the fact that English language learners benefit from using this tool.
APA, Harvard, Vancouver, ISO, and other styles
43

Shintani, Emi. "Teaching film to enhance brain compatible-learning in English-as-a-foreign language instruction." CSUSB ScholarWorks, 2003. https://scholarworks.lib.csusb.edu/etd-project/2403.

Full text
Abstract:
These learning strategies have presented a theoretical framework for applying brain-based learning to EFL teaching. The model is based on the holistic principles of brain based learning rather than memorization of skills and knowledge as has been previously employed in EFL instruction.
APA, Harvard, Vancouver, ISO, and other styles
44

Lin, Yi-Chun, and 林怡君. "Performance Improvement of Speaker Recognition for Clipped Audio Signals." Thesis, 2012. http://ndltd.ncl.edu.tw/handle/ysqxnq.

Full text
Abstract:
碩士
國立臺北科技大學
電腦與通訊研究所
100
This thesis investigates the problem of speaker verification under the condition that the recorded speech signals are clipped due to the saturation of quantization. The clipping of audio signals is not only unpleasant for human listening but also detrimental for speaker verification systems. Although there are a number of restoration techniques for improving the auditory quality of the clipped speech signals, it is found that the speaker characteristics of the restored clipped speech signals can be significantly changed; hence, the restoration techniques are of little help for speaker verification . To solve this problem, this study proposes improving the speaker verification by pruning the clipped signals instead of restoring them. However, to avoid that the length of a testing speech signal may be shorten severely after the pruning, we develop methods for detecting and discarding the speech frames that contain harmful clipped signals while keeping the speech frames that contain acceptable clipped signals. Our experiments conducted using the NIST2001 SRE database show that the proposed methods can reduce around 10% of the equal error rate of the speaker verification .
APA, Harvard, Vancouver, ISO, and other styles
45

Chen, Wayne Long, and 陳偉恩. "The impact of smart speaker to the audio industry." Thesis, 2019. http://ndltd.ncl.edu.tw/handle/779da4.

Full text
Abstract:
碩士
國立政治大學
國際經營管理英語碩士學位學程(IMBA)
107
The smart speaker technology was introduced to the market in recent years. The technology has changed not only the behavior of consumers but also impacted the entire electronic world. This thesis will specifically discuss about how smart speaker impacts the audio industry. From the introduction of the current audio industry to the rise of voice assistant technology, a thorough history background is covered in order to give a holistic view of the industry. This thesis will also introduce the players including technology providers, audio brands, and system manufacturers. The relationship between these players, problems they face, and the strategy they take to grow their businesses with each other will be the main analysis of this thesis. As the evolution of the smart voice technology continues, smart speakers become the trend of the future. From the market side, the thesis covers the current and future demand on the smart speakers. How to fulfill such demand from supplier’s side and what can each player continues to bring to the table are also discussed. At the end, strategic recommendations are provided to all players in the smart speaker supply chain. The goal is to adapt this abrupt change and able to continue the growth with this technology evolution.
APA, Harvard, Vancouver, ISO, and other styles
46

郭志梃. "DSP implementation of an audio/video system using panel speaker array." Thesis, 2002. http://ndltd.ncl.edu.tw/handle/75484798920950449088.

Full text
Abstract:
碩士
國立交通大學
機械工程系
90
Applying the technology of the array signal processing to make the sound radiate omnidirectionally is the main purpose of this paper. Hence the method of designing array coefficients to form omnidirectional pattern was employed in this paper. Further, the efficiency of omnidirectional response was greatly improved by using the method of optimization. The optimization is to find out a set of array coefficients, which has optimal efficiency at desired flatness of sound pattern. Owing to the nonlinear relation between array coefficients and spectral flatness function, a method of optimization called genetic algorithm was employed because of its effective searching global maximum value in nonlinear space. Further, a special case called modified optimal omnidirectional case occurs in low frequency. To provide more efficiency at low frequency is the main purpose in this case.ays.
APA, Harvard, Vancouver, ISO, and other styles
47

Tsao, Yan-cheng, and 曹晏誠. "A Speech Indexing System Using the Audio Segmentation and Speaker Clustering Schemes." Thesis, 2005. http://ndltd.ncl.edu.tw/handle/03049870471314472707.

Full text
APA, Harvard, Vancouver, ISO, and other styles
48

Chen, Chun-chi, and 陳俊吉. "A Design of Multi-session Text-independent Digital Camcorder Audio-Video Database for Speaker Recognition." Thesis, 2008. http://ndltd.ncl.edu.tw/handle/vqrmxu.

Full text
Abstract:
碩士
國立中山大學
電機工程學系研究所
96
In this thesis, an audio-video database for speaker recognition is constructed using a digital camcorder. Motion pictures of fifteen hundred speakers are recorded in three different sessions in the database. For each speaker, 20 still images per session are also derived from the video data. It is hoped that this database can provide an appropriate training and testing mechanism for person identification using both voice and face features.
APA, Harvard, Vancouver, ISO, and other styles
49

Wang, Long-Cheng, and 王龍政. "A Design of Multi-Session, Text Independent, TV-Recorded Audio-Video Database for Speaker Recognition." Thesis, 2006. http://ndltd.ncl.edu.tw/handle/55168776720675963268.

Full text
Abstract:
碩士
國立中山大學
電機工程學系研究所
94
A four-session text independent, TV-recorded audio-video database for speaker recognition is collected in this thesis. The speaker data is used to verify the applicability of a design methodology based on Mel-frequency cepstrum coefficients and Gaussian mixture model. Both single-session and multi-session problems are discussed in the thesis. Experimental results indicate that 90% correct rate can be achieved for a single-session 3000-speaker corpus while only 67% correct rate can be obtained for a two-session 800-speaker dataset. The performance of a multi-session speaker recognition system is greatly reduced due to the variability incurred in the recording environment, speakers’ recording mood and other unknown factors. How to increase the system performance under multi-session conditions becomes a challenging task in the future. And the establishment of such a multi-session large-scale speaker database does indeed play an indispensable role in this task.
APA, Harvard, Vancouver, ISO, and other styles
50

Garud, Meera. "Cricket Inspired micro Speakers." Thesis, 2019. https://etd.iisc.ac.in/handle/2005/4585.

Full text
Abstract:
MEMS technology has ushered in a new era of miniaturized sensors and actuators. Many smart devices and systems are being developed using these sensors. Home automation is now a widespread reality owing to the development of affordable miniature devices. Wearables like smart watches and point-of-care medical devices have brought positive changes in the healthcare industry. Also, at global scale, these sensors and actuators find their place in tracking weather changes and remote sensing applications. Many of these micro and nano systems communicate with humans using electroacoustic devices. They can take in voice input, process it and give out voice instructions/suggestions using a system made of microphones and audio speakers. However, when we compare the sizes of all the different sensors and actuators with the size of an audio speaker, we see that audio speakers have not really achieved miniaturization. For example, in a standard smartphone a mini audio speaker is still 8 times larger in volume when compared with a MEMS microphone. An audio speaker is still struggling to get into micron size range. This further limits the extent to which a smart device can reduce in size. The size reduction of the audio speaker, if possible, will lead to an overall size reduction of smart devices. We inspect the intricacies involved in miniaturization of an audio speaker and explore a possible solution by combining silicon MEMS technologies with nature inspired design for the same. In this work, we present two unconventional approaches to build electrostatically actuated thin audio speakers. First, we present a bio-mimetic micro-speaker inspired by the sound production mechanism of field crickets. This design uses peripheral actuation unlike the usual full area actuation in the conventional electrostatic speaker designs or unlike the electrodynamic speaker designs where the diaphragm is directly actuated by magnet-coil partially covering the central area of the vibrating diaphragm. Also, as in the cricket’s sound production mechanism, we design to take advantage of the resonance. Our speaker essentially uses a silicon diaphragm created by etching out patterned cavities in the handle layer of an SOI wafer and controlled lateral etch of the buried oxide to create closely spaced top and bottom annular electrodes for peripheral actuation. These electrodes are used to drive the diaphragm with audio signal close to its resonance. The open cavity provides an incredible advantage in terms of increasing the pull-in voltage enormously. While we demonstrate the working of these micro-speakers with several audio signals, the development must continue with an array of such speakers for attaining a flat response over audible frequency range in order to make them commercially viable. The second novel design to build wafer thin loudspeakers is based on an accidental discovery we made during testing of the cricket inspired speakers. We demonstrate how two simple pieces of silicon stacked loosely together and actuated with appropriate electrical signal produce sound. The theoretical explanation is given behind the new design idea, whose foundation is electrostatic actuation. Also, a few initial results for the thin speakers developed with this design are presented
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography