Dissertations / Theses: 'Visual speech model'

1

Somasundaram, Arunachalam. "A facial animation model for expressive audio-visual speech." Columbus, Ohio : Ohio State University, 2006. http://rave.ohiolink.edu/etdc/view?acc%5Fnum=osu1148973645.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Van, Wassenhove Virginie. "Cortical dynamics of auditory-visual speech a forward model of multisensory integration /." College Park, Md. : University of Maryland, 2004. http://hdl.handle.net/1903/1871.

Full text

Abstract:

Thesis (Ph. D.) -- University of Maryland, College Park, 2004.
Thesis research directed by: Neuroscience and Cognitive Science. Title from t.p. of PDF. Includes bibliographical references. Published by UMI Dissertation Services, Ann Arbor, Mich. Also available in paper.

APA, Harvard, Vancouver, ISO, and other styles

3

Cosker, Darren. "Animation of a hierarchical image based facial model and perceptual analysis of visual speech." Thesis, Cardiff University, 2005. http://orca.cf.ac.uk/56003/.

Full text

Abstract:

In this Thesis a hierarchical image-based 2D talking head model is presented, together with robust automatic and semi-automatic animation techniques, and a novel perceptual method for evaluating visual-speech based on the McGurk effect. The novelty of the hierarchical facial model stems from the fact that sub-facial areas are modelled individually. To produce a facial animation, animations for a set of chosen facial areas are first produced, either by key-framing sub-facial parameter values, or using a continuous input speech signal, and then combined into a full facial output. Modelling hierarchically has several attractive qualities. It isolates variation in sub-facial regions from the rest of the face, and therefore provides a high degree of control over different facial parts along with meaningful image based animation parameters. The automatic synthesis of animations may be achieved using speech not originally included in the training set. The model is also able to automatically animate pauses, hesitations and non-verbal (or non-speech related) sounds and actions. To automatically produce visual-speech, two novel analysis and synthesis methods are proposed. The first method utilises a Speech-Appearance Model (SAM), and the second uses a Hidden Markov Coarticulation Model (HMCM) - based on a Hidden Markov Model (HMM). To evaluate synthesised animations (irrespective of whether they are rendered semi automatically, or using speech), a new perceptual analysis approach based on the McGurk effect is proposed. This measure provides both an unbiased and quantitative method for evaluating talking head visual speech quality and overall perceptual realism. A combination of this new approach, along with other objective and perceptual evaluation techniques, are employed for a thorough evaluation of hierarchical model animations.

APA, Harvard, Vancouver, ISO, and other styles

4

Theobald, Barry-John. "Visual speech synthesis using shape and appearance models." Thesis, University of East Anglia, 2003. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.396720.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Dean, David Brendan. "Synchronous HMMs for audio-visual speech processing." Thesis, Queensland University of Technology, 2008. https://eprints.qut.edu.au/17689/3/David_Dean_Thesis.pdf.

Full text

Abstract:

Both human perceptual studies and automaticmachine-based experiments have shown that visual information from a speaker's mouth region can improve the robustness of automatic speech processing tasks, especially in the presence of acoustic noise. By taking advantage of the complementary nature of the acoustic and visual speech information, audio-visual speech processing (AVSP) applications can work reliably in more real-world situations than would be possible with traditional acoustic speech processing applications. The two most prominent applications of AVSP for viable human-computer-interfaces involve the recognition of the speech events themselves, and the recognition of speaker's identities based upon their speech. However, while these two fields of speech and speaker recognition are closely related, there has been little systematic comparison of the two tasks under similar conditions in the existing literature. Accordingly, the primary focus of this thesis is to compare the suitability of general AVSP techniques for speech or speaker recognition, with a particular focus on synchronous hidden Markov models (SHMMs). The cascading appearance-based approach to visual speech feature extraction has been shown to work well in removing irrelevant static information from the lip region to greatly improve visual speech recognition performance. This thesis demonstrates that these dynamic visual speech features also provide for an improvement in speaker recognition, showing that speakers can be visually recognised by how they speak, in addition to their appearance alone. This thesis investigates a number of novel techniques for training and decoding of SHMMs that improve the audio-visual speech modelling ability of the SHMM approach over the existing state-of-the-art joint-training technique. Novel experiments are conducted within to demonstrate that the reliability of the two streams during training is of little importance to the final performance of the SHMM. Additionally, two novel techniques of normalising the acoustic and visual state classifiers within the SHMM structure are demonstrated for AVSP. Fused hidden Markov model (FHMM) adaptation is introduced as a novel method of adapting SHMMs from existing wellperforming acoustic hidden Markovmodels (HMMs). This technique is demonstrated to provide improved audio-visualmodelling over the jointly-trained SHMMapproach at all levels of acoustic noise for the recognition of audio-visual speech events. However, the close coupling of the SHMM approach will be shown to be less useful for speaker recognition, where a late integration approach is demonstrated to be superior.

APA, Harvard, Vancouver, ISO, and other styles

6

Dean, David Brendan. "Synchronous HMMs for audio-visual speech processing." Queensland University of Technology, 2008. http://eprints.qut.edu.au/17689/.

Full text

Abstract:

Both human perceptual studies and automaticmachine-based experiments have shown that visual information from a speaker's mouth region can improve the robustness of automatic speech processing tasks, especially in the presence of acoustic noise. By taking advantage of the complementary nature of the acoustic and visual speech information, audio-visual speech processing (AVSP) applications can work reliably in more real-world situations than would be possible with traditional acoustic speech processing applications. The two most prominent applications of AVSP for viable human-computer-interfaces involve the recognition of the speech events themselves, and the recognition of speaker's identities based upon their speech. However, while these two fields of speech and speaker recognition are closely related, there has been little systematic comparison of the two tasks under similar conditions in the existing literature. Accordingly, the primary focus of this thesis is to compare the suitability of general AVSP techniques for speech or speaker recognition, with a particular focus on synchronous hidden Markov models (SHMMs). The cascading appearance-based approach to visual speech feature extraction has been shown to work well in removing irrelevant static information from the lip region to greatly improve visual speech recognition performance. This thesis demonstrates that these dynamic visual speech features also provide for an improvement in speaker recognition, showing that speakers can be visually recognised by how they speak, in addition to their appearance alone. This thesis investigates a number of novel techniques for training and decoding of SHMMs that improve the audio-visual speech modelling ability of the SHMM approach over the existing state-of-the-art joint-training technique. Novel experiments are conducted within to demonstrate that the reliability of the two streams during training is of little importance to the final performance of the SHMM. Additionally, two novel techniques of normalising the acoustic and visual state classifiers within the SHMM structure are demonstrated for AVSP. Fused hidden Markov model (FHMM) adaptation is introduced as a novel method of adapting SHMMs from existing wellperforming acoustic hidden Markovmodels (HMMs). This technique is demonstrated to provide improved audio-visualmodelling over the jointly-trained SHMMapproach at all levels of acoustic noise for the recognition of audio-visual speech events. However, the close coupling of the SHMM approach will be shown to be less useful for speaker recognition, where a late integration approach is demonstrated to be superior.

APA, Harvard, Vancouver, ISO, and other styles

7

Mukherjee, Niloy 1978. "Spontaneous speech recognition using visual context-aware language models." Thesis, Massachusetts Institute of Technology, 2003. http://hdl.handle.net/1721.1/62380.

Full text

Abstract:

Thesis (S.M.)--Massachusetts Institute of Technology, School of Architecture and Planning, Program in Media Arts and Sciences, 2003.
Includes bibliographical references (p. 83-88).
The thesis presents a novel situationally-aware multimodal spoken language system called Fuse that performs speech understanding for visual object selection. An experimental task was created in which people were asked to refer, using speech alone, to objects arranged on a table top. During training, Fuse acquires a grammar and vocabulary from a "show-and-tell" procedure in which visual scenes are paired with verbal descriptions of individual objects. Fuse determines a set of visually salient words and phrases and associates them to a set of visual features. Given a new scene, Fuse uses the acquired knowledge to generate class-based language models conditioned on the objects present in the scene as well as a spatial language model that predicts the occurrences of spatial terms conditioned on target and landmark objects. The speech recognizer in Fuse uses a weighted mixture of these language models to search for more likely interpretations of user speech in context of the current scene. During decoding, the weights are updated using a visual attention model which redistributes attention over objects based on partially decoded utterances. The dynamic situationally-aware language models enable Fuse to jointly infer spoken language utterances underlying speech signals as well as the identities of target objects they refer to. In an evaluation of the system, visual situationally-aware language modeling shows significant , more than 30 %, decrease in speech recognition and understanding error rates. The underlying ideas of situation-aware speech understanding that have been developed in Fuse may may be applied in numerous areas including assistive and mobile human-machine interfaces.
by Niloy Mukherjee.
S.M.

APA, Harvard, Vancouver, ISO, and other styles

8

Kalantari, Shahram. "Improving spoken term detection using complementary information." Thesis, Queensland University of Technology, 2015. https://eprints.qut.edu.au/90074/1/Shahram_Kalantari_Thesis.pdf.

Full text

Abstract:

This research has made contributions to the area of spoken term detection (STD), defined as the process of finding all occurrences of a specified search term in a large collection of speech segments. The use of visual information in the form of lip movements of the speaker in addition to audio and the use of topic of the speech segments, and the expected frequency of words in the target speech domain, are proposed. By using these complementary information, improvement in the performance of STD has been achieved which enables efficient search of key words in large collection of multimedia documents.

APA, Harvard, Vancouver, ISO, and other styles

9

Deena, Salil Prashant. "Visual speech synthesis by learning joint probabilistic models of audio and video." Thesis, University of Manchester, 2012. https://www.research.manchester.ac.uk/portal/en/theses/visual-speech-synthesis-by-learning-joint-probabilistic-models-of-audio-and-video(bdd1a78b-4957-469e-8be4-34e83e676c79).html.

Full text

Abstract:

Visual speech synthesis deals with synthesising facial animation from an audio representation of speech. In the last decade or so, data-driven approaches have gained prominence with the development of Machine Learning techniques that can learn an audio-visual mapping. Many of these Machine Learning approaches learn a generative model of speech production using the framework of probabilistic graphical models, through which efficient inference algorithms can be developed for synthesis. In this work, the audio and visual parameters are assumed to be generated from an underlying latent space that captures the shared information between the two modalities. These latent points evolve through time according to a dynamical mapping and there are mappings from the latent points to the audio and visual spaces respectively. The mappings are modelled using Gaussian processes, which are non-parametric models that can represent a distribution over non-linear functions. The result is a non-linear state-space model. It turns out that the state-space model is not a very accurate generative model of speech production because it assumes a single dynamical model, whereas it is well known that speech involves multiple dynamics (for e.g. different syllables) that are generally non-linear. In order to cater for this, the state-space model can be augmented with switching states to represent the multiple dynamics, thus giving a switching state-space model. A key problem is how to infer the switching states so as to model the multiple non-linear dynamics of speech, which we address by learning a variable-order Markov model on a discrete representation of audio speech. Various synthesis methods for predicting visual from audio speech are proposed for both the state-space and switching state-space models. Quantitative evaluation, involving the use of error and correlation metrics between ground truth and synthetic features, is used to evaluate our proposed method in comparison to other probabilistic models previously applied to the problem. Furthermore, qualitative evaluation with human participants has been conducted to evaluate the realism, perceptual characteristics and intelligibility of the synthesised animations. The results are encouraging and demonstrate that by having a joint probabilistic model of audio and visual speech that caters for the non-linearities in audio-visual mapping, realistic visual speech can be synthesised from audio speech.

APA, Harvard, Vancouver, ISO, and other styles

10

Ahmad, Nasir. "A motion based approach for audio-visual automatic speech recognition." Thesis, Loughborough University, 2011. https://dspace.lboro.ac.uk/2134/8564.

Full text

Abstract:

The research work presented in this thesis introduces novel approaches for both visual region of interest extraction and visual feature extraction for use in audio-visual automatic speech recognition. In particular, the speaker‘s movement that occurs during speech is used to isolate the mouth region in video sequences and motionbased features obtained from this region are used to provide new visual features for audio-visual automatic speech recognition. The mouth region extraction approach proposed in this work is shown to give superior performance compared with existing colour-based lip segmentation methods. The new features are obtained from three separate representations of motion in the region of interest, namely the difference in luminance between successive images, block matching based motion vectors and optical flow. The new visual features are found to improve visual-only and audiovisual speech recognition performance when compared with the commonly-used appearance feature-based methods. In addition, a novel approach is proposed for visual feature extraction from either the discrete cosine transform or discrete wavelet transform representations of the mouth region of the speaker. In this work, the image transform is explored from a new viewpoint of data discrimination; in contrast to the more conventional data preservation viewpoint. The main findings of this work are that audio-visual automatic speech recognition systems using the new features extracted from the frequency bands selected according to their discriminatory abilities generally outperform those using features designed for data preservation. To establish the noise robustness of the new features proposed in this work, their performance has been studied in presence of a range of different types of noise and at various signal-to-noise ratios. In these experiments, the audio-visual automatic speech recognition systems based on the new approaches were found to give superior performance both to audio-visual systems using appearance based features and to audio-only speech recognition systems.

APA, Harvard, Vancouver, ISO, and other styles

11

Roxburgh, Zoe. "Visualising articulation : real-time ultrasound visual biofeedback and visual articulatory models and their use in treating speech sound disorders associated with submucous cleft palate." Thesis, Queen Margaret University, 2018. https://eresearch.qmu.ac.uk/handle/20.500.12289/8899.

Full text

Abstract:

Background: Ultrasound Tongue Imaging (UTI) is growing increasingly popular for assessing and treating Speech Sound Disorders (SSDs) and has more recently been used to qualitatively investigate compensatory articulations in speakers with cleft palate (CP). However, its therapeutic application for speakers with CP remains to be tested. A different set of developments, Visual Articulatory Models (VAMs), provide an offline dynamic model with context for lingual patterns. However, unlike UTI, they do not provide real-time biofeedback. Commercially available VAMs, such as Speech Trainer 3D, are available on iDevices, yet their clinical application remains to be tested. Aims: This thesis aims to test the diagnostic use of ultrasound, and investigate the effectiveness of both UTI and VAMs for the treatment of SSDs associated with submucous cleft palate (SMCP). Method: Using a single-subject multiple baseline design, two males with repaired SMCP, Andrew (aged 9;2) and Craig (aged 6;2), received six assessment sessions and two blocks of therapy, following a motor-based therapy approach, using VAMs and UTI. Three methods were used to measure therapy outcomes. Firstly, percent target consonant correct scores, derived from phonetic transcriptions provide outcomes comparable to those used in typical practice. Secondly, a multiplephonetically trained listener perceptual evaluation, using a two-alternative multiple forced choice design, to measure listener agreement provides a more objective measure. Thirdly, articulatory analysis, using qualitative and quantitative measures provides an additional perspective able to reveal covert errors. Results and Conclusions: There was overall improvement in the speech for both speakers, with a greater rate of change in therapy block one (VAMs) and listener agreement in the perceptual evaluation. Articulatory analysis supplemented phonetic transcriptions and detected covert articulations and covert contrast as well as supporting the improvements in auditory outcome scores. Both VAMs and UTI show promise as a clinical tool for the treatment of SSDs associated with CP.

APA, Harvard, Vancouver, ISO, and other styles

12

Navarathna, Rajitha Dharshana Bandara. "Robust recognition of human behaviour in challenging environments." Thesis, Queensland University of Technology, 2014. https://eprints.qut.edu.au/66235/1/Rajitha%20Dharshana%20Bandara_Navarathna_Thesis.pdf.

Full text

Abstract:

Novel techniques have been developed for the automatic recognition of human behaviour in challenging environments using information from visual and infra-red camera feeds. The techniques have been applied to two interesting scenarios: Recognise drivers' speech using lip movements and recognising audience behaviour, while watching a movie, using facial features and body movements. Outcome of the research in these two areas will be useful in the improving the performance of voice recognition in automobiles for voice based control and for obtaining accurate movie interest ratings based on live audience response analysis.

APA, Harvard, Vancouver, ISO, and other styles

13

Fernández, López Adriana. "Learning of meaningful visual representations for continuous lip-reading." Doctoral thesis, Universitat Pompeu Fabra, 2021. http://hdl.handle.net/10803/671206.

Full text

Abstract:

In the last decades, there has been an increased interest in decoding speech exclusively using visual cues, i.e. mimicking the human capability to perform lip-reading, leading to Automatic Lip-Reading (ALR) systems. However, it is well known that the access to speech through the visual channel is subject to many limitations when compared to the audio channel, i.e. it has been argued that humans can actually read around 30% of the information from the lips, and the rest is filled-in from the context. Thus, one of the main challenges in ALR resides in the visual ambiguities that arise at the word level, highlighting that not all sounds that we hear can be easily distinguished by observing the lips. In the literature, early ALR systems addressed simple recognition tasks such as alphabet or digit recognition but progressively shifted to more complex and realistic settings leading to several recent systems that target continuous lip-reading. To a large extent, these advances have been possible thanks to the construction of powerful systems based on deep learning architectures that have quickly started to replace traditional systems. Despite the recognition rates for continuous lip-reading may appear modest in comparison to those achieved by audio-based systems, the field has undeniably made a step forward. Interestingly, an analogous effect can be observed when humans try to decode speech: given sufficiently clean signals, most people can effortlessly decode the audio channel but would struggle to perform lip-reading, since the ambiguity of the visual cues makes it necessary the use of further context to decode the message. In this thesis, we explore the appropriate modeling of visual representations with the aim to improve continuous lip-reading. To this end, we present different data-driven mechanisms to handle the main challenges in lip-reading related to the ambiguities or the speaker dependency of visual cues. Our results highlight the benefits of a proper encoding of the visual channel, for which the most useful features are those that encode corresponding lip positions in a similar way, independently of the speaker. This fact opens the door to i) lip-reading in many different languages without requiring large-scale datasets, and ii) increasing the contribution of the visual channel in audio-visual speech systems. On the other hand, our experiments identify a tendency to focus on the modeling of temporal context as the key to advance the field, where there is a need for ALR models that are trained on datasets comprising large speech variability at several context levels. In this thesis, we show that both proper modeling of visual representations and the ability to retain context at several levels are necessary conditions to build successful lip-reading systems.
En les darreres dècades, hi ha hagut un interès creixent en la descodificació de la parla utilitzant exclusivament senyals visuals, es a dir, imitant la capacitat humana de llegir els llavis, donant lloc a sistemes de lectura automàtica de llavis (ALR). No obstant això, se sap que l’accès a la parla a través del canal visual està subjecte a moltes limitacions en comparació amb el senyal acústic, es a dir, s’ha argumentat que els humans poden llegir al voltant del 30% de la informació dels llavis, i la resta es completa fent servir el context. Així, un dels principals reptes de l’ALR resideix en les ambigüitats visuals que sorgeixen a escala de paraula, destacant que no tots els sons que escoltem es poden distingir fàcilment observant els llavis. A la literatura, els primers sistemes ALR van abordar tasques de reconeixement senzilles, com ara el reconeixement de l’alfabet o els dígits, però progressivament van passar a entorns mes complexos i realistes que han conduït a diversos sistemes recents dirigits a la lectura continua dels llavis. En gran manera, aquests avenços han estat possibles gracies a la construcció de sistemes potents basats en arquitectures d’aprenentatge profund que han començat a substituir ràpidament els sistemes tradicionals. Tot i que les taxes de reconeixement de la lectura continua dels llavis poden semblar modestes en comparació amb les assolides pels sistemes basats en audio, és evident que el camp ha fet un pas endavant. Curiosament, es pot observar un efecte anàleg quan els humans intenten descodificar la parla: donats senyals sense soroll, la majoria de la gent pot descodificar el canal d’àudio sense esforç¸, però tindria dificultats per llegir els llavis, ja que l’ambigüitat dels senyals visuals fa necessari l’ús de context addicional per descodificar el missatge. En aquesta tesi explorem el modelatge adequat de representacions visuals amb l’objectiu de millorar la lectura contínua dels llavis. Amb aquest objectiu, presentem diferents mecanismes basats en dades per fer front als principals reptes de la lectura de llavis relacionats amb les ambigüitats o la dependència dels parlants dels senyals visuals. Els nostres resultats destaquen els avantatges d’una correcta codificació del canal visual, per a la qual les característiques més útils són aquelles que codifiquen les posicions corresponents dels llavis d’una manera similar, independentment de l’orador. Aquest fet obre la porta a i) la lectura de llavis en molts idiomes diferents sense necessitat de conjunts de dades a gran escala, i ii) a l’augment de la contribució del canal visual en sistemes de parla audiovisuals.´ D’altra banda, els nostres experiments identifiquen una tendència a centrar-se en iii la modelització del context temporal com la clau per avançar en el camp, on hi ha la necessitat de models d’ALR que s’entrenin en conjunts de dades que incloguin una gran variabilitat de la parla a diversos nivells de context. En aquesta tesi, demostrem que tant el modelatge adequat de les representacions visuals com la capacitat de retenir el context a diversos nivells són condicions necessàries per construir sistemes de lectura de llavis amb èxit.

APA, Harvard, Vancouver, ISO, and other styles

14

Chilakapati, Praveen. "DRIVING SIMULATOR VALIDATION AND REAR-END CRASH RISK ANALYSIS AT A SIGNALISED INTERSECTION." Master's thesis, University of Central Florida, 2006. http://digital.library.ucf.edu/cdm/ref/collection/ETD/id/2925.

Full text

Abstract:

In recent years the use of advanced driving simulators has increased in the transportation engineering field especially in evaluating safety countermeasures. The driving simulator at UCF is a high fidelity simulator with six degrees of freedom. This research aims at validating the simulator in terms of speed and safety with the intention of using it as a test bed for high risk locations and to use it in developing traffic safety countermeasures. The Simulator replicates a real world signalized intersection (Alafaya trail (SR-434) and Colonial Drive (SR-50)). A total of sixty one subjects of age ranging from sixteen to sixty years were recruited to drive the simulator for the experiment, which consists of eight scenarios. This research validates the driving simulator for speed, safety and visual aspects. Based on the overall comparisons of speed between the simulated results and the real world, it was concluded that the UCF driving simulator is a valid tool for traffic studies related to driving speed behavior. Based on statistical analysis conducted on the experiment results, it is concluded that SR-434 northbound right turn lane and SR-50 eastbound through lanes have a higher rear-end crash risk than that at SR-50 westbound right turn lane and SR-434 northbound through lanes, respectively. This conforms to the risk of rear-end crashes observed at the actual intersection. Therefore, the simulator is validated for using it as an effective tool for traffic safety studies to test high-risk intersection locations. The driving simulator is also validated for physical and visual aspects of the intersection as 87.10% of the subjects recognized the intersection and were of the opinion that the replicated intersection was good enough or realistic. A binary logistic regression model was estimated and was used to quantify the relative rear-end crash risk at through lanes. It was found that in terms of rear-end crash risk SR50 east- bound approach is 23.67% riskier than the SR434 north-bound approach.
M.S.
Department of Civil and Environmental Engineering
Engineering and Computer Science
Civil Engineering

APA, Harvard, Vancouver, ISO, and other styles

15

Yau, Wai Chee, and waichee@ieee org. "Video Analysis of Mouth Movement Using Motion Templates for Computer-based Lip-Reading." RMIT University. Electrical and Computer Engineering, 2008. http://adt.lib.rmit.edu.au/adt/public/adt-VIT20081209.162504.

Full text

Abstract:

This thesis presents a novel lip-reading approach to classifying utterances from video data, without evaluating voice signals. This work addresses two important issues which are the efficient representation of mouth movement for visual speech recognition the temporal segmentation of utterances from video. The first part of the thesis describes a robust movement-based technique used to identify mouth movement patterns while uttering phonemes. This method temporally integrates the video data of each phoneme into a 2-D grayscale image named as a motion template (MT). This is a view-based approach that implicitly encodes the temporal component of an image sequence into a scalar-valued MT. The data size was reduced by extracting image descriptors such as Zernike moments (ZM) and discrete cosine transform (DCT) coefficients from MT. Support vector machine (SVM) and hidden Markov model (HMM) were used to classify the feature descriptors. A video speech corpus of 2800 utterances was collected for evaluating the efficacy of MT for lip-reading. The experimental results demonstrate the promising performance of MT in mouth movement representation. The advantages and limitations of MT for visual speech recognition were identified and validated through experiments. A comparison between ZM and DCT features indicates that th e accuracy of classification for both methods is very comparable when there is no relative motion between the camera and the mouth. Nevertheless, ZM is resilient to rotation of the camera and continues to give good results despite rotation but DCT is sensitive to rotation. DCT features are demonstrated to have better tolerance to image noise than ZM. The results also demonstrate a slight improvement of 5% using SVM as compared to HMM. The second part of this thesis describes a video-based, temporal segmentation framework to detect key frames corresponding to the start and stop of utterances from an image sequence, without using the acoustic signals. This segmentation technique integrates mouth movement and appearance information. The efficacy of this technique was tested through experimental evaluation and satisfactory performance was achieved. This segmentation method has been demonstrated to perform efficiently for utterances separated with short pauses. Potential applications for lip-reading technologies include human computer interface (HCI) for mobility-impaired users, defense applications that require voice-less communication, lip-reading mobile phones, in-vehicle systems, and improvement of speech-based computer control in noisy environments.

APA, Harvard, Vancouver, ISO, and other styles

16

Jalkebo, Charlotte. "Placement of Controls in Construction Equipment Using Operators´Sitting Postures : Process and Recommendations." Thesis, Linköpings universitet, Maskinkonstruktion, 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-108980.

Full text

Abstract:

An ergonomically designed work environment may decrease work related musculoskeletal disorders, lead to less sick leaves and increase production time for operators and companies all around the world. Volvo Construction Equipment wants to deepen the knowledge and investigate more carefully how operators are actually sitting whilst operating the machines, how this affects placement of controls and furthermore optimize controls placements accordingly. The purpose is to enhance their product development process by suggesting guidelines for control placement with improved ergonomics based on operators’ sitting postures. The goal is to deliver a process which identifies and transfers sitting postures to RAMSIS and uses them for control placement recommendations in the cab and operator environments. Delimitations concerns: physical ergonomics, 80% usability of the resulted process on the machine types, and the level of detail for controls and their placements. Research, analysis, interviews, test driving of machines, video recordings of operators and the ergonomic software RAMSIS has served as base for analysis. The analysis led to (i) the conclusion that sitting postures affect optimal ergonomic placement of controls, though not ISO-standards, (ii) the conclusion that RAMSIS heavy truck postures does not seem to correspond to Volvo CE’s operators’ sitting postures and (iii) and to an advanced engineering project process suitable for all machine types and applicable in the product development process. The result can also be used for other machines than construction equipment. The resulted process consists of three independent sub-processes with step by step explanations and recommendations of; (i) what information that needs to be gathered, (ii) how to identify and transfer sitting postures into RAMSIS, (iii) how to use RAMSIS to create e design aid for recommended control placement. The thesis also contains additional enhancements to Volvo CE’s product development process with focus on ergonomics. A conclusion is that the use of motion capture could not be verified to work for Volvo Construction Equipment, though it was verified that if motion capture works, the process works. Another conclusion is that the suggested body landmarks not could be verified that they are all needed for this purpose except for those needed for control placement. Though they are based on previous sitting posture identification in vehicles and only those that also occur in RAMSIS are recommended, and therefore they can be used. This thesis also questions the most important parameters for interior vehicle design (hip- and eye locations) and suggests that shoulder locations are just as important. The thesis concluded five parameters for control categorization, and added seven categories in addition to those mentioned in the ISO-standards. Other contradictions and loopholes in the ISO-standards were identified, highlighted and discussed. Suggestions for improving the ergonomic analyses in RAMSIS can also be found in this report. More future research mentioned is more details on control placement as well as research regarding sitting postures are suggested. If the resulted process is delimited to concern upper body postures, other methods for posture identification may be used.

APA, Harvard, Vancouver, ISO, and other styles

17

LEONE, GIUSEPPE RICCARDO. "Comunicazione bimodale nel web per mezzo di facce parlanti 3D." Doctoral thesis, 2014. http://hdl.handle.net/2158/874631.

Full text

Abstract:

Lo scopo principale di questa tesi è la realizzazione di una faccia parlante facilmente integrabile in una pagina html e che fornisca il servizio di sintesi audio-visiva di un qualsiasi testo in lingua italiana con una produzio- ne del parlato visuale deve essere di elevata qualità. ` Il framework realizzato si chiama LUCIA-WebGL. E una versione completamente reingegnerizzata dell’applicazione LUCIA Talking Head (sviluppata presso lo ISTC-CNR di Padova negli anni 2003-2006) utilizzando la nuova tecnologia WebGL. LUCIA e LUCIA-WebGL sono sviluppate con tecnologie differenti, ma condividono gran parte delle logiche di progettazione e animazione. Sono entrambe totalmente basate sullo standard MPEG-4 SNHC (Synthetic/Natural Hybrid Coding) che fornisce le funzionalità per creare una animazione facciale in tempo reale guidata dai Parametri di Animazione Facciale (FAPs) con l’emulazione delle funzionalità dei muscoli mimici e la loro influenza sulla pelle della faccia. L’azione muscolare è resa esplicita per mezzo della deformazione del reticolo poligonale costruito attorno ai punti chiave FPs (feature points) che corrispondono alla giunzione dei muscoli mimici con la pelle. Tale deformazione dipende dal valore dei Parametri di Animazione Facciale FAPs (Facial Animation Parameters) che afferiscono ai movimenti dei singoli punti chiave. LUCIA-WebGL soddisfa le specifiche del Predictable Facial Animation Object Profile di MPEG-4 ovvero è in grado di importare modelli esterni per mezzo della ricezione dei Parametri di Definizione FDP (Facial Definition Parameters). Il sistema è stato integrato nel prototipo realizzato per il progetto ’Wikimemo.it: Il portale della lingua e della cultura italiana’ finanziato dal M.I.U.R. per promuovere la cultura italiana attraverso la lettura e l’ascolto. Il sistema permette all’utente di navigare nei contenuti, effettuare ricerche di frasi e parole e sentire come sono pronunciate in un contesto specifico. I risultati di una query di ricerca possono essere ascoltati con la voce di un italiano nativo o con la voce sintetica di LUCIA-WebGL che ne mostra il movimento labiale.

APA, Harvard, Vancouver, ISO, and other styles

18

Rajaram, Siddharth. "Selective attention and speech processing in the cortex." Thesis, 2014. https://hdl.handle.net/2144/13312.

Full text

Abstract:

In noisy and complex environments, human listeners must segregate the mixture of sound sources arriving at their ears and selectively attend a single source, thereby solving a computationally difficult problem called the cocktail party problem. However, the neural mechanisms underlying these computations are still largely a mystery. Oscillatory synchronization of neuronal activity between cortical areas is thought to provide a crucial role in facilitating information transmission between spatially separated populations of neurons, enabling the formation of functional networks. In this thesis, we seek to analyze and model the functional neuronal networks underlying attention to speech stimuli and find that the Frontal Eye Fields play a central 'hub' role in the auditory spatial attention network in a cocktail party experiment. We use magnetoencephalography (MEG) to measure neural signals with high temporal precision, while sampling from the whole cortex. However, several methodological issues arise when undertaking functional connectivity analysis with MEG data. Specifically, volume conduction of electrical and magnetic fields in the brain complicates interpretation of results. We compare several approaches through simulations, and analyze the trade-offs among various measures of neural phase-locking in the presence of volume conduction. We use these insights to study functional networks in a cocktail party experiment. We then construct a linear dynamical system model of neural responses to ongoing speech. Using this model, we are able to correctly predict which of two speakers is being attended by a listener. We then apply this model to data from a task where people were attending to stories with synchronous and scrambled videos of the speakers' faces to explore how the presence of visual information modifies the underlying neuronal mechanisms of speech perception. This model allows us to probe neural processes as subjects listen to long stimuli, without the need for a trial-based experimental design. We model the neural activity with latent states, and model the neural noise spectrum and functional connectivity with multivariate autoregressive dynamics, along with impulse responses for external stimulus processing. We also develop a new regularized Expectation-Maximization (EM) algorithm to fit this model to electroencephalography (EEG) data.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Visual speech model'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles