Thèses sur le sujet « Automatic Text Recognition (ATR) »
Créez une référence correcte selon les styles APA, MLA, Chicago, Harvard et plusieurs autres
Consultez les 47 meilleures thèses pour votre recherche sur le sujet « Automatic Text Recognition (ATR) ».
À côté de chaque source dans la liste de références il y a un bouton « Ajouter à la bibliographie ». Cliquez sur ce bouton, et nous générerons automatiquement la référence bibliographique pour la source choisie selon votre style de citation préféré : APA, MLA, Harvard, Vancouver, Chicago, etc.
Vous pouvez aussi télécharger le texte intégral de la publication scolaire au format pdf et consulter son résumé en ligne lorsque ces informations sont inclues dans les métadonnées.
Parcourez les thèses sur diverses disciplines et organisez correctement votre bibliographie.
Chiffoleau, Floriane. « Understanding the automatic text recognition process : model training, ground truth and prediction errors ». Electronic Thesis or Diss., Le Mans, 2024. http://www.theses.fr/2024LEMA3002.
Texte intégralThis thesis works on identifying what a text recognition model can learn during its training, through the examination of its ground truth’s content, and its prediction’s errors. The main intent here is to improve the knowledge of how a neural network operates, with experiments focused on typewritten documents. The methods used mostly concentrated on the thorough exploration of the training data, the observation of the model’s prediction’s errors, and the correlation between both. A first hypothesis, based on the influence of the lexicon, was inconclusive. However, it steered the observation towards a new level of study, relying on an infralexical level: the n-grams. Their training data’s distribution was analysed and subsequently compared to that of the n-grams retrieved from the prediction errors. Promising results lead to further exploration, while upgrading from single-language to multilingual model. Conclusive results enabled me to infer that the n-grams might indeed be a valid answer to recognition’s performances
Gregori, Alessandro <1975>. « Automatic Speech Recognition (ASR) and NMT for Interlingual and Intralingual Communication : Speech to Text Technology for Live Subtitling and Accessibility ». Doctoral thesis, Alma Mater Studiorum - Università di Bologna, 2021. http://amsdottorato.unibo.it/9931/1/Gregori_Alessandro_tesi.pdf.
Texte intégralJansson, Annika. « Tal till text för relevant metadatataggning av ljudarkiv hos Sveriges Radio ». Thesis, KTH, Medieteknik och interaktionsdesign, MID, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-169464.
Texte intégralSpeech to text for relevant metadata tagging of audio archive at Sveriges Radio Abstract In the years 2009-2013, Sveriges Radio digitized its program archive. Sveriges Radio's ambition is that more material from the 175 000 hours of radio they broadcast every year should be archived. This is a relatively time-consuming process to make all materials to be searchable and it's far from certain that the quality of the data is equally high on all items. The issue that has been treated for this thesis is: What opportunities exist to develop a system to Sveriges Radio for Swedish speech to text? Systems for speech to text has been analyzed and examined to give Sveriges Radio a current overview in this subject. Interviews with other similar organizations working in the field have been performed to see how far they have come in their development of the concerned subject. A literature study has been conducted on the recent research reports in speech recognition to compare which system would match Sveriges Radio's needs and requirements best to get on with. What Sveriges Radio should concentrate at first, in order to build an ASR, Automatic Speech Recognition, is to transcribe their audio material. Where there are three alternatives, either transcribe themselves by selecting a number of programs with different orientations to get such a large width as possible on the content, preferably with different speakers and then also be able to develop further recognition of the speaker. The easiest way is to let different professions who make the features/programs in the system do it. Other option is to start a similar project that the BBC has done and take help of the public. The third option is to buy the service for transcription. My advice is to continue evaluate the Kaldi system, because it has evolved significantly in recent years and seems to be relatively easy to extend. Also the open-source that Lingsoft uses is interesting to study further.
Gong, XiangQi. « Ellection markup language (EML) based tele-voting system ». Thesis, University of the Western Cape, 2009. http://etd.uwc.ac.za/index.php?module=etd&action=viewtitle&id=gen8Srv25Nme4_5841_1350999620.
Texte intégralvoting machines, voting via the Internet, telephone, SMS and digital interactive television. This thesis concerns voting by telephone, or televoting, it starts by giving a brief overview and evaluation of various models and technologies that are implemented within such systems. The aspects of televoting that have been investigated are technologies that provide a voice interface to the voter and conduct the voting process, namely the Election Markup Language (EML), Automated Speech Recognition (ASR) and Text-to-Speech (TTS).
Wager, Nicholas. « Automatic Target Recognition (ATR) ATR : background statistics and the detection of targets in clutter / ». Thesis, Monterey, Calif. : Springfield, Va. : Naval Postgraduate School ; Available from National Technical Information Service, 1994. http://handle.dtic.mil/100.2/ADA293062.
Texte intégralThesis advisor(s): David L. Fried, David Scott Davis. :December 1994." Includes bibliographical references. Also available online.
Horvath, Matthew Steven. « Performance Prediction of Quantization Based Automatic Target Recognition Algorithms ». Wright State University / OhioLINK, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=wright1452086412.
Texte intégralJobbins, Amanda Caryn. « The contribution of semantics to automatic text processing ». Thesis, Nottingham Trent University, 1999. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.302405.
Texte intégralNamane, Abderrahmane. « Degraded printed text and handwritten recognition methods : Application to automatic bank check recognition ». Université Louis Pasteur (Strasbourg) (1971-2008), 2007. http://www.theses.fr/2007STR13048.
Texte intégralCharacter recognition is a significant stage in all document recognition systems. Character recognition is considered as an assignment problem and decision of a given character, and is an active research subject in many disciplines. This thesis is mainly related to the recognition of degraded printed and handwritten characters. New solutions were brought to the field of document image analysis (DIA). The first solution concerns the development of two recognition methods for handwritten numeral character, namely, the method based on the use of Fourier-Mellin transform (FMT) and the self-organization map (SOM), and the parallel combination of HMM-based classifiers using as parameter extraction a new projection technique. In the second solution, one finds a new holistic recognition method of handwritten words applied to French legal amount. The third solution presents two recognition methods based on neural networks for the degraded printed character applied to the Algerian postal check. The first work is based on sequential combination and the second used a serial combination based mainly on the introduction of a relative distance for the quality measurement of the degraded character. During the development of this thesis, methods of preprocessing were also developed, in particular, the handwritten numeral slant correction, the handwritten word central zone detection and its slope
Bayik, Tuba Makbule. « Automatic Target Recognition In Infrared Imagery ». Master's thesis, METU, 2004. http://etd.lib.metu.edu.tr/upload/2/12605388/index.pdf.
Texte intégralBae, Junhyeong. « Adaptive Waveforms for Automatic Target Recognition and Range-Doppler Ambiguity Mitigation in Cognitive Sensor ». Diss., The University of Arizona, 2013. http://hdl.handle.net/10150/306942.
Texte intégralAbdel-Rahman, Tarek. « Mixture of Factor Analyzers (MoFA) Models for the Design and Analysis of SAR Automatic Target Recognition (ATR) Algorithms ». The Ohio State University, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=osu1500625807524146.
Texte intégralShou-Chun, Yin 1980. « Speaker adaptation in joint factor analysis based text independent speaker verification ». Thesis, McGill University, 2006. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=100735.
Texte intégralSequeira, José Francisco Rodrigues. « Automatic knowledge base construction from unstructured text ». Master's thesis, Universidade de Aveiro, 2016. http://hdl.handle.net/10773/17910.
Texte intégralTaking into account the overwhelming number of biomedical publications being produced, the effort required for a user to efficiently explore those publications in order to establish relationships between a wide range of concepts is staggering. This dissertation presents GRACE, a web-based platform that provides an advanced graphical exploration interface that allows users to traverse the biomedical domain in order to find explicit and latent associations between annotated biomedical concepts belonging to a variety of semantic types (e.g., Genes, Proteins, Disorders, Procedures and Anatomy). The knowledge base utilized is a collection of MEDLINE articles with English abstracts. These annotations are then stored in an efficient data storage that allows for complex queries and high-performance data delivery. Concept relationship are inferred through statistical analysis, applying association measures to annotated terms. These processes grant the graphical interface the ability to create, in real-time, a data visualization in the form of a graph for the exploration of these biomedical concept relationships.
Tendo em conta o crescimento do número de publicações biomédicas a serem produzidas todos os anos, o esforço exigido para que um utilizador consiga, de uma forma eficiente, explorar estas publicações para conseguir estabelecer associações entre um conjunto alargado de conceitos torna esta tarefa exaustiva. Nesta disertação apresentamos uma plataforma web chamada GRACE, que providencia uma interface gráfica de exploração que permite aos utilizadores navegar pelo domínio biomédico em busca de associações explícitas ou latentes entre conceitos biomédicos pertencentes a uma variedade de domínios semânticos (i.e., Genes, Proteínas, Doenças, Procedimentos e Anatomia). A base de conhecimento usada é uma coleção de artigos MEDLINE com resumos escritos na língua inglesa. Estas anotações são armazenadas numa base de dados que permite pesquisas complexas e obtenção de dados com alta performance. As relações entre conceitos são inferidas a partir de análise estatística, aplicando medidas de associações entre os conceitos anotados. Estes processos permitem à interface gráfica criar, em tempo real, uma visualização de dados, na forma de um grafo, para a exploração destas relações entre conceitos do domínio biomédico.
Alamri, Safi S. « Text-independent, automatic speaker recognition system evaluation with males speaking both Arabic and English ». Thesis, University of Colorado at Denver, 2015. http://pqdtopen.proquest.com/#viewpdf?dispub=1605087.
Texte intégralAutomatic speaker recognition is an important key to speaker identification in media forensics and with the increase of cultures mixing, there?s an increase in bilingual speakers all around the world. The purpose of this thesis is to compare text-independent samples of one person using two different languages, Arabic and English, against a single language reference population. The hope is that a design can be started that may be useful in further developing software that can complete accurate text-independent ASR for bilingual speakers speaking either language against a single language reference population. This thesis took an Arabic model sample and compared it against samples that were both Arabic and English using and an Arabic reference population, all collected from videos downloaded from the Internet. All of the samples were text-independent and enhanced to optimal performance. The data was run through a biometric software called BATVOX 4.1, which utilizes the MFCCs and GMM methods of speaker recognition and identification. The result of testing through BATVOX 4.1 was likelihood ratios for each sample that were evaluated for similarities and differences, trends, and problems that had occurred.
Lee, Spencer Jaehoon Gilbert Juan E. « Post-speech-recognition processiing in domain-specific text-corpus-based distributed listening system analysis, interpretation and selection of speech recognition results / ». Auburn, Ala., 2006. http://repo.lib.auburn.edu/2006%20Summer/Theses/LEE_SPENCER_7.pdf.
Texte intégralReynolds, Douglas A. « A Gaussian mixture modeling approach to text-independent speaker identification ». Diss., Georgia Institute of Technology, 1992. http://hdl.handle.net/1853/16903.
Texte intégralAlKhateeb, Jawad H. Y. « Word based off-line handwritten Arabic classification and recognition. Design of automatic recognition system for large vocabulary offline handwritten Arabic words using machine learning approaches ». Thesis, University of Bradford, 2010. http://hdl.handle.net/10454/4440.
Texte intégralPisane, Jonathan. « Automatic target recognition using passive bistatic radar signals ». Phd thesis, Supélec, 2013. http://tel.archives-ouvertes.fr/tel-00963601.
Texte intégralMillard, Benjamin J. « Oral Proficiency Assessment of French Using an Elicited Imitation Test and Automatic Speech Recognition ». BYU ScholarsArchive, 2011. https://scholarsarchive.byu.edu/etd/2690.
Texte intégralAlKhateeb, Jawad Hasan Yasin. « Word based off-line handwritten Arabic classification and recognition : design of automatic recognition system for large vocabulary offline handwritten Arabic words using machine learning approaches ». Thesis, University of Bradford, 2010. http://hdl.handle.net/10454/4440.
Texte intégralOgun, Sewade. « Generating diverse synthetic data for ASR training data augmentation ». Electronic Thesis or Diss., Université de Lorraine, 2024. http://www.theses.fr/2024LORR0116.
Texte intégralIn the last two decades, the error rate of automatic speech recognition (ASR) systems has drastically dropped, making them more useful in real-world applications. This improvement can be attributed to several factors including new architectures using deep learning techniques, new training algorithms, large and diverse training datasets, and data augmentation. In particular, the large-scale training datasets have been pivotal to learning robust speech representations for ASR. Their large size allows them to effectively cover the inherent diversity in speech, in terms of speaker voice, speaking rate, pitch, reverberation, and noise. However, the size and diversity of datasets typically found in high-resourced languages are not available in medium- and low-resourced languages and in domains with specialised vocabulary like the medical domain. Therefore, the popular method to increase dataset diversity is through data augmentation. With the recent increase in the naturalness and quality of synthetic data that can be generated by text-to-speech (TTS) and voice conversion (VC) systems, these systems have also become viable options for ASR data augmentation. However, several problems limit their application. First, TTS/VC systems require high-quality speech data for training. Hence, we develop a method of dataset curation from an ASR-designed corpus for training a TTS system. This method leverages the increasing accuracy of deep-learning-based, non-intrusive quality estimators to filter high-quality samples. We explore filtering the ASR dataset at different thresholds to balance the size of the dataset, number of speakers, and quality. With this method, we create a high-quality multi-speaker dataset which is comparable to LibriTTS in quality. Second, the data generation process needs to be controllable to generate diverse TTS/VC data with specific attributes. Previous TTS/VC systems either condition the system on the speaker embedding alone or use discriminative models to learn the speech variabilities. In our approach, we design an improved flow-based architecture that learns the distribution of different speech variables. We find that our modifications significantly increase the diversity and naturalness of the generated utterances over a GlowTTS baseline, while being controllable. Lastly, we evaluated the significance of generating diverse TTS and VC data for augmenting ASR training data. As opposed to naively generating the TTS/VC data, we independently examined different approaches such as sentence selection methods and increasing the diversity of speakers, phoneme duration, and pitch contours, in addition to systematically increasing the environmental conditions of the generated data. Our results show that TTS/VC augmentation holds promise in increasing ASR performance in low- and medium-data regimes. In conclusion, our experiments provide insight into the variabilities that are particularly important for ASR, and reveal a systematic approach to ASR data augmentation using synthetic data
Hon, Wing-kai. « On the construction and application of compressed text indexes ». Click to view the E-thesis via HKUTO, 2004. http://sunzi.lib.hku.hk/hkuto/record/B31059739.
Texte intégralHon, Wing-kai, et 韓永楷. « On the construction and application of compressed text indexes ». Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2004. http://hub.hku.hk/bib/B31059739.
Texte intégralZhu, Winstead Xingran. « Hotspot Detection for Automatic Podcast Trailer Generation ». Thesis, Uppsala universitet, Institutionen för lingvistik och filologi, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-444887.
Texte intégralMcMurtry, William F. « Information Retrieval for Call Center Quality Assurance ». The Ohio State University, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=osu1587036885211228.
Texte intégralKullmann, Emelie. « Speech to Text for Swedish using KALDI ». Thesis, KTH, Optimeringslära och systemteori, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-189890.
Texte intégralDe senaste åren har olika tillämpningar inom människa-dator interaktion och främst taligenkänning hittat sig ut på den allmänna marknaden. Många system och tekniska produkter stöder idag tjänsterna att transkribera tal och diktera text. Detta gäller dock främst de större språken och sällan finns samma stöd för mindre språk som exempelvis svenskan. I detta examensprojekt har en modell för taligenkänning på svenska ut- vecklas. Det är genomfört på uppdrag av Sveriges Radio som skulle ha stor nytta av en fungerande taligenkänningsmodell på svenska. Modellen är utvecklad i ramverket Kaldi. Två tillvägagångssätt för den akustiska träningen av modellen är implementerade och prestandan för dessa två är evaluerade och jämförda. Först tränas en modell med användningen av Hidden Markov Models och Gaussian Mixture Models och slutligen en modell där Hidden Markov Models och Deep Neural Networks an- vänds, det visar sig att den senare uppnår ett bättre resultat i form av måttet Word Error Rate.
Nguyen, Chu Duc. « Localization and quality enhancement for automatic recognition of vehicle license plates in video sequences ». Thesis, Ecully, Ecole centrale de Lyon, 2011. http://www.theses.fr/2011ECDL0018.
Texte intégralAutomatic reading of vehicle license plates is considered an approach to mass surveillance. It allows, through the detection / localization and optical recognition to identify a vehicle in the images or video sequences. Many applications such as traffic monitoring, detection of stolen vehicles, the toll or the management of entrance/ exit parking uses this method. Yet in spite of important progress made since the appearance of the first prototype sin 1979, with a recognition rate sometimes impressive thanks to advanced science and sensor technology, the constraints imposed for the operation of such systems limit laid. Indeed, the optimal use of techniques for localizing and recognizing license plates in operational scenarios requiring controlled lighting conditions and a limitation of the pose, velocity, or simply type plate. Automatic reading of vehicle license plates then remains an open research problem. The major contribution of this thesis is threefold. First, a new approach to robust license plate localization in images or image sequences is proposed. Then, improving the quality of the plates is treated with a localized adaptation of super-resolution technique. Finally, a unified model of location and super-resolution is proposed to reduce the time complexity of both approaches combined
Granell, Romero Emilio. « Advances on the Transcription of Historical Manuscripts based on Multimodality, Interactivity and Crowdsourcing ». Doctoral thesis, Universitat Politècnica de València, 2017. http://hdl.handle.net/10251/86137.
Texte intégralEl Procesamiento del Lenguaje Natural (PLN) es un campo de investigación interdisciplinar de las Ciencias de la Computación, Lingüística y Reconocimiento de Patrones que estudia, entre otros, el uso del lenguaje natural humano en la interacción Hombre-Máquina. La mayoría de las tareas de investigación del PLN se pueden aplicar para resolver problemas del mundo real. Este es el caso del reconocimiento y la traducción del lenguaje natural, que se pueden utilizar para construir sistemas automáticos para la transcripción y traducción de documentos. En cuanto a los documentos manuscritos digitalizados, la transcripción se utiliza para facilitar el acceso digital a los contenidos, ya que la simple digitalización de imágenes sólo proporciona, en la mayoría de los casos, la búsqueda por imagen y no por contenidos lingüísticos. La transcripción es aún más importante en el caso de los manuscritos históricos, ya que la mayoría de estos documentos son únicos y la preservación de su contenido es crucial por razones culturales e históricas. La transcripción de manuscritos históricos suele ser realizada por paleógrafos, que son personas expertas en escritura y vocabulario antiguos. Recientemente, los sistemas de Reconocimiento de Escritura (RES) se han convertido en una herramienta común para ayudar a los paleógrafos en su tarea, la cual proporciona un borrador de la transcripción que los paleógrafos pueden corregir con métodos más o menos sofisticados. Este borrador de transcripción es útil cuando presenta una tasa de error suficientemente reducida para que el proceso de corrección sea más cómodo que una completa transcripción desde cero. Por lo tanto, la obtención de un borrador de transcripción con una baja tasa de error es crucial para que esta tecnología de PLN sea incorporada en el proceso de transcripción. El trabajo descrito en esta tesis se centra en la mejora del borrador de transcripción ofrecido por un sistema RES, con el objetivo de reducir el esfuerzo realizado por los paleógrafos para obtener la transcripción de manuscritos históricos digitalizados. Este problema se enfrenta a partir de tres escenarios diferentes, pero complementarios: · Multimodalidad: El uso de sistemas RES permite a los paleógrafos acelerar el proceso de transcripción manual, ya que son capaces de corregir en un borrador de la transcripción. Otra alternativa es obtener el borrador de la transcripción dictando el contenido a un sistema de Reconocimiento Automático de Habla. Cuando ambas fuentes están disponibles, una combinación multimodal de las mismas es posible y se puede realizar un proceso iterativo para refinar la hipótesis final. · Interactividad: El uso de tecnologías asistenciales en el proceso de transcripción permite reducir el tiempo y el esfuerzo humano requeridos para obtener la transcripción correcta, gracias a la cooperación entre el sistema asistencial y el paleógrafo para obtener la transcripción perfecta. La realimentación multimodal se puede utilizar en el sistema asistencial para proporcionar otras fuentes de información adicionales con señales que representen la misma secuencia de palabras a transcribir (por ejemplo, una imagen de texto, o la señal de habla del dictado del contenido de dicha imagen de texto), o señales que representen sólo una palabra o carácter a corregir (por ejemplo, una palabra manuscrita mediante una pantalla táctil). · Crowdsourcing: La colaboración distribuida y abierta surge como una poderosa herramienta para la transcripción masiva a un costo relativamente bajo, ya que el esfuerzo de supervisión de los paleógrafos puede ser drásticamente reducido. La combinación multimodal permite utilizar el dictado del contenido de líneas de texto manuscrito en una plataforma de crowdsourcing multimodal, donde los colaboradores pueden proporcionar las muestras de habla utilizando su propio dispositivo móvil en lugar de usar ordenadores,
El Processament del Llenguatge Natural (PLN) és un camp de recerca interdisciplinar de les Ciències de la Computació, la Lingüística i el Reconeixement de Patrons que estudia, entre d'altres, l'ús del llenguatge natural humà en la interacció Home-Màquina. La majoria de les tasques de recerca del PLN es poden aplicar per resoldre problemes del món real. Aquest és el cas del reconeixement i la traducció del llenguatge natural, que es poden utilitzar per construir sistemes automàtics per a la transcripció i traducció de documents. Quant als documents manuscrits digitalitzats, la transcripció s'utilitza per facilitar l'accés digital als continguts, ja que la simple digitalització d'imatges només proporciona, en la majoria dels casos, la cerca per imatge i no per continguts lingüístics (paraules clau, expressions, categories sintàctiques o semàntiques). La transcripció és encara més important en el cas dels manuscrits històrics, ja que la majoria d'aquests documents són únics i la preservació del seu contingut és crucial per raons culturals i històriques. La transcripció de manuscrits històrics sol ser realitzada per paleògrafs, els quals són persones expertes en escriptura i vocabulari antics. Recentment, els sistemes de Reconeixement d'Escriptura (RES) s'han convertit en una eina comuna per ajudar els paleògrafs en la seua tasca, la qual proporciona un esborrany de la transcripció que els paleògrafs poden esmenar amb mètodes més o menys sofisticats. Aquest esborrany de transcripció és útil quan presenta una taxa d'error prou reduïda perquè el procés de correcció siga més còmode que una completa transcripció des de zero. Per tant, l'obtenció d'un esborrany de transcripció amb un baixa taxa d'error és crucial perquè aquesta tecnologia del PLN siga incorporada en el procés de transcripció. El treball descrit en aquesta tesi se centra en la millora de l'esborrany de la transcripció ofert per un sistema RES, amb l'objectiu de reduir l'esforç realitzat pels paleògrafs per obtenir la transcripció de manuscrits històrics digitalitzats. Aquest problema s'enfronta a partir de tres escenaris diferents, però complementaris: · Multimodalitat: L'ús de sistemes RES permet als paleògrafs accelerar el procés de transcripció manual, ja que són capaços de corregir un esborrany de la transcripció. Una altra alternativa és obtenir l'esborrany de la transcripció dictant el contingut a un sistema de Reconeixement Automàtic de la Parla. Quan les dues fonts (imatge i parla) estan disponibles, una combinació multimodal és possible i es pot realitzar un procés iteratiu per refinar la hipòtesi final. · Interactivitat: L'ús de tecnologies assistencials en el procés de transcripció permet reduir el temps i l'esforç humà requerits per obtenir la transcripció real, gràcies a la cooperació entre el sistema assistencial i el paleògraf per obtenir la transcripció perfecta. La realimentació multimodal es pot utilitzar en el sistema assistencial per proporcionar fonts d'informació addicionals amb senyals que representen la mateixa seqüencia de paraules a transcriure (per exemple, una imatge de text, o el senyal de parla del dictat del contingut d'aquesta imatge de text), o senyals que representen només una paraula o caràcter a corregir (per exemple, una paraula manuscrita mitjançant una pantalla tàctil). · Crowdsourcing: La col·laboració distribuïda i oberta sorgeix com una poderosa eina per a la transcripció massiva a un cost relativament baix, ja que l'esforç de supervisió dels paleògrafs pot ser reduït dràsticament. La combinació multimodal permet utilitzar el dictat del contingut de línies de text manuscrit en una plataforma de crowdsourcing multimodal, on els col·laboradors poden proporcionar les mostres de parla utilitzant el seu propi dispositiu mòbil en lloc d'utilitzar ordinadors d'escriptori o portàtils, la qual cosa permet ampliar el nombr
Granell Romero, E. (2017). Advances on the Transcription of Historical Manuscripts based on Multimodality, Interactivity and Crowdsourcing [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/86137
TESIS
Santos, André Jerónimo Martins dos. « Automatic and interactive annotation of PDF documents ». Master's thesis, Universidade de Aveiro, 2016. http://hdl.handle.net/10773/17886.
Texte intégralO aumento acelerado da literatura biomédica levou ao desenvolvimento de vários esforços para extrair e armazenar, de forma estruturada, a informação relativa aos conceitos e relações presentes nesses textos, oferecendo aos investigadores e clínicos um acesso rápido e fácil à informação. No entanto, este processo de "curadoria de conhecimento" é uma tarefa extremamente exaustiva, sendo cada vez mais comum o uso de ferramentas de anotação automática, fazendo uso de técnicas de mineração de texto. Apesar de já existirem sistemas de anotação bastante completos e que apresentam um alto desempenho, estes não são largamente usados pela comunidade biomédica, principalmente por serem complexos e apresentarem limitações ao nível de usabilidade. Por outro lado, o PDF tornou-se nos últimos anos num dos formatos mais populares para publicar e partilhar documentos visto poder ser apresentado exatamente da mesma maneira independentemente do sistema ou plataforma em que é acedido. A maioria das ferramentas de anotação foram principalmente desenhadas para extrair informação de texto livre, contudo hoje em dia uma grande parte da literatura biomédica é publicada e distribuída em PDF, e portanto a extração de informação de documentos PDF deve ser um ponto de foco para a comunidade de mineração de texto biomédico. O objetivo do trabalho descrito nesta dissertação foi a extensão da framework Neji, permitindo o processamento de documentos em formato PDF, e a integração dessas funcionalidades na plataforma Egas, permitindo que um utilizador possa visualizar e anotar, simultaneamente, o artigo original no formato PDF e o texto extraído deste. Os sistemas desenvolvidos apresentam bons resultados de desempenho, tanto em termos de velocidade de processamento como de representação da informação, o que também contribui para uma melhor experiência de utilizador. Além disso, apresentam várias vantagens para a comunidade de mineração de texto e curadores, permitindo a anotação direta de artigos no formato PDF e simplificando o uso e configuração destes sistemas de anotação por parte de investigadores.
The accelerated increase of the biomedical literature has led to various efforts to extract and store, in a structured way, the information related with the concepts and relations presented in those texts, providing to investigators and researchers a fast and easy access to knowledge. However, this process of “knowledge curation” is an extremely exhaustive task, being more and more common demanding the application of automatic annotation tools, that make use of text mining techniques. Even thought complete annotation systems already exist and produce high performance results, they are not widely used by the biomedical community, mainly because of their complexity and also due to some limitations in usability. On the other hand, the PDF has become in the last years one of the most popular formats for publishing and sharing documents because of it can be displayed exactly in the same way independently of the system or platform where it is accessed. The majority of annotation tools were mainly designed to extract information from raw text, although a big part of the biomedical literature is published and distributed in PDF, and thus the information extraction from PDF documents should be a focus point for the biomedical text mining community. The objective of the work described in this document is the extension of Neji framework, allowing the processing of documents in PDF format, and the integration of these features in Egas platform, allowing a user to simultaneously visualize the original article in PDF format and its extracted text. The improved and developed systems present good performing results, both in terms of processing speed and representation of the information, contributing also for a better user experience. Besides that, they present several advantages for the biomedical community, allowing the direct annotation of PDF articles and simplifying the use and configuration of these annotation systems by researchers.
Catae, Fabricio Shigueru. « Classificação automática de texto por meio de similaridade de palavras : um algoritmo mais eficiente ». Universidade de São Paulo, 2013. http://www.teses.usp.br/teses/disponiveis/3/3141/tde-06072014-225124/.
Texte intégralThe latent semantic analysis is a technique in natural language processing, which aims to simplify the task of finding words and sentences similarity. Using a vector space model for the text representation, it selects the most significant values for the space reconstruction into a smaller dimension. This simplification allows it to generalize models, moving words and texts towards a semantic representation. Thus, it identifies a set of underlying meanings or hidden concepts without prior knowledge of grammar. The goal of this study was to determine the optimal dimensionality of the semantic space in a text classification task. The proposed solution corresponds to a semi-supervised algorithm that applies the method of the nearest neighbor classification on known examples, and plots the estimated accuracy on a graph. Because it is a very time consuming process, the vectors are projected on a space in such a way the calculation becomes incremental. Since the spaces are isometric, the similarity between documents remains equivalent. This proposal determines the optimal dimension of the semantic space with little effort, not much beyond the time required by traditional latent semantic analysis. The results showed significant gains in adopting the correct number of dimensions.
Johansson, Elias. « Separation and Extraction of Valuable Information From Digital Receipts Using Google Cloud Vision OCR ». Thesis, Linnéuniversitetet, Institutionen för datavetenskap och medieteknik (DM), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-88602.
Texte intégralThompson, Carrie A. « The Development and Validation of a Spanish Elicited imitation Test of Oral Language Proficiency for the Missionary Training Center ». BYU ScholarsArchive, 2013. https://scholarsarchive.byu.edu/etd/3602.
Texte intégralZamora, Martínez Francisco Julián. « Aportaciones al modelado conexionista de lenguaje y su aplicación al reconocimiento de secuencias y traducción automática ». Doctoral thesis, Universitat Politècnica de València, 2012. http://hdl.handle.net/10251/18066.
Texte intégralZamora Martínez, FJ. (2012). Aportaciones al modelado conexionista de lenguaje y su aplicación al reconocimiento de secuencias y traducción automática [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/18066
Palancia
Alabau, Gonzalvo Vicente. « Multimodal interactive structured prediction ». Doctoral thesis, Universitat Politècnica de València, 2014. http://hdl.handle.net/10251/35135.
Texte intégralAlabau Gonzalvo, V. (2014). Multimodal interactive structured prediction [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/35135
TESIS
Premiado
Dang, Quoc Bao. « Information spotting in huge repositories of scanned document images ». Thesis, La Rochelle, 2018. http://www.theses.fr/2018LAROS024/document.
Texte intégralThis work aims at developing a generic framework which is able to produce camera-based applications of information spotting in huge repositories of heterogeneous content document images via local descriptors. The targeted systems may take as input a portion of an image acquired as a query and the system is capable of returning focused portion of database image that match the query best. We firstly propose a set of generic feature descriptors for camera-based document images retrieval and spotting systems. Our proposed descriptors comprise SRIF, PSRIF, DELTRIF and SSKSRIF that are built from spatial space information of nearest keypoints around a keypoints which are extracted from centroids of connected components. From these keypoints, the invariant geometrical features are considered to be taken into account for the descriptor. SRIF and PSRIF are computed from a local set of m nearest keypoints around a keypoint. While DELTRIF and SSKSRIF can fix the way to combine local shape description without using parameter via Delaunay triangulation formed from a set of keypoints extracted from a document image. Furthermore, we propose a framework to compute the descriptors based on spatial space of dedicated keypoints e.g SURF or SIFT or ORB so that they can deal with heterogeneous-content camera-based document image retrieval and spotting. In practice, a large-scale indexing system with an enormous of descriptors put the burdens for memory when they are stored. In addition, high dimension of descriptors can make the accuracy of indexing reduce. We propose three robust indexing frameworks that can be employed without storing local descriptors in the memory for saving memory and speeding up retrieval time by discarding distance validating. The randomized clustering tree indexing inherits kd-tree, kmean-tree and random forest from the way to select K dimensions randomly combined with the highest variance dimension from each node of the tree. We also proposed the weighted Euclidean distance between two data points that is computed and oriented the highest variance dimension. The secondly proposed hashing relies on an indexing system that employs one simple hash table for indexing and retrieving without storing database descriptors. Besides, we propose an extended hashing based method for indexing multi-kinds of features coming from multi-layer of the image. Along with proposed descriptors as well indexing frameworks, we proposed a simple robust way to compute shape orientation of MSER regions so that they can combine with dedicated descriptors (e.g SIFT, SURF, ORB and etc.) rotation invariantly. In the case that descriptors are able to capture neighborhood information around MSER regions, we propose a way to extend MSER regions by increasing the radius of each region. This strategy can be also applied for other detected regions in order to make descriptors be more distinctive. Moreover, we employed the extended hashing based method for indexing multi-kinds of features from multi-layer of images. This system are not only applied for uniform feature type but also multiple feature types from multi-layers separated. Finally, in order to assess the performances of our contributions, and based on the assessment that no public dataset exists for camera-based document image retrieval and spotting systems, we built a new dataset which has been made freely and publicly available for the scientific community. This dataset contains portions of document images acquired via a camera as a query. It is composed of three kinds of information: textual content, graphical content and heterogeneous content
Vythelingum, Kévin. « Construction rapide, performante et mutualisée de systèmes de reconnaissance et de synthèse de la parole pour de nouvelles langues ». Thesis, Le Mans, 2019. http://www.theses.fr/2019LEMA1035.
Texte intégralWe study in this thesis the joint construction of speech recognition and synthesis systems for new languages, with the goals of accuracy and quick development. The rapid development of voice technologies for new languages is driving scientific ambitions and is now considered strategic by industial players. However, language development research is led by a few research centers, each working on a limited number of languages. However, these technologies share many common points.Our study focuses on building and sharing tools between systems for creating lexicons, learning phonetic rules and taking advantage of imperfect data. Our contributions focus on the selection of relevant data for learning acoustic models, the joint development of phonetizers and pronunciation lexicons for speech recognition and synthesis, and the use of neural models for phonetic transcription from text and speech signal. In addition, we present an approach for automatic detection of phonetic transcript errors in annotated speech signal databases. This study has shown that it is possible to significantly reduce the quantity of data annotation useful for the development of new text-to-speech systems. It naturally helps to reduce data collection time in the process of new systems creation.Finally, we study an application case by jointly building a system for recognizing and synthesizing speech for a new language
Benammar, Riyadh. « Détection non-supervisée de motifs dans les partitions musicales manuscrites ». Thesis, Lyon, 2019. http://www.theses.fr/2019LYSEI112.
Texte intégralThis thesis is part of the data mining applied to ancient handwritten music scores and aims at a search for frequent melodic or rhythmic motifs defined as repetitive note sequences with characteristic properties. There are a large number of possible variations of motifs: transpositions, inversions and so-called "mirror" motifs. These motifs allow musicologists to have a level of in-depth analysis on the works of a composer or a musical style. In a context of exploring large corpora where scores are just digitized and not transcribed, an automated search for motifs that verify targeted constraints becomes an essential tool for their study. To achieve the objective of detecting frequent motifs without prior knowledge, we started from images of digitized scores. After pre-processing steps on the image, we exploited and adapted a model for detecting and recognizing musical primitives (note-heads, stems...) from the family of Region-Proposal CNN (RPN) convolution neural networks. We then developed a primitive encoding method to generate a sequence of notes without the complex task of transcribing the entire manuscript work. This sequence was then analyzed using the CSMA (Constraint String Mining Algorithm) approach designed to detect the frequent motifs present in one or more sequences, taking into account constraints on their frequency and length, as well as the size and number of gaps allowed within the motifs. The gap was then studied to avoid recognition errors produced by the RPN network, thus avoiding the implementation of a post-correction system for transcription errors. The work was finally validated by the study of musical motifs for composers identification and classification
PAGANO, ALICE. « Testing quality in interlingual respeaking and other methods of interlingual live subtitling ». Doctoral thesis, Università degli studi di Genova, 2022. https://hdl.handle.net/11567/1091438.
Texte intégralLive subtitling (LS) finds its foundations in pre-recorded subtitling for the d/Deaf and hard of hearing (SDH) to produce real-time subtitles for live events and programs. LS implies the transfer from oral into written content (intersemiotic translation) and can be carried out from and to the same language (intralingual), or from one language to another (interlingual) to provide full accessibility for all, therefore combining SDH to the need of guaranteeing multilingual access as well. Interlingual Live Subtitling (from now on referred to as ILS) in real-time is currently being achieved by using different methods: the focus here is placed on interlingual respeaking as one of the currently used methods of LS – also referred to in this work as speech-to-text interpreting (STTI) – which has triggered growing interest also in the Italian industry over the past years. The hereby presented doctoral thesis intends to provide a wider picture of the literature and the research on intralingual and interlingual respeaking to the date, emphasizing the current situation in Italy in this practice. The aim of the research was to explore different ILS methods through their strengths and weaknesses, in an attempt to inform the industry on the impact that both potentialities and risks can have on the final overall quality of the subtitles with the involvement of different techniques in producing ILS. To do so, five ILS workflows requiring human and machine interaction to different extents were tested overall in terms of quality, thus not only from a linguistic accuracy point of view, but also considering another crucial factor such as delay in the broadcast of the subtitles. Two case studies were carried out with different language pairs: a first experiment (English to Italian) tested and assessed quality in interlingual respeaking on one hand, then simultaneous interpreting (SI) combined with intralingual respeaking, and SI and Automatic Speech Recognition (ASR) on the other. A second experiment (Spanish to Italian) evaluated and compared all the five methods: the first three again, and two others more machine-centered: intralingual respeaking combined with machine translation (MT), and ASR with MT. Two workshops in interlingual respeaking were offered at the master’s degree in Translation and Interpreting from the University of Genova to prepare students for the experiments, aimed at testing different training modules on ILS and their effectiveness on students’ learning outcomes. For the final experiments, students were assigned different roles for each tested method and performed different required tasks producing ILS from the same source text: a video of a full original speech at a live event. The obtained outputs were analyzed using the NTR model (Romero-Fresco & Pöchhacker, 2017) and the delay was calculated for each method. Preliminary quantitative results deriving from the NTR analyses and the calculation of delay were compared to other two case studies conducted by the University of Vigo and the University of Surrey, showing that more and fully-automated workflows are, indeed, faster than the others, while they still present several important issues in translation and punctuation. Albeit on a small scale, the research also shows how urgent and potentially easy could be to educate translators and interpreters in respeaking during their training phase, given their keen interest in the subject matter. It is hoped that the results obtained can better shed light on the repercussions of the use of different methods and induce further reflection on the importance of human interaction with automatic machine systems in providing high quality accessibility at live events. It is also hoped that involved students’ interest in this field, which was completely unknown to them prior to this research, can inform on the urgency of raising students’ awareness and competence acquisition in the field of live subtitling through respeaking.
Wächter, Thomas. « Semi-automated Ontology Generation for Biocuration and Semantic Search ». Doctoral thesis, Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden, 2011. http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-64838.
Texte intégral« Text-independent speaker recognition using discriminative subspace analysis ». 2012. http://library.cuhk.edu.hk/record=b5549636.
Texte intégral在先進的說話人識別系統中,每個說話人模型是通過給定的說話人數據進行特徵統計分佈估計由生成模型訓練得到。這類方法由於需要逐帧進行概率或似然度計算而得出最終判決,會耗費大量系統資源並降低實時性性能。採用子空間降維技術,我們不僅避免選取冗餘高維度數據,同時能夠有效删除於識別中無用之數據。為克服上述生成性模型的不足並獲得不同說話人間的區分邊界,本文提出了利用區分性子空間方法訓練模型並採用有效的距離測度作為最終的建模識別新算法。
在本篇論文中,我們將先介紹並分析各類產生性說話人識別方法,例如高斯混合模型及聯合因子分析。另外,為了降低特徵空間維度和運算時間,我們也對子空間分析技術做了調研。除此之外,我們提出了一種取名為Fishervoice 基於非參數分佈假定的新穎說話人識別框架。所提出的Fishervoice 框架的主要目的是為了降低噪聲干擾同時加重分類信息,而能夠加強在可區分性的子空間內對聲音特徵建模。採用上述Fishervoice 框架,說話人識別可以簡單地通過測試樣本映射到Fishervoice 子空間並計算其簡單歐氏距離而實現。為了更好得降低維度及提高識別率,我們還對Fishervocie 框架進行多樣化探索。另外,我們也在低維度的全變化空間(Total Variability) 對各類多種子空間分析模型進行調比較。基於XM2VTS 和NIST 公開數據庫的實驗驗證了本文提出的算法的有效性。
Speaker Recognition (SR), which uses the voice to determine the speaker’s identity, is an important and challenging research topic for biometric authentication. Generally speaking, speaker recognition can be divided into text-dependent and text-independent methods according to the verbal content of the speech signal. There are two major applications of speaker recognition: the first is speaker verification, also referred to speaker authentication, which is used to validate the identity of a speaker according to the voice and it involves a binary decision. The second is speaker identification, which is used to determine an unknown speaker’s identity.
In a state-of-art speaker recognition system, the speaker training model is usually trained by generative methods, which estimate feature distribution of each speaker among the given data. These generative methods need a frame-based metric (e.g. probability, likelihoods) calculation for making final decision, which consumes much computer resources, slowing down the real-time responses. Meanwhile, lots of redundant data frames are blindly selected for training without efficient subspace dimension reduction. In order to overcome disadvantages of generative methods and obtain boundary information between individual speakers, we propose to apply the discriminative subspace technique for model training and employ simple but efficient distance metrics for decision score calculation.
In this thesis, we shall present an overview of both conventional and state-of-the-art generative speaker recognition methods (e.g. Gaussian Mixture Model and Joint Factor Analysis) and analyze their advantages and disadvantages. In addition, we have also made an investigation of the application of subspace analysis techniques to reduce feature dimensions and computation time. After that, a novel speaker recognition framework based on the nonparametric Fisher’s discriminant analysis which we name Fishervoice is proposed. The objective of the proposed Fishervoice algorithm is to model the intrinsic vocal characteristics in a discriminant subspace for de-emphasizing unwanted noise variations and emphasizing classification boundaries information. Using the proposed Fishervoice framework, speaker recognition can be easily realized by mapping a test utterance to the Fishervoice subspace and then calculating the score between the test utterance and its reference. Besides, we explore the proposed Fishervoice framework with several extensions for further dimensionality reduction and performance improvement. Furthermore, we investigate various subspace analysis techniques in a total variability-based low-dimensional space for fast computation. Extensive experiments on two large speaker recognition corpora (XM2VTS and NIST) demonstrate significant improvements of Fishervoice over standard, state-of-the-art approaches for both speaker identification and verification systems.
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Jiang, Weiwu.
Thesis (Ph.D.)--Chinese University of Hong Kong, 2012.
Includes bibliographical references (leaves 127-135).
Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web.
Abstract also in Chinese.
Abstract --- p.i
Acknowledgements --- p.vi
Contents --- p.xiv
List of Figures --- p.xvii
List of Tables --- p.xxiii
Chapter 1 --- Introduction --- p.1
Chapter 1.1 --- Overview of Speaker Recognition Systems --- p.1
Chapter 1.2 --- Motivation --- p.4
Chapter 1.3 --- Outline of Thesis --- p.6
Chapter 2 --- Background Study --- p.7
Chapter 2.1 --- Generative Gaussian Mixture Model (GMM) --- p.7
Chapter 2.1.1 --- Basic GMM --- p.7
Chapter 2.1.2 --- The Gaussian Mixture Model-Universal Background Model (GMM-UBM) System --- p.9
Chapter 2.2 --- Discriminative Subspace Analysis --- p.12
Chapter 2.2.1 --- Principal Component Analysis --- p.12
Chapter 2.2.2 --- Linear Discriminant Analysis --- p.16
Chapter 2.2.3 --- Heteroscedastic Linear Discriminant Analysis --- p.17
Chapter 2.2.4 --- Locality Preserving Projections --- p.18
Chapter 2.3 --- Noise Compensation --- p.20
Chapter 2.3.1 --- Eigenvoice --- p.20
Chapter 2.3.2 --- Joint Factor Analysis --- p.24
Chapter 2.3.3 --- Probabilistic Linear Discriminant Analysis --- p.26
Chapter 2.3.4 --- Nuisance Attribute Projection --- p.30
Chapter 2.3.5 --- Within-class Covariance Normalization --- p.32
Chapter 2.4 --- Support Vector Machine --- p.33
Chapter 2.5 --- Score Normalization --- p.35
Chapter 2.6 --- Summary --- p.39
Chapter 3 --- Corpora for Speaker Recognition Experiments --- p.41
Chapter 3.1 --- Corpora for Speaker Identification Experiments --- p.41
Chapter 3.1.1 --- XM2VTS Corpus --- p.41
Chapter 3.1.2 --- NIST Corpora --- p.42
Chapter 3.2 --- Corpora for Speaker Verification Experiments --- p.45
Chapter 3.3 --- Summary --- p.47
Chapter 4 --- Performance Measures for Speaker Recognition --- p.48
Chapter 4.1 --- Performance Measures for Identification --- p.48
Chapter 4.2 --- Performance Measures for Verification --- p.49
Chapter 4.2.1 --- Equal Error Rate --- p.49
Chapter 4.2.2 --- Detection Error Tradeoff Curves --- p.49
Chapter 4.2.3 --- Detection Cost Function --- p.50
Chapter 4.3 --- Summary --- p.51
Chapter 5 --- The Discriminant Fishervoice Framework --- p.52
Chapter 5.1 --- The Proposed Fishervoice Framework --- p.53
Chapter 5.1.1 --- Feature Representation --- p.53
Chapter 5.1.2 --- Nonparametric Fisher’s Discriminant Analysis --- p.55
Chapter 5.2 --- Speaker Identification Experiments --- p.60
Chapter 5.2.1 --- Experiments on the XM2VTS Corpus --- p.60
Chapter 5.2.2 --- Experiments on the NIST Corpus --- p.62
Chapter 5.3 --- Summary --- p.64
Chapter 6 --- Extension of the Fishervoice Framework --- p.66
Chapter 6.1 --- Two-level Fishervoice Framework --- p.66
Chapter 6.1.1 --- Proposed Algorithm --- p.66
Chapter 6.2 --- Performance Evaluation on the Two-level Fishervoice Framework --- p.70
Chapter 6.2.1 --- Experimental Setup --- p.70
Chapter 6.2.2 --- Performance Comparison of Different Types of Input Supervectors --- p.72
Chapter 6.2.3 --- Performance Comparison of Different Numbers of Slices --- p.73
Chapter 6.2.4 --- Performance Comparison of Different Dimensions of Fishervoice Projection Matrices --- p.75
Chapter 6.2.5 --- Performance Comparison with Other Systems --- p.77
Chapter 6.2.6 --- Fusion with Other Systems --- p.78
Chapter 6.2.7 --- Extension of the Two-level Subspace Analysis Framework --- p.80
Chapter 6.3 --- Random Subspace Sampling Framework --- p.81
Chapter 6.3.1 --- Supervector Extraction --- p.82
Chapter 6.3.2 --- Training Stage --- p.83
Chapter 6.3.3 --- Testing Procedures --- p.84
Chapter 6.3.4 --- Discussion --- p.84
Chapter 6.4 --- Performance Evaluation of the Random Subspace Sampling Framework --- p.85
Chapter 6.4.1 --- Experimental Setup --- p.85
Chapter 6.4.2 --- Random Subspace Sampling Analysis --- p.87
Chapter 6.4.3 --- Comparison with Other Systems --- p.90
Chapter 6.4.4 --- Fusion with the Other Systems --- p.90
Chapter 6.5 --- Summary --- p.92
Chapter 7 --- Discriminative Modeling in Low-dimensional Space --- p.94
Chapter 7.1 --- Discriminative Subspace Analysis in Low-dimensional Space --- p.95
Chapter 7.1.1 --- Experimental Setup --- p.96
Chapter 7.1.2 --- Performance Evaluation on Individual Subspace Analysis Techniques --- p.98
Chapter 7.1.3 --- Performance Evaluation on Multi-type of Subspace Analysis Techniques --- p.105
Chapter 7.2 --- Discriminative Subspace Analysis with Support Vector Machine --- p.115
Chapter 7.2.1 --- Experimental Setup --- p.116
Chapter 7.2.2 --- Performance Evaluation on LDA+WCCN+SVM --- p.117
Chapter 7.2.3 --- Performance Evaluation on Fishervoice+SVM --- p.118
Chapter 7.3 --- Summary --- p.118
Chapter 8 --- Conclusions and Future Work --- p.120
Chapter 8.1 --- Contributions --- p.120
Chapter 8.2 --- Future Directions --- p.121
Chapter A --- EM Training GMM --- p.123
Bibliography --- p.127
Henriques, Daniel Filipe Rodrigues. « Automatic Completion of Text-based Tasks ». Master's thesis, 2019. http://hdl.handle.net/10362/92296.
Texte intégralLai, Chun Han, et 賴俊翰. « A Python Implementation of Automatic Speech-text Synchronization Using Speech Recognition and Text-to-Speech Technology ». Thesis, 2015. http://ndltd.ncl.edu.tw/handle/53806441331969263004.
Texte intégral長庚大學
資訊工程學系
103
With the advent of the global village, "language learning" has become an important issue. Now, the variety of language ability is an indicator of competitiveness. Especially the listening and speaking ability are considered more important. In this study, we establish a method to create speech and text synchronized audiobooks with “speech recognition” and “cloud text-to-speech” technology. The user can prepare his own arbitrary articles to create the learning materials for "Shadowing technique" with this method. Besides, the materials are made by "word-level" speech and text synchronized audiobooks. These audiobooks are created by "timed-text" files, and the files are produced from the user's articles and corresponding speech files. By synchronization for speech and text technology, named "CGUAlign", user can easily make the "Timed-text" files. CGUAlign, uses Python to wrap the well-known speech recognition technology─HTK(Hidden Markov Model Toolkit). Just providing text file and the corresponding speech file, obtained from cloud text-to-speech technology, CGUAlign can create the timed-text file to achieve the synchronization of speech and text. Subsequently, we also build a simple website created with JavaScript. This website can use the timed-text file as CALL(Computer-assisted Language Learning) purposes. Using the website, user can browse the synchronized audiobooks to easily do Shadowing technique. Finally this website also provides dictionary function to achieve the goal of CALL.
« Text-independent bilingual speaker verification system ». 2003. http://library.cuhk.edu.hk/record=b5891732.
Texte intégralThesis (M.Phil.)--Chinese University of Hong Kong, 2003.
Includes bibliographical references (leaves 96-102).
Abstracts in English and Chinese.
Abstract --- p.i
Acknowledgement --- p.iv
Chapter 1 --- Introduction --- p.1
Chapter 1.1 --- Biometrics --- p.2
Chapter 1.2 --- Speaker Verification --- p.3
Chapter 1.3 --- Overview of Speaker Verification Systems --- p.4
Chapter 1.4 --- Text Dependency --- p.4
Chapter 1.4.1 --- Text-Dependent Speaker Verification --- p.5
Chapter 1.4.2 --- GMM-based Speaker Verification --- p.6
Chapter 1.5 --- Language Dependency --- p.6
Chapter 1.6 --- Normalization Techniques --- p.7
Chapter 1.7 --- Objectives of the Thesis --- p.8
Chapter 1.8 --- Thesis Organization --- p.8
Chapter 2 --- Background --- p.10
Chapter 2.1 --- Background Information --- p.11
Chapter 2.1.1 --- Speech Signal Acquisition --- p.11
Chapter 2.1.2 --- Speech Processing --- p.11
Chapter 2.1.3 --- Engineering Model of Speech Signal --- p.13
Chapter 2.1.4 --- Speaker Information in the Speech Signal --- p.14
Chapter 2.1.5 --- Feature Parameters --- p.15
Chapter 2.1.5.1 --- Mel-Frequency Cepstral Coefficients --- p.16
Chapter 2.1.5.2 --- Linear Predictive Coding Derived Cep- stral Coefficients --- p.18
Chapter 2.1.5.3 --- Energy Measures --- p.20
Chapter 2.1.5.4 --- Derivatives of Cepstral Coefficients --- p.21
Chapter 2.1.6 --- Evaluating Speaker Verification Systems --- p.22
Chapter 2.2 --- Common Techniques --- p.24
Chapter 2.2.1 --- Template Model Matching Methods --- p.25
Chapter 2.2.2 --- Statistical Model Methods --- p.26
Chapter 2.2.2.1 --- HMM Modeling Technique --- p.27
Chapter 2.2.2.2 --- GMM Modeling Techniques --- p.30
Chapter 2.2.2.3 --- Gaussian Mixture Model --- p.31
Chapter 2.2.2.4 --- The Advantages of GMM --- p.32
Chapter 2.2.3 --- Likelihood Scoring --- p.32
Chapter 2.2.4 --- General Approach to Decision Making --- p.35
Chapter 2.2.5 --- Cohort Normalization --- p.35
Chapter 2.2.5.1 --- Probability Score Normalization --- p.36
Chapter 2.2.5.2 --- Cohort Selection --- p.37
Chapter 2.3 --- Chapter Summary --- p.38
Chapter 3 --- Experimental Corpora --- p.39
Chapter 3.1 --- The YOHO Corpus --- p.39
Chapter 3.1.1 --- Design of the YOHO Corpus --- p.39
Chapter 3.1.2 --- Data Collection Process of the YOHO Corpus --- p.40
Chapter 3.1.3 --- Experimentation with the YOHO Corpus --- p.41
Chapter 3.2 --- CUHK Bilingual Speaker Verification Corpus --- p.42
Chapter 3.2.1 --- Design of the CUBS Corpus --- p.42
Chapter 3.2.2 --- Data Collection Process for the CUBS Corpus --- p.44
Chapter 3.3 --- Chapter Summary --- p.46
Chapter 4 --- Text-Dependent Speaker Verification --- p.47
Chapter 4.1 --- Front-End Processing on the YOHO Corpus --- p.48
Chapter 4.2 --- Cohort Normalization Setup --- p.50
Chapter 4.3 --- HMM-based Speaker Verification Experiments --- p.53
Chapter 4.3.1 --- Subword HMM Models --- p.53
Chapter 4.3.2 --- Experimental Results --- p.55
Chapter 4.3.2.1 --- Comparison of Feature Representations --- p.55
Chapter 4.3.2.2 --- Effect of Cohort Normalization --- p.58
Chapter 4.4 --- Experiments on GMM-based Speaker Verification --- p.61
Chapter 4.4.1 --- Experimental Setup --- p.61
Chapter 4.4.2 --- The number of Gaussian Mixture Components --- p.62
Chapter 4.4.3 --- The Effect of Cohort Normalization --- p.64
Chapter 4.4.4 --- Comparison of HMM and GMM --- p.65
Chapter 4.5 --- Comparison with Previous Systems --- p.67
Chapter 4.6 --- Chapter Summary --- p.70
Chapter 5 --- Language- and Text-Independent Speaker Verification --- p.71
Chapter 5.1 --- Front-End Processing of the CUBS --- p.72
Chapter 5.2 --- Language- and Text-Independent Speaker Modeling --- p.73
Chapter 5.3 --- Cohort Normalization --- p.74
Chapter 5.4 --- Experimental Results and Analysis --- p.75
Chapter 5.4.1 --- Number of Gaussian Mixture Components --- p.78
Chapter 5.4.2 --- The Cohort Normalization Effect --- p.79
Chapter 5.4.3 --- Language Dependency --- p.80
Chapter 5.4.4 --- Language-Independency --- p.83
Chapter 5.5 --- Chapter Summary --- p.88
Chapter 6 --- Conclusions and Future Work --- p.90
Chapter 6.1 --- Summary --- p.90
Chapter 6.1.1 --- Feature Comparison --- p.91
Chapter 6.1.2 --- HMM Modeling --- p.91
Chapter 6.1.3 --- GMM Modeling --- p.91
Chapter 6.1.4 --- Cohort Normalization --- p.92
Chapter 6.1.5 --- Language Dependency --- p.92
Chapter 6.2 --- Future Work --- p.93
Chapter 6.2.1 --- Feature Parameters --- p.93
Chapter 6.2.2 --- Model Quality --- p.93
Chapter 6.2.2.1 --- Variance Flooring --- p.93
Chapter 6.2.2.2 --- Silence Detection --- p.94
Chapter 6.2.3 --- Conversational Speaker Verification --- p.95
Bibliography --- p.102
Williams, Kyle. « Learning to Read Bushman : Automatic Handwriting Recognition for Bushman Languages ». Thesis, 2012. http://pubs.cs.uct.ac.za/archive/00000791/.
Texte intégralWarren, Jolan, et 王杰龍. « The Effects of Automatic Speech Recognition and Text-to-speech Software on EFL Students' Pronunciation ». Thesis, 2012. http://ndltd.ncl.edu.tw/handle/04697441114645545894.
Texte intégral國立高雄師範大學
英語學系
100
The purpose of this study is to evaluate the effects of automatic speech recognition (ASR) and text-to-speech (TTS) software on EFL students’ pronunciation ability. Participants were 48 first and second year non-English majors from National Kaohsiung Normal University. Participants’ ability to produce segmental sounds, 14 vowels, and their suprasegmental ability were measured using a pre-test and post-test that were scored by 2 raters. Participants were assigned to a control group, TTS group or ASR group, and used ASR or TTS software over 6 weeks to self-correct their pronunciation. Their attitudes towards ASR and TTS software were also measured via a questionnaire and open-ended questions. Based on the data analysis, results showed that the use of ASR software for pronunciation practice resulted in mixed improvements in participants’ pronunciation ability, none of which reached a level of significance. The use of TTS software for pronunciation practice resulted in improvements in all areas of pronunciation ability, of which only one was significant. Despite the lack of a significant difference, TTS software resulted in a larger overall gain in pronunciation ability, and participants in the TTS group held a much more positive view of TTS software for pronunciation practice than participants in the ASR group did for ASR software. There were several limitations of the study. The participants were non-English majors from a public university in Taiwan, there was a low level of inter-rater reliability for one section of the pre-test and post-test, treatment was restricted to just 6 weeks, the focus of the study was confined to English vowels and there was a high participant drop-out rate. Results from the study suggest that TTS software shows promise as a tool for creating custom practice material and that pronunciation practice software may be best implemented into pronunciation training when it supplements teacher-led pronunciation classes and is capable of providing students with a pronunciation model to listen to before practicing. To investigate further the effects of ASR and TTS software on EFL students’ pronunciation and possible applications of the software, it is recommend that research be undertaken involving a longer period of treatment, both non-English and English major students should be compared, and the use of a smaller set of sounds or sounds that are verified as problematic be investigated.
Rato, João Pedro Cordeiro. « Conversação homem-máquina. Caracterização e avaliação do estado actual das soluções de speech recognition, speech synthesis e sistemas de conversação homem-máquina ». Master's thesis, 2016. http://hdl.handle.net/10400.8/2375.
Texte intégralWächter, Thomas. « Semi-automated Ontology Generation for Biocuration and Semantic Search ». Doctoral thesis, 2010. https://tud.qucosa.de/id/qucosa%3A25496.
Texte intégral