To see the other types of publications on this topic, follow the link: Automatic Text Recognition (ATR).

Dissertations / Theses on the topic 'Automatic Text Recognition (ATR)'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 47 dissertations / theses for your research on the topic 'Automatic Text Recognition (ATR).'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Chiffoleau, Floriane. "Understanding the automatic text recognition process : model training, ground truth and prediction errors." Electronic Thesis or Diss., Le Mans, 2024. http://www.theses.fr/2024LEMA3002.

Full text
Abstract:
Cette thèse travaille à identifier ce qu’un modèle de reconnaissance de texte apprend pendant son entraînement, à travers l’examen du contenu de ses vérités de terrain et de ses erreurs de prédiction. L’intention principale ici est d’améliorer les connaissances sur le fonctionnement d’un réseau de neurones, avec des expériences focalisées sur des documents tapuscrits. Les méthodes utilisées se sont concentrées surtout sur l’exploration approfondie des données d’entraînement, l’observation des erreurs de prédiction des modèles et la corrélation entre les deux. Une première hypothèse, basée sur l’influence du lexique, fut non concluante. Cependant, cela a dirigé les observations vers un nouveau niveau d’étude, s’appuyant sur un niveau infralexical : les n-grammes. La distribution de ceux des données d’entraînement a été analysée et subséquemment, comparée à celle des n-grammes récupérés dans les erreurs de prédiction. Des résultats prometteurs ont conduit à une exploration approfondie, tout en passant d’un modèle de langue unique à un modèle multilingue. Des résultats concluants m’ont permis de déduire que les n-grammes pourraient effectivement être une réponse valide aux performances de reconnaissance
This thesis works on identifying what a text recognition model can learn during its training, through the examination of its ground truth’s content, and its prediction’s errors. The main intent here is to improve the knowledge of how a neural network operates, with experiments focused on typewritten documents. The methods used mostly concentrated on the thorough exploration of the training data, the observation of the model’s prediction’s errors, and the correlation between both. A first hypothesis, based on the influence of the lexicon, was inconclusive. However, it steered the observation towards a new level of study, relying on an infralexical level: the n-grams. Their training data’s distribution was analysed and subsequently compared to that of the n-grams retrieved from the prediction errors. Promising results lead to further exploration, while upgrading from single-language to multilingual model. Conclusive results enabled me to infer that the n-grams might indeed be a valid answer to recognition’s performances
APA, Harvard, Vancouver, ISO, and other styles
2

Gregori, Alessandro <1975&gt. "Automatic Speech Recognition (ASR) and NMT for Interlingual and Intralingual Communication: Speech to Text Technology for Live Subtitling and Accessibility." Doctoral thesis, Alma Mater Studiorum - Università di Bologna, 2021. http://amsdottorato.unibo.it/9931/1/Gregori_Alessandro_tesi.pdf.

Full text
Abstract:
Considered the increasing demand for institutional translation and the multilingualism of international organizations, the application of Artificial Intelligence (AI) technologies in multilingual communications and for the purposes of accessibility has become an important element in the production of translation and interpreting services (Zetzsche, 2019). In particular, the widespread use of Automatic Speech Recognition (ASR) and Neural Machine Translation (NMT) technology represents a recent development in the attempt of satisfying the increasing demand for interinstitutional, multilingual communications at inter-governmental level (Maslias, 2017). Recently, researchers have been calling for a universalistic view of media and conference accessibility (Greco, 2016). The application of ASR, combined with NMT, may allow for the breaking down of communication barriers at European institutional conferences where multilingualism represents a fundamental pillar (Jopek Bosiacka, 2013). In addition to representing a so-called disruptive technology (Accipio Consulting, 2006), ASR technology may facilitate the communication with non-hearing users (Lewis, 2015). Thanks to ASR, it is possible to guarantee content accessibility for non-hearing audience via subtitles at institutionally-held conferences or speeches. Hence the need for analysing and evaluating ASR output: a quantitative approach is adopted to try to make an evaluation of subtitles, with the objective of assessing its accuracy (Romero-Fresco, 2011). A database of F.A.O.’s and other international institutions’ English-language speeches and conferences on climate change is taken into consideration. The statistical approach is based on WER and NER models (Romero-Fresco, 2016) and on an adapted version. The ASR software solution implemented into the study will be VoxSigma by Vocapia Research and Google Speech Recognition engine. After having defined a taxonomic scheme, Native and Non-Native subtitles are compared to gold standard transcriptions. The intralingual and interlingual output generated by NMT is specifically analysed and evaluated via the NTR model to evaluate accuracy and accessibility.
APA, Harvard, Vancouver, ISO, and other styles
3

Jansson, Annika. "Tal till text för relevant metadatataggning av ljudarkiv hos Sveriges Radio." Thesis, KTH, Medieteknik och interaktionsdesign, MID, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-169464.

Full text
Abstract:
Tal till text för relevant metadatataggning av ljudarkiv hos Sveriges Radio Sammanfattning Under åren 2009-2013 har Sveriges Radio digitaliserat sitt programarkiv. Sveriges Radios ambition är att mer material från de 175 000 timmar radio som sänds varje år ska arkiveras. Det är en relativt tidsödande process att göra allt material sökbart och det är långt ifrån säkert att kvaliteten på dessa data är lika hög hos alla objekt.         Frågeställningen som har behandlats för detta examensarbete är: Vilka tekniska lösningar finns för att utveckla ett system åt Sveriges Radio för automatisk igenkänning av svenskt tal till text utifrån deras ljudarkiv?         System inom tal till text har analyserats och undersökts för att ge Sveriges Radio en aktuell sammanställning inom området.         Intervjuer med andra liknande organisationer som arbetar inom området har utförts för att se hur långt de har kommit i sin utveckling av det berörda ämnet.         En litteraturstudie har genomförts på de senare forskningsrapporterna inom taligenkänning för att jämföra vilket system som skulle passa Sveriges Radio behov och krav bäst att gå vidare med.         Det Sveriges Radio bör koncentrera sig på först för att kunna bygga en ASR, Automatic Speech Recognition, är att transkribera sitt ljudmaterial. Där finns det tre alternativ, antingen transkribera själva genom att välja ut ett antal program med olika inriktning för att få en så stor bredd som möjligt på innehållet, gärna med olika talare för att sedan även kunna utveckla vidare för igenkänning av talare. Enklaste sättet är att låta olika yrkeskategorier som lägger in inslagen/programmen i systemet göra det. Andra alternativet är att starta ett liknade projekt som BBC har gjort och ta hjälp av allmänheten. Tredje alternativet är att köpa tjänsten för transkribering.         Mitt råd är att fortsätta utvärdera systemet Kaldi, eftersom det har utvecklats mycket på senaste tiden och verkar vara relativt lätt att utvidga. Även den öppna källkod som Lingsoft använder sig av är intressant att studera vidare.
Speech to text for relevant metadata tagging of audio archive at Sveriges Radio Abstract In the years 2009-2013, Sveriges Radio digitized its program archive. Sveriges Radio's ambition is that more material from the 175 000 hours of radio they broadcast every year should be archived. This is a relatively time-consuming process to make all materials to be searchable and it's far from certain that the quality of the data is equally high on all items.         The issue that has been treated for this thesis is: What opportunities exist to develop a system to Sveriges Radio for Swedish speech to text?         Systems for speech to text has been analyzed and examined to give Sveriges Radio a current overview in this subject.         Interviews with other similar organizations working in the field have been performed to see how far they have come in their development of the concerned subject.         A literature study has been conducted on the recent research reports in speech recognition to compare which system would match Sveriges Radio's needs and requirements best to get on with.         What Sveriges Radio should concentrate at first, in order to build an ASR, Automatic Speech Recognition, is to transcribe their audio material. Where there are three alternatives, either transcribe themselves by selecting a number of programs with different orientations to get such a large width as possible on the content, preferably with different speakers and then also be able to develop further recognition of the speaker. The easiest way is to let different professions who make the features/programs in the system do it. Other option is to start a similar project that the BBC has done and take help of the public. The third option is to buy the service for transcription.         My advice is to continue evaluate the Kaldi system, because it has evolved significantly in recent years and seems to be relatively easy to extend. Also the open-source that Lingsoft uses is interesting to study further.
APA, Harvard, Vancouver, ISO, and other styles
4

Gong, XiangQi. "Ellection markup language (EML) based tele-voting system." Thesis, University of the Western Cape, 2009. http://etd.uwc.ac.za/index.php?module=etd&action=viewtitle&id=gen8Srv25Nme4_5841_1350999620.

Full text
Abstract:
Elections are one of the most fundamental activities of a democratic society. As is the case in any other aspect of life, developments in technology have resulted changes in the voting procedure from using the traditional paper-based voting to voting by use of electronic means, or e-voting. E-voting involves using different forms of electronic means like
voting machines, voting via the Internet, telephone, SMS and digital interactive television. This thesis concerns voting by telephone, or televoting, it starts by giving a brief overview and evaluation of various models and technologies that are implemented within such systems. The aspects of televoting that have been investigated are technologies that provide a voice interface to the voter and conduct the voting process, namely the Election Markup Language (EML), Automated Speech Recognition (ASR) and Text-to-Speech (TTS).
APA, Harvard, Vancouver, ISO, and other styles
5

Wager, Nicholas. "Automatic Target Recognition (ATR) ATR : background statistics and the detection of targets in clutter /." Thesis, Monterey, Calif. : Springfield, Va. : Naval Postgraduate School ; Available from National Technical Information Service, 1994. http://handle.dtic.mil/100.2/ADA293062.

Full text
Abstract:
Thesis (M.S. in Applied Physics) Naval Postgraduate School, December 1994.
Thesis advisor(s): David L. Fried, David Scott Davis. :December 1994." Includes bibliographical references. Also available online.
APA, Harvard, Vancouver, ISO, and other styles
6

Horvath, Matthew Steven. "Performance Prediction of Quantization Based Automatic Target Recognition Algorithms." Wright State University / OhioLINK, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=wright1452086412.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Jobbins, Amanda Caryn. "The contribution of semantics to automatic text processing." Thesis, Nottingham Trent University, 1999. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.302405.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Namane, Abderrahmane. "Degraded printed text and handwritten recognition methods : Application to automatic bank check recognition." Université Louis Pasteur (Strasbourg) (1971-2008), 2007. http://www.theses.fr/2007STR13048.

Full text
Abstract:
La reconnaissance des caractères est une étape importante dans tout système de reconnaissances de document. Cette reconnaissance de caractère est considérée comme un problème d'affectation et de décision de caractères, et a fait l'objet de recherches dans de nombreuses disciplines. Cette thèse porte principalement sur la reconnaissance du caractère imprimé dégradé et manuscrit. De nouvelles solutions ont été apportées au domaine de l'analyse du document image (ADI). On trouve en premier lieu, le développement de deux méthodes de reconnaissance du chiffre manuscrit, notamment, la méthode basée sur l'utilisation de la transformée de Fourier-Mellin (TFM) et la carte auto-organisatrice (CAO), et l'utilisation de la combinaison parallèle basée sur les HMMs comme classificateurs de bases, avec comme extracteur de paramètres une nouvelle technique de projection. En deuxième lieu, on trouve une nouvelle méthode de reconnaissance holistique de mots manuscrits appliquée au montant légal Français. En troisième lieu, deux travaux basés sur les réseaux de neurones ont étés réalisés sur la reconnaissance du caractère imprimé dégradé et appliqués au chèque postal Algérien. Le premier travail est basé sur la combinaison séquentielle et le deuxième a fait l'objet d'une combinaison série basé sur l'introduction d'une distance relative pour la mesure de qualité du caractère dégradé. Lors de l'élaboration de ce travail, des méthodes de prétraitement ont été aussi développées, notamment, la correction de l'inclinaison du chiffre manuscrit, la détection de la zone centrale du mot manuscrit ainsi que sa pente
Character recognition is a significant stage in all document recognition systems. Character recognition is considered as an assignment problem and decision of a given character, and is an active research subject in many disciplines. This thesis is mainly related to the recognition of degraded printed and handwritten characters. New solutions were brought to the field of document image analysis (DIA). The first solution concerns the development of two recognition methods for handwritten numeral character, namely, the method based on the use of Fourier-Mellin transform (FMT) and the self-organization map (SOM), and the parallel combination of HMM-based classifiers using as parameter extraction a new projection technique. In the second solution, one finds a new holistic recognition method of handwritten words applied to French legal amount. The third solution presents two recognition methods based on neural networks for the degraded printed character applied to the Algerian postal check. The first work is based on sequential combination and the second used a serial combination based mainly on the introduction of a relative distance for the quality measurement of the degraded character. During the development of this thesis, methods of preprocessing were also developed, in particular, the handwritten numeral slant correction, the handwritten word central zone detection and its slope
APA, Harvard, Vancouver, ISO, and other styles
9

Bayik, Tuba Makbule. "Automatic Target Recognition In Infrared Imagery." Master's thesis, METU, 2004. http://etd.lib.metu.edu.tr/upload/2/12605388/index.pdf.

Full text
Abstract:
The task of automatically recognizing targets in IR imagery has a history of approximately 25 years of research and development. ATR is an application of pattern recognition and scene analysis in the field of defense industry and it is still one of the challenging problems. This thesis may be viewed as an exploratory study of ATR problem with encouraging recognition algorithms implemented in the area. The examined algorithms are among the solutions to the ATR problem, which are reported to have good performance in the literature. Throughout the study, PCA, subspace LDA, ICA, nearest mean classifier, K nearest neighbors classifier, nearest neighbor classifier, LVQ classifier are implemented and their performances are compared in the aspect of recognition rate. According to the simulation results, the system, which uses the ICA as the feature extractor and LVQ as the classifier, has the best performing results. The good performance of this system is due to the higher order statistics of the data and the success of LVQ in modifying the decision boundaries.
APA, Harvard, Vancouver, ISO, and other styles
10

Bae, Junhyeong. "Adaptive Waveforms for Automatic Target Recognition and Range-Doppler Ambiguity Mitigation in Cognitive Sensor." Diss., The University of Arizona, 2013. http://hdl.handle.net/10150/306942.

Full text
Abstract:
This dissertation shows the performance of adaptive waveforms when applied to two radar applications. One application is automatic target recognition (ATR) and the other application is range-Doppler ambiguity mitigation. The adaptive waveforms are implemented via a feedback loop from receiver to transmitter, such that previous radar measurements affect how the adaptive waveforms proceed. For the ATR application, adaptive transmitter can change the waveform's temporal structure to improve target recognition performance. For range-Doppler ambiguity mitigation application, adaptive transmitter can change the pulse repetition frequency (PRF) to mitigate range and Doppler ambiguity. In the ATR application, commercial electromagnetic software is used to create high-fidelity aircraft target signatures. Realistic waveform constraints are applied to show radar performance. The radar equation is incorporated into the waveform design technique and template-based classification is performed. Translation invariant feature is used for inaccurately known range scenario. The performance of adaptive waveforms is evaluated with not only a monostatic radar, but also widely separated MIMO radar. In MIMO radar, multiple transmit waveforms are used, but spectral leakage caused by constant-modulus constraint shows minimal interference effect. In the range-Doppler ambiguity mitigation application, particle-filter-based track-before-detect for a single target is extended to track and detect multiple low signal-to-noise ratio (SNR) targets, simultaneously. To mitigate ambiguity, multiple PRFs are used and improved PRF selection technique is implemented via predicted entropy computation when both blind and clutter zones are considered.
APA, Harvard, Vancouver, ISO, and other styles
11

Abdel-Rahman, Tarek. "Mixture of Factor Analyzers (MoFA) Models for the Design and Analysis of SAR Automatic Target Recognition (ATR) Algorithms." The Ohio State University, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=osu1500625807524146.

Full text
APA, Harvard, Vancouver, ISO, and other styles
12

Shou-Chun, Yin 1980. "Speaker adaptation in joint factor analysis based text independent speaker verification." Thesis, McGill University, 2006. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=100735.

Full text
Abstract:
This thesis presents methods for supervised and unsupervised speaker adaptation of Gaussian mixture speaker models in text-independent speaker verification. The proposed methods are based on an approach which is able to separate speaker and channel variability so that progressive updating of speaker models can be performed while minimizing the influence of the channel variability associated with the adaptation recordings. This approach relies on a joint factor analysis model of intrinsic speaker variability and session variability where inter-session variation is assumed to result primarily from the effects of the transmission channel. These adaptation methods have been evaluated under the adaptation paradigm defined under the NIST 2005 speaker recognition evaluation plan which is based on conversational telephone speech.
APA, Harvard, Vancouver, ISO, and other styles
13

Sequeira, José Francisco Rodrigues. "Automatic knowledge base construction from unstructured text." Master's thesis, Universidade de Aveiro, 2016. http://hdl.handle.net/10773/17910.

Full text
Abstract:
Mestrado em Engenharia de Computadores e Telemática
Taking into account the overwhelming number of biomedical publications being produced, the effort required for a user to efficiently explore those publications in order to establish relationships between a wide range of concepts is staggering. This dissertation presents GRACE, a web-based platform that provides an advanced graphical exploration interface that allows users to traverse the biomedical domain in order to find explicit and latent associations between annotated biomedical concepts belonging to a variety of semantic types (e.g., Genes, Proteins, Disorders, Procedures and Anatomy). The knowledge base utilized is a collection of MEDLINE articles with English abstracts. These annotations are then stored in an efficient data storage that allows for complex queries and high-performance data delivery. Concept relationship are inferred through statistical analysis, applying association measures to annotated terms. These processes grant the graphical interface the ability to create, in real-time, a data visualization in the form of a graph for the exploration of these biomedical concept relationships.
Tendo em conta o crescimento do número de publicações biomédicas a serem produzidas todos os anos, o esforço exigido para que um utilizador consiga, de uma forma eficiente, explorar estas publicações para conseguir estabelecer associações entre um conjunto alargado de conceitos torna esta tarefa exaustiva. Nesta disertação apresentamos uma plataforma web chamada GRACE, que providencia uma interface gráfica de exploração que permite aos utilizadores navegar pelo domínio biomédico em busca de associações explícitas ou latentes entre conceitos biomédicos pertencentes a uma variedade de domínios semânticos (i.e., Genes, Proteínas, Doenças, Procedimentos e Anatomia). A base de conhecimento usada é uma coleção de artigos MEDLINE com resumos escritos na língua inglesa. Estas anotações são armazenadas numa base de dados que permite pesquisas complexas e obtenção de dados com alta performance. As relações entre conceitos são inferidas a partir de análise estatística, aplicando medidas de associações entre os conceitos anotados. Estes processos permitem à interface gráfica criar, em tempo real, uma visualização de dados, na forma de um grafo, para a exploração destas relações entre conceitos do domínio biomédico.
APA, Harvard, Vancouver, ISO, and other styles
14

Alamri, Safi S. "Text-independent, automatic speaker recognition system evaluation with males speaking both Arabic and English." Thesis, University of Colorado at Denver, 2015. http://pqdtopen.proquest.com/#viewpdf?dispub=1605087.

Full text
Abstract:

Automatic speaker recognition is an important key to speaker identification in media forensics and with the increase of cultures mixing, there?s an increase in bilingual speakers all around the world. The purpose of this thesis is to compare text-independent samples of one person using two different languages, Arabic and English, against a single language reference population. The hope is that a design can be started that may be useful in further developing software that can complete accurate text-independent ASR for bilingual speakers speaking either language against a single language reference population. This thesis took an Arabic model sample and compared it against samples that were both Arabic and English using and an Arabic reference population, all collected from videos downloaded from the Internet. All of the samples were text-independent and enhanced to optimal performance. The data was run through a biometric software called BATVOX 4.1, which utilizes the MFCCs and GMM methods of speaker recognition and identification. The result of testing through BATVOX 4.1 was likelihood ratios for each sample that were evaluated for similarities and differences, trends, and problems that had occurred.

APA, Harvard, Vancouver, ISO, and other styles
15

Lee, Spencer Jaehoon Gilbert Juan E. "Post-speech-recognition processiing in domain-specific text-corpus-based distributed listening system analysis, interpretation and selection of speech recognition results /." Auburn, Ala., 2006. http://repo.lib.auburn.edu/2006%20Summer/Theses/LEE_SPENCER_7.pdf.

Full text
APA, Harvard, Vancouver, ISO, and other styles
16

Reynolds, Douglas A. "A Gaussian mixture modeling approach to text-independent speaker identification." Diss., Georgia Institute of Technology, 1992. http://hdl.handle.net/1853/16903.

Full text
APA, Harvard, Vancouver, ISO, and other styles
17

AlKhateeb, Jawad H. Y. "Word based off-line handwritten Arabic classification and recognition. Design of automatic recognition system for large vocabulary offline handwritten Arabic words using machine learning approaches." Thesis, University of Bradford, 2010. http://hdl.handle.net/10454/4440.

Full text
Abstract:
The design of a machine which reads unconstrained words still remains an unsolved problem. For example, automatic interpretation of handwritten documents by a computer is still under research. Most systems attempt to segment words into letters and read words one character at a time. However, segmenting handwritten words is very difficult. So to avoid this words are treated as a whole. This research investigates a number of features computed from whole words for the recognition of handwritten words in particular. Arabic text classification and recognition is a complicated process compared to Latin and Chinese text recognition systems. This is due to the nature cursiveness of Arabic text. The work presented in this thesis is proposed for word based recognition of handwritten Arabic scripts. This work is divided into three main stages to provide a recognition system. The first stage is the pre-processing, which applies efficient pre-processing methods which are essential for automatic recognition of handwritten documents. In this stage, techniques for detecting baseline and segmenting words in handwritten Arabic text are presented. Then connected components are extracted, and distances between different components are analyzed. The statistical distribution of these distances is then obtained to determine an optimal threshold for word segmentation. The second stage is feature extraction. This stage makes use of the normalized images to extract features that are essential in recognizing the images. Various method of feature extraction are implemented and examined. The third and final stage is the classification. Various classifiers are used for classification such as K nearest neighbour classifier (k-NN), neural network classifier (NN), Hidden Markov models (HMMs), and the Dynamic Bayesian Network (DBN). To test this concept, the particular pattern recognition problem studied is the classification of 32492 words using ii the IFN/ENIT database. The results were promising and very encouraging in terms of improved baseline detection and word segmentation for further recognition. Moreover, several feature subsets were examined and a best recognition performance of 81.5% is achieved.
APA, Harvard, Vancouver, ISO, and other styles
18

Pisane, Jonathan. "Automatic target recognition using passive bistatic radar signals." Phd thesis, Supélec, 2013. http://tel.archives-ouvertes.fr/tel-00963601.

Full text
Abstract:
We present the design, development, and test of three novel, distinct automatic target recognition (ATR) systems for the recognition of airplanes and, more specifically, non-cooperative airplanes, i.e. airplanes that do not provide information when interrogated, in the framework of passive bistatic radar systems. Passive bistatic radar systems use one or more illuminators of opportunity (already present in the field), with frequencies up to 1 GHz for the transmitter part of the systems considered here, and one or more receivers, deployed by the persons managing the system, and not co-located with the transmitters. The sole source of information are the signal scattered on the airplane and the direct-path signal that are collected by the receiver, some basic knowledge about the transmitter, and the geometrical bistatic radar configuration. The three distinct ATR systems that we built respectively use the radar images, the bistatic complex radar cross-section (BS-RCS), and the bistatic radar cross-section (BS-RCS) of the targets. We use data acquired either on scale models of airplanes placed in an anechoic, electromagnetic chamber or on real-size airplanes using a bistatic testbed consisting of a VOR transmitter and a software-defined radio (SDR) receiver, located near Orly airport, France. We describe the radar phenomenology pertinent for the problem at hand, as well as the mathematical underpinnings of the derivation of the bistatic RCS values and of the construction of the radar images.For the classification of the observed targets into pre-defined classes, we use either extremely randomized trees or subspace methods. A key feature of our approach is that we break the recognition problem into a set of sub-problems by decomposing the parameter space, which consists of the frequency, the polarization, the aspect angle, and the bistatic angle, into regions. We build one recognizer for each region. We first validate the extra-trees method on the radar images of the MSTAR dataset, featuring ground vehicles. We then test the method on the images of the airplanes constructed from data acquired in the anechoic chamber, achieving a probability of correct recognition up to 0.99.We test the subspace methods on the BS-CRCS and on the BS-RCS of the airplanes extracted from the data acquired in the anechoic chamber, achieving a probability of correct recognition up to 0.98, with variations according to the frequency band, the polarization, the sector of aspect angle, the sector of bistatic angle, and the number of (Tx,Rx) pairs used. The ATR system deployed in the field gives a probability of correct recognition of $0.82$, with variations according to the sector of aspect angle and the sector of bistatic angle.
APA, Harvard, Vancouver, ISO, and other styles
19

Millard, Benjamin J. "Oral Proficiency Assessment of French Using an Elicited Imitation Test and Automatic Speech Recognition." BYU ScholarsArchive, 2011. https://scholarsarchive.byu.edu/etd/2690.

Full text
Abstract:
Testing oral proficiency is an important, but often neglected part of the foreign language classroom. Currently accepted methods in testing oral proficiency are timely and expensive. Some work has been done to test and implement new assessment methods, but have focused primarily on English or Spanish (Graham et al. 2008). In this thesis, I demonstrate that the processes established for English and Spanish elicited imitation (EI) testing are relevant to French EI testing. First, I document the development, implementation and evaluation of an EI test to assess French oral proficiency. I also detail the incorporation of the use of automatic speech recognition to score French EI items. Last, I substantiate with statistical analyses that carefully engineered, automatically scored French EI items correlate to a high degree with French OPI scores.
APA, Harvard, Vancouver, ISO, and other styles
20

AlKhateeb, Jawad Hasan Yasin. "Word based off-line handwritten Arabic classification and recognition : design of automatic recognition system for large vocabulary offline handwritten Arabic words using machine learning approaches." Thesis, University of Bradford, 2010. http://hdl.handle.net/10454/4440.

Full text
Abstract:
The design of a machine which reads unconstrained words still remains an unsolved problem. For example, automatic interpretation of handwritten documents by a computer is still under research. Most systems attempt to segment words into letters and read words one character at a time. However, segmenting handwritten words is very difficult. So to avoid this words are treated as a whole. This research investigates a number of features computed from whole words for the recognition of handwritten words in particular. Arabic text classification and recognition is a complicated process compared to Latin and Chinese text recognition systems. This is due to the nature cursiveness of Arabic text. The work presented in this thesis is proposed for word based recognition of handwritten Arabic scripts. This work is divided into three main stages to provide a recognition system. The first stage is the pre-processing, which applies efficient pre-processing methods which are essential for automatic recognition of handwritten documents. In this stage, techniques for detecting baseline and segmenting words in handwritten Arabic text are presented. Then connected components are extracted, and distances between different components are analyzed. The statistical distribution of these distances is then obtained to determine an optimal threshold for word segmentation. The second stage is feature extraction. This stage makes use of the normalized images to extract features that are essential in recognizing the images. Various method of feature extraction are implemented and examined. The third and final stage is the classification. Various classifiers are used for classification such as K nearest neighbour classifier (k-NN), neural network classifier (NN), Hidden Markov models (HMMs), and the Dynamic Bayesian Network (DBN). To test this concept, the particular pattern recognition problem studied is the classification of 32492 words using ii the IFN/ENIT database. The results were promising and very encouraging in terms of improved baseline detection and word segmentation for further recognition. Moreover, several feature subsets were examined and a best recognition performance of 81.5% is achieved.
APA, Harvard, Vancouver, ISO, and other styles
21

Ogun, Sewade. "Generating diverse synthetic data for ASR training data augmentation." Electronic Thesis or Diss., Université de Lorraine, 2024. http://www.theses.fr/2024LORR0116.

Full text
Abstract:
Au cours des deux dernières décennies, le taux d'erreur des systèmes de reconnaissance automatique de la parole (RAP) a chuté drastiquement, les rendant ainsi plus utiles dans les applications réelles. Cette amélioration peut être attribuée à plusieurs facteurs, dont les nouvelles architectures utilisant des techniques d'apprentissage profond, les nouveaux algorithmes d'entraînement, les ensembles de données d'entraînement grands et diversifiés, et l'augmentation des données. En particulier, les jeux de données d'entraînement de grande taille ont été essentiels pour apprendre des représentations robustes de la parole pour les systèmes de RAP. Leur taille permet de couvrir efficacement la diversité inhérente à la parole, en terme de voix des locuteurs, de vitesse de parole, de hauteur, de réverbération et de bruit. Cependant, la taille et la diversité des jeux de données disponibles dans les langues bien dotées ne sont pas accessibles pour les langues moyennement ou peu dotées, ainsi que pour des domaines à vocabulaire spécialisé comme le domaine médical. Par conséquent, la méthode populaire pour augmenter la diversité des ensembles de données est l'augmentation des données. Avec l'augmentation récente de la naturalité et de la qualité des données synthétiques pouvant être générées par des systèmes de synthèse de la parole (TTS) et de conversion de voix (VC), ces derniers sont également devenus des options viables pour l'augmentation des données de RAP. Cependant, plusieurs problèmes limitent leur application. Premièrement, les systèmes de TTS/VC nécessitent des données de parole de haute qualité pour l'entraînement. Par conséquent, nous développons une méthode de curation d'un jeux de données à partir d'un corpus conçu pour la RAP pour l'entraînement d'un système de TTS. Cette méthode exploite la précision croissante des estimateurs de qualité non intrusifs basés sur l'apprentissage profond pour filtrer les échantillons de haute qualité. Nous explorons le filtrage du jeux de données de RAP à différents seuils pour équilibrer sa taille, le nombre de locuteurs et la qualité. Avec cette méthode, nous créons un ensemble de données interlocuteurs de haute qualité, comparable en qualité à LibriTTS. Deuxièmement, le processus de génération de données doit être contrôlable pour générer des données TTS/VC diversifiées avec des attributs spécifiques. Les systèmes TTS/VC précédents conditionnent soit le système sur l'empreinte du locuteur seule, soit utilisent des modèles discriminatifs pour apprendre les variabilités de la parole. Dans notre approche, nous concevons une architecture améliorée basée sur le flux qui apprend la distribution de différentes variables de la parole. Nous constatons que nos modifications augmentent significativement la diversité et la naturalité des énoncés générés par rapport à une référence GlowTTS, tout en étant contrôlables. Enfin, nous avons évalué l'importance de générer des données des TTS et VC diversifiées pour augmenter les données d'entraînement de RAP. Contrairement à la génération naïve des données TTS/VC, nous avons examiné indépendamment différentes approches telles que les méthodes de sélection des phrases et l'augmentation de la diversité des locuteurs, la durée des phonèmes et les contours de hauteur, en plus d'augmenter systématiquement les conditions environnementales des données générées. Nos résultats montrent que l'augmentation TTS/VC est prometteuse pour augmenter les performances de RAP dans les régimes de données faibles et moyen. En conclusion, nos expériences fournissent un aperçu des variabilités particulièrement importantes pour la RAP et révèlent une approche systématique de l'augmentation des données de RAP utilisant des données synthétiques
In the last two decades, the error rate of automatic speech recognition (ASR) systems has drastically dropped, making them more useful in real-world applications. This improvement can be attributed to several factors including new architectures using deep learning techniques, new training algorithms, large and diverse training datasets, and data augmentation. In particular, the large-scale training datasets have been pivotal to learning robust speech representations for ASR. Their large size allows them to effectively cover the inherent diversity in speech, in terms of speaker voice, speaking rate, pitch, reverberation, and noise. However, the size and diversity of datasets typically found in high-resourced languages are not available in medium- and low-resourced languages and in domains with specialised vocabulary like the medical domain. Therefore, the popular method to increase dataset diversity is through data augmentation. With the recent increase in the naturalness and quality of synthetic data that can be generated by text-to-speech (TTS) and voice conversion (VC) systems, these systems have also become viable options for ASR data augmentation. However, several problems limit their application. First, TTS/VC systems require high-quality speech data for training. Hence, we develop a method of dataset curation from an ASR-designed corpus for training a TTS system. This method leverages the increasing accuracy of deep-learning-based, non-intrusive quality estimators to filter high-quality samples. We explore filtering the ASR dataset at different thresholds to balance the size of the dataset, number of speakers, and quality. With this method, we create a high-quality multi-speaker dataset which is comparable to LibriTTS in quality. Second, the data generation process needs to be controllable to generate diverse TTS/VC data with specific attributes. Previous TTS/VC systems either condition the system on the speaker embedding alone or use discriminative models to learn the speech variabilities. In our approach, we design an improved flow-based architecture that learns the distribution of different speech variables. We find that our modifications significantly increase the diversity and naturalness of the generated utterances over a GlowTTS baseline, while being controllable. Lastly, we evaluated the significance of generating diverse TTS and VC data for augmenting ASR training data. As opposed to naively generating the TTS/VC data, we independently examined different approaches such as sentence selection methods and increasing the diversity of speakers, phoneme duration, and pitch contours, in addition to systematically increasing the environmental conditions of the generated data. Our results show that TTS/VC augmentation holds promise in increasing ASR performance in low- and medium-data regimes. In conclusion, our experiments provide insight into the variabilities that are particularly important for ASR, and reveal a systematic approach to ASR data augmentation using synthetic data
APA, Harvard, Vancouver, ISO, and other styles
22

Hon, Wing-kai. "On the construction and application of compressed text indexes." Click to view the E-thesis via HKUTO, 2004. http://sunzi.lib.hku.hk/hkuto/record/B31059739.

Full text
APA, Harvard, Vancouver, ISO, and other styles
23

Hon, Wing-kai, and 韓永楷. "On the construction and application of compressed text indexes." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2004. http://hub.hku.hk/bib/B31059739.

Full text
APA, Harvard, Vancouver, ISO, and other styles
24

Zhu, Winstead Xingran. "Hotspot Detection for Automatic Podcast Trailer Generation." Thesis, Uppsala universitet, Institutionen för lingvistik och filologi, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-444887.

Full text
Abstract:
With podcasts being a fast growing audio-only form of media, an effective way of promoting different podcast shows becomes more and more vital to all the stakeholders concerned, including the podcast creators, the podcast streaming platforms, and the podcast listeners. This thesis investigates the relatively little studied topic of automatic podcast trailer generation, with the purpose of en- hancing the overall visibility and publicity of different podcast contents and gen- erating more user engagement in podcast listening. This thesis takes a hotspot- based approach, by specifically defining the vague concept of “hotspot” and designing different appropriate methods for hotspot detection. Different meth- ods are analyzed and compared, and the best methods are selected. The selected methods are then used to construct an automatic podcast trailer generation sys- tem, which consists of four major components and one schema to coordinate the components. The system can take a random podcast episode audio as input and generate an around 1 minute long trailer for it. This thesis also proposes two human-based podcast trailer evaluation approaches, and the evaluation results show that the proposed system outperforms the baseline with a large margin and achieves promising results in terms of both aesthetics and functionality.
APA, Harvard, Vancouver, ISO, and other styles
25

McMurtry, William F. "Information Retrieval for Call Center Quality Assurance." The Ohio State University, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=osu1587036885211228.

Full text
APA, Harvard, Vancouver, ISO, and other styles
26

Kullmann, Emelie. "Speech to Text for Swedish using KALDI." Thesis, KTH, Optimeringslära och systemteori, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-189890.

Full text
Abstract:
The field of speech recognition has during the last decade left the re- search stage and found its way in to the public market. Most computers and mobile phones sold today support dictation and transcription in a number of chosen languages.  Swedish is often not one of them. In this thesis, which is executed on behalf of the Swedish Radio, an Automatic Speech Recognition model for Swedish is trained and the performance evaluated. The model is built using the open source toolkit Kaldi.  Two approaches of training the acoustic part of the model is investigated. Firstly, using Hidden Markov Model and Gaussian Mixture Models and secondly, using Hidden Markov Models and Deep Neural Networks. The later approach using deep neural networks is found to achieve a better performance in terms of Word Error Rate.
De senaste åren har olika tillämpningar inom människa-dator interaktion och främst taligenkänning hittat sig ut på den allmänna marknaden. Många system och tekniska produkter stöder idag tjänsterna att transkribera tal och diktera text. Detta gäller dock främst de större språken och sällan finns samma stöd för mindre språk som exempelvis svenskan. I detta examensprojekt har en modell för taligenkänning på svenska ut- vecklas. Det är genomfört på uppdrag av Sveriges Radio som skulle ha stor nytta av en fungerande taligenkänningsmodell på svenska. Modellen är utvecklad i ramverket Kaldi. Två tillvägagångssätt för den akustiska träningen av modellen är implementerade och prestandan för dessa två är evaluerade och jämförda. Först tränas en modell med användningen av Hidden Markov Models och Gaussian Mixture Models och slutligen en modell där Hidden Markov Models och Deep Neural Networks an- vänds, det visar sig att den senare uppnår ett bättre resultat i form av måttet Word Error Rate.
APA, Harvard, Vancouver, ISO, and other styles
27

Nguyen, Chu Duc. "Localization and quality enhancement for automatic recognition of vehicle license plates in video sequences." Thesis, Ecully, Ecole centrale de Lyon, 2011. http://www.theses.fr/2011ECDL0018.

Full text
Abstract:
La lecture automatique de plaques d’immatriculation de véhicule est considérée comme une approche de surveillance de masse. Elle permet, grâce à la détection /localisation ainsi que la reconnaissance optique, d’identifier un véhicule dans les images ou les séquences d’images. De nombreuses applications comme le suivi du trafic, la détection de véhicules volés, le télépéage ou la gestion d’entrée / sortie des parkings utilise ce procédé. Or malgré d’important progrès enregistré depuis l’apparition des premiers prototypes en 1979 accompagné d’un taux de reconnaissance parfois impressionnant, notamment grâce aux avancés en recherche scientifique et en technologie des capteurs, les contraintes imposés pour le bon fonctionnement de tels systèmes en limitent les portées. En effet, l’utilisation optimale des techniques de localisation et de reconnaissance de plaque d’immatriculation dans les scénarii opérationnels nécessite des conditions d’éclairage contrôlées ainsi qu’une limitation dans de la pose, de vitesse ou tout simplement de type de plaque. La lecture automatique de plaques d’immatriculation reste alors un problème de recherche ouvert. La contribution majeure de cette thèse est triple. D’abord une nouvelle approche robuste de localisation de plaque d’immatriculation dans des images ou des séquences d’images est proposée. Puis, l’amélioration de la qualité des plaques localisées est traitée par une adaptation de technique de super-résolution. Finalement, un modèle unifié de localisation et de super-résolution est proposé permettant de diminuer la complexité temporelle des deux approches combinées
Automatic reading of vehicle license plates is considered an approach to mass surveillance. It allows, through the detection / localization and optical recognition to identify a vehicle in the images or video sequences. Many applications such as traffic monitoring, detection of stolen vehicles, the toll or the management of entrance/ exit parking uses this method. Yet in spite of important progress made since the appearance of the first prototype sin 1979, with a recognition rate sometimes impressive thanks to advanced science and sensor technology, the constraints imposed for the operation of such systems limit laid. Indeed, the optimal use of techniques for localizing and recognizing license plates in operational scenarios requiring controlled lighting conditions and a limitation of the pose, velocity, or simply type plate. Automatic reading of vehicle license plates then remains an open research problem. The major contribution of this thesis is threefold. First, a new approach to robust license plate localization in images or image sequences is proposed. Then, improving the quality of the plates is treated with a localized adaptation of super-resolution technique. Finally, a unified model of location and super-resolution is proposed to reduce the time complexity of both approaches combined
APA, Harvard, Vancouver, ISO, and other styles
28

Granell, Romero Emilio. "Advances on the Transcription of Historical Manuscripts based on Multimodality, Interactivity and Crowdsourcing." Doctoral thesis, Universitat Politècnica de València, 2017. http://hdl.handle.net/10251/86137.

Full text
Abstract:
Natural Language Processing (NLP) is an interdisciplinary research field of Computer Science, Linguistics, and Pattern Recognition that studies, among others, the use of human natural languages in Human-Computer Interaction (HCI). Most of NLP research tasks can be applied for solving real-world problems. This is the case of natural language recognition and natural language translation, that can be used for building automatic systems for document transcription and document translation. Regarding digitalised handwritten text documents, transcription is used to obtain an easy digital access to the contents, since simple image digitalisation only provides, in most cases, search by image and not by linguistic contents (keywords, expressions, syntactic or semantic categories). Transcription is even more important in historical manuscripts, since most of these documents are unique and the preservation of their contents is crucial for cultural and historical reasons. The transcription of historical manuscripts is usually done by paleographers, who are experts on ancient script and vocabulary. Recently, Handwritten Text Recognition (HTR) has become a common tool for assisting paleographers in their task, by providing a draft transcription that they may amend with more or less sophisticated methods. This draft transcription is useful when it presents an error rate low enough to make the amending process more comfortable than a complete transcription from scratch. Thus, obtaining a draft transcription with an acceptable low error rate is crucial to have this NLP technology incorporated into the transcription process. The work described in this thesis is focused on the improvement of the draft transcription offered by an HTR system, with the aim of reducing the effort made by paleographers for obtaining the actual transcription on digitalised historical manuscripts. This problem is faced from three different, but complementary, scenarios: · Multimodality: The use of HTR systems allow paleographers to speed up the manual transcription process, since they are able to correct on a draft transcription. Another alternative is to obtain the draft transcription by dictating the contents to an Automatic Speech Recognition (ASR) system. When both sources (image and speech) are available, a multimodal combination is possible and an iterative process can be used in order to refine the final hypothesis. · Interactivity: The use of assistive technologies in the transcription process allows one to reduce the time and human effort required for obtaining the actual transcription, given that the assistive system and the palaeographer cooperate to generate a perfect transcription. Multimodal feedback can be used to provide the assistive system with additional sources of information by using signals that represent the whole same sequence of words to transcribe (e.g. a text image, and the speech of the dictation of the contents of this text image), or that represent just a word or character to correct (e.g. an on-line handwritten word). · Crowdsourcing: Open distributed collaboration emerges as a powerful tool for massive transcription at a relatively low cost, since the paleographer supervision effort may be dramatically reduced. Multimodal combination allows one to use the speech dictation of handwritten text lines in a multimodal crowdsourcing platform, where collaborators may provide their speech by using their own mobile device instead of using desktop or laptop computers, which makes it possible to recruit more collaborators.
El Procesamiento del Lenguaje Natural (PLN) es un campo de investigación interdisciplinar de las Ciencias de la Computación, Lingüística y Reconocimiento de Patrones que estudia, entre otros, el uso del lenguaje natural humano en la interacción Hombre-Máquina. La mayoría de las tareas de investigación del PLN se pueden aplicar para resolver problemas del mundo real. Este es el caso del reconocimiento y la traducción del lenguaje natural, que se pueden utilizar para construir sistemas automáticos para la transcripción y traducción de documentos. En cuanto a los documentos manuscritos digitalizados, la transcripción se utiliza para facilitar el acceso digital a los contenidos, ya que la simple digitalización de imágenes sólo proporciona, en la mayoría de los casos, la búsqueda por imagen y no por contenidos lingüísticos. La transcripción es aún más importante en el caso de los manuscritos históricos, ya que la mayoría de estos documentos son únicos y la preservación de su contenido es crucial por razones culturales e históricas. La transcripción de manuscritos históricos suele ser realizada por paleógrafos, que son personas expertas en escritura y vocabulario antiguos. Recientemente, los sistemas de Reconocimiento de Escritura (RES) se han convertido en una herramienta común para ayudar a los paleógrafos en su tarea, la cual proporciona un borrador de la transcripción que los paleógrafos pueden corregir con métodos más o menos sofisticados. Este borrador de transcripción es útil cuando presenta una tasa de error suficientemente reducida para que el proceso de corrección sea más cómodo que una completa transcripción desde cero. Por lo tanto, la obtención de un borrador de transcripción con una baja tasa de error es crucial para que esta tecnología de PLN sea incorporada en el proceso de transcripción. El trabajo descrito en esta tesis se centra en la mejora del borrador de transcripción ofrecido por un sistema RES, con el objetivo de reducir el esfuerzo realizado por los paleógrafos para obtener la transcripción de manuscritos históricos digitalizados. Este problema se enfrenta a partir de tres escenarios diferentes, pero complementarios: · Multimodalidad: El uso de sistemas RES permite a los paleógrafos acelerar el proceso de transcripción manual, ya que son capaces de corregir en un borrador de la transcripción. Otra alternativa es obtener el borrador de la transcripción dictando el contenido a un sistema de Reconocimiento Automático de Habla. Cuando ambas fuentes están disponibles, una combinación multimodal de las mismas es posible y se puede realizar un proceso iterativo para refinar la hipótesis final. · Interactividad: El uso de tecnologías asistenciales en el proceso de transcripción permite reducir el tiempo y el esfuerzo humano requeridos para obtener la transcripción correcta, gracias a la cooperación entre el sistema asistencial y el paleógrafo para obtener la transcripción perfecta. La realimentación multimodal se puede utilizar en el sistema asistencial para proporcionar otras fuentes de información adicionales con señales que representen la misma secuencia de palabras a transcribir (por ejemplo, una imagen de texto, o la señal de habla del dictado del contenido de dicha imagen de texto), o señales que representen sólo una palabra o carácter a corregir (por ejemplo, una palabra manuscrita mediante una pantalla táctil). · Crowdsourcing: La colaboración distribuida y abierta surge como una poderosa herramienta para la transcripción masiva a un costo relativamente bajo, ya que el esfuerzo de supervisión de los paleógrafos puede ser drásticamente reducido. La combinación multimodal permite utilizar el dictado del contenido de líneas de texto manuscrito en una plataforma de crowdsourcing multimodal, donde los colaboradores pueden proporcionar las muestras de habla utilizando su propio dispositivo móvil en lugar de usar ordenadores,
El Processament del Llenguatge Natural (PLN) és un camp de recerca interdisciplinar de les Ciències de la Computació, la Lingüística i el Reconeixement de Patrons que estudia, entre d'altres, l'ús del llenguatge natural humà en la interacció Home-Màquina. La majoria de les tasques de recerca del PLN es poden aplicar per resoldre problemes del món real. Aquest és el cas del reconeixement i la traducció del llenguatge natural, que es poden utilitzar per construir sistemes automàtics per a la transcripció i traducció de documents. Quant als documents manuscrits digitalitzats, la transcripció s'utilitza per facilitar l'accés digital als continguts, ja que la simple digitalització d'imatges només proporciona, en la majoria dels casos, la cerca per imatge i no per continguts lingüístics (paraules clau, expressions, categories sintàctiques o semàntiques). La transcripció és encara més important en el cas dels manuscrits històrics, ja que la majoria d'aquests documents són únics i la preservació del seu contingut és crucial per raons culturals i històriques. La transcripció de manuscrits històrics sol ser realitzada per paleògrafs, els quals són persones expertes en escriptura i vocabulari antics. Recentment, els sistemes de Reconeixement d'Escriptura (RES) s'han convertit en una eina comuna per ajudar els paleògrafs en la seua tasca, la qual proporciona un esborrany de la transcripció que els paleògrafs poden esmenar amb mètodes més o menys sofisticats. Aquest esborrany de transcripció és útil quan presenta una taxa d'error prou reduïda perquè el procés de correcció siga més còmode que una completa transcripció des de zero. Per tant, l'obtenció d'un esborrany de transcripció amb un baixa taxa d'error és crucial perquè aquesta tecnologia del PLN siga incorporada en el procés de transcripció. El treball descrit en aquesta tesi se centra en la millora de l'esborrany de la transcripció ofert per un sistema RES, amb l'objectiu de reduir l'esforç realitzat pels paleògrafs per obtenir la transcripció de manuscrits històrics digitalitzats. Aquest problema s'enfronta a partir de tres escenaris diferents, però complementaris: · Multimodalitat: L'ús de sistemes RES permet als paleògrafs accelerar el procés de transcripció manual, ja que són capaços de corregir un esborrany de la transcripció. Una altra alternativa és obtenir l'esborrany de la transcripció dictant el contingut a un sistema de Reconeixement Automàtic de la Parla. Quan les dues fonts (imatge i parla) estan disponibles, una combinació multimodal és possible i es pot realitzar un procés iteratiu per refinar la hipòtesi final. · Interactivitat: L'ús de tecnologies assistencials en el procés de transcripció permet reduir el temps i l'esforç humà requerits per obtenir la transcripció real, gràcies a la cooperació entre el sistema assistencial i el paleògraf per obtenir la transcripció perfecta. La realimentació multimodal es pot utilitzar en el sistema assistencial per proporcionar fonts d'informació addicionals amb senyals que representen la mateixa seqüencia de paraules a transcriure (per exemple, una imatge de text, o el senyal de parla del dictat del contingut d'aquesta imatge de text), o senyals que representen només una paraula o caràcter a corregir (per exemple, una paraula manuscrita mitjançant una pantalla tàctil). · Crowdsourcing: La col·laboració distribuïda i oberta sorgeix com una poderosa eina per a la transcripció massiva a un cost relativament baix, ja que l'esforç de supervisió dels paleògrafs pot ser reduït dràsticament. La combinació multimodal permet utilitzar el dictat del contingut de línies de text manuscrit en una plataforma de crowdsourcing multimodal, on els col·laboradors poden proporcionar les mostres de parla utilitzant el seu propi dispositiu mòbil en lloc d'utilitzar ordinadors d'escriptori o portàtils, la qual cosa permet ampliar el nombr
Granell Romero, E. (2017). Advances on the Transcription of Historical Manuscripts based on Multimodality, Interactivity and Crowdsourcing [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/86137
TESIS
APA, Harvard, Vancouver, ISO, and other styles
29

Santos, André Jerónimo Martins dos. "Automatic and interactive annotation of PDF documents." Master's thesis, Universidade de Aveiro, 2016. http://hdl.handle.net/10773/17886.

Full text
Abstract:
Mestrado em Engenharia de Computadores e Telemática
O aumento acelerado da literatura biomédica levou ao desenvolvimento de vários esforços para extrair e armazenar, de forma estruturada, a informação relativa aos conceitos e relações presentes nesses textos, oferecendo aos investigadores e clínicos um acesso rápido e fácil à informação. No entanto, este processo de "curadoria de conhecimento" é uma tarefa extremamente exaustiva, sendo cada vez mais comum o uso de ferramentas de anotação automática, fazendo uso de técnicas de mineração de texto. Apesar de já existirem sistemas de anotação bastante completos e que apresentam um alto desempenho, estes não são largamente usados pela comunidade biomédica, principalmente por serem complexos e apresentarem limitações ao nível de usabilidade. Por outro lado, o PDF tornou-se nos últimos anos num dos formatos mais populares para publicar e partilhar documentos visto poder ser apresentado exatamente da mesma maneira independentemente do sistema ou plataforma em que é acedido. A maioria das ferramentas de anotação foram principalmente desenhadas para extrair informação de texto livre, contudo hoje em dia uma grande parte da literatura biomédica é publicada e distribuída em PDF, e portanto a extração de informação de documentos PDF deve ser um ponto de foco para a comunidade de mineração de texto biomédico. O objetivo do trabalho descrito nesta dissertação foi a extensão da framework Neji, permitindo o processamento de documentos em formato PDF, e a integração dessas funcionalidades na plataforma Egas, permitindo que um utilizador possa visualizar e anotar, simultaneamente, o artigo original no formato PDF e o texto extraído deste. Os sistemas desenvolvidos apresentam bons resultados de desempenho, tanto em termos de velocidade de processamento como de representação da informação, o que também contribui para uma melhor experiência de utilizador. Além disso, apresentam várias vantagens para a comunidade de mineração de texto e curadores, permitindo a anotação direta de artigos no formato PDF e simplificando o uso e configuração destes sistemas de anotação por parte de investigadores.
The accelerated increase of the biomedical literature has led to various efforts to extract and store, in a structured way, the information related with the concepts and relations presented in those texts, providing to investigators and researchers a fast and easy access to knowledge. However, this process of “knowledge curation” is an extremely exhaustive task, being more and more common demanding the application of automatic annotation tools, that make use of text mining techniques. Even thought complete annotation systems already exist and produce high performance results, they are not widely used by the biomedical community, mainly because of their complexity and also due to some limitations in usability. On the other hand, the PDF has become in the last years one of the most popular formats for publishing and sharing documents because of it can be displayed exactly in the same way independently of the system or platform where it is accessed. The majority of annotation tools were mainly designed to extract information from raw text, although a big part of the biomedical literature is published and distributed in PDF, and thus the information extraction from PDF documents should be a focus point for the biomedical text mining community. The objective of the work described in this document is the extension of Neji framework, allowing the processing of documents in PDF format, and the integration of these features in Egas platform, allowing a user to simultaneously visualize the original article in PDF format and its extracted text. The improved and developed systems present good performing results, both in terms of processing speed and representation of the information, contributing also for a better user experience. Besides that, they present several advantages for the biomedical community, allowing the direct annotation of PDF articles and simplifying the use and configuration of these annotation systems by researchers.
APA, Harvard, Vancouver, ISO, and other styles
30

Catae, Fabricio Shigueru. "Classificação automática de texto por meio de similaridade de palavras: um algoritmo mais eficiente." Universidade de São Paulo, 2013. http://www.teses.usp.br/teses/disponiveis/3/3141/tde-06072014-225124/.

Full text
Abstract:
A análise da semântica latente é uma técnica de processamento de linguagem natural, que busca simplificar a tarefa de encontrar palavras e sentenças por similaridade. Através da representação de texto em um espaço multidimensional, selecionam-se os valores mais significativos para sua reconstrução em uma dimensão reduzida. Essa simplificação lhe confere a capacidade de generalizar modelos, movendo as palavras e os textos para uma representação semântica. Dessa forma, essa técnica identifica um conjunto de significados ou conceitos ocultos sem a necessidade do conhecimento prévio da gramática. O objetivo desse trabalho foi determinar a dimensionalidade ideal do espaço semântico em uma tarefa de classificação de texto. A solução proposta corresponde a um algoritmo semi-supervisionado que, a partir de exemplos conhecidos, aplica o método de classificação pelo vizinho mais próximo e determina uma curva estimada da taxa de acerto. Como esse processamento é demorado, os vetores são projetados em um espaço no qual o cálculo se torna incremental. Devido à isometria dos espaços, a similaridade entre documentos se mantém equivalente. Esta proposta permite determinar a dimensão ideal do espaço semântico com pouco esforço além do tempo requerido pela análise da semântica latente tradicional. Os resultados mostraram ganhos significativos em adotar o número correto de dimensões.
The latent semantic analysis is a technique in natural language processing, which aims to simplify the task of finding words and sentences similarity. Using a vector space model for the text representation, it selects the most significant values for the space reconstruction into a smaller dimension. This simplification allows it to generalize models, moving words and texts towards a semantic representation. Thus, it identifies a set of underlying meanings or hidden concepts without prior knowledge of grammar. The goal of this study was to determine the optimal dimensionality of the semantic space in a text classification task. The proposed solution corresponds to a semi-supervised algorithm that applies the method of the nearest neighbor classification on known examples, and plots the estimated accuracy on a graph. Because it is a very time consuming process, the vectors are projected on a space in such a way the calculation becomes incremental. Since the spaces are isometric, the similarity between documents remains equivalent. This proposal determines the optimal dimension of the semantic space with little effort, not much beyond the time required by traditional latent semantic analysis. The results showed significant gains in adopting the correct number of dimensions.
APA, Harvard, Vancouver, ISO, and other styles
31

Johansson, Elias. "Separation and Extraction of Valuable Information From Digital Receipts Using Google Cloud Vision OCR." Thesis, Linnéuniversitetet, Institutionen för datavetenskap och medieteknik (DM), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-88602.

Full text
Abstract:
Automatization is a desirable feature in many business areas. Manually extracting information from a physical object such as a receipt is something that can be automated to save resources for a company or a private person. In this paper the process will be described of combining an already existing OCR engine with a developed python script to achieve data extraction of valuable information from a digital image of a receipt. Values such as VAT, VAT%, date, total-, gross-, and net-cost; will be considered as valuable information. This is a feature that has already been implemented in existing applications. However, the company that I have done this project for are interested in creating their own version. This project is an experiment to see if it is possible to implement such an application using restricted resources. To develop a program that can extract the information mentioned above. In this paper you will be guided though the process of the development of the program. As well as indulging in the mindset, findings and the steps taken to overcome the problems encountered along the way. The program achieved a success rate of 86.6% in extracting the most valuable information: total cost, VAT% and date from a set of 53 receipts originated from 34 separate establishments.
APA, Harvard, Vancouver, ISO, and other styles
32

Thompson, Carrie A. "The Development and Validation of a Spanish Elicited imitation Test of Oral Language Proficiency for the Missionary Training Center." BYU ScholarsArchive, 2013. https://scholarsarchive.byu.edu/etd/3602.

Full text
Abstract:
The Missionary Training Center (MTC), affiliated with the Church of Jesus Christ of Latter-day Saints, needs a reliable and cost effective way to measure the oral language proficiency of missionaries learning Spanish. The MTC needed to measure incoming missionaries' Spanish language proficiency for training and classroom assignment as well as to provide exit measures of institutional progress. Oral proficiency interviews and semi-direct assessments require highly trained raters, which is costly and time-consuming. The Elicited Imitation (EI) test is a computerized, automated test that measures oral language proficiency by having the participant hear and repeat utterances of varying syllable length in the target language. It is economical, simple to administer, and rate. This dissertation outlined the process of creating and scoring an EI test for the MTC. Item Response Theory (IRT) was used to analyze a large bank of EI items. The best performing 43 items comprise the final version MTC Spanish EI test. Questions about what linguistic features (syllable length, grammatical difficulty) contribute to item difficulty were addressed. Regression analysis showed that syllable length predicted item difficulty, whereas grammar difficulty did not.
APA, Harvard, Vancouver, ISO, and other styles
33

Zamora, Martínez Francisco Julián. "Aportaciones al modelado conexionista de lenguaje y su aplicación al reconocimiento de secuencias y traducción automática." Doctoral thesis, Universitat Politècnica de València, 2012. http://hdl.handle.net/10251/18066.

Full text
Abstract:
El procesamiento del lenguaje natural es un área de aplicación de la inteligencia artificial, en particular, del reconocimiento de formas que estudia, entre otras cosas, incorporar información sintáctica (modelo de lenguaje) sobre cómo deben juntarse las palabras de una determinada lengua, para así permitir a los sistemas de reconocimiento/traducción decidir cual es la mejor hipótesis �con sentido común�. Es un área muy amplia, y este trabajo se centra únicamente en la parte relacionada con el modelado de lenguaje y su aplicación a diversas tareas: reconocimiento de secuencias mediante modelos ocultos de Markov y traducción automática estadística. Concretamente, esta tesis tiene su foco central en los denominados modelos conexionistas de lenguaje, esto es, modelos de lenguaje basados en redes neuronales. Los buenos resultados de estos modelos en diversas áreas del procesamiento del lenguaje natural han motivado el desarrollo de este estudio. Debido a determinados problemas computacionales que adolecen los modelos conexionistas de lenguaje, los sistemas que aparecen en la literatura se construyen en dos etapas totalmente desacopladas. En la primera fase se encuentra, a través de un modelo de lenguaje estándar, un conjunto de hipótesis factibles, asumiendo que dicho conjunto es representativo del espacio de búsqueda en el cual se encuentra la mejor hipótesis. En segundo lugar, sobre dicho conjunto, se aplica el modelo conexionista de lenguaje y se extrae la hipótesis con mejor puntuación. A este procedimiento se le denomina �rescoring�. Este escenario motiva los objetivos principales de esta tesis: � Proponer alguna técnica que pueda reducir drásticamente dicho coste computacional degradando lo mínimo posible la calidad de la solución encontrada. � Estudiar el efecto que tiene la integración de los modelos conexionistas de lenguaje en el proceso de búsqueda de las tareas propuestas. � Proponer algunas modificaciones del modelo original que permitan mejorar su calidad
Zamora Martínez, FJ. (2012). Aportaciones al modelado conexionista de lenguaje y su aplicación al reconocimiento de secuencias y traducción automática [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/18066
Palancia
APA, Harvard, Vancouver, ISO, and other styles
34

Alabau, Gonzalvo Vicente. "Multimodal interactive structured prediction." Doctoral thesis, Universitat Politècnica de València, 2014. http://hdl.handle.net/10251/35135.

Full text
Abstract:
This thesis presents scientific contributions to the field of multimodal interac- tive structured prediction (MISP). The aim of MISP is to reduce the human effort required to supervise an automatic output, in an efficient and ergonomic way. Hence, this thesis focuses on the two aspects of MISP systems. The first aspect, which refers to the interactive part of MISP, is the study of strate- gies for efficient human¿computer collaboration to produce error-free outputs. Multimodality, the second aspect, deals with other more ergonomic modalities of communication with the computer rather than keyboard and mouse. To begin with, in sequential interaction the user is assumed to supervise the output from left-to-right so that errors are corrected in sequential order. We study the problem under the decision theory framework and define an optimum decoding algorithm. The optimum algorithm is compared to the usually ap- plied, standard approach. Experimental results on several tasks suggests that the optimum algorithm is slightly better than the standard algorithm. In contrast to sequential interaction, in active interaction it is the system that decides what should be given to the user for supervision. On the one hand, user supervision can be reduced if the user is required to supervise only the outputs that the system expects to be erroneous. In this respect, we define a strategy that retrieves first the outputs with highest expected error first. Moreover, we prove that this strategy is optimum under certain conditions, which is validated by experimental results. On the other hand, if the goal is to reduce the number of corrections, active interaction works by selecting elements, one by one, e.g., words of a given output to be supervised by the user. For this case, several strategies are compared. Unlike the previous case, the strategy that performs better is to choose the element with highest confidence, which coincides with the findings of the optimum algorithm for sequential interaction. However, this also suggests that minimizing effort and supervision are contradictory goals. With respect to the multimodality aspect, this thesis delves into techniques to make multimodal systems more robust. To achieve that, multimodal systems are improved by providing contextual information of the application at hand. First, we study how to integrate e-pen interaction in a machine translation task. We contribute to the state-of-the-art by leveraging the information from the source sentence. Several strategies are compared basically grouped into two approaches: inspired by word-based translation models and n-grams generated from a phrase-based system. The experiments show that the former outper- forms the latter for this task. Furthermore, the results present remarkable improvements against not using contextual information. Second, similar ex- periments are conducted on a speech-enabled interface for interactive machine translation. The improvements over the baseline are also noticeable. How- ever, in this case, phrase-based models perform much better than word-based models. We attribute that to the fact that acoustic models are poorer estima- tions than morphologic models and, thus, they benefit more from the language model. Finally, similar techniques are proposed for dictation of handwritten documents. The results show that speech and handwritten recognition can be combined in an effective way. Finally, an evaluation with real users is carried out to compare an interactive machine translation prototype with a post-editing prototype. The results of the study reveal that users are very sensitive to the usability aspects of the user interface. Therefore, usability is a crucial aspect to consider in an human evaluation that can hinder the real benefits of the technology being evaluated. Hopefully, once usability problems are fixed, the evaluation indicates that users are more favorable to work with the interactive machine translation system than to the post-editing system.
Alabau Gonzalvo, V. (2014). Multimodal interactive structured prediction [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/35135
TESIS
Premiado
APA, Harvard, Vancouver, ISO, and other styles
35

Dang, Quoc Bao. "Information spotting in huge repositories of scanned document images." Thesis, La Rochelle, 2018. http://www.theses.fr/2018LAROS024/document.

Full text
Abstract:
Ce travail vise à développer un cadre générique qui est capable de produire des applications de localisation d'informations à partir d’une caméra (webcam, smartphone) dans des très grands dépôts d'images de documents numérisés et hétérogènes via des descripteurs locaux. Ainsi, dans cette thèse, nous proposons d'abord un ensemble de descripteurs qui puissent être appliqués sur des contenus aux caractéristiques génériques (composés de textes et d’images) dédié aux systèmes de recherche et de localisation d'images de documents. Nos descripteurs proposés comprennent SRIF, PSRIF, DELTRIF et SSKSRIF qui sont construits à partir de l’organisation spatiale des points d’intérêts les plus proches autour d'un point-clé pivot. Tous ces points sont extraits à partir des centres de gravité des composantes connexes de l‘image. A partir de ces points d’intérêts, des caractéristiques géométriques invariantes aux dégradations sont considérées pour construire nos descripteurs. SRIF et PSRIF sont calculés à partir d'un ensemble local des m points d’intérêts les plus proches autour d'un point d’intérêt pivot. Quant aux descripteurs DELTRIF et SSKSRIF, cette organisation spatiale est calculée via une triangulation de Delaunay formée à partir d'un ensemble de points d’intérêts extraits dans les images. Cette seconde version des descripteurs permet d’obtenir une description de forme locale sans paramètres. En outre, nous avons également étendu notre travail afin de le rendre compatible avec les descripteurs classiques de la littérature qui reposent sur l’utilisation de points d’intérêts dédiés de sorte qu'ils puissent traiter la recherche et la localisation d'images de documents à contenu hétérogène. La seconde contribution de cette thèse porte sur un système d'indexation de très grands volumes de données à partir d’un descripteur volumineux. Ces deux contraintes viennent peser lourd sur la mémoire du système d’indexation. En outre, la très grande dimensionnalité des descripteurs peut amener à une réduction de la précision de l'indexation, réduction liée au problème de dimensionnalité. Nous proposons donc trois techniques d'indexation robustes, qui peuvent toutes être employées sans avoir besoin de stocker les descripteurs locaux dans la mémoire du système. Cela permet, in fine, d’économiser la mémoire et d’accélérer le temps de recherche de l’information, tout en s’abstrayant d’une validation de type distance. Pour cela, nous avons proposé trois méthodes s’appuyant sur des arbres de décisions : « randomized clustering tree indexing” qui hérite des propriétés des kd-tree, « kmean-tree » et les « random forest » afin de sélectionner de manière aléatoire les K dimensions qui permettent de combiner la plus grande variance expliquée pour chaque nœud de l’arbre. Nous avons également proposé une fonction de hachage étendue pour l'indexation de contenus hétérogènes provenant de plusieurs couches de l'image. Comme troisième contribution de cette thèse, nous avons proposé une méthode simple et robuste pour calculer l'orientation des régions obtenues par le détecteur MSER, afin que celui-ci puisse être combiné avec des descripteurs dédiés. Comme la plupart de ces descripteurs visent à capturer des informations de voisinage autour d’une région donnée, nous avons proposé un moyen d'étendre les régions MSER en augmentant le rayon de chaque région. Cette stratégie peut également être appliquée à d'autres régions détectées afin de rendre les descripteurs plus distinctifs. Enfin, afin d'évaluer les performances de nos contributions, et en nous fondant sur l'absence d'ensemble de données publiquement disponibles pour la localisation d’information hétérogène dans des images capturées par une caméra, nous avons construit trois jeux de données qui sont disponibles pour la communauté scientifique
This work aims at developing a generic framework which is able to produce camera-based applications of information spotting in huge repositories of heterogeneous content document images via local descriptors. The targeted systems may take as input a portion of an image acquired as a query and the system is capable of returning focused portion of database image that match the query best. We firstly propose a set of generic feature descriptors for camera-based document images retrieval and spotting systems. Our proposed descriptors comprise SRIF, PSRIF, DELTRIF and SSKSRIF that are built from spatial space information of nearest keypoints around a keypoints which are extracted from centroids of connected components. From these keypoints, the invariant geometrical features are considered to be taken into account for the descriptor. SRIF and PSRIF are computed from a local set of m nearest keypoints around a keypoint. While DELTRIF and SSKSRIF can fix the way to combine local shape description without using parameter via Delaunay triangulation formed from a set of keypoints extracted from a document image. Furthermore, we propose a framework to compute the descriptors based on spatial space of dedicated keypoints e.g SURF or SIFT or ORB so that they can deal with heterogeneous-content camera-based document image retrieval and spotting. In practice, a large-scale indexing system with an enormous of descriptors put the burdens for memory when they are stored. In addition, high dimension of descriptors can make the accuracy of indexing reduce. We propose three robust indexing frameworks that can be employed without storing local descriptors in the memory for saving memory and speeding up retrieval time by discarding distance validating. The randomized clustering tree indexing inherits kd-tree, kmean-tree and random forest from the way to select K dimensions randomly combined with the highest variance dimension from each node of the tree. We also proposed the weighted Euclidean distance between two data points that is computed and oriented the highest variance dimension. The secondly proposed hashing relies on an indexing system that employs one simple hash table for indexing and retrieving without storing database descriptors. Besides, we propose an extended hashing based method for indexing multi-kinds of features coming from multi-layer of the image. Along with proposed descriptors as well indexing frameworks, we proposed a simple robust way to compute shape orientation of MSER regions so that they can combine with dedicated descriptors (e.g SIFT, SURF, ORB and etc.) rotation invariantly. In the case that descriptors are able to capture neighborhood information around MSER regions, we propose a way to extend MSER regions by increasing the radius of each region. This strategy can be also applied for other detected regions in order to make descriptors be more distinctive. Moreover, we employed the extended hashing based method for indexing multi-kinds of features from multi-layer of images. This system are not only applied for uniform feature type but also multiple feature types from multi-layers separated. Finally, in order to assess the performances of our contributions, and based on the assessment that no public dataset exists for camera-based document image retrieval and spotting systems, we built a new dataset which has been made freely and publicly available for the scientific community. This dataset contains portions of document images acquired via a camera as a query. It is composed of three kinds of information: textual content, graphical content and heterogeneous content
APA, Harvard, Vancouver, ISO, and other styles
36

Vythelingum, Kévin. "Construction rapide, performante et mutualisée de systèmes de reconnaissance et de synthèse de la parole pour de nouvelles langues." Thesis, Le Mans, 2019. http://www.theses.fr/2019LEMA1035.

Full text
Abstract:
Nous étudions dans cette thèse la construction mutualisée de systèmes de reconnaissance et de synthèse de la parole pour de nouvelles langues, avec un objectif de performance et de rapidité de développement. Le développement rapide des technologies vocales pour de nouvelles langues anime des ambitions scientifiques et est aujourd’hui considéré comme stratégique par les acteurs industriels. Cependant, le développement des langues est conduit de manière morcelée par quelques centres de recherche travaillant chacun sur un nombre réduit de langues. Or, ces technologies partagent de nombreux points communs. Notre étude se concentre sur la construction et la mutualisation d'outils pour la création de lexiques, l’apprentissage de règles de phonétisation et l’exploitation de données imparfaites. Nos contributions portent sur la sélection de données pertinentes pour l’apprentissage de modèles acoustiques, le développement conjoint de phonétiseurs et de lexiques de prononciation pour la reconnaissance et la synthèse de la parole, et l’exploitation de modèles neuronaux pour la transcription phonétique à partir du texte et du signal de parole. De plus, nous présentons une approche de détection automatique des erreurs de transcriptions phonétiques dans les bases de données annotées de signal de parole. Cette étude a montré qu’il était possible de réduire de manière importante la quantité de données à annoter manuellement lors du développement de nouveaux systèmes de synthèse de la parole. Cela contribue naturellement à réduire le temps de collecte de données pour la création de nouveaux systèmes. Finalement, nous étudions un cas applicatif
We study in this thesis the joint construction of speech recognition and synthesis systems for new languages, with the goals of accuracy and quick development. The rapid development of voice technologies for new languages is driving scientific ambitions and is now considered strategic by industial players. However, language development research is led by a few research centers, each working on a limited number of languages. However, these technologies share many common points.Our study focuses on building and sharing tools between systems for creating lexicons, learning phonetic rules and taking advantage of imperfect data. Our contributions focus on the selection of relevant data for learning acoustic models, the joint development of phonetizers and pronunciation lexicons for speech recognition and synthesis, and the use of neural models for phonetic transcription from text and speech signal. In addition, we present an approach for automatic detection of phonetic transcript errors in annotated speech signal databases. This study has shown that it is possible to significantly reduce the quantity of data annotation useful for the development of new text-to-speech systems. It naturally helps to reduce data collection time in the process of new systems creation.Finally, we study an application case by jointly building a system for recognizing and synthesizing speech for a new language
APA, Harvard, Vancouver, ISO, and other styles
37

Benammar, Riyadh. "Détection non-supervisée de motifs dans les partitions musicales manuscrites." Thesis, Lyon, 2019. http://www.theses.fr/2019LYSEI112.

Full text
Abstract:
Cette thèse s'inscrit dans le contexte de la fouille de données appliquées aux partitions musicales manuscrites anciennes et vise une recherche de motifs mélodiques ou rythmiques fréquents définis comme des séquences de notes répétitives aux propriétés caractéristiques. On rencontre un grand nombre de déclinaisons possibles de motifs : les transpositions, les inversions et les motifs dits « miroirs ». Ces motifs permettent aux musicologues d'avoir un niveau d'analyse approfondi sur les œuvres d'un compositeur ou d'un style musical. Dans un contexte d'exploration de corpus de grande taille où les partitions sont juste numérisées et non transcrites, une recherche automatisée de motifs vérifiant des contraintes ciblées devient un outil indispensable à leur étude. Pour la réalisation de l'objectif de détection de motifs fréquents sans connaissance a priori, nous sommes partis d'images de partitions numérisées. Après des étapes de prétraitements sur l'image, nous avons exploité et adapté un modèle de détection et de reconnaissance de primitives musicales (tête de notes, hampes...) de la famille de réseaux de neurones à convolutions de type Region-Proposal CNN (RPN). Nous avons ensuite développé une méthode d'encodage de primitives pour générer une séquence de notes en évitant la tâche complexe de transcription complète de l'œuvre manuscrite. Cette séquence a ensuite été analysée à travers l'approche CSMA (Contraint String Mining Algorithm) que nous avons conçue pour détecter les motifs fréquents présents dans une ou plusieurs séquences avec une prise en compte de contraintes sur leur fréquence et leur taille, ainsi que la taille et le nombre de sauts autorisés (gaps) à l'intérieur des motifs. La prise en compte du gap a ensuite été étudiée pour contourner les erreurs de reconnaissance produites par le réseau RPN évitant ainsi la mise en place d'un système de post-correction des erreurs de transcription des partitions. Le travail a été finalement validé par l'étude des motifs musicaux pour des applications d'identification et de classification de compositeurs
This thesis is part of the data mining applied to ancient handwritten music scores and aims at a search for frequent melodic or rhythmic motifs defined as repetitive note sequences with characteristic properties. There are a large number of possible variations of motifs: transpositions, inversions and so-called "mirror" motifs. These motifs allow musicologists to have a level of in-depth analysis on the works of a composer or a musical style. In a context of exploring large corpora where scores are just digitized and not transcribed, an automated search for motifs that verify targeted constraints becomes an essential tool for their study. To achieve the objective of detecting frequent motifs without prior knowledge, we started from images of digitized scores. After pre-processing steps on the image, we exploited and adapted a model for detecting and recognizing musical primitives (note-heads, stems...) from the family of Region-Proposal CNN (RPN) convolution neural networks. We then developed a primitive encoding method to generate a sequence of notes without the complex task of transcribing the entire manuscript work. This sequence was then analyzed using the CSMA (Constraint String Mining Algorithm) approach designed to detect the frequent motifs present in one or more sequences, taking into account constraints on their frequency and length, as well as the size and number of gaps allowed within the motifs. The gap was then studied to avoid recognition errors produced by the RPN network, thus avoiding the implementation of a post-correction system for transcription errors. The work was finally validated by the study of musical motifs for composers identification and classification
APA, Harvard, Vancouver, ISO, and other styles
38

PAGANO, ALICE. "Testing quality in interlingual respeaking and other methods of interlingual live subtitling." Doctoral thesis, Università degli studi di Genova, 2022. https://hdl.handle.net/11567/1091438.

Full text
Abstract:
La sottotitolazione in tempo reale (Live Subtitling, LS), trova le sue fondamenta nella sottotitolazione preregistrata per non udenti e ipoudenti per la produzione di sottotitoli per eventi o programmi televisivi dal vivo. La sottotitolazione live comporta il trasferimento da un contenuto orale a uno scritto (traduzione intersemiotica) e può essere effettuata da e verso la stessa lingua (intralinguistica), o da una lingua a un’altra (interlinguistica), fornendo così accessibilità per soggetti non udenti e al tempo stesso garantendo accesso multilingue ai contenuti audiovisivi. La sottotitolazione interlinguistica in tempo reale (d'ora in poi indicata come ILS, Interlingual Live Subtitling) viene attualmente realizzata con diversi metodi: l'attenzione è qui posta sulla tecnica del respeaking interlinguistico, uno dei metodi di sottotitolazione in tempo reale o speech-to-text interpreting (STTI) che ha suscitato negli ultimi anni un crescente interesse, anche nel panorama italiano. Questa tesi di Dottorato intende fornire un quadro della letteratura e della ricerca sul respeaking intralinguistico e interlinguistico fino ad oggi, con particolare enfasi sulla situazione attuale in Italia di questa pratica. L'obiettivo della ricerca è stato quello di esplorare diversi metodi di ILS, mettendone in luce i punti di forza e le debolezze nel tentativo di informare il settore delle potenzialità e dei rischi che possono riflettersi sulla qualità complessiva finale dei sottotitoli attraverso l’utilizzo di diverse tecniche. Per fare ciò, sono stati testati in totale cinque metodi di ILS con diversi gradi di interazione uomo-macchina; ciascun metodo è stato analizzato in termini di qualità, quindi non solo dal punto di vista dell'accuratezza linguistica, ma anche considerando un altro fattore cruciale quale il ritardo nella trasmissione dei sottotitoli stessi. Nello svolgimento della ricerca sono stati condotti due casi di studio con diverse coppie linguistiche: il primo esperimento (dall'inglese all'italiano) ha testato e valutato la qualità di respeaking interlinguistico, interpretazione simultanea insieme a respeaking intralinguistico e, infine, interpretazione simultanea e sistema di riconoscimento automatico del parlato (Automatic Speech Recognition, ASR). Il secondo esperimento (dallo spagnolo all'italiano) ha valutato e confrontato cinque i metodi: i primi tre appena menzionati e altri due in cui la macchina svolgeva la maggior parte se non la totalità del lavoro: respeaking intralinguistico e traduzione automatica (Machine Translation, MT), e ASR con MT. Sono stati offerti due laboratori di respeaking interlinguistico nel Corso magistrale in Traduzione e Interpretazione dell'Università di Genova per preparare gli studenti agli esperimenti, volti a testare diversi moduli di formazione sull'ILS e la loro efficacia sull’apprendimento degli studenti. Durante le fasi di test, agli studenti sono stati assegnati diversi ruoli per ogni metodo, producendo sottotitoli interlinguistici live a partire dallo stesso testo di partenza: un video di un discorso originale completo durante un evento dal vivo. Le trascrizioni ottenute, sotto forma di sottotitoli, sono state analizzate utilizzando il modello NTR (Romero-Fresco & Pöchhacker, 2017) e per ciascun metodo è anche stato calcolato il ritardo. I risultati quantitativi preliminari derivanti dalle analisi NTR e dal calcolo del ritardo sono stati confrontati con altri due casi di studio condotti dall'Università di Vigo (Spagna) e dall'Università del Surrey (Gran Bretagna), sottolineando come i flussi di lavoro più automatizzati o completamente automatizzati siano effettivamente più veloci degli altri, ma al contempo presentino ancora diversi problemi di traduzione e di punteggiatura. Anche se su scala ridotta, la ricerca dimostra anche quanto sia urgente e possa potenzialmente essere facile formare i traduttori e gli interpreti sul respeaking durante il loro percorso accademico, grazie anche al loro spiccato interesse per la materia. Si spera che i risultati ottenuti possano meglio mettere in luce le ripercussioni dell'uso dei diversi metodi a confronto, nonché indurre un'ulteriore riflessione sull'importanza dell'interazione umana con i sistemi automatici di traduzione e di riconoscimento del parlato nel fornire accessibilità di alta qualità per eventi dal vivo. Si spera inoltre che l’interesse degli studenti in questo campo, che era a loro completamente sconosciuto prima di questa ricerca, possa informare sull'urgenza di sensibilizzare gli studenti nel campo della sottotitolazione dal vivo attraverso il respeaking.
Live subtitling (LS) finds its foundations in pre-recorded subtitling for the d/Deaf and hard of hearing (SDH) to produce real-time subtitles for live events and programs. LS implies the transfer from oral into written content (intersemiotic translation) and can be carried out from and to the same language (intralingual), or from one language to another (interlingual) to provide full accessibility for all, therefore combining SDH to the need of guaranteeing multilingual access as well. Interlingual Live Subtitling (from now on referred to as ILS) in real-time is currently being achieved by using different methods: the focus here is placed on interlingual respeaking as one of the currently used methods of LS – also referred to in this work as speech-to-text interpreting (STTI) – which has triggered growing interest also in the Italian industry over the past years. The hereby presented doctoral thesis intends to provide a wider picture of the literature and the research on intralingual and interlingual respeaking to the date, emphasizing the current situation in Italy in this practice. The aim of the research was to explore different ILS methods through their strengths and weaknesses, in an attempt to inform the industry on the impact that both potentialities and risks can have on the final overall quality of the subtitles with the involvement of different techniques in producing ILS. To do so, five ILS workflows requiring human and machine interaction to different extents were tested overall in terms of quality, thus not only from a linguistic accuracy point of view, but also considering another crucial factor such as delay in the broadcast of the subtitles. Two case studies were carried out with different language pairs: a first experiment (English to Italian) tested and assessed quality in interlingual respeaking on one hand, then simultaneous interpreting (SI) combined with intralingual respeaking, and SI and Automatic Speech Recognition (ASR) on the other. A second experiment (Spanish to Italian) evaluated and compared all the five methods: the first three again, and two others more machine-centered: intralingual respeaking combined with machine translation (MT), and ASR with MT. Two workshops in interlingual respeaking were offered at the master’s degree in Translation and Interpreting from the University of Genova to prepare students for the experiments, aimed at testing different training modules on ILS and their effectiveness on students’ learning outcomes. For the final experiments, students were assigned different roles for each tested method and performed different required tasks producing ILS from the same source text: a video of a full original speech at a live event. The obtained outputs were analyzed using the NTR model (Romero-Fresco & Pöchhacker, 2017) and the delay was calculated for each method. Preliminary quantitative results deriving from the NTR analyses and the calculation of delay were compared to other two case studies conducted by the University of Vigo and the University of Surrey, showing that more and fully-automated workflows are, indeed, faster than the others, while they still present several important issues in translation and punctuation. Albeit on a small scale, the research also shows how urgent and potentially easy could be to educate translators and interpreters in respeaking during their training phase, given their keen interest in the subject matter. It is hoped that the results obtained can better shed light on the repercussions of the use of different methods and induce further reflection on the importance of human interaction with automatic machine systems in providing high quality accessibility at live events. It is also hoped that involved students’ interest in this field, which was completely unknown to them prior to this research, can inform on the urgency of raising students’ awareness and competence acquisition in the field of live subtitling through respeaking.
APA, Harvard, Vancouver, ISO, and other styles
39

Wächter, Thomas. "Semi-automated Ontology Generation for Biocuration and Semantic Search." Doctoral thesis, Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden, 2011. http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-64838.

Full text
Abstract:
Background: In the life sciences, the amount of literature and experimental data grows at a tremendous rate. In order to effectively access and integrate these data, biomedical ontologies – controlled, hierarchical vocabularies – are being developed. Creating and maintaining such ontologies is a difficult, labour-intensive, manual process. Many computational methods which can support ontology construction have been proposed in the past. However, good, validated systems are largely missing. Motivation: The biocuration community plays a central role in the development of ontologies. Any method that can support their efforts has the potential to have a huge impact in the life sciences. Recently, a number of semantic search engines were created that make use of biomedical ontologies for document retrieval. To transfer the technology to other knowledge domains, suitable ontologies need to be created. One area where ontologies may prove particularly useful is the search for alternative methods to animal testing, an area where comprehensive search is of special interest to determine the availability or unavailability of alternative methods. Results: The Dresden Ontology Generator for Directed Acyclic Graphs (DOG4DAG) developed in this thesis is a system which supports the creation and extension of ontologies by semi-automatically generating terms, definitions, and parent-child relations from text in PubMed, the web, and PDF repositories. The system is seamlessly integrated into OBO-Edit and Protégé, two widely used ontology editors in the life sciences. DOG4DAG generates terms by identifying statistically significant noun-phrases in text. For definitions and parent-child relations it employs pattern-based web searches. Each generation step has been systematically evaluated using manually validated benchmarks. The term generation leads to high quality terms also found in manually created ontologies. Definitions can be retrieved for up to 78% of terms, child ancestor relations for up to 54%. No other validated system exists that achieves comparable results. To improve the search for information on alternative methods to animal testing an ontology has been developed that contains 17,151 terms of which 10% were newly created and 90% were re-used from existing resources. This ontology is the core of Go3R, the first semantic search engine in this field. When a user performs a search query with Go3R, the search engine expands this request using the structure and terminology of the ontology. The machine classification employed in Go3R is capable of distinguishing documents related to alternative methods from those which are not with an F-measure of 90% on a manual benchmark. Approximately 200,000 of the 19 million documents listed in PubMed were identified as relevant, either because a specific term was contained or due to the automatic classification. The Go3R search engine is available on-line under www.Go3R.org.
APA, Harvard, Vancouver, ISO, and other styles
40

"Text-independent speaker recognition using discriminative subspace analysis." 2012. http://library.cuhk.edu.hk/record=b5549636.

Full text
Abstract:
說話人識別(Speaker Recognition) 主要利用聲音來檢測說話人的身份,是一項重要且極具挑戰性的生物認證研究課題。通常來說,針對語音信號的文本內容差別,說話人識別可以分成文本相關和文本無關兩類。另外,說話人識別有兩類重要應用,第一類是說話人確認,主要是通過給定話者聲音信息對說話人聲稱之身份進行二元判定。另一類是說話人辨識,其主要是從待選說話人集中判斷未知身份信息的話者身份。
在先進的說話人識別系統中,每個說話人模型是通過給定的說話人數據進行特徵統計分佈估計由生成模型訓練得到。這類方法由於需要逐帧進行概率或似然度計算而得出最終判決,會耗費大量系統資源並降低實時性性能。採用子空間降維技術,我們不僅避免選取冗餘高維度數據,同時能夠有效删除於識別中無用之數據。為克服上述生成性模型的不足並獲得不同說話人間的區分邊界,本文提出了利用區分性子空間方法訓練模型並採用有效的距離測度作為最終的建模識別新算法。
在本篇論文中,我們將先介紹並分析各類產生性說話人識別方法,例如高斯混合模型及聯合因子分析。另外,為了降低特徵空間維度和運算時間,我們也對子空間分析技術做了調研。除此之外,我們提出了一種取名為Fishervoice 基於非參數分佈假定的新穎說話人識別框架。所提出的Fishervoice 框架的主要目的是為了降低噪聲干擾同時加重分類信息,而能夠加強在可區分性的子空間內對聲音特徵建模。採用上述Fishervoice 框架,說話人識別可以簡單地通過測試樣本映射到Fishervoice 子空間並計算其簡單歐氏距離而實現。為了更好得降低維度及提高識別率,我們還對Fishervocie 框架進行多樣化探索。另外,我們也在低維度的全變化空間(Total Variability) 對各類多種子空間分析模型進行調比較。基於XM2VTS 和NIST 公開數據庫的實驗驗證了本文提出的算法的有效性。
Speaker Recognition (SR), which uses the voice to determine the speaker’s identity, is an important and challenging research topic for biometric authentication. Generally speaking, speaker recognition can be divided into text-dependent and text-independent methods according to the verbal content of the speech signal. There are two major applications of speaker recognition: the first is speaker verification, also referred to speaker authentication, which is used to validate the identity of a speaker according to the voice and it involves a binary decision. The second is speaker identification, which is used to determine an unknown speaker’s identity.
In a state-of-art speaker recognition system, the speaker training model is usually trained by generative methods, which estimate feature distribution of each speaker among the given data. These generative methods need a frame-based metric (e.g. probability, likelihoods) calculation for making final decision, which consumes much computer resources, slowing down the real-time responses. Meanwhile, lots of redundant data frames are blindly selected for training without efficient subspace dimension reduction. In order to overcome disadvantages of generative methods and obtain boundary information between individual speakers, we propose to apply the discriminative subspace technique for model training and employ simple but efficient distance metrics for decision score calculation.
In this thesis, we shall present an overview of both conventional and state-of-the-art generative speaker recognition methods (e.g. Gaussian Mixture Model and Joint Factor Analysis) and analyze their advantages and disadvantages. In addition, we have also made an investigation of the application of subspace analysis techniques to reduce feature dimensions and computation time. After that, a novel speaker recognition framework based on the nonparametric Fisher’s discriminant analysis which we name Fishervoice is proposed. The objective of the proposed Fishervoice algorithm is to model the intrinsic vocal characteristics in a discriminant subspace for de-emphasizing unwanted noise variations and emphasizing classification boundaries information. Using the proposed Fishervoice framework, speaker recognition can be easily realized by mapping a test utterance to the Fishervoice subspace and then calculating the score between the test utterance and its reference. Besides, we explore the proposed Fishervoice framework with several extensions for further dimensionality reduction and performance improvement. Furthermore, we investigate various subspace analysis techniques in a total variability-based low-dimensional space for fast computation. Extensive experiments on two large speaker recognition corpora (XM2VTS and NIST) demonstrate significant improvements of Fishervoice over standard, state-of-the-art approaches for both speaker identification and verification systems.
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Jiang, Weiwu.
Thesis (Ph.D.)--Chinese University of Hong Kong, 2012.
Includes bibliographical references (leaves 127-135).
Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web.
Abstract also in Chinese.
Abstract --- p.i
Acknowledgements --- p.vi
Contents --- p.xiv
List of Figures --- p.xvii
List of Tables --- p.xxiii
Chapter 1 --- Introduction --- p.1
Chapter 1.1 --- Overview of Speaker Recognition Systems --- p.1
Chapter 1.2 --- Motivation --- p.4
Chapter 1.3 --- Outline of Thesis --- p.6
Chapter 2 --- Background Study --- p.7
Chapter 2.1 --- Generative Gaussian Mixture Model (GMM) --- p.7
Chapter 2.1.1 --- Basic GMM --- p.7
Chapter 2.1.2 --- The Gaussian Mixture Model-Universal Background Model (GMM-UBM) System --- p.9
Chapter 2.2 --- Discriminative Subspace Analysis --- p.12
Chapter 2.2.1 --- Principal Component Analysis --- p.12
Chapter 2.2.2 --- Linear Discriminant Analysis --- p.16
Chapter 2.2.3 --- Heteroscedastic Linear Discriminant Analysis --- p.17
Chapter 2.2.4 --- Locality Preserving Projections --- p.18
Chapter 2.3 --- Noise Compensation --- p.20
Chapter 2.3.1 --- Eigenvoice --- p.20
Chapter 2.3.2 --- Joint Factor Analysis --- p.24
Chapter 2.3.3 --- Probabilistic Linear Discriminant Analysis --- p.26
Chapter 2.3.4 --- Nuisance Attribute Projection --- p.30
Chapter 2.3.5 --- Within-class Covariance Normalization --- p.32
Chapter 2.4 --- Support Vector Machine --- p.33
Chapter 2.5 --- Score Normalization --- p.35
Chapter 2.6 --- Summary --- p.39
Chapter 3 --- Corpora for Speaker Recognition Experiments --- p.41
Chapter 3.1 --- Corpora for Speaker Identification Experiments --- p.41
Chapter 3.1.1 --- XM2VTS Corpus --- p.41
Chapter 3.1.2 --- NIST Corpora --- p.42
Chapter 3.2 --- Corpora for Speaker Verification Experiments --- p.45
Chapter 3.3 --- Summary --- p.47
Chapter 4 --- Performance Measures for Speaker Recognition --- p.48
Chapter 4.1 --- Performance Measures for Identification --- p.48
Chapter 4.2 --- Performance Measures for Verification --- p.49
Chapter 4.2.1 --- Equal Error Rate --- p.49
Chapter 4.2.2 --- Detection Error Tradeoff Curves --- p.49
Chapter 4.2.3 --- Detection Cost Function --- p.50
Chapter 4.3 --- Summary --- p.51
Chapter 5 --- The Discriminant Fishervoice Framework --- p.52
Chapter 5.1 --- The Proposed Fishervoice Framework --- p.53
Chapter 5.1.1 --- Feature Representation --- p.53
Chapter 5.1.2 --- Nonparametric Fisher’s Discriminant Analysis --- p.55
Chapter 5.2 --- Speaker Identification Experiments --- p.60
Chapter 5.2.1 --- Experiments on the XM2VTS Corpus --- p.60
Chapter 5.2.2 --- Experiments on the NIST Corpus --- p.62
Chapter 5.3 --- Summary --- p.64
Chapter 6 --- Extension of the Fishervoice Framework --- p.66
Chapter 6.1 --- Two-level Fishervoice Framework --- p.66
Chapter 6.1.1 --- Proposed Algorithm --- p.66
Chapter 6.2 --- Performance Evaluation on the Two-level Fishervoice Framework --- p.70
Chapter 6.2.1 --- Experimental Setup --- p.70
Chapter 6.2.2 --- Performance Comparison of Different Types of Input Supervectors --- p.72
Chapter 6.2.3 --- Performance Comparison of Different Numbers of Slices --- p.73
Chapter 6.2.4 --- Performance Comparison of Different Dimensions of Fishervoice Projection Matrices --- p.75
Chapter 6.2.5 --- Performance Comparison with Other Systems --- p.77
Chapter 6.2.6 --- Fusion with Other Systems --- p.78
Chapter 6.2.7 --- Extension of the Two-level Subspace Analysis Framework --- p.80
Chapter 6.3 --- Random Subspace Sampling Framework --- p.81
Chapter 6.3.1 --- Supervector Extraction --- p.82
Chapter 6.3.2 --- Training Stage --- p.83
Chapter 6.3.3 --- Testing Procedures --- p.84
Chapter 6.3.4 --- Discussion --- p.84
Chapter 6.4 --- Performance Evaluation of the Random Subspace Sampling Framework --- p.85
Chapter 6.4.1 --- Experimental Setup --- p.85
Chapter 6.4.2 --- Random Subspace Sampling Analysis --- p.87
Chapter 6.4.3 --- Comparison with Other Systems --- p.90
Chapter 6.4.4 --- Fusion with the Other Systems --- p.90
Chapter 6.5 --- Summary --- p.92
Chapter 7 --- Discriminative Modeling in Low-dimensional Space --- p.94
Chapter 7.1 --- Discriminative Subspace Analysis in Low-dimensional Space --- p.95
Chapter 7.1.1 --- Experimental Setup --- p.96
Chapter 7.1.2 --- Performance Evaluation on Individual Subspace Analysis Techniques --- p.98
Chapter 7.1.3 --- Performance Evaluation on Multi-type of Subspace Analysis Techniques --- p.105
Chapter 7.2 --- Discriminative Subspace Analysis with Support Vector Machine --- p.115
Chapter 7.2.1 --- Experimental Setup --- p.116
Chapter 7.2.2 --- Performance Evaluation on LDA+WCCN+SVM --- p.117
Chapter 7.2.3 --- Performance Evaluation on Fishervoice+SVM --- p.118
Chapter 7.3 --- Summary --- p.118
Chapter 8 --- Conclusions and Future Work --- p.120
Chapter 8.1 --- Contributions --- p.120
Chapter 8.2 --- Future Directions --- p.121
Chapter A --- EM Training GMM --- p.123
Bibliography --- p.127
APA, Harvard, Vancouver, ISO, and other styles
41

Henriques, Daniel Filipe Rodrigues. "Automatic Completion of Text-based Tasks." Master's thesis, 2019. http://hdl.handle.net/10362/92296.

Full text
Abstract:
Crowdsourcing is a widespread problem-solving model which consists in assigning tasks to an existing pool of workers in order to solve a problem, being a scalable alternative to hiring a group of experts for labeling high volumes of data. It can provide results that are similar in quality, with the advantage of achieving such standards in a faster and more efficient manner. Modern approaches to crowdsourcing use Machine Learning models to do the labeling of the data and request the crowd to validate the results. Such approaches can only be applied if the data in which the model was trained (source data), and the data that needs labeling (target data) share some relation. Furthermore, since the model is not adapted to the target data, its predictions may produce a substantial amount of errors. Consequently, the validation of these predictions can be very time-consuming. In this thesis, we propose an approach that leverages in-domain data, which is a labeled portion of the target data, to adapt the model. The remainder of the data is labeled based on these model’s predictions. The crowd is tasked with the generation of the in-domain data and the validation of the model’s predictions. Under this approach, train the model with only in-domain data and with both in-domain data and data from an outer domain. We apply these learning settings with the intent of optimizing a crowdsourcing pipeline for the area of Natural Language Processing, more concretely for the task of Named Entity Recognition (NER). This optimization relates to the effort required by the crowd to performed the NER task. The results of the experiments show that the usage of in-domain data achieves effort savings ranging from 6% to 53%. Furthermore, we such savings in nine distinct datasets, which demonstrates the robustness and application depth of this approach. In conclusion, the in-domain data approach is capable of optimizing a crowdsourcing pipeline of NER. Furthermore, it has a broader range of use cases when compared to reusing a model to generate predictions in the target data.
APA, Harvard, Vancouver, ISO, and other styles
42

Lai, Chun Han, and 賴俊翰. "A Python Implementation of Automatic Speech-text Synchronization Using Speech Recognition and Text-to-Speech Technology." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/53806441331969263004.

Full text
Abstract:
碩士
長庚大學
資訊工程學系
103
With the advent of the global village, "language learning" has become an important issue. Now, the variety of language ability is an indicator of competitiveness. Especially the listening and speaking ability are considered more important. In this study, we establish a method to create speech and text synchronized audiobooks with “speech recognition” and “cloud text-to-speech” technology. The user can prepare his own arbitrary articles to create the learning materials for "Shadowing technique" with this method. Besides, the materials are made by "word-level" speech and text synchronized audiobooks. These audiobooks are created by "timed-text" files, and the files are produced from the user's articles and corresponding speech files. By synchronization for speech and text technology, named "CGUAlign", user can easily make the "Timed-text" files. CGUAlign, uses Python to wrap the well-known speech recognition technology─HTK(Hidden Markov Model Toolkit). Just providing text file and the corresponding speech file, obtained from cloud text-to-speech technology, CGUAlign can create the timed-text file to achieve the synchronization of speech and text. Subsequently, we also build a simple website created with JavaScript. This website can use the timed-text file as CALL(Computer-assisted Language Learning) purposes. Using the website, user can browse the synchronized audiobooks to easily do Shadowing technique. Finally this website also provides dictionary function to achieve the goal of CALL.
APA, Harvard, Vancouver, ISO, and other styles
43

"Text-independent bilingual speaker verification system." 2003. http://library.cuhk.edu.hk/record=b5891732.

Full text
Abstract:
Ma Bin.
Thesis (M.Phil.)--Chinese University of Hong Kong, 2003.
Includes bibliographical references (leaves 96-102).
Abstracts in English and Chinese.
Abstract --- p.i
Acknowledgement --- p.iv
Chapter 1 --- Introduction --- p.1
Chapter 1.1 --- Biometrics --- p.2
Chapter 1.2 --- Speaker Verification --- p.3
Chapter 1.3 --- Overview of Speaker Verification Systems --- p.4
Chapter 1.4 --- Text Dependency --- p.4
Chapter 1.4.1 --- Text-Dependent Speaker Verification --- p.5
Chapter 1.4.2 --- GMM-based Speaker Verification --- p.6
Chapter 1.5 --- Language Dependency --- p.6
Chapter 1.6 --- Normalization Techniques --- p.7
Chapter 1.7 --- Objectives of the Thesis --- p.8
Chapter 1.8 --- Thesis Organization --- p.8
Chapter 2 --- Background --- p.10
Chapter 2.1 --- Background Information --- p.11
Chapter 2.1.1 --- Speech Signal Acquisition --- p.11
Chapter 2.1.2 --- Speech Processing --- p.11
Chapter 2.1.3 --- Engineering Model of Speech Signal --- p.13
Chapter 2.1.4 --- Speaker Information in the Speech Signal --- p.14
Chapter 2.1.5 --- Feature Parameters --- p.15
Chapter 2.1.5.1 --- Mel-Frequency Cepstral Coefficients --- p.16
Chapter 2.1.5.2 --- Linear Predictive Coding Derived Cep- stral Coefficients --- p.18
Chapter 2.1.5.3 --- Energy Measures --- p.20
Chapter 2.1.5.4 --- Derivatives of Cepstral Coefficients --- p.21
Chapter 2.1.6 --- Evaluating Speaker Verification Systems --- p.22
Chapter 2.2 --- Common Techniques --- p.24
Chapter 2.2.1 --- Template Model Matching Methods --- p.25
Chapter 2.2.2 --- Statistical Model Methods --- p.26
Chapter 2.2.2.1 --- HMM Modeling Technique --- p.27
Chapter 2.2.2.2 --- GMM Modeling Techniques --- p.30
Chapter 2.2.2.3 --- Gaussian Mixture Model --- p.31
Chapter 2.2.2.4 --- The Advantages of GMM --- p.32
Chapter 2.2.3 --- Likelihood Scoring --- p.32
Chapter 2.2.4 --- General Approach to Decision Making --- p.35
Chapter 2.2.5 --- Cohort Normalization --- p.35
Chapter 2.2.5.1 --- Probability Score Normalization --- p.36
Chapter 2.2.5.2 --- Cohort Selection --- p.37
Chapter 2.3 --- Chapter Summary --- p.38
Chapter 3 --- Experimental Corpora --- p.39
Chapter 3.1 --- The YOHO Corpus --- p.39
Chapter 3.1.1 --- Design of the YOHO Corpus --- p.39
Chapter 3.1.2 --- Data Collection Process of the YOHO Corpus --- p.40
Chapter 3.1.3 --- Experimentation with the YOHO Corpus --- p.41
Chapter 3.2 --- CUHK Bilingual Speaker Verification Corpus --- p.42
Chapter 3.2.1 --- Design of the CUBS Corpus --- p.42
Chapter 3.2.2 --- Data Collection Process for the CUBS Corpus --- p.44
Chapter 3.3 --- Chapter Summary --- p.46
Chapter 4 --- Text-Dependent Speaker Verification --- p.47
Chapter 4.1 --- Front-End Processing on the YOHO Corpus --- p.48
Chapter 4.2 --- Cohort Normalization Setup --- p.50
Chapter 4.3 --- HMM-based Speaker Verification Experiments --- p.53
Chapter 4.3.1 --- Subword HMM Models --- p.53
Chapter 4.3.2 --- Experimental Results --- p.55
Chapter 4.3.2.1 --- Comparison of Feature Representations --- p.55
Chapter 4.3.2.2 --- Effect of Cohort Normalization --- p.58
Chapter 4.4 --- Experiments on GMM-based Speaker Verification --- p.61
Chapter 4.4.1 --- Experimental Setup --- p.61
Chapter 4.4.2 --- The number of Gaussian Mixture Components --- p.62
Chapter 4.4.3 --- The Effect of Cohort Normalization --- p.64
Chapter 4.4.4 --- Comparison of HMM and GMM --- p.65
Chapter 4.5 --- Comparison with Previous Systems --- p.67
Chapter 4.6 --- Chapter Summary --- p.70
Chapter 5 --- Language- and Text-Independent Speaker Verification --- p.71
Chapter 5.1 --- Front-End Processing of the CUBS --- p.72
Chapter 5.2 --- Language- and Text-Independent Speaker Modeling --- p.73
Chapter 5.3 --- Cohort Normalization --- p.74
Chapter 5.4 --- Experimental Results and Analysis --- p.75
Chapter 5.4.1 --- Number of Gaussian Mixture Components --- p.78
Chapter 5.4.2 --- The Cohort Normalization Effect --- p.79
Chapter 5.4.3 --- Language Dependency --- p.80
Chapter 5.4.4 --- Language-Independency --- p.83
Chapter 5.5 --- Chapter Summary --- p.88
Chapter 6 --- Conclusions and Future Work --- p.90
Chapter 6.1 --- Summary --- p.90
Chapter 6.1.1 --- Feature Comparison --- p.91
Chapter 6.1.2 --- HMM Modeling --- p.91
Chapter 6.1.3 --- GMM Modeling --- p.91
Chapter 6.1.4 --- Cohort Normalization --- p.92
Chapter 6.1.5 --- Language Dependency --- p.92
Chapter 6.2 --- Future Work --- p.93
Chapter 6.2.1 --- Feature Parameters --- p.93
Chapter 6.2.2 --- Model Quality --- p.93
Chapter 6.2.2.1 --- Variance Flooring --- p.93
Chapter 6.2.2.2 --- Silence Detection --- p.94
Chapter 6.2.3 --- Conversational Speaker Verification --- p.95
Bibliography --- p.102
APA, Harvard, Vancouver, ISO, and other styles
44

Williams, Kyle. "Learning to Read Bushman: Automatic Handwriting Recognition for Bushman Languages." Thesis, 2012. http://pubs.cs.uct.ac.za/archive/00000791/.

Full text
Abstract:
The Bleek and Lloyd Collection contains notebooks that document the tradition, language and culture of the Bushman people who lived in South Africa in the late 19th century. Transcriptions of these notebooks would allow for the provision of services such as text-based search and text-to-speech. However, these notebooks are currently only available in the form of digital scans and the manual creation of transcriptions is a costly and time-consuming process. Thus, automatic methods could serve as an alternative approach to creating transcriptions of the text in the notebooks. In order to evaluate the use of automatic methods, a corpus of Bushman texts and their associated transcriptions was created. The creation of this corpus involved: the development of a custom method for encoding the Bushman script, which contains complex diacritics; the creation of a tool for creating and transcribing the texts in the notebooks; and the running of a series of workshops in which the tool was used to create the corpus. The corpus was used to evaluate the use of various techniques for automatically transcribing the texts in the corpus in order to determine which approaches were best suited to the complex Bushman script. These techniques included the use of Support Vector Machines, Artificial Neural Networks and Hidden Markov Models as machine learning algorithms, which were coupled with different descriptive features. The effect of the texts used for training the machine learning algorithms was also investigated as well as the use of a statistical language model. It was found that, for Bushman word recognition, the use of a Support Vector Machine with Histograms of Oriented Gradient features resulted in the best performance and, for Bushman text line recognition, Marti & Bunke features resulted in the best performance when used with Hidden Markov Models. The automatic transcription of the Bushman texts proved to be difficult and the performance of the different recognition systems was largely affected by the complexities of the Bushman script. It was also found that, besides having an influence on determining which techniques may be the most appropriate for automatic handwriting recognition, the texts used in a automatic handwriting recognition system also play a large role in determining whether or not automatic recognition should be attempted at all.
APA, Harvard, Vancouver, ISO, and other styles
45

Warren, Jolan, and 王杰龍. "The Effects of Automatic Speech Recognition and Text-to-speech Software on EFL Students' Pronunciation." Thesis, 2012. http://ndltd.ncl.edu.tw/handle/04697441114645545894.

Full text
Abstract:
碩士
國立高雄師範大學
英語學系
100
The purpose of this study is to evaluate the effects of automatic speech recognition (ASR) and text-to-speech (TTS) software on EFL students’ pronunciation ability. Participants were 48 first and second year non-English majors from National Kaohsiung Normal University. Participants’ ability to produce segmental sounds, 14 vowels, and their suprasegmental ability were measured using a pre-test and post-test that were scored by 2 raters. Participants were assigned to a control group, TTS group or ASR group, and used ASR or TTS software over 6 weeks to self-correct their pronunciation. Their attitudes towards ASR and TTS software were also measured via a questionnaire and open-ended questions. Based on the data analysis, results showed that the use of ASR software for pronunciation practice resulted in mixed improvements in participants’ pronunciation ability, none of which reached a level of significance. The use of TTS software for pronunciation practice resulted in improvements in all areas of pronunciation ability, of which only one was significant. Despite the lack of a significant difference, TTS software resulted in a larger overall gain in pronunciation ability, and participants in the TTS group held a much more positive view of TTS software for pronunciation practice than participants in the ASR group did for ASR software. There were several limitations of the study. The participants were non-English majors from a public university in Taiwan, there was a low level of inter-rater reliability for one section of the pre-test and post-test, treatment was restricted to just 6 weeks, the focus of the study was confined to English vowels and there was a high participant drop-out rate. Results from the study suggest that TTS software shows promise as a tool for creating custom practice material and that pronunciation practice software may be best implemented into pronunciation training when it supplements teacher-led pronunciation classes and is capable of providing students with a pronunciation model to listen to before practicing. To investigate further the effects of ASR and TTS software on EFL students’ pronunciation and possible applications of the software, it is recommend that research be undertaken involving a longer period of treatment, both non-English and English major students should be compared, and the use of a smaller set of sounds or sounds that are verified as problematic be investigated.
APA, Harvard, Vancouver, ISO, and other styles
46

Rato, João Pedro Cordeiro. "Conversação homem-máquina. Caracterização e avaliação do estado actual das soluções de speech recognition, speech synthesis e sistemas de conversação homem-máquina." Master's thesis, 2016. http://hdl.handle.net/10400.8/2375.

Full text
Abstract:
A comunicação verbal humana é realizada em dois sentidos, existindo uma compreensão de ambas as partes que resulta em determinadas considerações. Este tipo de comunicação, também chamada de diálogo, para além de agentes humanos pode ser constituído por agentes humanos e máquinas. A interação entre o Homem e máquinas, através de linguagem natural, desempenha um papel importante na melhoria da comunicação entre ambos. Com o objetivo de perceber melhor a comunicação entre Homem e máquina este documento apresenta vários conhecimentos sobre sistemas de conversação Homemmáquina, entre os quais, os seus módulos e funcionamento, estratégias de diálogo e desafios a ter em conta na sua implementação. Para além disso, são ainda apresentados vários sistemas de Speech Recognition, Speech Synthesis e sistemas que usam conversação Homem-máquina. Por último são feitos testes de performance sobre alguns sistemas de Speech Recognition e de forma a colocar em prática alguns conceitos apresentados neste trabalho, é apresentado a implementação de um sistema de conversação Homem-máquina. Sobre este trabalho várias ilações foram obtidas, entre as quais, a alta complexidade dos sistemas de conversação Homem-máquina, a baixa performance no reconhecimento de voz em ambientes com ruído e as barreiras que se podem encontrar na implementação destes sistemas.
APA, Harvard, Vancouver, ISO, and other styles
47

Wächter, Thomas. "Semi-automated Ontology Generation for Biocuration and Semantic Search." Doctoral thesis, 2010. https://tud.qucosa.de/id/qucosa%3A25496.

Full text
Abstract:
Background: In the life sciences, the amount of literature and experimental data grows at a tremendous rate. In order to effectively access and integrate these data, biomedical ontologies – controlled, hierarchical vocabularies – are being developed. Creating and maintaining such ontologies is a difficult, labour-intensive, manual process. Many computational methods which can support ontology construction have been proposed in the past. However, good, validated systems are largely missing. Motivation: The biocuration community plays a central role in the development of ontologies. Any method that can support their efforts has the potential to have a huge impact in the life sciences. Recently, a number of semantic search engines were created that make use of biomedical ontologies for document retrieval. To transfer the technology to other knowledge domains, suitable ontologies need to be created. One area where ontologies may prove particularly useful is the search for alternative methods to animal testing, an area where comprehensive search is of special interest to determine the availability or unavailability of alternative methods. Results: The Dresden Ontology Generator for Directed Acyclic Graphs (DOG4DAG) developed in this thesis is a system which supports the creation and extension of ontologies by semi-automatically generating terms, definitions, and parent-child relations from text in PubMed, the web, and PDF repositories. The system is seamlessly integrated into OBO-Edit and Protégé, two widely used ontology editors in the life sciences. DOG4DAG generates terms by identifying statistically significant noun-phrases in text. For definitions and parent-child relations it employs pattern-based web searches. Each generation step has been systematically evaluated using manually validated benchmarks. The term generation leads to high quality terms also found in manually created ontologies. Definitions can be retrieved for up to 78% of terms, child ancestor relations for up to 54%. No other validated system exists that achieves comparable results. To improve the search for information on alternative methods to animal testing an ontology has been developed that contains 17,151 terms of which 10% were newly created and 90% were re-used from existing resources. This ontology is the core of Go3R, the first semantic search engine in this field. When a user performs a search query with Go3R, the search engine expands this request using the structure and terminology of the ontology. The machine classification employed in Go3R is capable of distinguishing documents related to alternative methods from those which are not with an F-measure of 90% on a manual benchmark. Approximately 200,000 of the 19 million documents listed in PubMed were identified as relevant, either because a specific term was contained or due to the automatic classification. The Go3R search engine is available on-line under www.Go3R.org.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography