Dissertations / Theses on the topic 'NLO Computation'

To see the other types of publications on this topic, follow the link: NLO Computation.

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'NLO Computation.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

LIMATOLA, GIOVANNI. "Infrared Linear Renormalons in Collider Processes." Doctoral thesis, Università degli Studi di Milano-Bicocca, 2023. https://hdl.handle.net/10281/402371.

Full text
Abstract:
Al fine di descrivere appropriatamente le osservabili studiate tramite collisori adronici e/o leptonici, risulta cruciale una corretta comprensione delle correzioni non perturbative (che si manifestano come correzioni di potenza), lineari nel rapporto tra una scala di energia non perturbativa e la scala caratteristica del processo. Utilizzando un modello abeliano ricerchiamo tali effetti nella distribuzione in momento trasverso di un bosone Z prodotto con un jet in collisioni adroniche, essendo questa una delle osservabili meglio misurate presso LHC. La presenza delle correzioni non perturbative di cui sopra impedirebbe di raggiungere la precisione sperimentale, anche considerando ordini superiori nello sviluppo perturbativo. Non avendo individuato alcuna correzione di questo tipo tramite tecniche semi numeriche, abbiamo scelto di adottare un approccio più rigoroso dal punto di vista teorico, fornendo una spiegazione attorno al manifestarsi di tali correzioni di potenza. Tale comprensione teorica è stata applicata allo studio delle variabili di shape in annichilazione $e^+e^-$, con particolare attenzione allo studio del C-parametro e del thrust, e ottenendo per tali osservabili una stima delle correzioni non perturbative nella regione dei tre jet per la prima volta. In tale lavoro è stata altresì ottenuta un'espressione fattorizzata per le correzioni non perturbative per alcune osservabili, con un termine dipendente dalla variazione dell'osservabile a seguito dell'emissione di un partone soffice, e un termine costante e universale, proporzionale al cosiddetto Milan factor. Tali osservabili sono ampiamente utilizzate al fine di estrarre valori della costante di interazione forte $\alpha_s$ e costituiscono il contesto ideale al fine di compiere studi di QCD perturbativa. È dunque estremamente importante ottenere stime affidabili delle correzioni non perturbative nell'intera regione di spazio delle fasi rilevante per i fit di $\alpha_s$.
Understanding leading non-perturbative corrections, showing up as linear power corrections, is crucial to properly describe observables both at lepton and hadron colliders. Using an abelian model, we examine these effects for the transverse momentum distribution of a $Z$ boson produced in association with a jet in hadronic collisions, that is one of the cleanest LHC observables, where the presence of leading non-perturbative corrections would spoil the chance to reach the current experimental accuracy, even considering higher orders in the perturbative expansion. As we did not find any such corrections exploiting semi-numerical techniques, we looked for a rigorous field-theoretical derivation of them, and explain under which circumstances linear power corrections can arise. We apply our theoretical understanding to the study of event-shape observables in $e^+e^-$ annihilation, focusing in particular on $C$-parameter and thrust, and obtaining for them an estimate of non-perturbative corrections in the three-jet region for the first time. We also derived a factorisation formula for non-perturbative corrections, with a term describing the change of the shape variable when a soft parton is emitted, and a constant universal factor, proportional to the so-called Milan factor. These observables are routinely used to extract the strong coupling constant $\alpha_s$ and they constitute an environment to test perturbative QCD. It is then extremely important to obtain reliable estimates of non-perturbative corrections in the whole kinematic region relevant for the $\alpha_s$ fits.
APA, Harvard, Vancouver, ISO, and other styles
2

Filali, Karim. "Multi-dynamic Bayesian networks for machine translation and NLP /." Thesis, Connect to this title online; UW restricted, 2007. http://hdl.handle.net/1773/6857.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Westin, Emil. "Fine-grained sentiment analysis of product reviews in Swedish." Thesis, Uppsala universitet, Institutionen för lingvistik och filologi, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-424266.

Full text
Abstract:
In this study we gather customer reviews from Prisjakt, a Swedish price comparison site, with the goal to study the relationship between review and rating, known as sentiment analysis. The purpose of the study is to evaluate three different supervised machine learning models on a fine-grained dependent variable representing the review rating. For classification, a binary and multinomial model is used with the one-versus-one strategy implemented in the Support Vector Machine, with a linear kernel, evaluated with F1, accuracy, precision and recall scores. We use Support Vector Regression by approximating the fine-grained variable as continuous, evaluated using MSE. Furthermore, three models are evaluated on a balanced and unbalanced dataset in order to investigate the effects of class imbalance. The results show that the SVR performs better on unbalanced fine-grained data, with the best fine-grained model reaching a MSE 4.12, compared to the balanced SVR (6.84). The binary SVM model reaches an accuracy of 86.37% and weighted F1 macro of 86.36% on the unbalanced data, while the balanced binary SVM model reaches approximately 80% for both measures. The multinomial model shows the worst performance due to the inability to handle class imbalance, despite the implementation of class weights. Furthermore, results from feature engineering shows that SVR benefits marginally from certain regex conversions, and tf-idf weighting shows better performance on the balanced sets compared to the unbalanced sets.
APA, Harvard, Vancouver, ISO, and other styles
4

Scheible, Silke. "Computational treatment of superlatives." Thesis, University of Edinburgh, 2009. http://hdl.handle.net/1842/4153.

Full text
Abstract:
The use of gradable adjectives and adverbs represents an important means of expressing comparison in English. The grammatical forms of comparatives and superlatives are used to express explicit orderings between objects with respect to the degree to which they possess some gradable property. While comparatives are commonly used to compare two entities (e.g., “The blue whale is larger than an African elephant”), superlatives such as “The blue whale is the largest mammal” are used to express a comparison between a target entity (here, the blue whale) and its comparison set (the set of mammals), with the target ranked higher or lower on a scale of comparison than members of the comparison set. Superlatives thus highlight the uniqueness of the target with respect to its comparison set. Although superlatives are frequently found in natural language, with the exception of recent work by (Bos and Nissim, 2006) and (Jindal and Liu, 2006b), they have not yet been investigated within a computational framework. And within the framework of theoretical linguistics, studies of superlatives have mainly focused on semantic properties that may only rarely occur in natural language (Szabolsci (1986), Heim (1999)). My PhD research aims to pave the way for a comprehensive computational treatment of superlatives. The initial question I am addressing is that of automatically extracting useful information about the target entity, its comparison set and their relationship from superlative constructions. One of the central claims of the thesis is that no unified computational treatment of superlatives is possible because of their great semantic complexity and the variety of syntactic structures in which they occur. I propose a classification of superlative surface forms, and initially focus on so-called “ISA superlatives”, which make explicit the IS-A relation that holds between target and comparison set. They are suitable for a computational approach because both their target and comparison set are usually explicitly realised in the text. I also aim to show that the findings of this thesis are of potential benefit for NLP applications such as Question Answering, Natural Language Generation, Ontology Learning, and Sentiment Analysis/Opinion Mining. In particular, I investigate the use of the “Superlative Relation Extractor“ implemented in this project in the area of Sentiment Analysis/Opinion Mining, and claim that a superlative analysis of the sort presented in this thesis, when applied to product evaluations and recommendations, can provide just the kind of information that Opinion Mining aims to identify.
APA, Harvard, Vancouver, ISO, and other styles
5

Rontsch, Raoul Horst. "Higher order QCD corrections to diboson production at hadron colliders." Thesis, University of Oxford, 2012. http://ora.ox.ac.uk/objects/uuid:5c4c3e7e-5c2a-4878-9fad-d9e5e0535d30.

Full text
Abstract:
Hadronic collider experiments have played a major role in particle physics phenomenology over the last few decades. Data recorded at the Tevatron at Fermilab is still of interest, and its successor, the Large Hadron Collider (LHC) at CERN, has recently announced the discovery of a particle consistent with the Standard Model Higgs boson. Hadronic colliders look set to guide the field for the next fifteen years or more, with the discovery of more particles anticipated. The discovery and detailed study of new particles relies crucially on the availability of high-precision theoretical predictions for both the signal and background processes. This requires observables to be calculated to next-to-leading order (NLO) in perturbative quantum chromodynamics (QCD). Many hadroproduction processes of interest contain multiple particles in the final state. Until recently, this caused a bottleneck in NLO QCD calculations, due to the difficulty in calculating one-loop corrections to processes involving three or more final state particles. Spectacular developments in on-shell methods over the last six years have made these calculations feasible, allowing highly accurate predictions for final state observables at the Tevatron and LHC. A particular realisation of on-shell methods, generalised unitarity, is used to compute the NLO QCD cross-sections and distributions for two processes: the hadroproduction of W+ W-jj, and the hadroproduction of W+ W-jj. The NLO corrections to both processes serve to reduce the scale dependence of the results significantly, while having a moderate effect on the central scale choice cross-sections, and leaving the shapes of the kinematic distributions mostly unchanged. Additionally, the gluon fusion contribution to the next-to-next-to-leading order (NNLO) QCD corrections to W+ W-j productions are studied. These contributions are found to be highly depen- dent on the kinematic cuts used. For cuts used in Higgs searches, the gluon fusion effect can be as large as the NLO scale uncertainty, and should not be neglected. All of the higher-order QCD corrections increase the accuracy and reliability of the theoretical predictions at hadronic colliders.
APA, Harvard, Vancouver, ISO, and other styles
6

Eisenberg, Joshua Daniel. "Automatic Extraction of Narrative Structure from Long Form Text." FIU Digital Commons, 2018. https://digitalcommons.fiu.edu/etd/3912.

Full text
Abstract:
Automatic understanding of stories is a long-time goal of artificial intelligence and natural language processing research communities. Stories literally explain the human experience. Understanding our stories promotes the understanding of both individuals and groups of people; various cultures, societies, families, organizations, governments, and corporations, to name a few. People use stories to share information. Stories are told –by narrators– in linguistic bundles of words called narratives. My work has given computers awareness of narrative structure. Specifically, where are the boundaries of a narrative in a text. This is the task of determining where a narrative begins and ends, a non-trivial task, because people rarely tell one story at a time. People don’t specifically announce when we are starting or stopping our stories: We interrupt each other. We tell stories within stories. Before my work, computers had no awareness of narrative boundaries, essentially where stories begin and end. My programs can extract narrative boundaries from novels and short stories with an F1 of 0.65. Before this I worked on teaching computers to identify which paragraphs of text have story content, with an F1 of 0.75 (which is state of the art). Additionally, I have taught computers to identify the narrative point of view (POV; how the narrator identifies themselves) and diegesis (how involved in the story’s action is the narrator) with F1 of over 0.90 for both narrative characteristics. For the narrative POV, diegesis, and narrative level extractors I ran annotation studies, with high agreement, that allowed me to teach computational models to identify structural elements of narrative through supervised machine learning. My work has given computers the ability to find where stories begin and end in raw text. This allows for further, automatic analysis, like extraction of plot, intent, event causality, and event coreference. These tasks are impossible when the computer can’t distinguish between which stories are told in what spans of text. There are two key contributions in my work: 1) my identification of features that accurately extract elements of narrative structure and 2) the gold-standard data and reports generated from running annotation studies on identifying narrative structure.
APA, Harvard, Vancouver, ISO, and other styles
7

Acosta, Andrew D. "Laff-O-Tron: Laugh Prediction in TED Talks." DigitalCommons@CalPoly, 2016. https://digitalcommons.calpoly.edu/theses/1667.

Full text
Abstract:
Did you hear where the thesis found its ancestors? They were in the "parent-thesis"! This joke, whether you laughed at it or not, contains a fascinating and mysterious quality: humor. Humor is something so incredibly human that if you squint, the two words can even look the same. As such, humor is not often considered something that computers can understand. But, that doesn't mean we won't try to teach it to them. In this thesis, we propose the system Laff-O-Tron to attempt to predict when the audience of a public speech would laugh by looking only at the text of the speech. To do this, we create a corpus of over 1700 TED Talks retrieved from the TED website. We then adapted various techniques used by researchers to identify humor in text. We also investigated features that were specific to our public speaking environment. Using supervised learning, we try to classify if a chunk of text would cause the audience to laugh or not based on these features. We examine the effects of each feature, classifier, and size of the text chunk provided. On a balanced data set, we are able to accurately predict laughter with up to 75% accuracy in our best conditions. Medium level conditions prove to be around 70% accuracy; while our worst conditions result in 66% accuracy. Computers with humor recognition capabilities would be useful in the fields of human computer interaction and communications. Humor can make a computer easier to interact with and function as a tool to check if humor was properly used in an advertisement or speech.
APA, Harvard, Vancouver, ISO, and other styles
8

Al-Liabi, Majda Majeed. "Computational support for learners of Arabic." Thesis, University of Manchester, 2012. https://www.research.manchester.ac.uk/portal/en/theses/computational-support-for-learners-of-arabic(abd20b76-3ba2-4e11-8aa5-459ec6d8d7d2).html.

Full text
Abstract:
This thesis documents the use of Natural Language Processing (NLP) in Computer Assisted Language Learning (CALL) and its contribution to the learning experience of students studying Arabic as a foreign language. The goal of this project is to build an Intelligent Computer Assisted Language Learning (ICALL) system that provides computational assistance to learners of Arabic by teaching grammar, producing homework and issuing students with immediate feedback. To produce this system we use the Parasite system, which produces morphological, syntactic and semantic analysis of textual input, and extend it to provide error detection and diagnosis. The methodology we adopt involves relaxing constraints on unification so that correct information contained in a badly formed sentence may still be used to obtain a coherent overall analysis. We look at a range of errors, drawn from experience with learners at various levels, covering word internal problems (addition of inappropriate affixes, failure to apply morphotactic rules properly) and problems with relations between words (local constraints on features, and word order problems). As feedback is an important factor in learning, we look into different types of feedback that can be used to evaluate which is the most appropriate for the aim of our system.
APA, Harvard, Vancouver, ISO, and other styles
9

RE, EMANUELE. "Next - to - leading order qcd corrections to shower Monte Carlo event generators: single vector- boson and single- top hadroproduction." Doctoral thesis, Università degli Studi di Milano-Bicocca, 2009. http://hdl.handle.net/10281/7455.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Lagerkvist, Love. "Computation as Strange Material : Excursions into Critical Accidents." Thesis, Malmö universitet, Institutionen för konst, kultur och kommunikation (K3), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:mau:diva-43639.

Full text
Abstract:
Waking up in a world where everyone carries a miniature supercomputer, interaction designers find themselves in their forerunners dreams. Faced with the reality of planetary-scale we have to confront the task of articulating approaches responsive this accidental ubiquity of computation. This thesis attempts such a formulation by defining computation as a strange material, a plasticity shaped equally by its technical properties and the mode of production by which is its continuously re-produced. The definition is applied through a methodology of excursions — participatory explorations into two seemingly disparate sites of computation, connected in they ways they manifest a labor of care. First, we visit the social infrastructures that constitute the Linux kernel, examining strangle entanglements of programming and care in the world's largest design process. This is followed by a tour into the thorny lands of artificial intelligence, situated in the smart replies of LinkedIn. Here, we investigate the fluctuating border between the artificial and the human with participants performing AI, formulating new Turing tests in the process. These excursions afford an understanding of computation as fundamentally re-produced through interaction, a strange kind of affective work the understanding of which is crucial if we ambition to disarm the critical accidents of our present future.
APA, Harvard, Vancouver, ISO, and other styles
11

Lager, Adam. "Improving Solr search with Natural Language Processing : An NLP implementation for information retrieval in Solr." Thesis, Linköpings universitet, Programvara och system, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-177790.

Full text
Abstract:
The field of AI is emerging fast and institutions and companies are pushing the limits of impossibility. Natural Language Processing is a branch of AI where the goal is to understand human speech and/or text. This technology is used to improve an inverted index,the full text search engine Solr. Solr is open source and has integrated OpenNLP makingit a suitable choice for these kinds of operations. NLP-enabled Solr showed great results compared to the Solr that’s currently running on the systems, where NLP-Solr was slightly worse in terms of precision, it excelled at recall and returning the correct documents.
APA, Harvard, Vancouver, ISO, and other styles
12

Neme, Alexis. "An arabic language resource for computational morphology based on the semitic model." Thesis, Paris Est, 2020. http://www.theses.fr/2020PESC2013.

Full text
Abstract:
La morphologie de la langue arabe est riche, complexe, et hautement flexionnelle. Nous avons développé une nouvelle approche pour la morphologie traditionnelle arabe destinés aux traitements automatiques de l’arabe écrit. Cette approche permet de formaliser plus simplement la morphologie sémitique en utilisant Unitex, une suite logicielle fondée sur des ressources lexicales pour l'analyse de corpus. Pour les verbes (Neme, 2011), j’ai proposé une taxonomie flexionnelle qui accroît la lisibilité du lexique et facilite l’encodage, la correction et la mise-à-jour par les locuteurs et linguistes arabes. La grammaire traditionnelle définit les classes verbales par des schèmes et des sous-classes par la nature des lettres de la racine. Dans ma taxonomie, les classes traditionnelles sont réutilisées, et les sous-classes sont redéfinies plus simplement. La couverture lexicale de cette ressource pour les verbes dans un corpus test est de 99 %. Pour les noms et les adjectifs (Neme, 2013) et leurs pluriels brisés, nous sommes allés plus loin dans l’adaptation de la morphologie traditionnelle. Tout d’abord, bien que cette tradition soit basée sur des règles dérivationnelles, nous nous sommes restreints aux règles exclusivement flexionnelles. Ensuite, nous avons gardé les concepts de racine et de schème, essentiels au modèle sémitique. Pourtant, notre innovation réside dans l’inversion du modèle traditionnel de racine-et-schème au modèle schème-et-racine, qui maintient concis et ordonné l’ensemble des classes de modèle et de sous-classes de racine. Ainsi, nous avons élaboré une taxonomie pour le pluriel brisé contenant 160 classes flexionnelles, ce qui simplifie dix fois l’encodage du pluriel brisé. Depuis, j’ai élaboré des ressources complètes pour l’arabe écrit. Ces ressources sont décrites dans Neme et Paumier (2019). Ainsi, nous avons complété ces taxonomies par des classes suffixées pour les pluriels réguliers, adverbes, et d’autres catégories grammaticales afin de couvrir l’ensemble du lexique. En tout, nous obtenons environ 1000 classes de flexion implémentées au moyen de transducteurs concatenatifs et non-concatenatifs. A partir de zéro, j’ai créé 76000 lemmes entièrement voyellisés, et chacun est associé à une classe flexionnelle. Ces lemmes sont fléchis en utilisant ces 1000 FST, produisant un lexique entièrement fléchi de plus 6 millions de formes. J’ai étendu cette ressource entièrement fléchie à l’aide de grammaires d’agglutination pour identifier les mots composés jusqu’à 5 segments, agglutinés autour d’un verbe, d’un nom, d’un adjectif ou d’une particule. Les grammaires d’agglutination étendent la reconnaissance à plus de 500 millions de formes de mots valides, partiellement ou entièrement voyelles. La taille de fichier texte généré est de 340 mégaoctets (UTF-16). Il est compressé en 11 mégaoctets avant d’être chargé en mémoire pour la recherche rapide (fast lookup). La génération, la compression et la minimisation du lexique prennent moins d’une minute sur un MacBook. Le taux de couverture lexical d’un corpus est supérieur à 99 %. La vitesse de tagger est de plus de 200 000 mots/s, si les ressources ont été pré-chargées en mémoire RAM. La précision et la rapidité de nos outils résultent de notre approche linguistique systématique et de l’adoption des meilleurs choix pratiques en matière de méthodes mathématiques et informatiques. La procédure de recherche est rapide parce que nous utilisons l’algorithme de minimisation d’automate déterministique acyclique (Revuz, 1992) pour comprimer le dictionnaire complet, et parce qu’il n’a que des chaînes constantes. La performance du tagger est le résultat des bons choix pratiques dans les technologies automates finis (FSA/FST) car toutes les formes fléchies calculées à l’avance pour une identification précise et pour tirer le meilleur parti de la compression et une recherche des mots déterministes et efficace
We developed an original approach to Arabic traditional morphology, involving new concepts in Semitic lexicology, morphology, and grammar for standard written Arabic. This new methodology for handling the rich and complex Semitic languages is based on good practices in Finite-State technologies (FSA/FST) by using Unitex, a lexicon-based corpus processing suite. For verbs (Neme, 2011), I proposed an inflectional taxonomy that increases the lexicon readability and makes it easier for Arabic speakers and linguists to encode, correct, and update it. Traditional grammar defines inflectional verbal classes by using verbal pattern-classes and root-classes. In our taxonomy, traditional pattern-classes are reused, and root-classes are redefined into a simpler system. The lexicon of verbs covered more than 99% of an evaluation corpus. For nouns and adjectives (Neme, 2013), we went one step further in the adaptation of traditional morphology. First, while this tradition is based on derivational rules, we found our description on inflectional ones. Next, we keep the concepts of root and pattern, which is the backbone of the traditional Semitic model. Still, our breakthrough lies in the reversal of the traditional root-and-pattern Semitic model into a pattern-and-root model, which keeps small and orderly the set of pattern classes and root sub-classes. I elaborated a taxonomy for broken plural containing 160 inflectional classes, which simplifies ten times the encoding of broken plural. Since then, I elaborated comprehensive resources for Arabic. These resources are described in Neme and Paumier (2019). To take into account all aspects of the rich morphology of Arabic, I have completed our taxonomy with suffixal inflexional classes for regular plurals, adverbs, and other parts of speech (POS) to cover all the lexicon. In all, I identified around 1000 Semitic and suffixal inflectional classes implemented with concatenative and non-concatenative FST devices.From scratch, I created 76000 fully vowelized lemmas, and each one is associated with an inflectional class. These lemmas are inflected by using these 1000 FSTs, producing a fully inflected lexicon with more than 6 million forms. I extended this fully inflected resource using agglutination grammars to identify words composed of up to 5 segments, agglutinated around a core inflected verb, noun, adjective, or particle. The agglutination grammars extend the recognition to more than 500 million valid delimited word forms, partially or fully vowelized. The flat file size of 6 million forms is 340 megabytes (UTF-16). It is compressed then into 11 Mbytes before loading to memory for fast retrieval. The generation, compression, and minimization of the full-form lexicon take less than one minute on a common Unix laptop. The lexical coverage rate is more than 99%. The tagger speed is 5000 words/second, and more than 200 000 words/s, if the resources are preloaded/resident in the RAM. The accuracy and speed of our tools result from our systematic linguistic approach and from our choice to embrace the best practices in mathematical and computational methods. The lookup procedure is fast because we use Minimal Acyclic Deterministic Finite Automaton (Revuz, 1992) to compress the full-form dictionary, and because it has only constant strings and no embedded rules. The breakthrough of our linguistic approach remains principally on the reversal of the traditional root-and-pattern Semitic model into a pattern-and-root model.Nonetheless, our computational approach is based on good practices in Finite-State technologies (FSA/FST) as all the full-forms were computed in advance for accurate identification and to get the best from the FSA compression for fast and efficient lookups
APA, Harvard, Vancouver, ISO, and other styles
13

Ceylan, Yavuz Selim. "Exploration of Transition Metal-Containing Catalytic Cycles via Computational Methods." Thesis, University of North Texas, 2019. https://digital.library.unt.edu/ark:/67531/metadc1505287/.

Full text
Abstract:
Styrene production by a (FlDAB)PdII(TFA)(η2-C2H4) complex was modeled using density functional theory (DFT). Benzene C-H activation by this complex was studied via five mechanisms: oxidative addition/reductive elimination, sigma-bond metathesis, concerted metalation deprotonation (CMD), CMD activation of ethylene, and benzene substitution of ethylene followed by CMD of the ligated benzene. Calculations provided evidence that conversion of benzene and ethylene to styrene was initiated by the fifth pathway, arylation via CMD of coordinated benzene, followed by ethylene insertion into the Ru-Ph bond, and then β-hydrogen elimination. Also, monomer (active species)/dimer equilibrium concentrations were analyzed. The results obtained from present study were compared with that of a recently reported RhI complex to help identify more suitable catalysts for the direct production of styrene from ethylene and benzene. Second, theoretical studies of heterobimetallic {Ag–Fe(CO)5}+ fragments were performed in conjunction with experiments. The computational models suggested that for this first example of a heterodinuclear, metal-only FeAg Lewis pair (MOLP) that Fe(CO)5 acts as a Lewis base and AgI as a Lewis acid. The ῡCO bands of the studied molecules showed a blue shift relative to those measured for free Fe(CO)5, which indicated a reduction in Fe→CO backbonding upon coordination to silver(I). Electrostatic interaction is predicted via DFT as the dominant mode of Fe—Ag bonding augmented by a modest amount of charge transfer between Ag+ and Fe(CO)5. Third, computational analyses of hypothetical transition metal-terminal boride [MB(PNPR)] complexes were reported. DFT, natural orbital analysis (NBO), and multiconfiguration self-consistent field (MCSCF) calculations were employed to investigate the structure and bonding of terminal boride complexes, in particular the extent of metal dπ - boron pπ bonding. Comparison of metal-boride, -borylene and –boryl bond lengths confirms the presence of metal-boron π bonds, albeit the modest shortening (~ 3%) of the metal-boron bond suggests that the π-bonding is weak. Their instabilities, as measured by free energies of H2 addition to make the corresponding boryl complexes, indicate terminal boride complexes to be thermodynamically weak. It is concluded that for the boride complexes studied, covering a range of 4d and 5d metals, that the metal-boride bond consisted of a reasonably covalent σ and two very polarized π metal-boron bonds. High polarization of the boron to metal π-bonds indicated that a terminal boride is an acceptor or Z type ligand. Finally, anti-Markovnikov addition of water to olefins has been a long-standing goal in catalysis. The [Rh(COD)(DPEphos)]+ complex was found as a general and regioselective group 9 catalyst for intermolecular hydroamination of alkenes. The reaction mechanism was adapted for intermolecular hydration of alkenes catalyzed by a [Rh(DPEphos)]+ catalyst and studied by DFT calculations. Olefin hydration pathways were analyzed for anti-Markovnikov and Markovnikov regioselectivity. On the basis of the DFT results, the operating mechanism can be summarized as follows: styrene activation through nucleophilic attack by OHδ− of water to alkene with simultaneous Hδ+ transfer to the Rh; this is then followed by formation of primary alcohol via reductive elimination. The competitive formation of phenylethane was studied via a β-elimination pathway followed by hydrogenation. The origin of the regioselectivity (Markovnikov vs anti-Markovnikov) was analyzed by means of studying the molecular orbitals, plus natural atomic charges, and shown to be primarily orbital-driven rather than charge-driven.
APA, Harvard, Vancouver, ISO, and other styles
14

Zidi, Mohamed Sadok. "Calcul à une boucle avec plusieurs pattes externes dans les théories de jauge : la bibliothèque Golem95." Thesis, Grenoble, 2013. http://www.theses.fr/2013GRENY031/document.

Full text
Abstract:
Les calculs de précision dans les théories de jauge jouent un rôle très important pour l’étude de la physique du Modèle Standard et au-delà dans les super-collisionneurs de particules comme le LHC, TeVatron et ILC. Par conséquent, il est extrêmement important de fournir des outils du calcul d’amplitudes à une boucle stables, rapides, efficaces et hautement automatisés. Cette thèse a pour but de développer la bibliothèque d’intégrales Golem95. Cette bibliothèque est un programme écrit en Fortran95, qui contient tous les ingrédients nécessaires pour calculer une intégrale scalaire ou tensorielle à une boucle avec jusqu’à six pattes externes. Golem95 utilise une méthode traditionnelle de réduction (réduction à la Golem) qui réduit les facteurs de forme en des intégrales de base redondantes qui peuvent être scalaires (sans paramètres de Feynman au numérateur) ou tensorielles (avec des paramètres de Feynman au numérateur); ce formalisme permet d’éviter les problèmes de l’instabilité numérique engendrés par des singularités factices dues à l’annulation des déterminants de Gram. En plus, cette bibliothèque peut être interfacée avec des programmes du calcul automatique basés sur les méthodes d’unitarité comme GoSam par exemple. Les versions antérieures de Golem95 ont été conçues pour le calcul des amplitudes sans masses internes. Le but de ce travail de thèse est de généraliser cette bibliothèque pour les configurations les plus générales (les masses complexes sont incluses), et de fournir un calcul numériquement stable dans les régions problématique en donnant une représentation intégrale unidimensionnelle stable pour chaque intégrale de base de Golem95
Higher order corrections in gauge theories play a crucial role in studying physics within the standard model and beyond at TeV colliders, like LHC, TeVatron and ILC. Therefore, it is of extreme importance to provide tools for next-to-leading order amplitude computation which are fast, stable, efficient and highly automatized. This thesis aims at developing the library of integrals Golem95. This library is a program written in Fortran95, it contains all the necessary ingredients to calculate any one-loop scalar or tensorial integral with up to six external legs. Golem95 uses the traditional reduction method (Golem reduction) to reduce the form factors into redundant basic integrals, which can be scalar (without Feynman parameters in the numerator) or tensorial (with Feynman parameter in the numerator); this formalism allows us to avoid the problems of numerical instabilities generated by the spurious singularities induced by the vanishing of the Gram determinants. In addition, this library can be interfaced with automatic programs of NLO calculation based on the unitarity inspired reduction methods as GoSam for example. Earlierversions of Golem95 were designed for the calculation of amplitudes without internal masses. The purpose of this thesis is to extend this library for more general configurations (complex masses are supported); and to provide numerically stable calculation in the problematic regions (det(G) → 0), by providing a stable one-dimensional integral representation for each Golem95 basic integral
APA, Harvard, Vancouver, ISO, and other styles
15

Björkman, Desireé. "Machine Learning Evaluation of Natural Language to Computational Thinking : On the possibilities of coding without syntax." Thesis, Uppsala universitet, Institutionen för informationsteknologi, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-424269.

Full text
Abstract:
Voice commands are used in today's society to offer services like putting events into a calendar, tell you about the weather and to control the lights at home. This project tries to extend the possibilities of voice commands by improving an earlier proof of concept system that interprets intention given in natural language to program code. This improvement was made by mixing linguistic methods and neural networks to increase accuracy and flexibility of the interpretation of input. A user testing phase was made to conclude if the improvement would attract users to the interface. The results showed possibilities of educational purposes for computational thinking and the issues to overcome to become a general programming tool.
APA, Harvard, Vancouver, ISO, and other styles
16

Shokat, Imran. "Computational Analyses of Scientific Publications Using Raw and Manually Curated Data with Applications to Text Visualization." Thesis, Linnéuniversitetet, Institutionen för datavetenskap och medieteknik (DM), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-78995.

Full text
Abstract:
Text visualization is a field dedicated to the visual representation of textual data by using computer technology. A large number of visualization techniques are available, and now it is becoming harder for researchers and practitioners to choose an optimal technique for a particular task among the existing techniques. To overcome this problem, the ISOVIS Group developed an interactive survey browser for text visualization techniques. ISOVIS researchers gathered papers which describe text visualization techniques or tools and categorized them according to a taxonomy. Several categories were manually assigned to each visualization technique. In this thesis, we aim to analyze the dataset of this browser. We carried out several analyses to find temporal trends and correlations of the categories present in the browser dataset. In addition, a comparison of these categories with a computational approach has been made. Our results show that some categories became more popular than before whereas others have declined in popularity. The cases of positive and negative correlation between various categories have been found and analyzed. Comparison between manually labeled datasets and results of computational text analyses were presented to the experts with an opportunity to refine the dataset. Data which is analyzed in this thesis project is specific to text visualization field, however, methods that are used in the analyses can be generalized for applications to other datasets of scientific literature surveys or, more generally, other manually curated collections of textual documents.
APA, Harvard, Vancouver, ISO, and other styles
17

Meechan-Maddon, Ailsa. "The effect of noise in the training of convolutional neural networks for text summarisation." Thesis, Uppsala universitet, Institutionen för lingvistik och filologi, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-384607.

Full text
Abstract:
In this thesis, we work towards bridging the gap between two distinct areas: noisy text handling and text summarisation. The overall goal of the paper is to examine the effects of noise in the training of convolutional neural networks for text summarisation, with a view to understanding how to effectively create a noise-robust text-summarisation system. We look specifically at the problem of abstractive text summarisation of noisy data in the context of summarising error-containing documents from automatic speech recognition (ASR) output. We experiment with adding varying levels of noise (errors) to the 4 million-article Gigaword corpus and training an encoder-decoder CNN on it with the aim of producing a noise-robust text summarisation system. A total of six text summarisation models are trained, each with a different level of noise. We discover that the models with a high level of noise are indeed able to aptly summarise noisy data into clean summaries, despite a tendency for all models to overfit to the level of noise on which they were trained. Directions are given for future steps in order to create an even more noise-robust and flexible text summarisation system.
APA, Harvard, Vancouver, ISO, and other styles
18

Adewumi, Oluwatosin. "Word Vector Representations using Shallow Neural Networks." Licentiate thesis, Luleå tekniska universitet, EISLAB, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:ltu:diva-83578.

Full text
Abstract:
This work highlights some important factors for consideration when developing word vector representations and data-driven conversational systems. The neural network methods for creating word embeddings have gained more prominence than their older, count-based counterparts.However, there are still challenges, such as prolonged training time and the need for more data, especially with deep neural networks. Shallow neural networks with lesser depth appear to have the advantage of less complexity, however, they also face challenges, such as sub-optimal combination of hyper-parameters which produce sub-optimal models. This work, therefore, investigates the following research questions: "How importantly do hyper-parameters influence word embeddings’ performance?" and "What factors are important for developing ethical and robust conversational systems?" In answering the questions, various experiments were conducted using different datasets in different studies. The first study investigates, empirically, various hyper-parameter combinations for creating word vectors and their impact on a few natural language processing (NLP) downstream tasks: named entity recognition (NER) and sentiment analysis (SA). The study shows that optimal performance of embeddings for downstream \acrshort{nlp} tasks depends on the task at hand.It also shows that certain combinations give strong performance across the tasks chosen for the study. Furthermore, it shows that reasonably smaller corpora are sufficient or even produce better models in some cases and take less time to train and load. This is important, especially now that environmental considerations play prominent role in ethical research. Subsequent studies build on the findings of the first and explore the hyper-parameter combinations for Swedish and English embeddings for the downstream NER task. The second study presents the new Swedish analogy test set for evaluation of Swedish embeddings. Furthermore, it shows that character n-grams are useful for Swedish, a morphologically rich language. The third study shows that broad coverage of topics in a corpus appears to be important to produce better embeddings and that noise may be helpful in certain instances, though they are generally harmful. Hence, relatively smaller corpus can show better performance than a larger one, as demonstrated in the work with the smaller Swedish Wikipedia corpus against the Swedish Gigaword. The argument is made, in the final study (in answering the second question) from the point of view of the philosophy of science, that the near-elimination of the presence of unwanted bias in training data and the use of foralike the peer-review, conferences, and journals to provide the necessary avenues for criticism and feedback are instrumental for the development of ethical and robust conversational systems.
APA, Harvard, Vancouver, ISO, and other styles
19

Roos, Daniel. "Evaluation of BERT-like models for small scale ad-hoc information retrieval." Thesis, Linköpings universitet, Artificiell intelligens och integrerade datorsystem, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-177675.

Full text
Abstract:
Measuring semantic similarity between two sentences is an ongoing research field with big leaps being taken every year. This thesis looks at using modern methods of semantic similarity measurement for an ad-hoc information retrieval (IR) system. The main challenge tackled was answering the question "What happens when you don’t have situation-specific data?". Using encoder-based transformer architectures pioneered by Devlin et al., which excel at fine-tuning to situationally specific domains, this thesis shows just how well the presented methodology can work and makes recommendations for future attempts at similar domain-specific tasks. It also shows an example of how a web application can be created to make use of these fast-learning architectures.
APA, Harvard, Vancouver, ISO, and other styles
20

Karamolegkou, Antonia. "Argument Mining: Claim Annotation, Identification, Verification." Thesis, Uppsala universitet, Institutionen för lingvistik och filologi, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-448855.

Full text
Abstract:
Researchers writing scientific articles summarize their work in the abstracts mentioning the final outcome of their study. Argumentation mining can be used to extract the claim of the researchers as well as the evidence that could support their claim. The rapid growth of scientific articles demands automated tools that could help in the detection and evaluation of the scientific claims’ veracity. However, there are neither a lot of studies focusing on claim identification and verification neither a lot of annotated corpora available to effectively train deep learning models. For this reason, we annotated two argument mining corpora and perform several experiments with state-of-the-art BERT-based models aiming to identify and verify scientific claims. We find that using SciBERT provides optimal results regardless of the dataset. Furthermore, increasing the amount of training data can improve the performance of every model we used. These findings highlight the need for large-scale argument mining corpora, as well as domain-specific pre-trained models.
APA, Harvard, Vancouver, ISO, and other styles
21

Dinkar, Tanvi. "Computational models of disfluencies : fillers and discourse markers in spoken language understanding." Electronic Thesis or Diss., Institut polytechnique de Paris, 2022. http://www.theses.fr/2022IPPAT001.

Full text
Abstract:
Les gens s'expriment rarement de la même manière qu'ils écrivent - en effet ils écrivent rarement de manière diffluente. Les disfluences sont des interruptions dans le flux régulier de la parole, telles que les pauses (silencieuses), les répétitions de mots ou les interruptions pour corriger une phrase précédemment dite. Bien qu'il s'agisse d'une caractéristique naturelle de la parole spontanée et malgré la riche littérature linguistique qui traite de leur caractère informatif, elles sont souvent considérées comme du bruit et éliminées lors du post-traitement des transcriptions de sortie des systèmes de reconnaissance de la parole. Jusqu'à présent, leur prise en compte dans un contexte de compréhension de la langue parlée (CLP) a rarement été explorée. L'objectif de cette thèse est de développer des modèles informatiques des disfluences dans la CLP. Pour ce faire, nous prenons inspirons dans les modèles psycholinguistiques des disfluences, qui se concentrent sur le rôle que les disfluences jouent dans l'expression (par le locuteur) et la compréhension (par l'auditeur) du discours. Plus précisément, lorsque nous utilisons le terme "modèles informatiques des disfluences", nous entendons développer des méthodologies qui traitent automatiquement les disfluences afin d'observer empiriquement 1) leurs impacts sur la production et la compréhension de la parole et 2) leurs interactions avec le signal primaire (lexical, ou la substance du discours). A cet effet, nous nous concentrons sur deux types de discours : les monologues et les dialogues orientés vers une tâche. Nos résultats se concentrent sur des tâches de CLP, ainsi que sur les recherches pertinentes pour les systèmes de dialogues parlés. Lors de l'étude des monologues, nous utilisons une combinaison de modèles traditionnels et neuronaux pour étudier les représentations et l'impact des disfluences sur la performance de le CLP. De plus, nous développons des méthodologies pour étudier les disfluences en tant qu'indices d'informations entrantes dans le flux du discours. Dans l'étude des dialogues orientés vers une tâche, nous nous concentrons sur le développement de modèles informatiques pour étudier les rôles des disfluences dans la dynamique auditeur-locuteur. Nous étudions spécifiquement les disfluences dans le contexte de l'alignement verbal, c'est-à-dire l'alignement des expressions lexicales des interlocuteurs et leurs roles dans l'alignement comportemental, un nouveau contexte d'alignement que nous proposons de définir comme le moment où les instructions données par un interlocuteur sont suivis d'une action par un autre interlocuteur. Nous examinons également comment les disfluences dans les contextes d'alignement locaux peuvent être associées à des phénomènes au niveau du discours, tels que la réussite de la tâche. Nous considérons cette thèse comme l'un des premiers travaux, qui pourrait aboutir à intégration des disfluences dans les contextes d'alignement local
People rarely speak in the same manner that they write – they are generally disfluent. Disfluencies can be defined as interruptions in the regular flow of speech, such as pausing silently, repeating words, or interrupting oneself to correct something said previously. Despite being a natural characteristic of spontaneous speech, and the rich linguistic literature that discusses their informativeness, they are often removed as noise in post-processing from the output transcripts of speech recognisers. So far, their consideration in a Spoken Language Understanding (SLU) context has been rarely explored. The aim of this thesis is to develop computational models of disfluencies in SLU. To do so, we take inspiration from psycholinguistic models of disfluencies, which focus on the role that disfluencies play in the production (by the speaker) and comprehension (by the listener) of speech. Specifically, when we use the term ``computational models of disfluencies'', we mean to develop methodologies that automatically process disfluencies to empirically observe 1) their impact on the production and comprehension of speech, and 2) how they interact with the primary signal (the lexical, or what was said in essence). To do so, we focus on two discourse contexts; monologues and task-oriented dialogues.Our results contribute to broader tasks in SLU, and also research relevant to Spoken Dialogue Systems. When studying monologues, we use a combination of traditional and neural models to study the representations and impact of disfluencies on SLU performance. Additionally, we develop methodologies to study disfluencies as a cue for incoming information in the flow of the discourse. In studying task-oriented dialogues, we focus on developing computational models to study the roles of disfluencies in the listener-speaker dynamic. We specifically study disfluencies in the context of verbal alignment; i.e. the alignment of the interlocutors' lexical expressions, and the role of disfluencies in behavioural alignment; a new alignment context that we propose to mean when instructions given by one interlocutor are followed with an action by another interlocutor. We also consider how these disfluencies in local alignment contexts can be associated with discourse level phenomena; such as success in the task. We consider this thesis one of the many first steps that could be undertaken to integrate disfluencies in SLU contexts
APA, Harvard, Vancouver, ISO, and other styles
22

Danielsson, Benjamin. "A Study on Text Classification Methods and Text Features." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-159992.

Full text
Abstract:
When it comes to the task of classification the data used for training is the most crucial part. It follows that how this data is processed and presented for the classifier plays an equally important role. This thesis attempts to investigate the performance of multiple classifiers depending on the features that are used, the type of classes to classify and the optimization of said classifiers. The classifiers of interest are support-vector machines (SMO) and multilayer perceptron (MLP), the features tested are word vector spaces and text complexity measures, along with principal component analysis on the complexity measures. The features are created based on the Stockholm-Umeå-Corpus (SUC) and DigInclude, a dataset containing standard and easy-to-read sentences. For the SUC dataset the classifiers attempted to classify texts into nine different text categories, while for the DigInclude dataset the sentences were classified into either standard or simplified classes. The classification tasks on the DigInclude dataset showed poor performance in all trials. The SUC dataset showed best performance when using SMO in combination with word vector spaces. Comparing the SMO classifier on the text complexity measures when using or not using PCA showed that the performance was largely unchanged between the two, although not using PCA had slightly better performance
APA, Harvard, Vancouver, ISO, and other styles
23

Schneider, Michael J. "A Study on the Efficacy of Sentiment Analysis in Author Attribution." Digital Commons @ East Tennessee State University, 2015. https://dc.etsu.edu/etd/2538.

Full text
Abstract:
The field of authorship attribution seeks to characterize an author’s writing style well enough to determine whether he or she has written a text of interest. One subfield of authorship attribution, stylometry, seeks to find the necessary literary attributes to quantify an author’s writing style. The research presented here sought to determine the efficacy of sentiment analysis as a new stylometric feature, by comparing its performance in attributing authorship against the performance of traditional stylometric features. Experimentation, with a corpus of sci-fi texts, found sentiment analysis to have a much lower performance in assigning authorship than the traditional stylometric features.
APA, Harvard, Vancouver, ISO, and other styles
24

Bridal, Olle. "Named-entity recognition with BERT for anonymization of medical records." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-176547.

Full text
Abstract:
Sharing data is an important part of the progress of science in many fields. In the largely deep learning dominated field of natural language processing, textual resources are in high demand. In certain domains, such as that of medical records, the sharing of data is limited by ethical and legal restrictions and therefore requires anonymization. The process of manual anonymization is tedious and expensive, thus automated anonymization is of great value. Since medical records consist of unstructured text, pieces of sensitive information have to be identified in order to be masked for anonymization. Named-entity recognition (NER) is the subtask of information extraction named entities, such as person names or locations, are identified and categorized. Recently, models that leverage unsupervised training on large quantities of unlabeled training data have performed impressively on the NER task, which shows promise in their usage for the problem of anonymization. In this study, a small set of medical records was annotated with named-entity tags. Because of the lack of any training data, a BERT model already fine-tuned for NER was then evaluated on the evaluation set. The aim was to find out how well the model would perform on NER on medical records, and to explore the possibility of using the model to anonymize medical records. The most positive result was that the model was able to identify all person names in the dataset. The average accuracy for identifying all entity types was however relatively low. It is discussed that the success of identifying person names shows promise in the model’s application for anonymization. However, because the overall accuracy is significantly worse than that of models fine-tuned on domain-specific data, it is suggested that there might be better methods for anonymization in the absence of relevant training data.
APA, Harvard, Vancouver, ISO, and other styles
25

Wang, Run Fen. "Semantic Text Matching Using Convolutional Neural Networks." Thesis, Uppsala universitet, Institutionen för lingvistik och filologi, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-362134.

Full text
Abstract:
Semantic text matching is a fundamental task for many applications in NaturalLanguage Processing (NLP). Traditional methods using term frequencyinversedocument frequency (TF-IDF) to match exact words in documentshave one strong drawback which is TF-IDF is unable to capture semanticrelations between closely-related words which will lead to a disappointingmatching result. Neural networks have recently been used for various applicationsin NLP, and achieved state-of-the-art performances on many tasks.Recurrent Neural Networks (RNN) have been tested on text classificationand text matching, but it did not gain any remarkable results, which is dueto RNNs working more effectively on texts with a short length, but longdocuments. In this paper, Convolutional Neural Networks (CNN) will beapplied to match texts in a semantic aspect. It uses word embedding representationsof two texts as inputs to the CNN construction to extract thesemantic features between the two texts and give a score as the output ofhow certain the CNN model is that they match. The results show that aftersome tuning of the parameters the CNN model could produce accuracy,prediction, recall and F1-scores all over 80%. This is a great improvementover the previous TF-IDF results and further improvements could be madeby using dynamic word vectors, better pre-processing of the data, generatelarger and more feature rich data sets and further tuning of the parameters.
APA, Harvard, Vancouver, ISO, and other styles
26

Navér, Norah. "The past, present or future? : A comparative NLP study of Naive Bayes, LSTM and BERT for classifying Swedish sentences based on their tense." Thesis, Uppsala universitet, Institutionen för informationsteknologi, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-446793.

Full text
Abstract:
Natural language processing is a field in computer science that is becoming increasingly important. One important part of NLP is the ability to sort text to the past, present or future, depending on when the event came or will come about. The objective of this thesis was to use text classification to classify Swedish sentences based on their tense, either past, present or future. Furthermore, the objective was also to compare how lemmatisation would affect the performance of the models. The problem was tackled by implementing three machine learning models on both lemmatised and not lemmatised data. The machine learning models were Naive Bayes, LSTM and BERT. The result showed that the overall performance was affected negatively when the data was lemmatised. The best performing model was BERT with an accuracy of 96.3\%. The result was useful as the best performing model had very high accuracy and performed well on newly constructed sentences.
Språkteknologi är område inom datavetenskap som som har blivit allt viktigare. En viktig del av språkteknologi är förmågan att sortera texter till det förflutna, nuet eller framtiden, beroende på när en händelse skedde eller kommer att ske. Syftet med denna avhandling var att använda textklassificering för att klassificera svenska meningar baserat på deras tempus, antingen dåtid, nutid eller framtid. Vidare var syftet även att jämföra hur lemmatisering skulle påverka modellernas prestanda. Problemet hanterades genom att implementera tre maskininlärningsmodeller på både lemmatiserade och icke lemmatiserade data. Maskininlärningsmodellerna var Naive Bayes, LSTM och BERT. Resultatet var att den övergripande prestandan påverkades negativt när datan lemmatiserade. Den bäst presterande modellen var BERT med en träffsäkerhet på 96,3 \%. Resultatet var användbart eftersom den bäst presterande modellen hade mycket hög träffsäkerhet och fungerade bra på nybyggda meningar.
APA, Harvard, Vancouver, ISO, and other styles
27

Woldemariam, Yonas Demeke. "Natural language processing in cross-media analysis." Licentiate thesis, Umeå universitet, Institutionen för datavetenskap, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-147640.

Full text
Abstract:
A cross-media analysis framework is an integrated multi-modal platform where a media resource containing different types of data such as text, images, audio and video is analyzed with metadata extractors, working jointly to contextualize the media resource. It generally provides cross-media analysis and automatic annotation, metadata publication and storage, searches and recommendation services. For on-line content providers, such services allow them to semantically enhance a media resource with the extracted metadata representing the hidden meanings and make it more efficiently searchable. Within the architecture of such frameworks, Natural Language Processing (NLP) infrastructures cover a substantial part. The NLP infrastructures include text analysis components such as a parser, named entity extraction and linking, sentiment analysis and automatic speech recognition. Since NLP tools and techniques are originally designed to operate in isolation, integrating them in cross-media frameworks and analyzing textual data extracted from multimedia sources is very challenging. Especially, the text extracted from audio-visual content lack linguistic features that potentially provide important clues for text analysis components. Thus, there is a need to develop various techniques to meet the requirements and design principles of the frameworks. In our thesis, we explore developing various methods and models satisfying text and speech analysis requirements posed by cross-media analysis frameworks. The developed methods allow the frameworks to extract linguistic knowledge of various types and predict various information such as sentiment and competence. We also attempt to enhance the multilingualism of the frameworks by designing an analysis pipeline that includes speech recognition, transliteration and named entity recognition for Amharic, that also enables the accessibility of Amharic contents on the web more efficiently. The method can potentially be extended to support other under-resourced languages.
APA, Harvard, Vancouver, ISO, and other styles
28

Fang, Yimai. "Proposition-based summarization with a coherence-driven incremental model." Thesis, University of Cambridge, 2019. https://www.repository.cam.ac.uk/handle/1810/287468.

Full text
Abstract:
Summarization models which operate on meaning representations of documents have been neglected in the past, although they are a very promising and interesting class of methods for summarization and text understanding. In this thesis, I present one such summarizer, which uses the proposition as its meaning representation. My summarizer is an implementation of Kintsch and van Dijk's model of comprehension, which uses a tree of propositions to represent the working memory. The input document is processed incrementally in iterations. In each iteration, new propositions are connected to the tree under the principle of local coherence, and then a forgetting mechanism is applied so that only a few important propositions are retained in the tree for the next iteration. A summary can be generated using the propositions which are frequently retained. Originally, this model was only played through by hand by its inventors using human-created propositions. In this work, I turned it into a fully automatic model using current NLP technologies. First, I create propositions by obtaining and then transforming a syntactic parse. Second, I have devised algorithms to numerically evaluate alternative ways of adding a new proposition, as well as to predict necessary changes in the tree. Third, I compared different methods of modelling local coherence, including coreference resolution, distributional similarity, and lexical chains. In the first group of experiments, my summarizer realizes summary propositions by sentence extraction. These experiments show that my summarizer outperforms several state-of-the-art summarizers. The second group of experiments concerns abstractive generation from propositions, which is a collaborative project. I have investigated the option of compressing extracted sentences, but generation from propositions has been shown to provide better information packaging.
APA, Harvard, Vancouver, ISO, and other styles
29

Erik, Cambria. "Application of common sense computing for the development of a novel knowledge-based opinion mining engine." Thesis, University of Stirling, 2011. http://hdl.handle.net/1893/6497.

Full text
Abstract:
The ways people express their opinions and sentiments have radically changed in the past few years thanks to the advent of social networks, web communities, blogs, wikis and other online collaborative media. The distillation of knowledge from this huge amount of unstructured information can be a key factor for marketers who want to create an image or identity in the minds of their customers for their product, brand, or organisation. These online social data, however, remain hardly accessible to computers, as they are specifically meant for human consumption. The automatic analysis of online opinions, in fact, involves a deep understanding of natural language text by machines, from which we are still very far. Hitherto, online information retrieval has been mainly based on algorithms relying on the textual representation of web-pages. Such algorithms are very good at retrieving texts, splitting them into parts, checking the spelling and counting their words. But when it comes to interpreting sentences and extracting meaningful information, their capabilities are known to be very limited. Existing approaches to opinion mining and sentiment analysis, in particular, can be grouped into three main categories: keyword spotting, in which text is classified into categories based on the presence of fairly unambiguous affect words; lexical affinity, which assigns arbitrary words a probabilistic affinity for a particular emotion; statistical methods, which calculate the valence of affective keywords and word co-occurrence frequencies on the base of a large training corpus. Early works aimed to classify entire documents as containing overall positive or negative polarity, or rating scores of reviews. Such systems were mainly based on supervised approaches relying on manually labelled samples, such as movie or product reviews where the opinionist’s overall positive or negative attitude was explicitly indicated. However, opinions and sentiments do not occur only at document level, nor they are limited to a single valence or target. Contrary or complementary attitudes toward the same topic or multiple topics can be present across the span of a document. In more recent works, text analysis granularity has been taken down to segment and sentence level, e.g., by using presence of opinion-bearing lexical items (single words or n-grams) to detect subjective sentences, or by exploiting association rule mining for a feature-based analysis of product reviews. These approaches, however, are still far from being able to infer the cognitive and affective information associated with natural language as they mainly rely on knowledge bases that are still too limited to efficiently process text at sentence level. In this thesis, common sense computing techniques are further developed and applied to bridge the semantic gap between word-level natural language data and the concept-level opinions conveyed by these. In particular, the ensemble application of graph mining and multi-dimensionality reduction techniques on two common sense knowledge bases was exploited to develop a novel intelligent engine for open-domain opinion mining and sentiment analysis. The proposed approach, termed sentic computing, performs a clause-level semantic analysis of text, which allows the inference of both the conceptual and emotional information associated with natural language opinions and, hence, a more efficient passage from (unstructured) textual information to (structured) machine-processable data. The engine was tested on three different resources, namely a Twitter hashtag repository, a LiveJournal database and a PatientOpinion dataset, and its performance compared both with results obtained using standard sentiment analysis techniques and using different state-of-the-art knowledge bases such as Princeton’s WordNet, MIT’s ConceptNet and Microsoft’s Probase. Differently from most currently available opinion mining services, the developed engine does not base its analysis on a limited set of affect words and their co-occurrence frequencies, but rather on common sense concepts and the cognitive and affective valence conveyed by these. This allows the engine to be domain-independent and, hence, to be embedded in any opinion mining system for the development of intelligent applications in multiple fields such as Social Web, HCI and e-health. Looking ahead, the combined novel use of different knowledge bases and of common sense reasoning techniques for opinion mining proposed in this work, will, eventually, pave the way for development of more bio-inspired approaches to the design of natural language processing systems capable of handling knowledge, retrieving it when necessary, making analogies and learning from experience.
APA, Harvard, Vancouver, ISO, and other styles
30

Johansson, Oskar. "Parafrasidentifiering med maskinklassificerad data : utvärdering av olika metoder." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-167039.

Full text
Abstract:
Detta arbete undersöker hur språkmodellen BERT och en MaLSTM-arkitektur fungerar att för att identifiera parafraser ur 'Microsoft Paraphrase Research Corpus' (MPRC) om dessa tränats på automatiskt identifierade parafraser ur 'Paraphrase Database' (PPDB). Metoderna ställs mot varandra för att undersöka vilken som presterar bäst och metoden att träna på maskinklassificerad data för att användas på mänskligt klassificerad data utvärderas i förhållande till annan klassificering av samma dataset. Meningsparen som används för att träna modellerna hämtas från de högst rankade parafraserna ur PPDB och genom en genereringsmetod som skapar icke-parafraser ur samma dataset. I resultatet visar sig BERT vara kapabel till att identifiera en del parafraser ur MPRC, medan MaLSTM-arkitekturen inte klarade av detta trots förmåga att särskilja på parafraser och icke-parafraser under träning. Både BERT och MaLSTM presterade sämre på att identifiera parafraser ur MPRC än modeller som till exempel StructBERT, som tränat och utvärderats på samma dataset, presterar. Anledningar till att MaLSTM inte klarar av uppgiften diskuteras och främst lyfts att meningarna från icke-parafraserna ur träningsdatan är för olika varandra i förhållande till hur de ser ut i MPRC. Slutligen diskuteras vikten av att forska vidare på hur man kan använda sig av maskinframtagna parafraser inom parafraseringsrelaterad forskning.
APA, Harvard, Vancouver, ISO, and other styles
31

Kantzola, Evangelia. "Extractive Text Summarization of Greek News Articles Based on Sentence-Clusters." Thesis, Uppsala universitet, Institutionen för lingvistik och filologi, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-420291.

Full text
Abstract:
This thesis introduces an extractive summarization system for Greek news articles based on sentence clustering. The main purpose of the paper is to evaluate the impact of three different types of text representation, Word2Vec embeddings, TF-IDF and LASER embeddings, on the summarization task. By taking these techniques into account, we build three different versions of the initial summarizer. Moreover, we create a new corpus of gold standard summaries to evaluate them against the system summaries. The new collection of reference summaries is merged with a part of the MultiLing Pilot 2011 in order to constitute our main dataset. We perform both automatic and human evaluation. Our automatic ROUGE results suggest that System A which employs Average Word2Vec vectors to create sentence embeddings, outperforms the other two systems by yielding higher ROUGE-L F-scores. Contrary to our initial hypotheses, System C using LASER embeddings fails to surpass even the Word2Vec embeddings method, showing sometimes a weak sentence representation. With regard to the scores obtained by the manual evaluation task, we observe that System A using Average Word2Vec vectors and System C with LASER embeddings tend to produce more coherent and adequate summaries than System B employing TF-IDF. Furthermore, the majority of system summaries are rated very high with respect to non-redundancy. Overall, System A utilizing Average Word2Vec embeddings performs quite successfully according to both evaluations.
APA, Harvard, Vancouver, ISO, and other styles
32

Toska, Marsida. "A Rule-Based Normalization System for Greek Noisy User-Generated Text." Thesis, Uppsala universitet, Institutionen för lingvistik och filologi, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-424777.

Full text
Abstract:
The ever-growing usage of social media platforms generates daily vast amounts of textual data which could potentially serve as a great source of information. Therefore, mining user-generated data for commercial, academic, or other purposes has already attracted the interest of the research community. However, the informal writing which often characterizes online user-generated texts poses a challenge for automatic text processing with Natural Language Processing (NLP) tools. To mitigate the effect of noise in these texts, lexical normalization has been proposed as a preprocessing method which in short is the task of converting non-standard word forms into a canonical one. The present work aims to contribute to this field by developing a rule-based normalization system for Greek Tweets. We perform an analysis of the categories of the out-of-vocabulary (OOV) word forms identified in the dataset and define hand-crafted rules which we combine with edit distance (Levenshtein distance approach) to tackle noise in the cases under scope. To evaluate the performance of the system we perform both an intrinsic and an extrinsic evaluation in order to explore the effect of normalization on the part-of-speech-tagging. The results of the intrinsic evaluation suggest that our system has an accuracy of approx. 95% compared to approx. 81% for the baseline. In the extrinsic evaluation, it is observed a boost of approx. 8% in the tagging performance when the text has been preprocessed through lexical normalization.
APA, Harvard, Vancouver, ISO, and other styles
33

Shockley, Darla Magdalene. "Email Thread Summarization with Conditional Random Fields." The Ohio State University, 2010. http://rave.ohiolink.edu/etdc/view?acc_num=osu1268159269.

Full text
APA, Harvard, Vancouver, ISO, and other styles
34

Monsen, Julius. "Building high-quality datasets for abstractive text summarization : A filtering‐based method applied on Swedish news articles." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-176352.

Full text
Abstract:
With an increasing amount of information on the internet, automatic text summarization could potentially make content more readily available for a larger variety of people. Training and evaluating text summarization models require datasets of sufficient size and quality. Today, most such datasets are in English, and for minor languages such as Swedish, it is not easy to obtain corresponding datasets with handwritten summaries. This thesis proposes methods for compiling high-quality datasets suitable for abstractive summarization from a large amount of noisy data through characterization and filtering. The data used consists of Swedish news articles and their preambles which are here used as summaries. Different filtering techniques are applied, yielding five different datasets. Furthermore, summarization models are implemented by warm-starting an encoder-decoder model with BERT checkpoints and fine-tuning it on the different datasets. The fine-tuned models are evaluated with ROUGE metrics and BERTScore. All models achieve significantly better results when evaluated on filtered test data than when evaluated on unfiltered test data. Moreover, models trained on the most filtered dataset with the smallest size achieves the best results on the filtered test data. The trade-off between dataset size and quality and other methodological implications of the data characterization, the filtering and the model implementation are discussed, leading to suggestions for future research.
APA, Harvard, Vancouver, ISO, and other styles
35

Lameris, Harm. "Homograph Disambiguation and Diacritization for Arabic Text-to-Speech Using Neural Networks." Thesis, Uppsala universitet, Institutionen för lingvistik och filologi, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-446509.

Full text
Abstract:
Pre-processing Arabic text for Text-to-Speech (TTS) systems poses major challenges, as Arabic omits short vowels in writing. This omission leads to a large number of homographs, and means that Arabic text needs to be diacritized to disambiguate these homographs, in order to be matched up with the intended pronunciation. Diacritizing Arabic has generally been achieved by using rule-based, statistical, or hybrid methods that combine rule-based and statistical methods. Recently, diacritization methods involving deep learning have shown promise in reducing error rates. These deep-learning methods are not yet commonly used in TTS engines, however. To examine neural diacritization methods for use in TTS engines, we normalized and pre-processed a version of the Tashkeela corpus, a large diacritized corpus containing largely Classical Arabic texts, for TTS purposes. We then trained and tested three state-of-the-art Recurrent-Neural-Network-based models on this data set. Additionally we tested these models on the Wiki News corpus, a test set that contains Modern Standard Arabic (MSA) news articles and thus more closely resembles most TTS queries. The models were evaluated by comparing the Diacritic Error Rate (DER) and Word Error Rate (WER) achieved for each data set to one another and to the DER and WER reported in the original papers. Moreover, the per-diacritic accuracy was examined, and a manual evaluation was performed. For the Tashkeela corpus, all models achieved a lower DER and WER than reported in the original papers. This was largely the result of using more training data in addition to the TTS pre-processing steps that were performed on the data. For the Wiki News corpus, the error rates were higher, largely due to the domain gap between the data sets. We found that for both data sets the models overfit on common patterns and the most common diacritic. For the Wiki News corpus the models struggled with Named Entities and loanwords. Purely neural models generally outperformed the model that combined deep learning with rule-based and statistical corrections. These findings highlight the usability of deep learning methods for Arabic diacritization in TTS engines as well as the need for diacritized corpora that are more representative of Modern Standard Arabic.
APA, Harvard, Vancouver, ISO, and other styles
36

Sidås, Albin, and Simon Sandberg. "Conversational Engine for Transportation Systems." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-176810.

Full text
Abstract:
Today's communication between operators and professional drivers takes place through direct conversations between the parties. This thesis project explores the possibility to support the operators in classifying the topic of incoming communications and which entities are affected through the use of named entity recognition and topic classifications. By developing a synthetic training dataset, a NER model and a topic classification model was developed and evaluated to achieve F1-scores of 71.4 and 61.8 respectively. These results were explained by a low variance in the synthetic dataset in comparison to a transcribed dataset from the real world which included anomalies not represented in the synthetic dataset. The aforementioned models were integrated into the dialogue framework Emora to seamlessly handle the back and forth communication and generating responses.
APA, Harvard, Vancouver, ISO, and other styles
37

Svedberg, Jonatan, and George Shmas. "Effekten av textaugmenteringsstrategier på träffsäkerhet, F1-värde och viktat F1-värde." Thesis, KTH, Hälsoinformatik och logistik, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-296550.

Full text
Abstract:
Att utveckla en sofistikerad chatbotlösning kräver stora mängder textdata för att kunna anpassalösningen till en specifik domän. Att manuellt skapa en komplett uppsättning textdata, specialanpassat för den givna domänen och innehållandes ett stort antal varierande meningar som en människa kan tänkas yttra, är ett enormt tidskrävande arbete. För att kringgå detta tillämpas dataaugmentering för att generera mer data utifrån en mindre uppsättning redan existerande textdata. Softronic AB vill undersöka alternativa strategier för dataaugmentering med målet att eventuellt ersätta den nuvarande lösningen med en mer vetenskapligt underbyggd sådan. I detta examensarbete har prototypmodeller utvecklats för att jämföra och utvärdera effekten av olika textaugmenteringsstrategier. Resultatet av genomförda experiment med prototypmodellerna visar att augmentering genom synonymutbyten med en domänanpassad synonymordlista, presenterade märkbart förbättrade effekter på förmågan hos en NLU-modell att korrekt klassificera data, gentemot övriga utvärderade strategier. Vidare indikerar resultatet att ett samband föreligger mellan den strukturella variationsgraden av det augmenterade datat och de tillämpade språkparens semantiska likhetsgrad under tillbakaöversättningar.
Developing a sophisticated chatbot solution requires large amounts of text data to be able to adapt the solution to a specific domain. Manually creating a complete set of text data, specially adapted for the given domain, and containing a large number of varying sentences that a human conceivably can express, is an exceptionally time-consuming task. To circumvent this, data augmentation is applied to generate more data based on a smaller set of already existing text data. Softronic AB wants to investigate alternative strategies for data augmentation with the aim of possibly replacing the current solution with a more scientifically substantiated one. In this thesis, prototype models have been developed to compare and evaluate the effect of different text augmentation strategies. The results of conducted experiments with the prototype models show that augmentation through synonym swaps with a domain-adapted thesaurus, presented noticeably improved effects on the ability of an NLU-model to correctly classify data, compared to other evaluated strategies. Furthermore, the result indicates that there is a relationship between the structural degree of variation of the augmented data and the applied language pair's semantic degree of similarity during back-translations.
APA, Harvard, Vancouver, ISO, and other styles
38

Cheung, Luciana Montera. "SimAffling um ambiente computacional para suporte e simulação do processo de DNA shuffling." Universidade Federal de São Carlos, 2008. https://repositorio.ufscar.br/handle/ufscar/249.

Full text
Abstract:
Made available in DSpace on 2016-06-02T19:02:39Z (GMT). No. of bitstreams: 1 2372.pdf: 3456814 bytes, checksum: 7894f1e8062bb948621e2d222d01e3b0 (MD5) Previous issue date: 2008-11-06
Financiadora de Estudos e Projetos
The Molecular Evolution of the living organisms is a slow process that occurs over the years producing mutations and recombinations at the genetic material, i.e. at the DNA. The mutations can occur as nucleotide remotion, insertion and/or substitution at the DNA chain. The Directed Molecular Evolution is an in vitro process that tries to improve biological functions of specific molecules producing mutations at the molecule s genetic material, mimicking the natural process of evolution. Many technics that simulate in vitro molecular evolution, among them the DNA shuffling, have been used aiming to improve specific properties of a variety of commercially important products as pharmaceutical proteins, vaccines and enzymes used in industries. The original DNA shuffling methodology can be sumarized by the following steps: 1) selection of the parental sequences; 2) random fragmentation of the parental sequences by an enzyme; 3) repeated cycles of PCR (Polymerase Chain Reaction), in order to reassemble the DNA fragments produced in the previous step; 4) PCR amplification of the reassembled sequences obtained in step 3). The DNA shuffling technic success can be measured by the number of recombinat molecules found at the DNA shuffling library obtained, since these recombinant molecules potentially have improved functionalities in relation to their parent since their sequence may accumulate beneficial mutations originated from distinct parent sequences. Nowadays some few models can be found in the literature whose purpose is to suggest optimization to this process aiming the increase of the genetic diversity of the DNA shuffling library obtained. This research work presents a comparative study of four models used to predict/estimate the DNA shuffling results. In addition a computational tool for simulating the DNA shuffling proccess is proposed and implemented in an environment where other functionalities related to the analyses of the parental sequences and the resulting sequences from the DNA shuffling library is also implemented.
A Evolução Molecular dos organismos vivos é um processo lento que ocorre ao longo dos anos e diz respeito às mutações e recombinações sofridas por um determinado organismo em seu material genético, ou seja, em seu DNA. As mutações ocorrem na forma de remoções, inserções e/ou substituições de nucleotídeos ao logo da cadeia de DNA. A Evolução Molecular Direta é um processo laboratorial, ou seja, in vitro, que visa melhorar funções biológicas específicas de moléculas por meio de mutações/recombinações em seu material genético, imitando o processo natural de evolução. Diversas técnicas que simulam a evolução molecular em laboratório, entre elas a técnica de DNA shuffling, têm sido amplamente utilizadas na tentativa de melhorar determinadas propriedades de uma variedade de produtos comercialmente importantes como vacinas, enzimas industriais e substâncias de interesse famacológico. A metodologia original de DNA shuffling pode ser sumarizada pelas seguintes etapas: 1) seleção dos genes de interesse, dito parentais; 2) fragmentação enzimática dos genes; 3) ciclos de PCR (Polymerase Chain Reaction), para que ocorra a remontagem dos fragmentos; 4) amplificação das seqüências remontadas cujo tamanho é igual a dos parentais. O sucesso ou não da técnica de DNA shuffling pode ser medido pelo número de moléculas recombinantes encontradas na biblioteca de DNA shuffling obtida, uma vez que estas podem apresentar melhorias funcionais em relação aos parentais pelo fato de, possivelmente, acumularem em sua seqüência mutações benéficas presentes em parentais distintos. Atualmente podem ser encontradas na literatura algumas poucas modelagens computacionais capazes de sugerir otimizações para o processo, com vistas em aumentar a diversidade genética da biblioteca resultante. O presente trabalho apresenta um estudo comparativo de quatros modelos para predição/estimativa de resultados de experimentos de DNA shuffling encontrados na literatura bem como a proposta e implementação de uma ferramenta computacional de simulação para o processo de DNA shuffling. A ferramenta de simulação foi implementada em um ambiente que disponibiliza outras funcionalidades referentes à análise das seqüências a serem submetidas ao shuffling bem como ferramentas para análise das seqüências resultantes do processo.
APA, Harvard, Vancouver, ISO, and other styles
39

Akrin, Christoffer, and Simon Tham. "A Natural Language Interface for Querying Linked Data." Thesis, Karlstads universitet, Institutionen för matematik och datavetenskap (from 2013), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kau:diva-78921.

Full text
Abstract:
The thesis introduces a proof of concept idea that could spark great interest from many industries. The idea consists of a remote Natural Language Interface (NLI), for querying Knowledge Bases (KBs). The system applies natural language technology tools provided by the Stanford CoreNLP, and queries KBs with the use of the query language SPARQL. Natural Language Processing (NLP) is used to analyze the semantics of a question written in natural language, and generates relational information about the question. With correctly defined relations, the question can be queried on KBs containing relevant Linked Data. The Linked Data follows the Resource Description Framework (RDF) model by expressing relations in the form of semantic triples: subject-predicate-object. With our NLI, any KB can be understood semantically. By providing correct training data, the AI can learn to understand the semantics of the RDF data stored in the KB. The ability to understand the RDF data allows for the process of extracting relational information from questions about the KB. With the relational information, questions can be translated to SPARQL and be queried on the KB.
APA, Harvard, Vancouver, ISO, and other styles
40

Svensson, Henrik, and Kalle Lindqvist. "Rättssäker Textanalys." Thesis, Malmö universitet, Fakulteten för teknik och samhälle (TS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:mau:diva-20396.

Full text
Abstract:
Digital språkbehandling (natural language processing) är ett forskningsområde inom vilketdet ständigt görs nya framsteg. En betydande del av den textanalys som sker inom dettafält har som mål att uppnå en fullgod tillämpning kring dialogen mellan människa ochdator. I denna studie vill vi dock fokusera på den inverkan digital språkbehandling kan hapå den mänskliga inlärningsprocessen. Vårt praktiska testområde har också en framtidainverkan på en av de mest grundläggande förutsättningarna för ett rättssäkert samhälle,nämligen den polisiära rapportskrivningen.Genom att skapa en teoretisk idébas som förenar viktiga aspekter av digital språk-behandling och polisrapportskrivning samt därefter implementera dem i en pedagogiskwebbplattform ämnad för polisstudenter är vi av uppfattningen att vår forskning tillförnågot nytt inom det datavetenskapliga respektive det samhällsvetenskapliga fälten.Syftet med arbetet är att verka som de första stegen mot en webbapplikation somunderstödjer svensk polisdokumentation.
Natural language processing is a research area in which new advances are constantly beingmade. A significant portion of text analyses that takes place in this field have the aim ofachieving a satisfactory application in the dialogue between human and computer. In thisstudy, we instead want to focus on what impact natural language processing can have onthe human learning process.Simultaneously, the context for our research has a future impact on one of the mostbasic principles for a legally secure society, namely the writing of the police report.By creating a theoretical foundation of ideas that combines aspects of natural languageprocessing as well as official police report writing and then implementing them in aneducational web platform intended for police students, we are of the opinion that ourresearch adds something new in the computer science and sociological fields.The purpose of this work is to act as the first steps towards a web application thatsupports the Swedish police documentation.
APA, Harvard, Vancouver, ISO, and other styles
41

Holmer, Daniel. "Context matters : Classifying Swedish texts using BERT's deep bidirectional word embeddings." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-166304.

Full text
Abstract:
When classifying texts using a linear classifier, the texts are commonly represented as feature vectors. Previous methods to represent features as vectors have been unable to capture the context of individual words in the texts, in theory leading to a poor representation of natural language. Bidirectional Encoder Representations from Transformers (BERT), uses a multi-headed self-attention mechanism to create deep bidirectional feature representations, able to model the whole context of all words in a sequence. A BERT model uses a transfer learning approach, where it is pre-trained on a large amount of data and can be further fine-tuned for several down-stream tasks. This thesis uses one multilingual, and two dedicated Swedish BERT models, for the task of classifying Swedish texts as of either easy-to-read or standard complexity in their respective domains. The performance on the text classification task using the different models is then compared both with feature representation methods used in earlier studies, as well as with the other BERT models. The results show that all models performed better on the classification task than the previous methods of feature representation. Furthermore, the dedicated Swedish models show better performance than the multilingual model, with the Swedish model pre-trained on more diverse data outperforming the other.
APA, Harvard, Vancouver, ISO, and other styles
42

Olsson, Fredrik. "Bootstrapping Named Entity Annotation by Means of Active Machine Learning: A Method for Creating Corpora." Doctoral thesis, SICS, 2008. http://urn.kb.se/resolve?urn=urn:nbn:se:ri:diva-22935.

Full text
Abstract:
This thesis describes the development and in-depth empirical investigation of a method, called BootMark, for bootstrapping the marking up of named entities in textual documents. The reason for working with documents, as opposed to for instance sentences or phrases, is that the BootMark method is concerned with the creation of corpora. The claim made in the thesis is that BootMark requires a human annotator to manually annotate fewer documents in order to produce a named entity recognizer with a given performance, than would be needed if the documents forming the basis for the recognizer were randomly drawn from the same corpus. The intention is then to use the created named en- tity recognizer as a pre-tagger and thus eventually turn the manual annotation process into one in which the annotator reviews system-suggested annotations rather than creating new ones from scratch. The BootMark method consists of three phases: (1) Manual annotation of a set of documents; (2) Bootstrapping – active machine learning for the purpose of selecting which document to an- notate next; (3) The remaining unannotated documents of the original corpus are marked up using pre-tagging with revision. Five emerging issues are identified, described and empirically investigated in the thesis. Their common denominator is that they all depend on the real- ization of the named entity recognition task, and as such, require the context of a practical setting in order to be properly addressed. The emerging issues are related to: (1) the characteristics of the named entity recognition task and the base learners used in conjunction with it; (2) the constitution of the set of documents annotated by the human annotator in phase one in order to start the bootstrapping process; (3) the active selection of the documents to annotate in phase two; (4) the monitoring and termination of the active learning carried out in phase two, including a new intrinsic stopping criterion for committee-based active learning; and (5) the applicability of the named entity recognizer created during phase two as a pre-tagger in phase three. The outcomes of the empirical investigations concerning the emerging is- sues support the claim made in the thesis. The results also suggest that while the recognizer produced in phases one and two is as useful for pre-tagging as a recognizer created from randomly selected documents, the applicability of the recognizer as a pre-tagger is best investigated by conducting a user study involving real annotators working on a real named entity recognition task.
APA, Harvard, Vancouver, ISO, and other styles
43

Trembczyk, Max. "Answer Triggering Mechanisms in Neural Reading Comprehension-based Question Answering Systems." Thesis, Uppsala universitet, Institutionen för lingvistik och filologi, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-390840.

Full text
Abstract:
We implement a state-of-the-art question answering system based on Convolutional Neural Networks and Attention Mechanisms and include four different variants of answer triggering that have been discussed in recent literature. The mechanisms are included in different places in the architecture and work with different information and mechanisms. We train, develop and test our models on the popular SQuAD data set for Question Answering based on Reading Comprehension that has in its latest version been equipped with additional non-answerable questions that have to be retrieved by the systems. We test the models against baselines and against each other and provide an extensive evaluation both in a general question answering task and in the explicit performance of the answer triggering mechanisms. We show that the answer triggering mechanisms all clearly improve the model over the baseline without answer triggering by as much as 19.6% to 31.3% depending on the model and the metric. The best performance in general question answering shows a model that we call Candidate:No, that treats the possibility that no answer can be found in the document as just another answer candidate instead of having an additional decision step at some place in the model's architecture as in the other three mechanisms. The performance on detecting the non-answerable questions is very similar in three of the four mechanisms, while one performs notably worse. We give suggestions which approach to use when a more or less conservative approach is desired, and discuss suggestions for future developments.
APA, Harvard, Vancouver, ISO, and other styles
44

Pettersson, Eva. "Spelling Normalisation and Linguistic Analysis of Historical Text for Information Extraction." Doctoral thesis, Uppsala universitet, Institutionen för lingvistik och filologi, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-269753.

Full text
Abstract:
Historical text constitutes a rich source of information for historians and other researchers in humanities. Many texts are however not available in an electronic format, and even if they are, there is a lack of NLP tools designed to handle historical text. In my thesis, I aim to provide a generic workflow for automatic linguistic analysis and information extraction from historical text, with spelling normalisation as a core component in the pipeline. In the spelling normalisation step, the historical input text is automatically normalised to a more modern spelling, enabling the use of existing taggers and parsers trained on modern language data in the succeeding linguistic analysis step. In the final information extraction step, certain linguistic structures are identified based on the annotation labels given by the NLP tools, and ranked in accordance with the specific information need expressed by the user. An important consideration in my implementation is that the pipeline should be applicable to different languages, time periods, genres, and information needs by simply substituting the language resources used in each module. Furthermore, the reuse of existing NLP tools developed for the modern language is crucial, considering the lack of linguistically annotated historical data combined with the high variability in historical text, making it hard to train NLP tools specifically aimed at analysing historical text. In my evaluation, I show that spelling normalisation can be a very useful technique for easy access to historical information content, even in cases where there is little (or no) annotated historical training data available. For the specific information extraction task of automatically identifying verb phrases describing work in Early Modern Swedish text, 91 out of the 100 top-ranked instances are true positives in the best setting.
APA, Harvard, Vancouver, ISO, and other styles
45

Capshaw, Riley. "Relation Classification using Semantically-Enhanced Syntactic Dependency Paths : Combining Semantic and Syntactic Dependencies for Relation Classification using Long Short-Term Memory Networks." Thesis, Linköpings universitet, Interaktiva och kognitiva system, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-153877.

Full text
Abstract:
Many approaches to solving tasks in the field of Natural Language Processing (NLP) use syntactic dependency trees (SDTs) as a feature to represent the latent nonlinear structure within sentences. Recently, work in parsing sentences to graph-based structures which encode semantic relationships between words—called semantic dependency graphs (SDGs)—has gained interest. This thesis seeks to explore the use of SDGs in place of and alongside SDTs within a relation classification system based on long short-term memory (LSTM) neural networks. Two methods for handling the information in these graphs are presented and compared between two SDG formalisms. Three new relation extraction system architectures have been created based on these methods and are compared to a recent state-of-the-art LSTM-based system, showing comparable results when semantic dependencies are used to enhance syntactic dependencies, but with significantly fewer training parameters.
APA, Harvard, Vancouver, ISO, and other styles
46

Öhrström, Fredrik. "Cluster Analysis with Meaning : Detecting Texts that Convey the Same Message." Thesis, Linköpings universitet, Interaktiva och kognitiva system, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-153873.

Full text
Abstract:
Textual duplicates can be hard to detect as they differ in words but have similar semantic meaning. At Etteplan, a technical documentation company, they have many writers that accidentally re-write existing instructions explaining procedures. These "duplicates" clutter the database. This is not desired because it is duplicate work. The condition of the database will only deteriorate as the company expands. This thesis attempts to map where the problem is worst, and also how to calculate how many duplicates there are. The corpus is small, but written in a controlled natural language called Simplified Technical English. The method uses document embeddings from doc2vec and clustering by use of HDBSCAN* and validation using Density-Based Clustering Validation index (DBCV), to chart the problems. A survey was sent out to try to determine a threshold value of when documents stop being duplicates, and then using this value, a theoretical duplicate count was calculated.
APA, Harvard, Vancouver, ISO, and other styles
47

Mann, Jasleen Kaur. "Semantic Topic Modeling and Trend Analysis." Thesis, Linköpings universitet, Statistik och maskininlärning, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-173924.

Full text
Abstract:
This thesis focuses on finding an end-to-end unsupervised solution to solve a two-step problem of extracting semantically meaningful topics and trend analysis of these topics from a large temporal text corpus. To achieve this, the focus is on using the latest develop- ments in Natural Language Processing (NLP) related to pre-trained language models like Google’s Bidirectional Encoder Representations for Transformers (BERT) and other BERT based models. These transformer-based pre-trained language models provide word and sentence embeddings based on the context of the words. The results are then compared with traditional machine learning techniques for topic modeling. This is done to evalu- ate if the quality of topic models has improved and how dependent the techniques are on manually defined model hyperparameters and data preprocessing. These topic models provide a good mechanism for summarizing and organizing a large text corpus and give an overview of how the topics evolve with time. In the context of research publications or scientific journals, such analysis of the corpus can give an overview of research/scientific interest areas and how these interests have evolved over the years. The dataset used for this thesis is research articles and papers from a journal, namely ’Journal of Cleaner Productions’. This journal has more than 24000 research articles at the time of working on this project. We started with implementing Latent Dirichlet Allocation (LDA) topic modeling. In the next step, we implemented LDA along with document clus- tering to get topics within these clusters. This gave us an idea of the dataset and also gave us a benchmark. After having some base results, we explored transformer-based contextual word and sentence embeddings to evaluate if this leads to more meaningful, contextual, and semantic topics. For document clustering, we have used K-means clustering. In this thesis, we also discuss methods to optimally visualize the topics and the trend changes of these topics over the years. Finally, we conclude with a method for leveraging contextual embeddings using BERT and Sentence-BERT to solve this problem and achieve semantically meaningful topics. We also discuss the results from traditional machine learning techniques and their limitations.
APA, Harvard, Vancouver, ISO, and other styles
48

Grant, Harald. "Extractive Multi-document Summarization of News Articles." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-158275.

Full text
Abstract:
Publicly available data grows exponentially through web services and technological advancements. To comprehend large data-streams multi-document summarization (MDS) can be used. In this research, the area of multi-document summarization is investigated. Multiple systems for extractive multi-document summarization are implemented using modern techniques, in the form of the pre-trained BERT language model for word embeddings and sentence classification. This is combined with well proven techniques, in the form of the TextRank ranking algorithm, the Waterfall architecture and anti-redundancy filtering. The systems are evaluated on the DUC-2002, 2006 and 2007 datasets using the ROUGE metric. Where the results show that the BM25 sentence representation implemented in the TextRank model using the Waterfall architecture and an anti-redundancy technique outperforms the other implementations, providing competitive results with other state-of-the-art systems. A cohesive model is derived from the leading system and tried in a user study using a real-world application. The user study is conducted using a real-time news detection application with users from the news-domain. The study shows a clear favour for cohesive summaries in the case of extractive multi-document summarization. Where the cohesive summary is preferred in the majority of cases.
APA, Harvard, Vancouver, ISO, and other styles
49

Lund, Max. "Duplicate Detection and Text Classification on Simplified Technical English." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-158714.

Full text
Abstract:
This thesis investigates the most effective way of performing classification of text labels and clustering of duplicate texts in technical documentation written in Simplified Technical English. Pre-trained language models from transformers (BERT) were tested against traditional methods such as tf-idf with cosine similarity (kNN) and SVMs on the classification task. For detecting duplicate texts, vector representations from pre-trained transformer and LSTM models were tested against tf-idf using the density-based clustering algorithms DBSCAN and HDBSCAN. The results show that traditional methods are comparable to pre-trained models for classification, and that using tf-idf vectors with a low distance threshold in DBSCAN is preferable for duplicate detection.
APA, Harvard, Vancouver, ISO, and other styles
50

Garabato, Brady D. "Synthesis and Computational Studies of a New Class of Lanthanide Niobate Cluster : [Ln4(H2O)8(SO4)5(NbO3)2]+3H2O; Ln= Dy, Tb." TopSCHOLAR®, 2013. http://digitalcommons.wku.edu/theses/1294.

Full text
Abstract:
Polyoxoniobates (PONbs) are a small family of highly electron-rich clusters. The development of new solids composed of these clusters have applications in green energy and electronics. However, the high charge environment of PONbs typically requires alkaline synthetic conditions that are unsuitable for introducing other metals and organic molecules, making synthesis of new systems difficult. To date, very few transition metals and organic ligands have been incorporated into these PONb solids, and lanthanide metal inclusion, which generally improves photoconductivity due to longlived f-orbital excitations, has not yet been fully realized. Here, the synthesis of a new class of lanthanide niobate cluster [Ln4(H2O)8(SO4)5(NbO3)2]·3H2O; Ln= Dy, Tb under acidic conditions is reported. Structures were determined by crystallography and time-dependent density functional theory (TD-DFT) was used to provide insight into photo-induced electronic transitions. Supporting computational methods that are currently being developed for modeling these emerging cluster systems are described.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography