Dissertations / Theses: 'Summarization'

1

Bosma, Wauter Eduard. "Discourse oriented summarization." Enschede : Centre for Telematics and Information Technology (CTIT), 2008. http://doc.utwente.nl/58836.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Moon, Brandon B. "Interactive football summarization /." Diss., CLICK HERE for online access, 2010. http://contentdm.lib.byu.edu/ETD/image/etd3337.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Moon, Brandon B. "Interactive Football Summarization." BYU ScholarsArchive, 2009. https://scholarsarchive.byu.edu/etd/1999.

Full text

Abstract:

Football fans do not have the time to watch every game in its entirety and need an effective solution that summarizes them the story of the game. Human-generated summaries are often too short, requiring time and resources to create. We utilize the advantages of Interactive TV to create an automatic football summarization service that is cohesive, provides context, covers the necessary plays, and is concise. First, we construct a degree of interest function that ranks each play based on detailed, play-by-play game events as well as viewing statistics collected from an interactive viewing environment. This allows us to select the plays that are important to the game as well as those that are interesting to the viewer. Second, we create a visual transition that shows the progress of the ball whenever plays are skipped, allowing the viewer to understand the context of each play within the summary. Third, we enable interactive controls that allow viewers to manipulate the summary and delve deeper into the actual game whenever they wish. We validate our solution through two user studies—one to ensure that our degree of interest function selects the plays that are most interesting to the viewer, and the other to show that our transitions and interactive controls provide a better understanding of the game. We conclude that our summary solution is effective at conveying the story of a football game.

APA, Harvard, Vancouver, ISO, and other styles

4

Sizov, Gleb. "Extraction-Based Automatic Summarization : Theoretical and Empirical Investigation of Summarization Techniques." Thesis, Norwegian University of Science and Technology, Department of Computer and Information Science, 2010. http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-10861.

Full text

Abstract:

A summary is a shortened version of a text that contains the main points of the original content. Automatic summarization is the task of generating a summary by a computer. For example, given a collection of news articles for the last week an automatic summarizer is able to create a concise overview of the important events. This summary can be used as the replacement for the original content or help to identify the events that a person is particularly interested in. Potentially, automatic summarization can save a lot of time for people that deal with a large amount of textual information. The straightforward way to generate a summary is to select several sentences from the original text and organize them in way to create a coherent text. This approach is called extraction-based summarization and is the topic of this thesis. Extraction-based summarization is a complex task that consists of several challenging subtasks. The essential part of the extraction-based approach is identification of sentences that contain important information. It can be done using graph-based representations and centrality measures that exploit similarities between sentences to identify the most central sentences. This thesis provide a comprehensive overview of methods used in extraction-based automatic summarization. In addition, several general natural language processing issues such as feature selection and text representation models are discussed with regard to automatic summarization. Part of the thesis is dedicated to graph-based representations and centrality measures used in extraction-based summarization. Theoretical analysis is reinforced with the experiments using the summarization framework implemented for this thesis. The task for the experiments is query-focused multi-document extraction-based summarization, that is, summarization of several documents according to a user query. The experiments investigate several approaches to this task as well as the use of different representation models, similarity and centrality measures. The obtained results indicate that use of graph centrality measures significantly improves the quality of generated summaries. Among the variety of centrality measure the degree-based ones perform better than path-based measures. The best performance is achieved when centralities are combined with redundancy removal techniques that prevent inclusion of similar sentences in a summary. Experiments with representation models reveal that a simple local term count representation performs better than the distributed representation based on latent semantic analysis, which indicates that further investigation of distributed representations in regard to automatic summarization is necessary. The implemented system performs quite good compared with the systems that participated in DUC 2007 summarization competition. Nevertheless, manual inspection of the generated summaries demonstrate some of the flaws of the implemented summarization mechanism that can be addressed by introducing advanced algorithms for sentence simplification and sentence ordering.

APA, Harvard, Vancouver, ISO, and other styles

5

Chellal, Abdelhamid. "Event summarization on social media stream : retrospective and prospective tweet summarization." Thesis, Toulouse 3, 2018. http://www.theses.fr/2018TOU30118/document.

Full text

Abstract:

Le contenu généré dans les médias sociaux comme Twitter permet aux utilisateurs d'avoir un aperçu rétrospectif d'évènement et de suivre les nouveaux développements dès qu'ils se produisent. Cependant, bien que Twitter soit une source d'information importante, il est caractérisé par le volume et la vélocité des informations publiées qui rendent difficile le suivi de l'évolution des évènements. Pour permettre de mieux tirer profit de ce nouveau vecteur d'information, deux tâches complémentaires de recherche d'information dans les médias sociaux ont été introduites : la génération de résumé rétrospectif qui vise à sélectionner les tweets pertinents et non redondant récapitulant "ce qui s'est passé" et l'envoi des notifications prospectives dès qu'une nouvelle information pertinente est détectée. Notre travail s'inscrit dans ce cadre. L'objectif de cette thèse est de faciliter le suivi d'événement, en fournissant des outils de génération de synthèse adaptés à ce vecteur d'information. Les défis majeurs sous-jacents à notre problématique découlent d'une part du volume, de la vélocité et de la variété des contenus publiés et, d'autre part, de la qualité des tweets qui peut varier d'une manière considérable. La tâche principale dans la notification prospective est l'identification en temps réel des tweets pertinents et non redondants. Le système peut choisir de retourner les nouveaux tweets dès leurs détections où bien de différer leur envoi afin de s'assurer de leur qualité. Dans ce contexte, nos contributions se situent à ces différents niveaux : Premièrement, nous introduisons Word Similarity Extended Boolean Model (WSEBM), un modèle d'estimation de la pertinence qui exploite la similarité entre les termes basée sur le word embedding et qui n'utilise pas les statistiques de flux. L'intuition sous- jacente à notre proposition est que la mesure de similarité à base de word embedding est capable de considérer des mots différents ayant la même sémantique ce qui permet de compenser le non-appariement des termes lors du calcul de la pertinence. Deuxièmement, l'estimation de nouveauté d'un tweet entrant est basée sur la comparaison de ses termes avec les termes des tweets déjà envoyés au lieu d'utiliser la comparaison tweet à tweet. Cette méthode offre un meilleur passage à l'échelle et permet de réduire le temps d'exécution. Troisièmement, pour contourner le problème du seuillage de pertinence, nous utilisons un classificateur binaire qui prédit la pertinence. L'approche proposée est basée sur l'apprentissage supervisé adaptatif dans laquelle les signes sociaux sont combinés avec les autres facteurs de pertinence dépendants de la requête. De plus, le retour des jugements de pertinence est exploité pour re-entrainer le modèle de classification. Enfin, nous montrons que l'approche proposée, qui envoie les notifications en temps réel, permet d'obtenir des performances prometteuses en termes de qualité (pertinence et nouveauté) avec une faible latence alors que les approches de l'état de l'art tendent à favoriser la qualité au détriment de la latence. Cette thèse explore également une nouvelle approche de génération du résumé rétrospectif qui suit un paradigme différent de la majorité des méthodes de l'état de l'art. Nous proposons de modéliser le processus de génération de synthèse sous forme d'un problème d'optimisation linéaire qui prend en compte la diversité temporelle des tweets. Les tweets sont filtrés et regroupés d'une manière incrémentale en deux partitions basées respectivement sur la similarité du contenu et le temps de publication. Nous formulons la génération du résumé comme étant un problème linéaire entier dans lequel les variables inconnues sont binaires, la fonction objective est à maximiser et les contraintes assurent qu'au maximum un tweet par cluster est sélectionné dans la limite de la longueur du résumé fixée préalablement
User-generated content on social media, such as Twitter, provides in many cases, the latest news before traditional media, which allows having a retrospective summary of events and being updated in a timely fashion whenever a new development occurs. However, social media, while being a valuable source of information, can be also overwhelming given the volume and the velocity of published information. To shield users from being overwhelmed by irrelevant and redundant posts, retrospective summarization and prospective notification (real-time summarization) were introduced as two complementary tasks of information seeking on document streams. The former aims to select a list of relevant and non-redundant tweets that capture "what happened". In the latter, systems monitor the live posts stream and push relevant and novel notifications as soon as possible. Our work falls within these frameworks and focuses on developing a tweet summarization approaches for the two aforementioned scenarios. It aims at providing summaries that capture the key aspects of the event of interest to help users to efficiently acquire information and follow the development of long ongoing events from social media. Nevertheless, tweet summarization task faces many challenges that stem from, on one hand, the high volume, the velocity and the variety of the published information and, on the other hand, the quality of tweets, which can vary significantly. In the prospective notification, the core task is the relevancy and the novelty detection in real-time. For timeliness, a system may choose to push new updates in real-time or may choose to trade timeliness for higher notification quality. Our contributions address these levels: First, we introduce Word Similarity Extended Boolean Model (WSEBM), a relevance model that does not rely on stream statistics and takes advantage of word embedding model. We used word similarity instead of the traditional weighting techniques. By doing this, we overcome the shortness and word mismatch issues in tweets. The intuition behind our proposition is that context-aware similarity measure in word2vec is able to consider different words with the same semantic meaning and hence allows offsetting the word mismatch issue when calculating the similarity between a tweet and a topic. Second, we propose to compute the novelty score of the incoming tweet regarding all words of tweets already pushed to the user instead of using the pairwise comparison. The proposed novelty detection method scales better and reduces the execution time, which fits real-time tweet filtering. Third, we propose an adaptive Learning to Filter approach that leverages social signals as well as query-dependent features. To overcome the issue of relevance threshold setting, we use a binary classifier that predicts the relevance of the incoming tweet. In addition, we show the gain that can be achieved by taking advantage of ongoing relevance feedback. Finally, we adopt a real-time push strategy and we show that the proposed approach achieves a promising performance in terms of quality (relevance and novelty) with low cost of latency whereas the state-of-the-art approaches tend to trade latency for higher quality. This thesis also explores a novel approach to generate a retrospective summary that follows a different paradigm than the majority of state-of-the-art methods. We consider the summary generation as an optimization problem that takes into account the topical and the temporal diversity. Tweets are filtered and are incrementally clustered in two cluster types, namely topical clusters based on content similarity and temporal clusters that depends on publication time. Summary generation is formulated as integer linear problem in which unknowns variables are binaries, the objective function is to be maximized and constraints ensure that at most one post per cluster is selected with respect to the defined summary length limit

APA, Harvard, Vancouver, ISO, and other styles

6

Nahnsen, Thade. "Automation of summarization evaluation methods and their application to the summarization process." Thesis, University of Edinburgh, 2011. http://hdl.handle.net/1842/5278.

Full text

Abstract:

Summarization is the process of creating a more compact textual representation of a document or a collection of documents. In view of the vast increase in electronically available information sources in the last decade, filters such as automatically generated summaries are becoming ever more important to facilitate the efficient acquisition and use of required information. Different methods using natural language processing (NLP) techniques are being used to this end. One of the shallowest approaches is the clustering of available documents and the representation of the resulting clusters by one of the documents; an example of this approach is the Google News website. It is also possible to augment the clustering of documents with a summarization process, which would result in a more balanced representation of the information in the cluster, NewsBlaster being an example. However, while some systems are already available on the web, summarization is still considered a difficult problem in the NLP community. One of the major problems hampering the development of proficient summarization systems is the evaluation of the (true) quality of system-generated summaries. This is exemplified by the fact that the current state-of-the-art evaluation method to assess the information content of summaries, the Pyramid evaluation scheme, is a manual procedure. In this light, this thesis has three main objectives. 1. The development of a fully automated evaluation method. The proposed scheme is rooted in the ideas underlying the Pyramid evaluation scheme and makes use of deep syntactic information and lexical semantics. Its performance improves notably on previous automated evaluation methods. 2. The development of an automatic summarization system which draws on the conceptual idea of the Pyramid evaluation scheme and the techniques developed for the proposed evaluation system. The approach features the algorithm for determining the pyramid and bases importance on the number of occurrences of the variable-sized contributors of the pyramid as opposed to word-based methods exploited elsewhere. 3. The development of a text coherence component that can be used for obtaining the best ordering of the sentences in a summary.

APA, Harvard, Vancouver, ISO, and other styles

7

Smith, Christian. "Automatic summarization and readability." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2011. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-68332.

Full text

Abstract:

The enormous amount of information available today within different media gives rise to the notion of ways to reduce the inevitable complexity and to distribute text material to different channels or media. In an effort to investigate the possibilities of a tool to help eleviate the problem, an automatic summarizer called COGSUM has been developed and evaluated with regards to the informational quality of the summaries and with regards to the readability. COGSUM is based on word space methodology, including virtues such as problematic computational complexity and possibilities of inferring semantic relations. The results from the evaluations show how to set some parameters in order to get as good summary as possible and that the resulting summaries have higher readability score than the full text on different genres.

APA, Harvard, Vancouver, ISO, and other styles

8

Seidlhofer, Barbara. "Discourse analysis for summarization." Thesis, University College London (University of London), 1991. http://discovery.ucl.ac.uk/10018780/.

Full text

Abstract:

Summarization is an activity which language students are frequently called upon to perform, often without any explicit guidance. In a wider sense, it might be said that all learning, whether of language or anything else, involves the ability to distinguish what is important from what is not, and to incorporate it into existing schematic knowledge. In this respect, summarization can be seen as central to education in general as well as language education in particular. This thesis is an attempt to gain insights into the essential criteria for summarization. After the first chapter has outlined the scope and methodology of the enquiry, chapters 2 to 5 review a number of models of text analysis and discourse processing which, on the face of it, promise to provide a systematic basis for the identification of "main ideas" in written texts. It reviews a number of models of text analysis and discourse processing which, on the face of it, promise to provide a systematic basis for the identification of "main ideas" in written texts. These include the analysis of thematic structure associated with the work of Halliday and the Prague School, the Macrostructures proposed by van Dijk and Kintsch, and Meyer's studies of rhetorical structure. A critical investigation of these models leads to a consideration of a very different approach which focuses not on the text itself as product but on the reader's reaction to it in the process of interpretation. This emerges from the empirical analysis of student summaries and accounts in chapter 6, and is further discussed in the last chapter. In general, the thesis considers the theoretical validity of these different approaches to text description and their practical utility as points of reference for summarization. It surveys applied work based on them, relates them empirically to the analysis of summaries and accounts elicited from advanced Austrian students of English at university level, and works its way towards a set of principles and procedures which might be made operational in language pedagogy.

APA, Harvard, Vancouver, ISO, and other styles

9

Ceylan, Hakan. "Investigating the Extractive Summarization of Literary Novels." Thesis, University of North Texas, 2011. https://digital.library.unt.edu/ark:/67531/metadc103298/.

Full text

Abstract:

Abstract Due to the vast amount of information we are faced with, summarization has become a critical necessity of everyday human life. Given that a large fraction of the electronic documents available online and elsewhere consist of short texts such as Web pages, news articles, scientific reports, and others, the focus of natural language processing techniques to date has been on the automation of methods targeting short documents. We are witnessing however a change: an increasingly larger number of books become available in electronic format. This means that the need for language processing techniques able to handle very large documents such as books is becoming increasingly important. This thesis addresses the problem of summarization of novels, which are long and complex literary narratives. While there is a significant body of research that has been carried out on the task of automatic text summarization, most of this work has been concerned with the summarization of short documents, with a particular focus on news stories. However, novels are different in both length and genre, and consequently different summarization techniques are required. This thesis attempts to close this gap by analyzing a new domain for summarization, and by building unsupervised and supervised systems that effectively take into account the properties of long documents, and outperform the traditional extractive summarization systems typically addressing news genre.

APA, Harvard, Vancouver, ISO, and other styles

10

Demirtas, Kezban. "Automatic Video Categorization And Summarization." Master's thesis, METU, 2009. http://etd.lib.metu.edu.tr/upload/3/12611113/index.pdf.

Full text

Abstract:

In this thesis, we make automatic video categorization and summarization by using subtitles of videos. We propose two methods for video categorization. The first method makes unsupervised categorization by applying natural language processing techniques on video subtitles and uses the WordNet lexical database and WordNet domains. The method starts with text preprocessing. Then a keyword extraction algorithm and a word sense disambiguation method are applied. The WordNet domains that correspond to the correct senses of keywords are extracted. Video is assigned a category label based on the extracted domains. The second method has the same steps for extracting WordNet domains of video but makes categorization by using a learning module. Experiments with documentary videos give promising results in discovering the correct categories of videos. Video summarization algorithms present condensed versions of a full length video by identifying the most significant parts of the video. We propose a video summarization method using the subtitles of videos and text summarization techniques. We identify significant sentences in the subtitles of a video by using text summarization techniques and then we compose a video summary by finding the video parts corresponding to these summary sentences.

APA, Harvard, Vancouver, ISO, and other styles

11

Kazantseva, Anna. "Automatic summarization of short fiction." Thesis, University of Ottawa (Canada), 2007. http://hdl.handle.net/10393/27861.

Full text

Abstract:

This work is an inquiry into automatic summarization of short of fiction. In this dissertation, I present a system that composes summaries of literary short stories employing two types of information: information about entities central to a story and information about the grammatical aspect of clauses. The summaries are tailored to a specific purpose: helping a reader decide whether she would be interested in reading a particular story. They contain just enough information to enable a reader to form adequate expectations about the story, but they do not reveal the plot. According to these criteria, a target summary provides a reader with an idea of whom the story is about, where and when it happens (in a way that goes beyond simply listing names and places) but does not re-tell the events of the story. In order to build such summaries, the system attempts to identify sentences that meet two criteria: they focus on main entities in the story and they relate the background of the story rather than events. Discussing the criteria for the sentence selection process comprises a large part of this dissertation. These criteria can be roughly divided into two categories: (1) information about main entities (e.g., main characters and locations) and (2) information related to the grammatical aspect of clauses. By relying on this information the system selects sentences that contain important information pertinent to the setting of the story. Six human judges evaluated the produced summaries in two different ways. Initially, the machine-made summaries were compared against man-made ones. On this account, the summaries rated better than those produced using two naive lead-based baselines. Subsequently, the judges answered a number of questions using the summaries as the only source of information. These answers were compared with the answers made using the complete stories. The summaries appeared to be useful for helping the judges decide whether they would like to read the stories. The judges could also answer simple questions about the setting of the story using the summaries only. The results suggest that aspectual information and information about important entities can be effectively used to build summaries of literary short fiction, even though this information atone is not sufficient for producing high-quality indicative summaries.

APA, Harvard, Vancouver, ISO, and other styles

12

Wen, Chung-Lin S. M. Massachusetts Institute of Technology. "Event-centric Twitter photo summarization." Thesis, Massachusetts Institute of Technology, 2014. http://hdl.handle.net/1721.1/91417.

Full text

Abstract:

Thesis: S.M., Massachusetts Institute of Technology, School of Architecture and Planning, Program in Media Arts and Sciences, 2014.
40
Cataloged from PDF version of thesis.
Includes bibliographical references (pages 71-74).
We develop a novel algorithm based on spectral geometry that summarize a photo collection into a small subset that represents the collection well. While the definition for a good summarization might not be unique, we focus on two metrics in this thesis: representativeness and diversity. By representativeness we mean that the sampled photo should be similar to other photos in the data set. The intuition behind this is that by regarding each photo as a "vote" towards the scene it depicts, we want to include the photos that have high "votes". Diversity is also desirable because repeating the same information is an inefficient use of the few spaces we have for summarization. We achieve these seemingly contradictory properties by applying diversified sampling on the denser part of the feature space. The proposed method uses diffusion distance to measure the distance between any given pair in the dataset. By emphasizing the connectivity of the local neighborhood, we achieve better accuracy compared to previous methods that used the global distance. Heat Kernel Signature (HKS) is then used to separate the denser part and the sparser part of the data. By intersecting the denser part generated by different features, we are able to remove most of the outliers, i.e., photos that have few similar photos in the dataset. Farthest Point Sampling (FPS) is then applied to give a diversified sampling, which produces our final summarization. The method can be applied to any image collection that has a specific topic but also a fair proportion of outliers. One scenario especially motivating us to develop this technique is the Twitter photos of a specific event. Microblogging services have became a major way that people share new information. However, the huge amount of data, the lack of structure, and the highly noisy nature prevent users from effectively mining useful information from it. There are textual data based methods but the absence of visual information makes them less valuable. To the best of our knowledge, this study is the first to address visual data in Twitter event summarization. Our method's output can produce a kind of "crowd-sourced news", useful for journalists as well as the general public. We illustrate our results by summarizing recent Twitter events and comparing them with those generated by metadata such as retweet numbers. Our results are of at least the same quality although produced by a fully automatic mechanism. In some cases, because metadata can be biased by factors such as the number of followers, our results are even better in comparison. We also note that by our initial pilot study, the photos we found with high-quality have little overlap with highly-tweeted photos. That suggests the signal we found is orthogonal to the retweet signal and the two signals can be potentially combined to achieve even better results.
by Chung-Lin Wen.
S.M.

APA, Harvard, Vancouver, ISO, and other styles

13

Branavan, Satchuthananthavale Rasiah Kuhan. "High compression rate text summarization." Thesis, Massachusetts Institute of Technology, 2008. http://hdl.handle.net/1721.1/44368.

Full text

Abstract:

Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008.
Includes bibliographical references (p. 95-97).
This thesis focuses on methods for condensing large documents into highly concise summaries, achieving compression rates on par with human writers. While the need for such summaries in the current age of information overload is increasing, the desired compression rate has thus far been beyond the reach of automatic summarization systems. The potency of our summarization methods is due to their in-depth modelling of document content in a probabilistic framework. We explore two types of document representation that capture orthogonal aspects of text content. The first represents the semantic properties mentioned in a document in a hierarchical Bayesian model. This method is used to summarize thousands of consumer reviews by identifying the product properties mentioned by multiple reviewers. The second representation captures discourse properties, modelling the connections between different segments of a document. This discriminatively trained model is employed to generate tables of contents for books and lecture transcripts. The summarization methods presented here have been incorporated into large-scale practical systems that help users effectively access information online.
by Satchuthananthavale Rasiah Kuhan Branavan.
S.M.

APA, Harvard, Vancouver, ISO, and other styles

14

LI, WEI. "HIERARCHICAL SUMMARIZATION OF VIDEO DATA." University of Cincinnati / OhioLINK, 2007. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1186941444.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

Subramanian, Hema. "Summarization Of Real Valued Biclusters." University of Cincinnati / OhioLINK, 2011. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1307442728.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Karlbom, Hannes. "Abstractive Summarization of Podcast Transcriptions." Thesis, Uppsala universitet, Artificiell intelligens, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-443377.

Full text

Abstract:

In the rapidly growing medium of podcasts, as episodes are automatically transcribed the need for good natural language summarization models which can handle a variety of obstacles presented by the transcriptions and the format has increased. This thesis investigates the transformer-based sequence-to-sequence models, where an attention mechanism keeps track of which words in the context are most important to the next word prediction in the sequence. Different summarization models are investigated on a large-scale open-domain podcast dataset which presents challenges such as transcription errors, multiple speakers, different genres, structures, as well as long texts. The results show that a sparse attention mechanism using a sliding window has an increased average ROUGE-2 score F-measure of 21.6% over transformer models using a short input length with fully connected attention layers.

APA, Harvard, Vancouver, ISO, and other styles

17

Linhares, Pontes Elvys. "Compressive Cross-Language Text Summarization." Thesis, Avignon, 2018. http://www.theses.fr/2018AVIG0232/document.

Full text

Abstract:

La popularisation des réseaux sociaux et des documents numériques a rapidement accru l'information disponible sur Internet. Cependant, cette quantité massive de données ne peut pas être analysée manuellement. Parmi les applications existantes du Traitement Automatique du Langage Naturel (TALN), nous nous intéressons dans cette thèse au résumé cross-lingue de texte, autrement dit à la production de résumés dans une langue différente de celle des documents sources. Nous analysons également d'autres tâches du TALN (la représentation des mots, la similarité sémantique ou encore la compression de phrases et de groupes de phrases) pour générer des résumés cross-lingues plus stables et informatifs. La plupart des applications du TALN, celle du résumé automatique y compris, utilisent une mesure de similarité pour analyser et comparer le sens des mots, des séquences de mots, des phrases et des textes. L’une des façons d'analyser cette similarité est de générer une représentation de ces phrases tenant compte de leur contenu. Le sens des phrases est défini par plusieurs éléments, tels que le contexte des mots et des expressions, l'ordre des mots et les informations précédentes. Des mesures simples, comme la mesure cosinus et la distance euclidienne, fournissent une mesure de similarité entre deux phrases. Néanmoins, elles n'analysent pas l'ordre des mots ou les séquences de mots. En analysant ces problèmes, nous proposons un modèle de réseau de neurones combinant des réseaux de neurones récurrents et convolutifs pour estimer la similarité sémantique d'une paire de phrases (ou de textes) en fonction des contextes locaux et généraux des mots. Sur le jeu de données analysé, notre modèle a prédit de meilleurs scores de similarité que les systèmes de base en analysant mieux le sens local et général des mots mais aussi des expressions multimots. Afin d'éliminer les redondances et les informations non pertinentes de phrases similaires, nous proposons de plus une nouvelle méthode de compression multiphrase, fusionnant des phrases au contenu similaire en compressions courtes. Pour ce faire, nous modélisons des groupes de phrases semblables par des graphes de mots. Ensuite, nous appliquons un modèle de programmation linéaire en nombres entiers qui guide la compression de ces groupes à partir d'une liste de mots-clés ; nous cherchons ainsi un chemin dans le graphe de mots qui a une bonne cohésion et qui contient le maximum de mots-clés. Notre approche surpasse les systèmes de base en générant des compressions plus informatives et plus correctes pour les langues française, portugaise et espagnole. Enfin, nous combinons les méthodes précédentes pour construire un système de résumé de texte cross-lingue. Notre système génère des résumés cross-lingue de texte en analysant l'information à la fois dans les langues source et cible, afin d’identifier les phrases les plus pertinentes. Inspirés par les méthodes de résumé de texte par compression en analyse monolingue, nous adaptons notre méthode de compression multiphrase pour ce problème afin de ne conserver que l'information principale. Notre système s'avère être performant pour compresser l'information redondante et pour préserver l'information pertinente, en améliorant les scores d'informativité sans perdre la qualité grammaticale des résumés cross-lingues du français vers l'anglais. En analysant les résumés cross-lingues depuis l’anglais, le français, le portugais ou l’espagnol, vers l’anglais ou le français, notre système améliore les systèmes par extraction de l'état de l'art pour toutes ces langues. En outre, une expérience complémentaire menée sur des transcriptions automatiques de vidéo montre que notre approche permet là encore d'obtenir des scores ROUGE meilleurs et plus stables, même pour ces documents qui présentent des erreurs grammaticales et des informations inexactes ou manquantes
The popularization of social networks and digital documents increased quickly the informationavailable on the Internet. However, this huge amount of data cannot be analyzedmanually. Natural Language Processing (NLP) analyzes the interactions betweencomputers and human languages in order to process and to analyze natural languagedata. NLP techniques incorporate a variety of methods, including linguistics, semanticsand statistics to extract entities, relationships and understand a document. Amongseveral NLP applications, we are interested, in this thesis, in the cross-language textsummarization which produces a summary in a language different from the languageof the source documents. We also analyzed other NLP tasks (word encoding representation,semantic similarity, sentence and multi-sentence compression) to generate morestable and informative cross-lingual summaries.Most of NLP applications (including all types of text summarization) use a kind ofsimilarity measure to analyze and to compare the meaning of words, chunks, sentencesand texts in their approaches. A way to analyze this similarity is to generate a representationfor these sentences that contains the meaning of them. The meaning of sentencesis defined by several elements, such as the context of words and expressions, the orderof words and the previous information. Simple metrics, such as cosine metric andEuclidean distance, provide a measure of similarity between two sentences; however,they do not analyze the order of words or multi-words. Analyzing these problems,we propose a neural network model that combines recurrent and convolutional neuralnetworks to estimate the semantic similarity of a pair of sentences (or texts) based onthe local and general contexts of words. Our model predicted better similarity scoresthan baselines by analyzing better the local and the general meanings of words andmulti-word expressions.In order to remove redundancies and non-relevant information of similar sentences,we propose a multi-sentence compression method that compresses similar sentencesby fusing them in correct and short compressions that contain the main information ofthese similar sentences. We model clusters of similar sentences as word graphs. Then,we apply an integer linear programming model that guides the compression of theseclusters based on a list of keywords. We look for a path in the word graph that has goodcohesion and contains the maximum of keywords. Our approach outperformed baselinesby generating more informative and correct compressions for French, Portugueseand Spanish languages. Finally, we combine these previous methods to build a cross-language text summarizationsystem. Our system is an {English, French, Portuguese, Spanish}-to-{English,French} cross-language text summarization framework that analyzes the informationin both languages to identify the most relevant sentences. Inspired by the compressivetext summarization methods in monolingual analysis, we adapt our multi-sentencecompression method for this problem to just keep the main information. Our systemproves to be a good alternative to compress redundant information and to preserve relevantinformation. Our system improves informativeness scores without losing grammaticalquality for French-to-English cross-lingual summaries. Analyzing {English,French, Portuguese, Spanish}-to-{English, French} cross-lingual summaries, our systemsignificantly outperforms extractive baselines in the state of the art for all these languages.In addition, we analyze the cross-language text summarization of transcriptdocuments. Our approach achieved better and more stable scores even for these documentsthat have grammatical errors and missing information

APA, Harvard, Vancouver, ISO, and other styles

18

Wu, Jiewen. "WHISK: Web Hosted Information into Summarized Knowledge." DigitalCommons@CalPoly, 2016. https://digitalcommons.calpoly.edu/theses/1633.

Full text

Abstract:

Today’s online content increases at an alarmingly rate which exceeds users’ ability to consume such content. Modern search techniques allow users to enter keyword queries to find content they wish to see. However, such techniques break down when users freely browse the internet without knowing exactly what they want. Users may have to invest an unnecessarily long time reading content to see if they are interested in it. Automatic text summarization helps relieve this problem by creating synopses that significantly reduce the text while preserving the key points. Steffen Lyngbaek created the SPORK summarization pipeline to solve the content overload in Reddit comment threads. Lyngbaek adapted the Opinosis graph model for extractive summarization and combined it with agglomerative hierarchical clustering and the Smith-Waterman algorithm to perform multi-document summarization on Reddit comments.This thesis presents WHISK as a pipeline for general multi-document text summarization based on SPORK. A generic data model in WHISK allows creating new drivers for different platforms to work with the pipeline. In addition to the existing Opinosis graph model adapted in SPORK, WHISK introduces two simplified graph models for the pipeline. The simplified models removes unnecessary restrictions inherited from Opinosis graph’s abstractive summarization origins. Performance measurements and a study with Digital Democracy compare the two new graph models against the Opinosis graph model. Additionally, the study evaluates WHISK’s ability to generate pull quotes from political discussions as summaries.

APA, Harvard, Vancouver, ISO, and other styles

19

Ozsoy, Makbule Gulcin. "Text Summarization Using Latent Semantic Analysis." Master's thesis, METU, 2011. http://etd.lib.metu.edu.tr/upload/12612988/index.pdf.

Full text

Abstract:

Text summarization solves the problem of presenting the information needed by a user in a compact form. There are different approaches to create well formed summaries in literature. One of the newest methods in text summarization is the Latent Semantic Analysis (LSA) method. In this thesis, different LSA based summarization algorithms are explained and two new LSA based summarization algorithms are proposed. The algorithms are evaluated on Turkish and English documents, and their performances are compared using their ROUGE scores.

APA, Harvard, Vancouver, ISO, and other styles

20

Oya, Tatsuro. "Automatic abstractive summarization of meeting conversations." Thesis, University of British Columbia, 2014. http://hdl.handle.net/2429/49946.

Full text

Abstract:

Nowadays, there are various ways for people to share and exchange information. Phone calls, E-mails, and social networking applications are tools which have made it much easier for us to communicate. Despite the existence of these convenient methods for exchanging ideas, meetings are still one of the most important ways for people to collaborate, share information, discuss their plans, and make decisions for their organizations. However, some drawbacks exist to them as well. Generally, meetings are time consuming and require the participation of all members. Taking meeting minutes for the benefit of those who miss meetings also requires considerable time and effort. To this end, there has been increasing demand for the creation of systems to automatically summarize meetings. So far, most summarization systems have applied extractive approaches whereby summaries are simply created by extracting important phrases or sentences and concatenating them in sequence. However, considering that meeting transcripts consist of spontaneous utterances containing speech disfluencies such as repetitions and filled pauses, traditional extractive summarization approaches do not work effectively in this domain. To address these issues, we present a novel template-based abstractive meeting summarization system requiring less annotated data than that needed for previous abstractive summarization approaches. In order to generate abstract and robust templates that can guide the summarization process, our system extends a novel multi-sentence fusion algorithm and utilizes lexico-semantic information. It also leverages the relationship between human-authored summaries and their source meeting transcripts to select the best templates for generating abstractive summaries of meetings. In our experiment, we use the AMI corpus to instantiate our framework and compare it with state-of-the-art extractive and abstractive systems as well as human extractive and abstractive summaries. Our comprehensive evaluations, based on both automatic and manual approaches, have demonstrated that our system outperforms all baseline systems and human extractive summaries in terms of both readability and informativeness. Furthermore, it has achieved a level of quality nearly equal to that of human abstracts based on a crowd-sourced manual evaluation.
Science, Faculty of
Computer Science, Department of
Graduate

APA, Harvard, Vancouver, ISO, and other styles

21

Liu, Qing Computer Science &amp Engineering Faculty of Engineering UNSW. "Summarization of very large spatial dataset." Awarded by:University of New South Wales. School of Computer Science and Engineering, 2006. http://handle.unsw.edu.au/1959.4/25489.

Full text

Abstract:

Nowadays there are a large number of applications, such as digital library information retrieval, business data analysis, CAD/CAM, multimedia applications with images and sound, real-time process control and scientific computation, with data sets about gigabytes, terabytes or even petabytes. Because data distributions are too large to be stored accurately, maintaining compact and accurate summarized information about underlying data is of crucial important. The summarizing problem for Level 1 (disjoint and non-disjoint) topological relationship has been well studied for the past few years. However the spatial database users are often interested in a much richer set of spatial relations such as contains. Little work has been done on summarization for Level 2 topological relationship which includes contains, contained, overlap, equal and disjoint relations. We study the problem of effective summatization to represent the underlying data distribution to answer window queries for Level 2 topological relationship. Cell-density based approach has been demonstrated as an effective way to this problem. But the challenges are the accuracy of the results and the storage space required which should be linearly proportional to the number of cells to be practical. In this thesis, we present several novel techniques to effectively construct cell density based spatial histograms. Based on the framework proposed, exact results could be obtained in constant time for aligned window queries. To minimize the storage space of the framework, an approximate algorithm with the approximate ratio 19/12 is presented, while the problem is shown NP-hard generally. Because the framework requires only a storage space linearly proportional to the number of cells, it is practical for many popular real datasets. To conform to a limited storage space, effective histogram construction and query algorithms are proposed which can provide approximate results but with high accuracy. The problem for non-aligned window queries is also investigated and techniques of un-even partitioned space are developed to support non-aligned window queries. Finally, we extend our techniques to 3D space. Our extensive experiments against both synthetic and real world datasets demonstrate the efficiency of the algorithms developed in this thesis.

APA, Harvard, Vancouver, ISO, and other styles

22

Mlynarski, Angela, and University of Lethbridge Faculty of Arts and Science. "Automatic text summarization in digital libraries." Thesis, Lethbridge, Alta. : University of Lethbridge, Faculty of Arts and Science, 2006, 2006. http://hdl.handle.net/10133/270.

Full text

Abstract:

A digital library is a collection of services and information objects for storing, accessing, and retrieving digital objects. Automatic text summarization presents salient information in a condensed form suitable for user needs. This thesis amalgamates digital libraries and automatic text summarization by extending the Greenstone Digital Library software suite to include the University of Lethbridge Summarizer. The tool generates summaries, nouns, and non phrases for use as metadata for searching and browsing digital collections. Digital collections of newspapers, PDFs, and eBooks were created with summary metadata. PDF documents were processed the fastest at 1.8 MB/hr, followed by the newspapers at 1.3 MB/hr, with eBooks being the slowest at 0.9 MV/hr. Qualitative analysis on four genres: newspaper, M.Sc. thesis, novel, and poetry, revealed narrative newspapers were most suitable for automatically generated summarization. The other genres suffered from incoherence and information loss. Overall, summaries for digital collections are suitable when used with newspaper documents and unsuitable for other genres.
xiii, 142 leaves ; 28 cm.

APA, Harvard, Vancouver, ISO, and other styles

23

Sanchan, Nattapong. "Domain-focused summarization of polarized debates." Thesis, University of Sheffield, 2018. http://etheses.whiterose.ac.uk/20878/.

Full text

Abstract:

Due to the exponential growth of Internet use, textual content is increasingly published in online media. In everyday, more and more news content, blog posts, and scientific articles are published to the online volumes and thus open doors for the text summarization research community to conduct research on those areas. Whilst there are freely accessible repositories for such content, online debates which have recently become popular have remained largely unexplored. This thesis addresses the challenge in applying text summarization to summarize online debates. We view that the task of summarizing online debates should not only focus on summarization techniques but also should look further on presenting the summaries into the formats favored by users.   In this thesis, we present how a summarization system is developed to generate online debate summaries in accordance with a designed output, called the Combination 2. It is the combination of two summaries. The primary objective of the first summary, Chart Summary, is to visualize the debate summary as a bar chart in high-level view. The chart consists of the bars conveying clusters of the salient sentences, labels showing short descriptions of the bars, and numbers of salient sentences conversed in the two opposing sides. The other part, Side-By-Side Summary, linked to the Chart Summary, shows a more detailed summary of an online debate related to a bar clicked by a user. The development of the summarization system is divided into three processes.   In the first process, we create a gold standard dataset of online debates. The dataset contains a collection of debate comments that have been subjectively annotated by 5 judgments. We develop a summarization system with key features to help identify salient sentences in the comments. The sentences selected by the system are evaluated against the annotation results. We found that the system performance outperforms the baseline.   The second process begins with the generation of Chart Summary from the salient sentences selected by the system. We propose a framework with two branches where each branch presents either a term-based clustering and the term-based labeling method or X-means based clustering and the MI labeling strategy. Our evaluation results indicate that the X-means clustering approach is a better alternative for clustering.   In the last process, we view the generation of Side-By-Side Summary as a contradiction detection task. We create two debate entailment datasets derived from the two clustering approaches and annotate them with the Contradiction and Non-Contradiction relations. We develop a classifier and investigate combinations of features that maximize the F1 scores. Based on the proposed features, we discovered that the combinations of at least two features to the maximum of eight features yield good results.

APA, Harvard, Vancouver, ISO, and other styles

24

Singi, Reddy Dinesh Reddy. "Comparative text summarization of product reviews." Thesis, Kansas State University, 2010. http://hdl.handle.net/2097/7031.

Full text

Abstract:

Master of Science
Department of Computing and Information Sciences
William H. Hsu
This thesis presents an approach towards summarizing product reviews using comparative sentences by sentiment analysis. Specifically, we consider the problem of extracting and scoring features from natural language text for qualitative reviews in a particular domain. When shopping for a product, customers do not find sufficient time to learn about all products on the market. Similarly, manufacturers do not have proper written sources from which to learn about customer opinions. The only available techniques involve gathering customer opinions, often in text form, from e-commerce and social networking web sites and analyzing them, which is a costly and time-consuming process. In this work I address these issues by applying sentiment analysis, an automated method of finding the opinion stated by an author about some entity in a text document. Here I first gather information about smart phones from many e-commerce web sites. I then present a method to differentiate comparative sentences from normal sentences, form feature sets for each domain, and assign a numerical score to each feature of a product and a weight coefficient obtained by statistical machine learning, to be used as a weight for that feature in ranking various products by linear combinations of their weighted feature scores. In this thesis I also explain what role comparative sentences play in summarizing the product. In order to find the polarity of each feature a statistical algorithm is defined using a small-to-medium sized data set. Then I present my experimental environment and results, and conclude with a review of claims and hypotheses stated at the outset. The approach specified in this thesis is evaluated using manual annotated trained data and also using data from domain experts. I also demonstrate empirically how different algorithms on this summarization can be derived from the technique provided by an annotator. Finally, I review diversified options for customers such as providing alternate products for each feature, top features of a product, and overall rankings for products.

APA, Harvard, Vancouver, ISO, and other styles

25

Tohalino, Jorge Andoni Valverde. "Extractive document summarization using complex networks." Universidade de São Paulo, 2018. http://www.teses.usp.br/teses/disponiveis/55/55134/tde-24102018-155954/.

Full text

Abstract:

Due to a large amount of textual information available on the Internet, the task of automatic document summarization has gained significant importance. Document summarization became important because its focus is the development of techniques aimed at finding relevant and concise content in large volumes of information without changing its original meaning. The purpose of this Masters work is to use network theory concepts for extractive document summarization for both Single Document Summarization (SDS) and Multi-Document Summarization (MDS). In this work, the documents are modeled as networks, where sentences are represented as nodes with the aim of extracting the most relevant sentences through the use of ranking algorithms. The edges between nodes are established in different ways. The first approach for edge calculation is based on the number of common nouns between two sentences (network nodes). Another approach to creating an edge is through the similarity between two sentences. In order to calculate the similarity of such sentences, we used the vector space model based on Tf-Idf weighting and word embeddings for the vector representation of the sentences. Also, we make a distinction between edges linking sentences from different documents (inter-layer) and those connecting sentences from the same document (intra-layer) by using multilayer network models for the Multi-Document Summarization task. In this approach, each network layer represents a document of the document set that will be summarized. In addition to the measurements typically used in complex networks such as node degree, clustering coefficient, shortest paths, etc., the network characterization also is guided by dynamical measurements of complex networks, including symmetry, accessibility and absorption time. The generated summaries were evaluated by using different corpus for both Portuguese and English language. The ROUGE-1 metric was used for the validation of generated summaries. The results suggest that simpler models like Noun and Tf-Idf based networks achieved a better performance in comparison to those models based on word embeddings. Also, excellent results were achieved by using the multilayered representation of documents for MDS. Finally, we concluded that several measurements could be used to improve the characterization of networks for the summarization task.
Devido à grande quantidade de informações textuais disponíveis na Internet, a tarefa de sumarização automática de documentos ganhou importância significativa. A sumarização de documentos tornou-se importante porque seu foco é o desenvolvimento de técnicas destinadas a encontrar conteúdo relevante e conciso em grandes volumes de informação sem alterar seu significado original. O objetivo deste trabalho de Mestrado é usar os conceitos da teoria de grafos para o resumo extrativo de documentos para Sumarização mono-documento (SDS) e Sumarização multi-documento (MDS). Neste trabalho, os documentos são modelados como redes, onde as sentenças são representadas como nós com o objetivo de extrair as sentenças mais relevantes através do uso de algoritmos de ranqueamento. As arestas entre nós são estabelecidas de maneiras diferentes. A primeira abordagem para o cálculo de arestas é baseada no número de substantivos comuns entre duas sentenças (nós da rede). Outra abordagem para criar uma aresta é através da similaridade entre duas sentenças. Para calcular a similaridade de tais sentenças, foi usado o modelo de espaço vetorial baseado na ponderação Tf-Idf e word embeddings para a representação vetorial das sentenças. Além disso, fazemos uma distinção entre as arestas que vinculam sentenças de diferentes documentos (inter-camada) e aquelas que conectam sentenças do mesmo documento (intra-camada) usando modelos de redes multicamada para a tarefa de Sumarização multi-documento. Nesta abordagem, cada camada da rede representa um documento do conjunto de documentos que será resumido. Além das medições tipicamente usadas em redes complexas como grau dos nós, coeficiente de agrupamento, caminhos mais curtos, etc., a caracterização da rede também é guiada por medições dinâmicas de redes complexas, incluindo simetria, acessibilidade e tempo de absorção. Os resumos gerados foram avaliados usando diferentes corpus para Português e Inglês. A métrica ROUGE-1 foi usada para a validação dos resumos gerados. Os resultados sugerem que os modelos mais simples, como redes baseadas em Noun e Tf-Idf, obtiveram um melhor desempenho em comparação com os modelos baseados em word embeddings. Além disso, excelentes resultados foram obtidos usando a representação de redes multicamada de documentos para MDS. Finalmente, concluímos que várias medidas podem ser usadas para melhorar a caracterização de redes para a tarefa de sumarização.

APA, Harvard, Vancouver, ISO, and other styles

26

AGUIAR, C. Z. "Concept Maps Mining for Text Summarization." Universidade Federal do Espírito Santo, 2017. http://repositorio.ufes.br/handle/10/9846.

Full text

Abstract:

Made available in DSpace on 2018-08-02T00:03:48Z (GMT). No. of bitstreams: 1 tese_11160_CamilaZacche_dissertacao_final.pdf: 5437260 bytes, checksum: 0c96c6b2cce9c15ea234627fad78ac9a (MD5) Previous issue date: 2017-03-31
8 Resumo Os mapas conceituais são ferramentas gráficas para a representação e construção do conhecimento. Conceitos e relações formam a base para o aprendizado e, portanto, os mapas conceituais têm sido amplamente utilizados em diferentes situações e para diferentes propósitos na educação, sendo uma delas a represent ação do texto escrito. Mes mo um gramá tico e complexo texto pode ser representado por um mapa conceitual contendo apenas conceitos e relações que represente m o que foi expresso de uma forma mais complicada. No entanto, a construção manual de um mapa conceit ual exige bastante tempo e esforço na identificação e estruturação do conhecimento, especialmente quando o mapa não deve representar os conceitos da estrutura cognitiva do autor. Em vez disso, o mapa deve representar os conceitos expressos em um texto. Ass im, várias abordagens tecnológicas foram propostas para facilitar o processo de construção de mapas conceituais a partir de textos. Portanto, esta dissertação propõe uma nova abordagem para a construção automática de mapas conceituais como sumarização de t extos científicos. A sumarização pretende produzir um mapa conceitual como uma representação resumida do texto, mantendo suas diversas e mais importantes características. A sumarização pode facilitar a compreensão dos textos, uma vez que os alunos estão te ntando lidar com a sobrecarga cognitiva causada pela crescente quantidade de informação textual disponível atualmente. Este crescimento também pode ser prejudicial à construção do conhecimento. Assim, consideramos a hipótese de que a sumarização de um text o representado por um mapa conceitual pode atribuir características importantes para assimilar o conhecimento do texto, bem como diminuir a sua complexidade e o tempo necessário para processá - lo. Neste contexto, realizamos uma revisão da literatura entre o s anos de 1994 e 2016 sobre as abordagens que visam a construção automática de mapas conceituais a partir de textos. A partir disso, construímos uma categorização para melhor identificar e analisar os recursos e as características dessas abordagens tecnoló gicas. Além disso, buscamos identificar as limitações e reunir as melhores características dos trabalhos relacionados para propor nossa abordagem. 9 Ademais, apresentamos um processo Concept Map Mining elaborado seguindo quatro dimensões : Descrição da Fonte de Dados, Definição do Domínio, Identificação de Elementos e Visualização do Mapa. Com o intuito de desenvolver uma arquitetura computacional para construir automaticamente mapas conceituais como sumarização de textos acadêmicos, esta pesquisa resultou na ferramenta pública CMBuilder , uma ferramenta online para a construção automática de mapas conceituais a partir de textos, bem como uma api java chamada ExtroutNLP , que contém bibliotecas para extração de informações e serviços públicos. Para alcançar o objetivo proposto, direcionados esforços para áreas de processamento de linguagem natural e recuperação de informação. Ressaltamos que a principal tarefa para alcançar nosso objetivo é extrair do texto as proposições do tipo ( conceito, rela ção, conceito ). Sob essa premissa, a pesquisa introduz um pipeline que compreende: regras gramaticais e busca em profundidade para a extração de conceitos e relações a partir do texto; mapeamento de preposição, resolução de anáforas e exploração de entidad es nomeadas para a rotulação de conceitos; ranking de conceitos baseado na análise de frequência de elementos e na topologia do mapa; e sumarização de proposição baseada na topologia do grafo. Além disso, a abordagem também propõe o uso de técnicas de apre ndizagem supervisionada de clusterização e classificação associadas ao uso de um tesauro para a definição do domínio do texto e construção de um vocabulário conceitual de domínios. Finalmente, uma análise objetiva para validar a exatidão da biblioteca Extr outNLP é executada e apresenta 0.65 precision sobre o corpus . Além disso, uma análise subjetiva para validar a qualidade do mapa conceitual construído pela ferramenta CMBuilder é realizada , apresentando 0.75/0.45 para precision / recall de conceitos e 0.57/ 0.23 para precision/ recall de relações em idioma inglês e apresenta ndo 0.68/ 0.38 para precision/ recall de conceitos e 0.41/ 0.19 para precision/ recall de relações em idioma português. Ademais , um experimento para verificar se o mapa conceitual sumarizado pe lo CMBuilder tem influência para a compreensão do assunto abordado em um texto é realizado , atingindo 60% de acertos para mapas extraídos de pequenos textos com questões de múltipla escolha e 77% de acertos para m apas extraídos de textos extensos com quest ões discursivas

APA, Harvard, Vancouver, ISO, and other styles

27

O'Brien, Shayne S. M. Massachusetts Institute of Technology. "Unsupervised summarization of public talk radio." Thesis, Massachusetts Institute of Technology, 2019. https://hdl.handle.net/1721.1/123648.

Full text

Abstract:

Thesis: S.M., Massachusetts Institute of Technology, School of Architecture and Planning, Program in Media Arts and Sciences, 2019
Cataloged from PDF version of thesis.
Includes bibliographical references (pages 111-118).
Talk radio exerts significant influence on the political and social dynamics of the United States, but labor-intensive data collection and curation processes have prevented previous works from analyzing its content at scale. Over the past year, the Laboratory for Social Machines and Cortico have created an ingest system to record and automatically transcribe audio from more than 150 public talk radio stations across the country. Using the outputs from this ingest, I introduce "hierarchical compression" for neural unsupervised summarization of spoken opinion in conversational dialogue. By relying on an unsupervised framework that obviates the need for labeled data, the summarization task becomes largely agnostic to human input beyond necessary decisions regarding model architecture, input data, and output length. Trained models are thus able to automatically identify and summarize opinion in a dynamic fashion, which is noted in relevant literature as one of the most significant obstacles to fully unlocking talk radio as a data source for linguistic, ethnographic, and political analysis. To evaluate model performance, I create a novel spoken opinion summarization dataset consisting of compressed versions of "representative," opinion-containing utterances extracted from a hand-curated and crowd-source-annotated dataset of 275 snippets. I use this evaluation dataset to show that my model quantitatively outperforms strong rule- and graph-based unsupervised baselines on ROUGE and METEOR while qualitatively demonstrating fluency and information retention according to human judges. Additional analyses of model outputs show that many improvements are still yet to be made to this model, thus laying the ground for its use in important future work such as characterizing the linguistic structure of spoken opinion "in the wild."
by Shayne O'Brien.
S.M.
S.M. Massachusetts Institute of Technology, School of Architecture and Planning, Program in Media Arts and Sciences

APA, Harvard, Vancouver, ISO, and other styles

28

Pham, Quang-Khai. "Time Sequence Summarization: Theory and Applications." Phd thesis, Université de Nantes, 2010. http://tel.archives-ouvertes.fr/tel-00538512.

Full text

Abstract:

Les domaines de la médecine, du web, du commerce ou de la nance génèrent et stockent de grandes masses d'information sous la forme de séquences d'événements. Ces archives représentent des sources d'information très riches pour des analystes avides d'y découvrir des perles de connaissance. Par exemple, les biologistes cherchent à découvrir les facteurs de risque d'une maladie en analysant l'historique des patients, les producteurs de contenu web et les bureaux de marketing examinent les habitudes de consommation des clients et les opérateurs boursiers suivent les évolutions du marché pour mieux l'anticiper. Cependant, ces applications requièrent l'exploration de séquences d'événements très volumineuses, par exemple, la nance génère quotidiennement des millions d'événements, où les événements peuvent être décrits par des termes extraits de riches contenus textuels. La variabilité des descripteurs peut alors être très grande. De ce fait, découvrir des connaissances non triviales à l'aide d'approches classiques de fouille de données dans ces sources d'information prolixes est un problème dicile. Une étude récente montre que les approches classiques de fouille de données peuvent tirer prot de formes condensées de ces données, telles que des résultats d'agrégation ou encore des résumés. La connaissance ainsi extraite est qualiée de connaissance d'ordre supérieur. À partir de ce constat, nous présentons dans ces travaux le concept de résumé de séquence d'événements dont le but est d'amener les applications dépendantes du temps à gagner un facteur d'échelle sur de grandes masses de données. Un résumé s'obtient en transformant une séquence d'événements où les événements sont ordonnés chronologiquement. Chaque événement est précisément décrit par un ensemble ni de descripteurs symboliques. Le résumé produit est alors une séquence d'événements, plus concise que la séquence initiale, et pouvant s'y substituer dans les applications. Nous proposons une première méthode de construction guidée par l'utilisateur, appelée TSaR. Il s'agit d'un processus en trois phases : i) une généralisation, ii) un regroupement et iii) une formation de concepts. TSaR utilise des connaissances de domaine exprimées sous forme de taxonomies pour généraliser les descripteurs d'événements. Une fenêtre temporelle est donnée pour contrôler le processus de regroupement selon la proximité temporelle des événements. Dans un second temps, pour rendre le processus de résumé autonome, c'est- à-dire sans paramétrage, nous proposons une redénition du problème de résumé en un nouveau problème de classication. L'originalité de ce problème de classication tient au fait que la fonction objective à optimiser dépend simultanément du contenu des événements et de leur proximité dans le temps. Nous proposons deux algorithmes gloutons appelés G-BUSS et GRASS pour répondre à ce problème. Enn, nous explorons et analysons l'aptitude des résumés de séquences d'événements à contribuer à l'extraction de motifs séquentiels d'ordre supérieur. Nous analysons les caractéristiques des motifs fréquents extraits des résumés et proposons une méthodologie qui s'appuie sur ces motifs pour en découvrir d'autres, à granularité plus ne. Nous évaluons et validons nos approches de résumé et notre méthodologie par un ensemble d'expériences sur un jeu de données réelles extraites des archives d'actualités nancières produites par Reuters.

APA, Harvard, Vancouver, ISO, and other styles

29

Pham, Quang-Khai. "Time sequence summarization : theory and applications." Phd thesis, Nantes, 2010. http://www.theses.fr/2010NANT2102.

Full text

Abstract:

Les domaines de la médecine, du web, du commerce ou de la nance génèrent et stockent de grandes masses d'information sous la forme de séquences d'événements. Ces archives représentent des sources d'information très riches pour des analystes avides d'y découvrir des perles de connaissance. Par exemple, les biologistes cherchent à découvrir les facteurs de risque d'une maladie en analysant l'historique des patients, les producteurs de contenu web et les bureaux de marketing examinent les habitudes de consommation des clients et les opérateurs boursiers suivent les évolutions du marché pour mieux l'anticiper. Cependant, ces applications requièrent l'exploration de séquences d'événements très volumineuses, par exemple, la nance génère quotidiennement des millions d'événements, où les événements peuvent être décrits par des termes extraits de riches contenus textuels. La variabilité des descripteurs peut alors être très grande. De ce fait, découvrir des connaissances non triviales à l'aide d'approches classiques de fouille de données dans ces sources d'information prolixes est un problème dicile. Une étude récente montre que les approches classiques de fouille de données peuvent tirer prot de formes condensées de ces données, telles que des résultats d'agrégation ou encore des résumés. La connaissance ainsi extraite est qualiée de connaissance d'ordre supérieur. À partir de ce constat, nous présentons dans ces travaux le concept de résumé de séquence d'événements dont le but est d'amener les applications dépendantes du temps à gagner un facteur d'échelle sur de grandes masses de données. Un résumé s'obtient en transformant une séquence d'événements où les événements sont ordonnés chronologiquement. Chaque événement est précisément décrit par un ensemble ni de descripteurs symboliques. Le résumé produit est alors une séquence d'événements, plus concise que la séquence initiale, et pouvant s'y substituer dans les applications. Nous proposons une première méthode de construction guidée par l'utilisateur, appelée TSaR. Il s'agit d'un processus en trois phases : i) une généralisation, ii) un regroupement et iii) une formation de concepts. TSaR utilise des connaissances de domaine exprimées sous forme de taxonomies pour généraliser les descripteurs d'événements. Une fenêtre temporelle est donnée pour contrôler le processus de regroupement selon la proximité temporelle des événements. Dans un second temps, pour rendre le processus de résumé autonome, c'est- à-dire sans paramétrage, nous proposons une redénition du problème de résumé en un nouveau problème de classication. L'originalité de ce problème de classication tient au fait que la fonction objective à optimiser dépend simultanément du contenu des événements et de leur proximité dans le temps. Nous proposons deux algorithmes gloutons appelés G-BUSS et GRASS pour répondre à ce problème. Enn, nous explorons et analysons l'aptitude des résumés de séquences d'événements à contribuer à l'extraction de motifs séquentiels d'ordre supérieur. Nous analysons les caractéristiques des motifs fréquents extraits des résumés et proposons une méthodologie qui s'appuie sur ces motifs pour en découvrir d'autres, à granularité plus ne. Nous évaluons et validons nos approches de résumé et notre méthodologie par un ensemble d'expériences sur un jeu de données réelles extraites des archives d'actualités nancières produites par Reuters.

APA, Harvard, Vancouver, ISO, and other styles

30

Niccolai, Lorenzo. "Distillation Knowledge applied on Pegasus for Summarization." Bachelor's thesis, Alma Mater Studiorum - Università di Bologna, 2020. http://amslaurea.unibo.it/22202/.

Full text

Abstract:

In the scope of Natural Language Processing one of the most intricate tasks is Text Summarization, in human terms: writing an essay. Something that we learn in primary school is yet very difficult to reproduce for a machine, it was almost impossible before the advent of Deep Learning. The trending technology to face up Summarization - and every task that involves generating text - is the Transformer. This thesis aims to experiment what entails reducing the complexity of Pegasus, a huge state-of-the-art model based on Transformers. Through a technique called Knowledge Distillation the original model can be compressed in a smaller one transferring the knowledge, but without losing much efficiency. For the experimentation part the distilled replicas were varied in size and their performance assessed evaluating some suitable metrics. Reducing the computational power needed by the models is crucial to deploy such technologies in devices with poor capabilities and a not reliable enough internet connection to use cloud computing, like mobile devices.

APA, Harvard, Vancouver, ISO, and other styles

31

Casamayor, Gerard. "Semantically-oriented text planning for automatic summarization." Doctoral thesis, Universitat Pompeu Fabra, 2021. http://hdl.handle.net/10803/671530.

Full text

Abstract:

Text summarization deals with the automatic creation of summaries from one or more documents, either by extracting fragments from the input text or by generating an abstract de novo. Research in recent years has become dominated by a new paradigm where summarization is addressed as a mapping from a sequence of tokens in an input document to a new sequence of tokens summarizing the input. Works following this paradigm apply supervised deep learning methods to learn sequence to sequence models from a large corpus of documents paired with human-crafted summaries. Despite impressive results in automatic quantitative evaluations, this approach to summarization also suffers from a number of drawbacks. One concern is that learned models tend to operate in a black-box fashion that prevents obtaining insights or results from intermediate analysis that could be applied to other tasks -an important consideration in many real-world scenarios where summaries are not the only desired output of a natural language processing system. Another significant drawback is that deep learning methods are largely constrained to languages and types of summary for which abundant corpora containing human authored summaries is available. Albeit researchers are experimenting with transfer learning methods to overcome this problem, it is far from clear how effective these methods are and how to apply them to scenarios where summaries need to adapt to a query or to user preferences. In those cases where it is not practical to learn a sequence to sequence model, it is convenient to fall back to a more traditional formulation of summarization where the input documents are first analyzed, then a summary is planned by selecting and organizing contents, and the final summary is generated either extractively or abstractively --using natural language generation methods in the latter case. By separating linguistic analysis, planning and generation, it becomes possible to apply different approaches to each task. This thesis focuses on the text planning step. Drawing from past research in word sense disambiguation, text summarization and natural language generation, this thesis presents an unsupervised approach to planning the production of summaries. Following the observation that a common strategy for both disambiguation and summarization tasks is to rank candidate items --meanings, text fragments-- we propose a strategy, at the core of our approach, that ranks candidate lexical meanings and individual words in a text. These ranks contribute towards the creation of a graph-based semantic representation from which we select non-redundant contents and organize them for inclusion in the summary. The overall approach is supported by lexicographic databases that provide cross-lingual and cross-domain knowledge, and by textual similarity methods used to compare meanings with each other and with the text. The methods presented in this thesis are tested on two separate tasks, disambiguation of word senses and named entities, and single-document extractive summarization of English texts. The evaluation of the disambiguation task shows that our approach produces useful results for tasks other than summarization, while evaluating in an extractive summarization setting allows us to compare our approach to existing summarization systems. While the results are inconclusive with respect to state-of-the-art in disambiguation and summarization systems, they hint at a large potential for our approach.
El resum automàtic de textos és una tasca dins del camp d'estudi de processament del llenguatge natural que versa sobre la creació automàtica de resums d'un o més documents, ja sigui extraient fragments del text d'entrada or generant un resum des de zero. La recerca recent en aquesta tasca ha estat dominada per un nou paradigma on el resum és abordat com un mapeig d'una seqüència de paraules en el document d'entrada a una nova seqüència de paraules que resumeixen el document. Els treballs que segueixen aquest paradigma apliquen mètodes d'aprenentatge supervisat profund per tal d'aprendre model seqüència a seqüència a partir d'un gran corpus de documents emparellats amb resums escrits a mà. Tot i els resultats impressionants en avaluacions quantitatives automàtiques, aquesta aproximació al resum automàtic també té alguns inconvenients. Un primer problema és que els models entrenats tendeixen a operar com una caixa negra que impedeix obtenir coneixements o resultats de representacions intermèdies i que puguin ser aplicat a altres tasques. Aquest és un problema important en situacions del món real on els resums no son l'única sortida que s'espera d'un sistema de processament de llenguatge natural. Un altre inconvenient significatiu és que els mètodes d'aprenentatge profund estan limitats a idiomes i tipus de resum pels que existeixen grans corpus amb resums escrits per humans. Tot i que els investigadors experimenten amb mètodes de transferència del coneixement per a superar aquest problema, encara ens trobem lluny de saber com d'efectius son aquests mètodes i com aplicar-los a situacions on els resums s'han d'adaptar a consultes o preferències formulades per l'usuari. En aquells casos en que no és pràctic aprendre models de seqüència a seqüència, convé tornar a una formulació més tradicional del resum automàtic on els documents d'entrada s'analitzen en primer lloc, es planifica el resum tot seleccionant i organitzant continguts i el resum final es genera per extracció o abstracció, fent servir mètodes de generació de llenguatge natural en aquest últim cas. Separar l'anàlisi lingüístic, la planificació i la generació permet aplicar estratègies diferents a cada tasca. Aquesta tesi tracta el pas central de planificació del resum. Inspirant-nos en recerca existent en desambiguació de sentits de mots, resum automàtic de textos i generació de llenguatge natural, aquesta tesi presenta una estratègia no supervisada per a la creació de resums. Seguim l'observació de que el rànquing d'ítems (significats o fragments de text) és un mètode comú per a tasques desambiguació i de resum, i proposem un mètode central per a la nostra estratègia que ordena significats lèxics i paraules d'un text. L'ordre resultant contribueix a la creació d'una representació semàntica en forma de graf des de la que seleccionem continguts no redundants i els organitzem per a la seva inclusió en el resum. L'estratègia general es fonamenta en bases de dades lexicogràfiques que proporcionen coneixement creuat entre múltiples idiomes i àrees temàtiques, i per mètodes de càlcul de similitud entre texts que fem servir per comparar significats entre sí i amb el text. Els mètodes que es presenten en aquesta tesi son posats a prova en dues tasques separades, la desambiguació de sentits de paraula i d'entitats amb nom, i el resum extractiu de documents en anglès. L'avaluació de la desambiguació mostra que la nostra estratègia produeix resultats útils per a tasques més enllà del resum automàtic, mentre que l'avaluació del resum extractiu ens permet comparar el nostre enfocament a sistemes existents de resum automàtic. Tot i que els nostres resultats no representen un avenç significatiu respecte a l'estat de la qüestió en desambiguació i resum automàtic, suggereixen que l'estratègia té un gran potencial.

APA, Harvard, Vancouver, ISO, and other styles

32

Reeve, Lawrence H. Han Hyoil. "Semantic annotation and summarization of biomedical text /." Philadelphia, Pa. : Drexel University, 2007. http://hdl.handle.net/1860/1779.

Full text

APA, Harvard, Vancouver, ISO, and other styles

33

Hassel, Martin. "Resource Lean and Portable Automatic Text Summarization." Doctoral thesis, Stockholm : Numerisk analys och datalogi Numerical Analysis and Computer Science, Kungliga Tekniska högskolan, 2007. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-4414.

Full text

APA, Harvard, Vancouver, ISO, and other styles

34

Ulrich, Jan. "Supervised machine learning for email thread summarization." Thesis, University of British Columbia, 2008. http://hdl.handle.net/2429/2363.

Full text

Abstract:

Email has become a part of most people's lives, and the ever increasing amount of messages people receive can lead to email overload. We attempt to mitigate this problem using email thread summarization. Summaries can be used for things other than just replacing an incoming email message. They can be used in the business world as a form of corporate memory, or to allow a new team member an easy way to catch up on an ongoing conversation. Email threads are of particular interest to summarization because they contain much structural redundancy due to their conversational nature. Our email thread summarization approach uses machine learning to pick which sentences from the email thread to use in the summary. A machine learning summarizer must be trained using previously labeled data, i.e. manually created summaries. After being trained our summarization algorithm can generate summaries that on average contain over 70% of the same sentences as human annotators. We show that labeling some key features such as speech acts, meta sentences, and subjectivity can improve performance to over 80% weighted recall. To create such email summarization software, an email dataset is needed for training and evaluation. Since email communication is a private matter, it is hard to get access to real emails for research. Furthermore these emails must be annotated with human generated summaries as well. As these annotated datasets are rare, we have created one and made it publicly available. The BC3 corpus contains annotations for 40 email threads which include extractive summaries, abstractive summaries with links, and labeled speech acts, meta sentences, and subjective sentences. While previous research has shown that machine learning algorithms are a promising approach to email summarization, there has not been a study on the impact of the choice of algorithm. We explore new techniques in email thread summarization using several different kinds of regression, and the results show that the choice of classifier is very critical. We also present a novel feature set for email summarization and do analysis on two email corpora: the BC3 corpus and the Enron corpus.

APA, Harvard, Vancouver, ISO, and other styles

35

Di, Fabrizzio Giuseppe. "Automatic summarization of opinions in service reviews." Thesis, University of Sheffield, 2012. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.632550.

Full text

APA, Harvard, Vancouver, ISO, and other styles

36

"Fractal summarization." 2003. http://library.cuhk.edu.hk/record=b6073565.

Full text

Abstract:

Wang Fu Lee.
"August 2003."
Thesis (Ph.D.)--Chinese University of Hong Kong, 2003.
Includes bibliographical references (p. 256-281).
Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web.
Mode of access: World Wide Web.
Abstracts in English and Chinese.

APA, Harvard, Vancouver, ISO, and other styles

37

Costa, Vítor Manuel da. "Update Summarization." Master's thesis, 2014. https://repositorio-aberto.up.pt/handle/10216/77587.

Full text

APA, Harvard, Vancouver, ISO, and other styles

38

Costa, Vítor Manuel da. "Update Summarization." Dissertação, 2014. https://repositorio-aberto.up.pt/handle/10216/77587.

Full text

APA, Harvard, Vancouver, ISO, and other styles

39

Hassanlou, Nasrin. "Probabilistic graph summarization." Thesis, 2012. http://hdl.handle.net/1828/4403.

Full text

Abstract:

We study group-summarization of probabilistic graphs that naturally arise in social networks, semistructured data, and other applications. Our proposed framework groups the nodes and edges of the graph based on a user selected set of node attributes. We present methods to compute useful graph aggregates without the need to create all of the possible graph-instances of the original probabilistic graph. Also, we present an algorithm for graph summarization based on pure relational (SQL) technology. We analyze our algorithm and practically evaluate its efficiency using an extended Epinions dataset as well as synthetic datasets. The experimental results show the scalability of our algorithm and its efficiency in producing highly compressed summary graphs in reasonable time.
Graduate

APA, Harvard, Vancouver, ISO, and other styles

40

Ding, Wei-Ming, and 丁偉民. "Summarization Scoring System." Thesis, 2005. http://ndltd.ncl.edu.tw/handle/41800171399131023093.

Full text

Abstract:

碩士
國立臺灣師範大學
資訊教育學系
93
The main purpose of this research is to develop a summarization scoring system for the teacher of the elementary school. This system refers to a method called Latent Semantic Analysis(LSA), to use Singular Value Decomposition(SVD) to build the semantic space. We build several kinds of semantic spaces which the size and style of writing are different. To compare the keywords of the summarizations between the teacher and the students in the different semantic spaces for the scoring. In addition to scoring, we analyze the other indexes of the summarization scoring to find the more appropriate approaches to summarizing for Chinese. The participants are the students of Xi-men elementary school. After the processes of the experiments of the summarizing, we analyze the correlation between all kinds of indexes of scoring calculated by the system and the teacher of scoring in different kinds of semantic spaces. The Results of the research are :(1)We can get a good result when we compare the summarization between the teacher and the students in the semantic spaces built by the translation of the SVD.(2)We try to scoring the summarization written by the students by comparing to the sentences of the summarization between the teacher and the students, and think that this approach is a worthy aspect to research.(3)The difference of size and the style of writing will effect upon the analysis of the result for the indexes of the scoring.

APA, Harvard, Vancouver, ISO, and other styles

41

Chiu, Chung-Ren, and 邱中人. "Chinese News Summarization." Thesis, 2000. http://ndltd.ncl.edu.tw/handle/12254466664744727912.

Full text

APA, Harvard, Vancouver, ISO, and other styles

42

Kumar, Trun. "Automaic Text Summarization." Thesis, 2014. http://ethesis.nitrkl.ac.in/5619/1/110CS0127.pdf.

Full text

Abstract:

Automatic summarization is the procedure of decreasing the content of a document with a machine (computer) program so as to make a summary that holds the most critical sentences of the text file (document). Extracting summary from the document is a difficult task for human beings. Therefore to generate summary automatically has to facilitate several challenges; as the system automates it can only extract the required information from the original document. As the issue of information overload has grown - trouble has been initiated, and as the measure of data has extended, so has eagerness to customize it. It is uncommonly troublesome for individuals to physically condense broad reports of substance. Automatic Summarization systems may be classified into extractive and abstractive summary. An extractive summary method involves selecting indispensable sentences from the record and interfacing them into shorter structure. The vitality of sentences chosen is focused around factual and semantic characteristics of sentences. Extractive method work by selecting a subset of existing words, or sentences in the text file (content document) to produce the summary of input text file. The looking of important data from a huge content document is exceptionally difficult occupation for the user consequently to programmed concentrate the imperative information or summary of the content record. This summary helps the users to reduce time instead Of reading the whole text document and it provide quick knowledge from the large text file. The extractive summarization are commonly focused around techniques for sentence extraction to blanket the set of sentences that are most important for the general understanding of a given text file. In frequency based technique, obtained summary makes more meaning. But in k-means clustering due to out of order extraction, summary might not make sense

APA, Harvard, Vancouver, ISO, and other styles

43

Kumar, T. "Automatic text summarization." Thesis, 2014. http://ethesis.nitrkl.ac.in/5617/1/E-65.pdf.

Full text

Abstract:

Automatic summarization is the procedure of decreasing the content of a document with a machine (computer) program so as to make a summary that holds the most critical sentences of the text file (document). Extracting summary from the document is a difficult task for human beings. Therefore to generate summary automatically has to facilitate several challenges; as the system automates it can only extract the required information from the original document. As the issue of information overload has grown - trouble has been initiated, and as the measure of data has extended, so has eagerness to customize it. It is uncommonly troublesome for individuals to physically condense broad reports of substance. Automatic Summarization systems may be classified into extractive and abstractive summary. An extractive summary method involves selecting indispensable sentences from the record and interfacing them into shorter structure. The vitality of sentences chosen is focused around factual and semantic characteristics of sentences. Extractive method work by selecting a subset of existing words, or sentences in the text file (content document) to produce the summary of input text file. The looking of important data from a huge content document is exceptionally difficult occupation for the user consequently to programmed concentrate the imperative information or summary of the content record. This summary helps the users to reduce time instead Of reading the whole text document and it provide quick knowledge from the large text file. The extractive summarization are commonly focused around techniques for sentence extraction to blanket the set of sentences that are most important for the general understanding of a given text file. In frequency based technique, obtained summary makes more meaning. But in k-means clustering due to out of order extraction, summary might not make sense.

APA, Harvard, Vancouver, ISO, and other styles

44

Lee, Hsiang-Pin, and 李祥賓. "Text Summarization on News." Thesis, 2001. http://ndltd.ncl.edu.tw/handle/54130654842944705212.

Full text

Abstract:

碩士
東吳大學
資訊科學系
89
The swift development of information technique and the Internet has resulted in a problem of information overflow. Hence it is imperative to find a way to help users browse through documents efficiently and effectively. Text summarization could be a remedy to this problem. Traditional text summarization is usually processed manually. However, it does cost lots of human resources and cannot satisfy the demand in real time. Therefore, it is necessary to automate the process. This paper presents three methods of text summarization on Reuters news corpus. First, we use the technique of Information Retrieval to collect the important vocabulary of the document (called Important Vocabulary Extract Policy). Second, we determine the significance of the sentence with its position in the document (called Optimal Position Policy). Last, we expand the vocabulary of the title (called Title Expand Policy). To express the concept of the document, we extract the important vocabulary from the document and analyze its structure to find which position the document subject occupies. Moreover, we believe that the title is rather significant in the document. We therefore expand the relative vocabulary of the title from the WordNet. We then use the expanded set of words to find the appropriate sentence for summarization. In experimentation, we design different experiments for three text summarization methods. The summary of text is then evaluated according to text categorization. Experimental results indicate that all of the methods used in this thesis can achieve acceptable performance. Finally, this thesis also proposes a method to combine two policies -- Optimal Position and Title Expand. Opposite to the criterion in 65.6% precision rate, the proposed method result a 71.9% precision rate, a 9.6% improvement in precision.

APA, Harvard, Vancouver, ISO, and other styles

45

Yang, Jeng-Yuan, and 楊政遠. "Statistical Chinese News Summarization." Thesis, 2010. http://ndltd.ncl.edu.tw/handle/tu9p36.

Full text

Abstract:

碩士
國立臺北科技大學
資訊工程系研究所
98
With the growing number of news articles around the world every day, it would be helpful to users if the time to read news articles can be reduced. Typically, there are two general ways to summarize documents: multi-document summarization and single-document summarization. Multi-document news summarization is similar to ‘hot topics of the week’, which only lists the most important news reports; while single-document news summarization is more similar to a short abstract, which help readers quickly grasp the overall idea in articles. The focus of single-document news summarization is to remove as many unimportant words as possible and only preserve major keywords. In this paper, we mainly focus on single-document summarization for Chinese news articles with statistical methods. The proposed architecture of this paper is as follows. First, auxiliary vocabularies will be collected from news articles, which are included as the dictionary of our system. The original news articles will be kept along with the vocabularies. The vocabularies are stored in word bi-grams, as well as the document frequency and term frequency. Then, these are used to calculate the importance of sentences and select the most representative sentences as the summary. In our experiments, we only adopted news articles in the ‘science and technology’ category since more new terms can be easily obtained. The experimental result showed that news summaries generated from our system can be effectively clustered with the original news articles. These news summaries also showed a great reduction in the time needed to read news articles, which also save the total time to read all news articles. This shows that we have successfully achieved the major goal of our proposed system: to reduce the news reading time.

APA, Harvard, Vancouver, ISO, and other styles

46

Tsai, Jean Ya-Chin, and 蔡雅晴. "Comic-Styled Movie Summarization." Thesis, 2007. http://ndltd.ncl.edu.tw/handle/88014021431825574473.

Full text

Abstract:

碩士
國立臺灣大學
資訊管理學研究所
95
This paper intends to use comics as the form of presentation best for movie content summarization. Movies, with its powerful ability to convey stories and evoke emotions through moving frames, find a significant body of research and application in video content analysis field. However, while this art form has been widely investigated using existing video analysis technology, none of them has been able to produce story content summarization with pleasant or satisfactory results. Indeed, the comic form’s naturally rich visual story-telling vocabulary and vivid imagery is ideal for movie summarization. By re-examining the translation rules between movie and comics, we have been successful in building an effective system that produces comic-styled movie summaries within a reasonable time frame. In our system, a heuristic pictorial layout and balloon placement algorithm is proposed after image processing of keyframes selected by a one-pass video processing. By applying comic style rendering, the generated movie content summary is greatly enhanced in its appearance. The system is easy to implement, fast, and flexible; it can be adapted for use in a variety of movie genres and comic styles, and extended to fit in specific video processing techniques.

APA, Harvard, Vancouver, ISO, and other styles

47

Cheng, Kun-You, and 成崑佑. "Content-oriented Video Summarization." Thesis, 2006. http://ndltd.ncl.edu.tw/handle/09382153335369623907.

Full text

Abstract:

碩士
國立東華大學
資訊工程學系
94
The paper presents a new framework of video summarization which can extract and summarize video shots that a user interests in from a long and complicated video, according to their similarity of motion type and scene. Firstly, the shot detection adopts the color and edge information to make shot boundaries accurate. Then the clustering process classifies the shots according to their similarities in scenes and motion types. Finally, we select the important shots of each cluster by estimating their priority value. The priority value determines the importance of each shot by measuring the motion energy and color variation. The proposed method can produce a classified video summary, which allows users to review and search the video more easily. Experiment results illustrate that the proposed method can successfully classify a video into several clusters of different motion types and scenes, and extract the specific shots according to their importance.

APA, Harvard, Vancouver, ISO, and other styles

48

沈健誠. "Multi-Document Summarization System." Thesis, 2001. http://ndltd.ncl.edu.tw/handle/67547214470615254060.

Full text

Abstract:

碩士
國立清華大學
資訊工程學系
89
Most summarization systems are designed for a single document at present. These systems indicate the essence of individual document, but do not transfer similar documents into single summary. Can we develop a multi-document summarization system, which transfers related documents with the same event into a summary? If that is possible, the main points of documents will be clearly and simply displayed with two or three sentences. Users can see whether these documents are what they want in a minute. It can reduce time for collecting documents and enable users to gather information on the Internet more efficiently. To develop a multi-document summarization system is the goal of this thesis. Summary produced by the must system satisfy two conditions: indicative and topic related. The summary should be tailored to suit user’s query. To achieve this goal, we will study the indicativeness and topic relevance of sentences, and the selection of sentences that are important and independence to each other. Finally, unimportant small clauses will be deleted, to make the final summary more concise. System generates summaries with 248 documents and fifty topics of NTCIR. The reduction rate is over 95%. overall, the quality of summaries produced were satisfactory.

APA, Harvard, Vancouver, ISO, and other styles

49

Tsai, Jean Ya-Chin. "Comic-Styled Movie Summarization." 2007. http://www.cetd.com.tw/ec/thesisdetail.aspx?etdun=U0001-2307200715130800.

Full text

APA, Harvard, Vancouver, ISO, and other styles

50

Chou, Yu-Yu, and 周宥宇. "Spoken Document Summarization : with Structural Support Vector Machine，Domain Adaptation and Abstractive Summarization." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/48591868287184216440.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Summarization'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles