Tesi: "Nearest neighbor analysis (Statistics)"

1

Shen, Qiong Mao. "Group nearest neighbor queries /". View abstract or full-text, 2003. http://library.ust.hk/cgi/db/thesis.pl?COMP%202003%20SHEN.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

2

Hui, Michael Chun Kit. "Aggregate nearest neighbor queries /". View abstract or full-text, 2004. http://library.ust.hk/cgi/db/thesis.pl?COMP%202004%20HUI.

Testo completo

Abstract (sommario):

Thesis (M. Phil.)--Hong Kong University of Science and Technology, 2004.
Includes bibliographical references (leaves 91-95). Also available in electronic version. Access restricted to campus users.

Gli stili APA, Harvard, Vancouver, ISO e altri

3

Xie, Xike, e 谢希科. "Evaluating nearest neighbor queries over uncertain databases". Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2012. http://hub.hku.hk/bib/B4784954X.

Testo completo

Abstract (sommario):

Nearest Neighbor (NN in short) queries are important in emerging applications, such as wireless networks, location-based services, and data stream applications, where the data obtained are often imprecise. The imprecision or imperfection of the data sources is modeled by uncertain data in recent research works. Handling uncertainty is important because this issue affects the quality of query answers. Although queries on uncertain data are useful, evaluating the queries on them can be costly, in terms of I/O or computational efficiency. In this thesis, we study how to efficiently evaluate NN queries on uncertain data. Given a query point q and a set of uncertain objects O, the possible nearest neighbor query returns a set of candidates which have non-zero probabilities to be the query answer. It is also interesting to ask \which region has the same set of possible nearest neighbors", and \which region has one specific object as its possible nearest neighbor". To reveal the relationship between the query space and nearest neighbor answers, we propose the UV-diagram, where the query space is split into disjoint partitions, such that each partition is associated with a set of objects. If a query point is located inside the partition, its possible nearest neighbors could be directly retrieved. However, the number of such partitions is exponential and the construction effort can be expensive. To tackle this problem, we propose an alternative concept, called UV-cell, and efficient algorithms for constructing it. The UV-cell has an irregular shape, which incurs difficulties in storage, maintenance, and query evaluation. We design an index structure, called UV-index, which is an approximated version of the UV-diagram. Extensive experiments show that the UV-index could efficiently answer different variants of NN queries, such as Probabilistic Nearest Neighbor Queries, Continuous Probabilistic Nearest Neighbor Queries. Another problem studied in this thesis is the trajectory nearest neighbor query. Here the query point is restricted to a pre-known trajectory. In applications (e.g. monitoring potential threats along a flight/vessel's trajectory), it is useful to derive nearest neighbors for all points on the query trajectory. Simple solutions, such as sampling or approximating the locations of uncertain objects as points, fails to achieve a good query quality. To handle this problem, we design efficient algorithms and optimization methods for this query. Experiments show that our solution can efficiently and accurately answer this query. Our solution is also scalable to large datasets and long trajectories.
published_or_final_version
Computer Science
Doctoral
Doctor of Philosophy

Gli stili APA, Harvard, Vancouver, ISO e altri

4

Zhang, Jun. "Nearest neighbor queries in spatial and spatio-temporal databases /". View abstract or full-text, 2003. http://library.ust.hk/cgi/db/thesis.pl?COMP%202003%20ZHANG.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

5

Ram, Parikshit. "New paradigms for approximate nearest-neighbor search". Diss., Georgia Institute of Technology, 2013. http://hdl.handle.net/1853/49112.

Testo completo

Abstract (sommario):

Nearest-neighbor search is a very natural and universal problem in computer science. Often times, the problem size necessitates approximation. In this thesis, I present new paradigms for nearest-neighbor search (along with new algorithms and theory in these paradigms) that make nearest-neighbor search more usable and accurate. First, I consider a new notion of search error, the rank error, for an approximate neighbor candidate. Rank error corresponds to the number of possible candidates which are better than the approximate neighbor candidate. I motivate this notion of error and present new efficient algorithms that return approximate neighbors with rank error no more than a user specified amount. Then I focus on approximate search in a scenario where the user does not specify the tolerable search error (error constraint); instead the user specifies the amount of time available for search (time constraint). After differentiating between these two scenarios, I present some simple algorithms for time constrained search with provable performance guarantees. I use this theory to motivate a new space-partitioning data structure, the max-margin tree, for improved search performance in the time constrained setting. Finally, I consider the scenario where we do not require our objects to have an explicit fixed-length representation (vector data). This allows us to search with a large class of objects which include images, documents, graphs, strings, time series and natural language. For nearest-neighbor search in this general setting, I present a provably fast novel exact search algorithm. I also discuss the empirical performance of all the presented algorithms on real data.

Gli stili APA, Harvard, Vancouver, ISO e altri

6

Zhang, Peiwu, e 张培武. "Voronoi-based nearest neighbor search for multi-dimensional uncertain databases". Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2012. http://hub.hku.hk/bib/B49618179.

Testo completo

Abstract (sommario):

In Voronoi-based nearest neighbor search, the Voronoi cell of every point p in a database can be used to check whether p is the closest to some query point q. We extend the notion of Voronoi cells to support uncertain objects, whose attribute values are inexact. Particularly, we propose the Possible Voronoi cell (or PV-cell). A PV-cell of a multi-dimensional uncertain object o is a region R, such that for any point p ∈ R, o may be the nearest neighbor of p. If the PV-cells of all objects in a database S are known, they can be used to identify objects that have a chance to be the nearest neighbor of q. However, there is no efficient algorithm for computing an exact PV-cell. We hence study how to derive an axis-parallel hyper-rectangle (called the Uncertain Bounding Rectangle, or UBR) that tightly contains a PV-cell. We further develop the PV-index, a structure that stores UBRs, to evaluate probabilistic nearest neighbor queries over uncertain data. An advantage of the PV-index is that upon updates on S, it can be incrementally updated. Extensive experiments on both synthetic and real datasets are carried out to validate the performance of the PV-index.
published_or_final_version
Computer Science
Master
Master of Philosophy

Gli stili APA, Harvard, Vancouver, ISO e altri

7

Wong, Wing Sing. "K-nearest-neighbor queries with non-spatial predicates on range attributes /". View abstract or full-text, 2005. http://library.ust.hk/cgi/db/thesis.pl?COMP%202005%20WONGW.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

8

Yiu, Man-lung. "Advanced query processing on spatial networks". Click to view the E-thesis via HKUTO, 2006. http://sunzi.lib.hku.hk/hkuto/record/B36279365.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

9

Yiu, Man-lung, e 姚文龍. "Advanced query processing on spatial networks". Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2006. http://hub.hku.hk/bib/B36279365.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

10

Dastile, Xolani Collen. "Improved tree species discrimination at leaf level with hyperspectral data combining binary classifiers". Thesis, Rhodes University, 2011. http://hdl.handle.net/10962/d1002807.

Testo completo

Abstract (sommario):

The purpose of the present thesis is to show that hyperspectral data can be used for discrimination between different tree species. The data set used in this study contains the hyperspectral measurements of leaves of seven savannah tree species. The data is high-dimensional and shows large within-class variability combined with small between-class variability which makes discrimination between the classes challenging. We employ two classification methods: G-nearest neighbour and feed-forward neural networks. For both methods, direct 7-class prediction results in high misclassification rates. However, binary classification works better. We constructed binary classifiers for all possible binary classification problems and combine them with Error Correcting Output Codes. We show especially that the use of 1-nearest neighbour binary classifiers results in no improvement compared to a direct 1-nearest neighbour 7-class predictor. In contrast to this negative result, the use of neural networks binary classifiers improves accuracy by 10% compared to a direct neural networks 7-class predictor, and error rates become acceptable. This can be further improved by choosing only suitable binary classifiers for combination.

Gli stili APA, Harvard, Vancouver, ISO e altri

11

Bengtsson, Thomas. "Time series discrimination, signal comparison testing, and model selection in the state-space framework /". free to MU campus, to others for purchase, 2000. http://wwwlib.umi.com/cr/mo/fullcit?p9974611.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

12

Ali, Khan Syed Irteza. "Classification using residual vector quantization". Diss., Georgia Institute of Technology, 2013. http://hdl.handle.net/1853/50300.

Testo completo

Abstract (sommario):

Residual vector quantization (RVQ) is a 1-nearest neighbor (1-NN) type of technique. RVQ is a multi-stage implementation of regular vector quantization. An input is successively quantized to the nearest codevector in each stage codebook. In classification, nearest neighbor techniques are very attractive since these techniques very accurately model the ideal Bayes class boundaries. However, nearest neighbor classification techniques require a large size of representative dataset. Since in such techniques a test input is assigned a class membership after an exhaustive search the entire training set, a reasonably large training set can make the implementation cost of the nearest neighbor classifier unfeasibly costly. Although, the k-d tree structure offers a far more efficient implementation of 1-NN search, however, the cost of storing the data points can become prohibitive, especially in higher dimensionality. RVQ also offers a nice solution to a cost-effective implementation of 1-NN-based classification. Because of the direct-sum structure of the RVQ codebook, the memory and computational of cost 1-NN-based system is greatly reduced. Although, as compared to an equivalent 1-NN system, the multi-stage implementation of the RVQ codebook compromises the accuracy of the class boundaries, yet the classification error has been empirically shown to be within 3% to 4% of the performance of an equivalent 1-NN-based classifier.

Gli stili APA, Harvard, Vancouver, ISO e altri

13

Hawash, Maher Mofeid. "Methods for Efficient Synthesis of Large Reversible Binary and Ternary Quantum Circuits and Applications of Linear Nearest Neighbor Model". PDXScholar, 2013. https://pdxscholar.library.pdx.edu/open_access_etds/1090.

Testo completo

Abstract (sommario):

This dissertation describes the development of automated synthesis algorithms that construct reversible quantum circuits for reversible functions with large number of variables. Specifically, the research area is focused on reversible, permutative and fully specified binary and ternary specifications and the applicability of the resulting circuit to the physical limitations of existing quantum technologies. Automated synthesis of arbitrary reversible specifications is an NP hard, multiobjective optimization problem, where 1) the amount of time and computational resources required to synthesize the specification, 2) the number of primitive quantum gates in the resulting circuit (quantum cost), and 3) the number of ancillary qubits (variables added to hold intermediate calculations) are all minimized while 4) the number of variables is maximized. Some of the existing algorithms in the literature ignored objective 2 by focusing on the synthesis of a single solution without the addition of any ancillary qubits while others attempted to explore every possible solution in the search space in an effort to discover the optimal solution (i.e., sacrificed objective 1 and 4). Other algorithms resorted to adding a huge number of ancillary qubits (counter to objective 3) in an effort minimize the number of primitive gates (objective 2). In this dissertation, I first introduce the MMDSN algorithm that is capable of synthesizing binary specifications up to 30 variables, does not add any ancillary variables, produces better quantum cost (8-50% improvement) than algorithms which limit their search to a single solution and within a minimal amount of time compared to algorithms which perform exhaustive search (seconds vs. hours). The MMDSN algorithm introduces an innovative method of using the Hasse diagram to construct candidate solutions that are guaranteed to be valid and then selects the solution with the minimal quantum cost out of this subset. I then introduce the Covered Set Partitions (CSP) algorithm that expands the search space of valid candidate solutions and allows for exploring solutions outside the range of MMDSN. I show a method of subdividing the expansive search landscape into smaller partitions and demonstrate the benefit of focusing on partition sizes that are around half of the number of variables (15% to 25% improvements, over MMDSN, for functions less than 12 variables, and more than 1000% improvement for functions with 12 and 13 variables). For a function of n variables, the CSP algorithm, theoretically, requires n times more to synthesize; however, by focusing on the middle k (k by MMDSN which typically yields lower quantum cost. I also show that using a Tabu search for selecting the next set of candidate from the CSP subset results in discovering solutions with even lower quantum costs (up to 10% improvement over CSP with random selection). In Chapters 9 and 10 I question the predominant methods of measuring quantum cost and its applicability to physical implementation of quantum gates and circuits. I counter the prevailing literature by introducing a new standard for measuring the performance of quantum synthesis algorithms by enforcing the Linear Nearest Neighbor Model (LNNM) constraint, which is imposed by the today's leading implementations of quantum technology. In addition to enforcing physical constraints, the new LNNM quantum cost (LNNQC) allows for a level comparison amongst all methods of synthesis; specifically, methods which add a large number of ancillary variables to ones that add no additional variables. I show that, when LNNM is enforced, the quantum cost for methods that add a large number of ancillary qubits increases significantly (up to 1200%). I also extend the Hasse based method to the ternary and I demonstrate synthesis of specifications of up to 9 ternary variables (compared to 3 ternary variables that existed in the literature). I introduce the concept of ternary precedence order and its implication on the construction of the Hasse diagram and the construction of valid candidate solutions. I also provide a case study comparing the performance of ternary logic synthesis of large functions using both a CUDA graphic processor with 1024 cores and an Intel i7 processor with 8 cores. In the process of exploring large ternary functions I introduce, to the literature, eight families of ternary benchmark functions along with a Multiple Valued file specification (the Extended Quantum Specification XQS). I also introduce a new composite quantum gate, the multiple valued Swivel gate, which swaps the information of qubits around a centrally located pivot point. In summary, my research objectives are as follows: * Explore and create automated synthesis algorithms for reversible circuits both in binary and ternary logic for large number of variables. * Study the impact of enforcing Linear Nearest Neighbor Model (LNNM) constraint for every interaction between qubits for reversible binary specifications. * Advocate for a revised metric for measuring the cost of a quantum circuit in concordance with LNNM, where, on one hand, such a metric would provide a way for balanced comparison between the various flavors of algorithms, and on the other hand, represents a realistic cost of a quantum circuit with respect to an ion trap implementation. * Establish an open source repository for sharing the results, software code and publications with the scientific community. With the dwindling expectations for a new lifeline on silicon-based technologies, quantum computations have the potential of becoming the future workhorse of computations. Similar to the automated CAD tools of classical logic, my work lays the foundation for creating automated tools for constructing quantum circuits from reversible specifications.

Gli stili APA, Harvard, Vancouver, ISO e altri

14

Lee, Jong-Seok. "Preserving nearest neighbor consistency in cluster analysis". [Ames, Iowa : Iowa State University], 2009. http://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqdiss&rft_dat=xri:pqdiss:3369852.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

15

Zhong, Xiao. "A study of several statistical methods for classification with application to microbial source tracking". Link to electronic thesis, 2004. http://www.wpi.edu/Pubs/ETD/Available/etd-0430104-155106/.

Testo completo

Abstract (sommario):

Thesis (M.S.)--Worcester Polytechnic Institute.
Keywords: classification; k-nearest-neighbor (k-n-n); neural networks; linear discriminant analysis (LDA); support vector machines; microbial source tracking (MST); quadratic discriminant analysis (QDA); logistic regression. Includes bibliographical references (p. 59-61).

Gli stili APA, Harvard, Vancouver, ISO e altri

16

Cheng, Si. "Hierarchical Nearest Neighbor Co-kriging Gaussian Process For Large And Multi-Fidelity Spatial Dataset". University of Cincinnati / OhioLINK, 2021. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1613750570927821.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

17

Ogden, Mitchell S. "Observing Clusters and Point Densities in Johnson City, TN Crime Using Nearest Neighbor Hierarchical Clustering and Kernel Density Estimation". Digital Commons @ East Tennessee State University, 2019. https://dc.etsu.edu/asrf/2019/schedule/138.

Testo completo

Abstract (sommario):

Utilizing statistical methods as a risk assessment tool can lead to potentially effective solutions and policies that address various social issues. One usage for such methods is in observation of crime trends within a municipality. Cluster and hotspot analysis is often practiced in criminal statistics to delineate potential areas at-risk of recurring criminal activity. Two approaches to this analytical method are Nearest Neighbor Hierarchical Clustering (NNHC) and Kernel Density Estimation (KDE). Kernel Density Estimation fits incidence points on a grid based on a kernel and bandwidth determined by the analyst. Nearest Neighbor Hierarchical Clustering, a less common and less quantitative method, derives clusters based on the distance between observed points and the expected distance for points of a random distribution. Crime data originated from a public web map and database service that acquires data from the Johnson City Police Department, where each incident is organized into one of many broad categories such as assault, theft, etc. Preliminary analysis of raw volume data shows trends of high crime volume in expected locales; highly trafficked areas such as downtown, the Mall, both Walmarts, as well as low-income residential areas of town. The two methods, KDE and NNHC, dispute the size and location of many clusters. A more in-depth analysis of normalized data with refined parameters may provide further insight on crime in Johnson City.

Gli stili APA, Harvard, Vancouver, ISO e altri

18

Gard, Rikard. "Design-based and Model-assisted estimators using Machine learning methods : Exploring the k-Nearest Neighbor metod applied to data from the Recreational Fishing Survey". Thesis, Örebro universitet, Handelshögskolan vid Örebro Universitet, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:oru:diva-72488.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

19

Funai, Tomohiko. "Extensions of Nearest Shrunken Centroid Method for Classification". BYU ScholarsArchive, 2010. https://scholarsarchive.byu.edu/etd/2402.

Testo completo

Abstract (sommario):

Stylometry assumes that the essence of the individual style of an author can be captured using a number of quantitative criteria, such as the relative frequencies of noncontextual words (e.g., or, the, and, etc.). Several statistical methodologies have been developed for authorship analysis. Jockers et al. (2009) utilize Nearest Shrunken Centroid (NSC) classification, a promising classification methodology in DNA microarray analysis for authorship analysis of the Book of Mormon. Schaalje et al. (2010) develop an extended NSC classification to remedy the problem of a missing author. Dabney (2005) and Koppel et al. (2009) suggest other modifications of NSC. This paper develops a full Bayesian classifier and compares its performance to five versions of the NSC classifier using the Federalist Papers, the Book of Mormon text blocks, and the texts of seven other authors. The full Bayesian classifier was superior to all other methods.

Gli stili APA, Harvard, Vancouver, ISO e altri

20

Ma, Tao. "Statistics of Quantum Energy Levels of Integrable Systems and a Stochastic Network Model with Applications to Natural and Social Sciences". University of Cincinnati / OhioLINK, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1378196433.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

21

Zhang, Xianjie, e Sebastian Bogic. "Datautvinning av klickdata : Kombination av klustring och klassifikation". Thesis, KTH, Hälsoinformatik och logistik, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-230630.

Testo completo

Abstract (sommario):

Ägare av webbplatser och applikationer tjänar ofta på att användare klickar på deras länkar. Länkarna kan bland annat vara reklam eller varor som säljs. Det finns många studier inom dataanalys angående om en sådan länk kommer att bli klickad, men få studier fokuserar på hur länkarna kan justeras för att bli klickade. Problemet som företaget Flygresor.se har är att de saknar ett verktyg för deras kunder, resebyråer, att analysera deras biljetter och därefter justera attributen för resorna. Den efterfrågade lösningen var en applikation som gav förslag på hur biljetterna skulle förändras för att bli mer klickade och på såsätt kunna sälja fler resor. I detta arbete byggdes en prototyp som använder sig av två olika datautvinningsmetoder, klustring med algoritmen DBSCAN och klassifikation med algoritmen k-NN. Algoritmerna användes tillsammans med en utvärderingsprocess, kallad DNNA, som analyserade resultatet från dessa två algoritmer och gav förslag på förändringar av artikelns attribut. Kombinationen av algoritmerna tillsammans med DNNA testades och utvärderades som lösning till problemet. Programmet lyckades förutse vilka attribut av biljetter som behövde justeras för att biljetterna skulle bli mer klickade. Rekommendationerna av justeringar var rimliga men eftersom andra liknande verktyg inte hade publicerats kunde detta arbetes resultat inte jämföras.
Owners of websites and applications usually profits through users that clicks on their links. These can be advertisements or items for sale amongst others. There are many studies about data analysis where they tell you if a link will be clicked, but only a few that focus on what needs to be adjusted to get the link clicked. The problem that Flygresor.se have is that they are missing a tool for their customers, travel agencies, that analyses their tickets and after that adjusts the attributes of those trips. The requested solution was an application which gave suggestions about how to change the tickets in a way that would make it more clicked and in that way, make more sales. A prototype was constructed which make use of two different data mining methods, clustering with the algorithm DBSCAN and classification with the algorithm knearest neighbor. These algorithms were used together with an evaluation process, called DNNA, which analyzes the result from the algorithms and gave suggestions about changes that could be done to the attributes of the links. The combination of the algorithms and DNNA was tested and evaluated as the solution to the problem. The program was able to predict what attributes of the tickets needed to be adjusted to get the tickets more clicks. ‘The recommendations of adjustments were reasonable but this result could not be compared to similar tools since they had not been published.

Gli stili APA, Harvard, Vancouver, ISO e altri

22

Sammon, Ryan. "Data Collection, Analysis, and Classification for the Development of a Sailing Performance Evaluation System". Thèse, Université d'Ottawa / University of Ottawa, 2013. http://hdl.handle.net/10393/25481.

Testo completo

Abstract (sommario):

The work described in this thesis contributes to the development of a system to evaluate sailing performance. This work was motivated by the lack of tools available to evaluate sailing performance. The goal of the work presented is to detect and classify the turns of a sailing yacht. Data was collected using a BlackBerry PlayBook affixed to a J/24 sailing yacht. This data was manually annotated with three types of turn: tack, gybe, and mark rounding. This manually annotated data was used to train classification methods. Classification methods tested were multi-layer perceptrons (MLPs) of two sizes in various committees and nearest- neighbour search. Pre-processing algorithms tested were Kalman filtering, categorization using quantiles, and residual normalization. The best solution was found to be an averaged answer committee of small MLPs, with Kalman filtering and residual normalization performed on the input as pre-processing.

Gli stili APA, Harvard, Vancouver, ISO e altri

23

Nathan, Andrew Prashant. "Single Chain Statistics of a Polymer in a Crystallizable Solvent". University of Akron / OhioLINK, 2008. http://rave.ohiolink.edu/etdc/view?acc_num=akron1216146248.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

24

Shi, Hongxiang. "Hierarchical Statistical Models for Large Spatial Data in Uncertainty Quantification and Data Fusion". University of Cincinnati / OhioLINK, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1504802515691938.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

25

Tsipenyuk, Gregory. "Evaluation of decentralized email architecture and social network analysis based on email attachment sharing". Thesis, University of Cambridge, 2018. https://www.repository.cam.ac.uk/handle/1810/273963.

Testo completo

Abstract (sommario):

Present day email is provided by centralized services running in the cloud. The services transparently connect users behind middleboxes and provide backup, redundancy, and high availability at the expense of user privacy. In present day mobile environments, users can access and modify email from multiple devices with updates reconciled on the central server. Prioritizing updates is difficult and may be undesirable. Moreover, legacy email protocols do not provide optimal email synchronization and access. Recent phenomena of the Internet of Things (IoT) will see the number of interconnected devices grow to 27 billion by 2021. In the first part of my dissertation I am proposing a decentralized email architecture which takes advantage of user's a IoT devices to maintain a complete email history. This addresses the email reconciliation issue and places data under user control. I replace legacy email protocols with a synchronization protocol to achieve eventual consistency of email and optimize bandwidth and energy usage. The architecture is evaluated on a Raspberry Pi computer. There is an extensive body of research on Social Network Analysis (SNA) based on email archives. Typically, the analyzed network reflects either communication between users or a relationship between the email and the information found in the email's header and the body. This approach discards either all or some email attachments that cannot be converted to text; for instance, images. Yet attachments may use up to 90% of an email archive size. In the second part of my dissertation I suggest extracting the network from email attachments shared between users. I hypothesize that the network extracted from shared email attachments might provide more insight into the social structure of the email archive. I evaluate communication and shared email attachments networks by analyzing common centrality measures and classication and clustering algorithms. I further demonstrate how the analysis of the shared attachments network can be used to optimize the proposed decentralized email architecture.

Gli stili APA, Harvard, Vancouver, ISO e altri

26

Prabhu, Chitra. "COMPARISON OF THE UTILITY OF REGRESSION ANALYSIS AND K-NEAREST NEIGHBOR TECHNIQUE TO ESTIMATE ABOVE-GROUND BIOMASS IN PINE FORESTS USING LANDSAT ETM+ IMAGERY". MSSTATE, 2006. http://sun.library.msstate.edu/ETD-db/theses/available/etd-08092006-091449/.

Testo completo

Abstract (sommario):

There is a lack of precise and universally accepted approach in the quantification of carbon sequestered in aboveground woody biomass using remotely sensed data. Drafting of the Kyoto Protocol has made the subject of carbon sequestration more important, making the development of accurate and cost-effective remote sensing models a necessity. There has been much work done in estimating aboveground woody biomass from spectral data using the traditional multiple linear regression analysis approach and the Finnish k-nearest neighbor approach, but the accuracy of these methods to estimate biomass has not been compared. The purpose of this study is to compare the ability of these two methods in estimating above ground biomass (AGB) using spectral data derived from Landsat ETM+ imagery.

Gli stili APA, Harvard, Vancouver, ISO e altri

27

Sahtout, Mohammad Omar. "Improving the performance of the prediction analysis of microarrays algorithm via different thresholding methods and heteroscedastic modeling". Diss., Kansas State University, 2014. http://hdl.handle.net/2097/17914.

Testo completo

Abstract (sommario):

Doctor of Philosophy
Department of Statistics
Haiyan Wang
This dissertation considers diﬀerent methods to improve the performance of the Prediction Analysis of Microarrays (PAM). PAM is a popular algorithm for high-dimensional classiﬁcation. However, it has a drawback of retaining too many features even after multiple runs of the algorithm to perform further feature selection. The average number of selected features is 2611 from the application of PAM to 10 multi-class microarray human cancer datasets. Such a large number of features make it diﬃcult to perform follow up study. This drawback is the result of the soft thresholding method used in the PAM algorithm and the thresholding parameter estimate of PAM. In this dissertation, we extend the PAM algorithm with two other thresholding methods (hard and order thresholding) and a deep search algorithm to achieve better thresholding parameter estimate. In addition to the new proposed algorithms, we derived an approximation for the probability of misclassiﬁcation for the hard thresholded algorithm under the binary case. Beyond the aforementioned work, this dissertation considers the heteroscedastic case in which the variances for each feature are diﬀerent for diﬀerent classes. In the PAM algorithm the variance of the values for each predictor was assumed to be constant across diﬀerent classes. We found that this homogeneity assumption is invalid for many features in most data sets, which motivates us to develop the new heteroscedastic version algorithms. The diﬀerent thresholding methods were considered in these algorithms. All new algorithms proposed in this dissertation are extensively tested and compared based on real data or Monte Carlo simulation studies. The new proposed algorithms, in general, not only achieved better cancer status prediction accuracy, but also resulted in more parsimonious models with signiﬁcantly smaller number of genes.

Gli stili APA, Harvard, Vancouver, ISO e altri

28

Favaro, Martha Maria Andreotti 1981. "Exploração de dados multivariados de fontes e extratos de antocianinas ultilizando análise de componentes princiaipais e método do vizinho mais proximo". [s.n.], 2012. http://repositorio.unicamp.br/jspui/handle/REPOSIP/250159.

Testo completo

Abstract (sommario):

Orientador: Adriana Vitorino Rossi
Tese (doutorado) - Universidade Estadual de Campinas, Instituto de Química
Made available in DSpace on 2018-08-20T02:46:28Z (GMT). No. of bitstreams: 1 Favaro_MarthaMariaAndreotti_D.pdf: 3734314 bytes, checksum: 08002efe51b2f18e9a942c3b818270b7 (MD5) Previous issue date: 2012
Resumo: Antocianinas (ACYS) são corantes naturais responsáveis pela coloração de frutas, hortaliças, flores e grãos. Novas perspectivas de usos de antocianinas em diversos segmentos industriais estimulam estudos analíticos para sistematizar a identificação e a classificação de fontes e extratos desses corantes. Neste trabalho foram utilizadas fontes de ACYS como frutas típicas brasileiras: AMORA (Morus nigra), amora preta (Rubus sp.), jabuticaba (Myrciaria cauliflora), jambolão (Syzygium cumini), jussara (Euterpe edulis Mart.), morango (Fragaria x ananassa Duch) e uva (Vitis vinífera e Vitis vinífera L. Brasil); hortaliças: alface roxa (Lactuca sativa), berinjela (Solanum melongena), cebola roxa (Allium cepa), rabanete (Raphanus sativus), repolho roxo (Brassica oleraceae) e flores: beijo-turco (Impatiens walleriana), gerânio (Pelargonium hortorum e Pelargonium peltatum L.), hibisco (Hibiscus sinensis e Hibiscus syriacus) e hortênsia (Hydrangea macrophylla). A literatura descreve diversas técnicas para análise de ACYS em vegetais e seus extratos, com destaque para cromatografia líquida de alta eficiência (HPLC), espectrometria de massas (MS) e espectrofotometria (UV-Vis), sendo que todas elas foram aplicadas neste trabalho, incluindo-se espectrofotometria de reflectância e a técnica de eletromigração em capilares cromatografia eletrocinética micelar (MEKC). As ferramentas quimiométricas utilizadas no tratamento dos dados foram análise de componentes principais (PCA) e método do vizinho mais próximo (KNN). Os modelos quimiométricos de classificação obtidos apresentaram-se robustos com erros de previsão de menos de 30 % sendo possível identificar as fontes de ACYS, o solvente extrator, a idade dos extratos e dados sobre sua estabilidade e condições de armazenamento. Os resultados apontaram que dados obtidos de técnicas analíticas simples como espectrofotometria de absorção e sem necessidade de preparo de amostra como reflectância difusa na região do visível são comparáveis a resultados de técnicas mais sofisticadas e caras como HPLC e MEKC e até superam o potencial de algumas informações obtidas por MS
Abstract: Anthocyanins (ACYS) are natural dyes responsible for color in fruits, vegetables, flowers and grains. New perspectives for use of anthocyanins in various industries stimulate analytical studies to systematize the identification and classification of sources and extracts of these dyes. In this work, typical Brazilian fruits: mulberry (Morus nigra), blackberry (Rubus sp), jaboticaba (Myrciaria cauliflora), jambolan (Syzygium cumini), jussara fruit (Euterpe edulis Mart.), strawberry (Fragaria x ananassa Duch) and grapes (Vitis vinifera and Vitis vinifera L. 'Brazil'); vegetables: red lettuce (Lactuca sativa), eggplant (Solanum melongena), purple onion (Allium cepa), radish (Raphanus sativus), red cabbage (Brassica oleracea) and flowers, Buzy Lizzie (Impatiens walleriana), geranium (Pelargonium hortorum and Pelargonium peltatum L.), hibiscus (Hibiscus sinensis and Hibiscus syriacus) and hydrangea (Hydrangea macrophylla) were used as sources of ACYS. The literature describes several techniques for analyzing ACYS in vegetables and their extracts, with emphasis on high performance liquid chromatography (HPLC), mass spectrometry (MS) and spectrophotometry (UV-VIS). All of these techniques were applied in this work, including reflectance spectrophotometry and micellar electrokinetic chromatography (MEKC) which is one of the capillary electromigration techniques. The chemometric tools used in data handling were the principal component analysis (PCA) and the K-nearest neighbor method (KNN). The chemometric classification models obtained are robust with predict errors of less than 30 %. It is possible to identify the sources of ACYS, the extractor solvent, the age of the extracts, their stability and storage conditions. The results show that data obtained from simple analytical techniques such as absorption spectroscopy and diffuse reflectance in the visible region (sample preparation is not needed) are comparable to results of those obtained from sophisticated and expensive techniques such as HPLC and MEKC. These techniques also surpass the information obtained by MS
Doutorado
Quimica Analitica
Doutor em Ciências

Gli stili APA, Harvard, Vancouver, ISO e altri

29

Aygar, Alper. "Doppler Radar Data Processing And Classification". Master's thesis, METU, 2008. http://etd.lib.metu.edu.tr/upload/12609890/index.pdf.

Testo completo

Abstract (sommario):

In this thesis, improving the performance of the automatic recognition of the Doppler radar targets is studied. The radar used in this study is a ground-surveillance doppler radar. Target types are car, truck, bus, tank, helicopter, moving man and running man. The input of this thesis is the output of the real doppler radar signals which are normalized and preprocessed (TRP vectors: Target Recognition Pattern vectors) in the doctorate thesis by Erdogan (2002). TRP vectors are normalized and homogenized doppler radar target signals with respect to target speed, target aspect angle and target range. Some target classes have repetitions in time in their TRPs. By the use of these repetitions, improvement of the target type classification performance is studied. K-Nearest Neighbor (KNN) and Support Vector Machine (SVM) algorithms are used for doppler radar target classification and the results are evaluated. Before classification PCA (Principal Component Analysis), LDA (Linear Discriminant Analysis), NMF (Nonnegative Matrix Factorization) and ICA (Independent Component Analysis) are implemented and applied to normalized doppler radar signals for feature extraction and dimension reduction in an efficient way. These techniques transform the input vectors, which are the normalized doppler radar signals, to another space. The effects of the implementation of these feature extraction algoritms and the use of the repetitions in doppler radar target signals on the doppler radar target classification performance are studied.

Gli stili APA, Harvard, Vancouver, ISO e altri

30

Kucuktunc, Onur. "Result Diversification on Spatial, Multidimensional, Opinion, and Bibliographic Data". The Ohio State University, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=osu1374148621.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

31

Ahmed, Mohamed Salem. "Contribution à la statistique spatiale et l'analyse de données fonctionnelles". Thesis, Lille 3, 2017. http://www.theses.fr/2017LIL30047/document.

Testo completo

Abstract (sommario):

Ce mémoire de thèse porte sur la statistique inférentielle des données spatiales et/ou fonctionnelles. En effet, nous nous sommes intéressés à l’estimation de paramètres inconnus de certains modèles à partir d’échantillons obtenus par un processus d’échantillonnage aléatoire ou non (stratifié), composés de variables indépendantes ou spatialement dépendantes.La spécificité des méthodes proposées réside dans le fait qu’elles tiennent compte de la nature de l’échantillon étudié (échantillon stratifié ou composé de données spatiales dépendantes).Tout d’abord, nous étudions des données à valeurs dans un espace de dimension infinie ou dites ”données fonctionnelles”. Dans un premier temps, nous étudions les modèles de choix binaires fonctionnels dans un contexte d’échantillonnage par stratification endogène (échantillonnage Cas-Témoin ou échantillonnage basé sur le choix). La spécificité de cette étude réside sur le fait que la méthode proposée prend en considération le schéma d’échantillonnage. Nous décrivons une fonction de vraisemblance conditionnelle sous l’échantillonnage considérée et une stratégie de réduction de dimension afin d’introduire une estimation du modèle par vraisemblance conditionnelle. Nous étudions les propriétés asymptotiques des estimateurs proposées ainsi que leurs applications à des données simulées et réelles. Nous nous sommes ensuite intéressés à un modèle linéaire fonctionnel spatial auto-régressif. La particularité du modèle réside dans la nature fonctionnelle de la variable explicative et la structure de la dépendance spatiale des variables de l’échantillon considéré. La procédure d’estimation que nous proposons consiste à réduire la dimension infinie de la variable explicative fonctionnelle et à maximiser une quasi-vraisemblance associée au modèle. Nous établissons la consistance, la normalité asymptotique et les performances numériques des estimateurs proposés.Dans la deuxième partie du mémoire, nous abordons des problèmes de régression et prédiction de variables dépendantes à valeurs réelles. Nous commençons par généraliser la méthode de k-plus proches voisins (k-nearest neighbors; k-NN) afin de prédire un processus spatial en des sites non-observés, en présence de co-variables spatiaux. La spécificité du prédicteur proposé est qu’il tient compte d’une hétérogénéité au niveau de la co-variable utilisée. Nous établissons la convergence presque complète avec vitesse du prédicteur et donnons des résultats numériques à l’aide de données simulées et environnementales.Nous généralisons ensuite le modèle probit partiellement linéaire pour données indépendantes à des données spatiales. Nous utilisons un processus spatial linéaire pour modéliser les perturbations du processus considéré, permettant ainsi plus de flexibilité et d’englober plusieurs types de dépendances spatiales. Nous proposons une approche d’estimation semi paramétrique basée sur une vraisemblance pondérée et la méthode des moments généralisées et en étudions les propriétés asymptotiques et performances numériques. Une étude sur la détection des facteurs de risque de cancer VADS (voies aéro-digestives supérieures)dans la région Nord de France à l’aide de modèles spatiaux à choix binaire termine notre contribution
This thesis is about statistical inference for spatial and/or functional data. Indeed, weare interested in estimation of unknown parameters of some models from random or nonrandom(stratified) samples composed of independent or spatially dependent variables.The specificity of the proposed methods lies in the fact that they take into considerationthe considered sample nature (stratified or spatial sample).We begin by studying data valued in a space of infinite dimension or so-called ”functionaldata”. First, we study a functional binary choice model explored in a case-controlor choice-based sample design context. The specificity of this study is that the proposedmethod takes into account the sampling scheme. We describe a conditional likelihoodfunction under the sampling distribution and a reduction of dimension strategy to definea feasible conditional maximum likelihood estimator of the model. Asymptotic propertiesof the proposed estimates as well as their application to simulated and real data are given.Secondly, we explore a functional linear autoregressive spatial model whose particularityis on the functional nature of the explanatory variable and the structure of the spatialdependence. The estimation procedure consists of reducing the infinite dimension of thefunctional variable and maximizing a quasi-likelihood function. We establish the consistencyand asymptotic normality of the estimator. The usefulness of the methodology isillustrated via simulations and an application to some real data.In the second part of the thesis, we address some estimation and prediction problemsof real random spatial variables. We start by generalizing the k-nearest neighbors method,namely k-NN, to predict a spatial process at non-observed locations using some covariates.The specificity of the proposed k-NN predictor lies in the fact that it is flexible and allowsa number of heterogeneity in the covariate. We establish the almost complete convergencewith rates of the spatial predictor whose performance is ensured by an application oversimulated and environmental data. In addition, we generalize the partially linear probitmodel of independent data to the spatial case. We use a linear process for disturbancesallowing various spatial dependencies and propose a semiparametric estimation approachbased on weighted likelihood and generalized method of moments methods. We establishthe consistency and asymptotic distribution of the proposed estimators and investigate thefinite sample performance of the estimators on simulated data. We end by an applicationof spatial binary choice models to identify UADT (Upper aerodigestive tract) cancer riskfactors in the north region of France which displays the highest rates of such cancerincidence and mortality of the country

Gli stili APA, Harvard, Vancouver, ISO e altri

32

Dikkaya, Fahri. "Settlement Patterns Of Altinova In The Early Bronze Age". Master's thesis, METU, 2003. http://etd.lib.metu.edu.tr/upload/1254614/index.pdf.

Testo completo

Abstract (sommario):

This study aims to investigate the settlement patterns of Altinova in the Early Bronze Age and its reflection to social and cultural phenomena. Altinova, which is the most arable plain in Eastern Anatolia, is situated in the borders of Elazig province. The region in the Early Bronze Age was the conjunction and interaction area for two main cultural complexes in the Near East, which were Syro-Mesopotamia and Transcaucasia, with a strong local character. The effect of the foreign and local cultural interactions to the settlement patterns of Altinova in the Early Bronze Age and its reflection in the socio-economic structures have been discussed in the social perspective. In addition, the settlement distribution and its system were analyzed through the quantitative methods, that were gravity model, rank-size analysis, and nearest neighbor analysis. The results of these quantitative analyses with the archaeological data have been discussed in the social and theoretical context.

Gli stili APA, Harvard, Vancouver, ISO e altri

33

Tandan, Isabelle, e Erika Goteman. "Bank Customer Churn Prediction : A comparison between classification and evaluation methods". Thesis, Uppsala universitet, Statistiska institutionen, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-411918.

Testo completo

Abstract (sommario):

This study aims to assess which supervised statistical learning method; random forest, logistic regression or K-nearest neighbor, that is the best at predicting banks customer churn. Additionally, the study evaluates which cross-validation set approach; k-Fold cross-validation or leave-one-out cross-validation that yields the most reliable results. Predicting customer churn has increased in popularity since new technology, regulation and changed demand has led to an increase in competition for banks. Thus, with greater reason, banks acknowledge the importance of maintaining their customer base. The findings of this study are that unrestricted random forest model estimated using k-Fold is to prefer out of performance measurements, computational efficiency and a theoretical point of view. Albeit, k-Fold cross-validation and leave-one-out cross-validation yield similar results, k-Fold cross-validation is to prefer due to computational advantages. For future research, methods that generate models with both good interpretability and high predictability would be beneficial. In order to combine the knowledge of which customers end their engagement as well as understanding why. Moreover, interesting future research would be to analyze at which dataset size leave-one-out cross-validation and k-Fold cross-validation yield the same results.

Gli stili APA, Harvard, Vancouver, ISO e altri

34

Servien, Rémi. "Estimation de régularité locale". Phd thesis, Université Montpellier II - Sciences et Techniques du Languedoc, 2010. http://tel.archives-ouvertes.fr/tel-00730491.

Testo completo

Abstract (sommario):

L'objectif de cette thèse est d'étudier le comportement local d'une mesure de probabilité, notamment au travers d'un indice de régularité locale. Dans la première partie, nous établissons la normalité asymptotique de l'estimateur des kn plus proches voisins de la densité et de l'histogramme. Dans la deuxième, nous définissons un estimateur du mode sous des hypothèses affaiblies. Nous montrons que l'indice de régularité intervient dans ces deux problèmes. Enfin, nous construisons dans une troisième partie différents estimateurs pour l'indice de régularité à partir d'estimateurs de la fonction de répartition, dont nous réalisons une revue bibliographique.

Gli stili APA, Harvard, Vancouver, ISO e altri

35

Ambrožová, Monika. "Detekce fibrilace síní v krátkodobých EKG záznamech". Master's thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2019. http://www.nusl.cz/ntk/nusl-400984.

Testo completo

Abstract (sommario):

Atrial fibrillation is diagnosed in 1-2% of the population, in next decades, it expects a significant increase in the number of patients with this arrhythmia in connection with the aging of the population and the higher incidence of some diseases that are considered as risk factors of atrial fibrillation. The aim of this work is to describe the problem of atrial fibrillation and the methods that allow its detection in the ECG record. In the first part of work there is a theory dealing with cardiac physiology and atrial fibrillation. There is also basic descreption of the detection of atrial fibrillation. In the practical part of work, there is described software for detection of atrial fibrillation, which is provided by BTL company. Furthermore, an atrial fibrillation detector is designed. Several parameters were selected to detect the variation of RR intervals. These are the parameters of the standard deviation, coefficient of skewness and kurtosis, coefficient of variation, root mean square of the successive differences, normalized absolute deviation, normalized absolute difference, median absolute deviation and entropy. Three different classification models were used: support vector machine (SVM), k-nearest neighbor (KNN) and discriminant analysis classification. The SVM classification model achieves the best results. Results of success indicators (sensitivity: 67.1%; specificity: 97.0%; F-measure: 66.8%; accuracy: 92.9%).

Gli stili APA, Harvard, Vancouver, ISO e altri

36

Baum, Kristen Anne. "Feral Africanized honey bee ecology in a coastal prairie landscape". Texas A&M University, 2003. http://hdl.handle.net/1969/150.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

37

Ramraj, Varun. "Exploiting whole-PDB analysis in novel bioinformatics applications". Thesis, University of Oxford, 2014. http://ora.ox.ac.uk/objects/uuid:6c59c813-2a4c-440c-940b-d334c02dd075.

Testo completo

Abstract (sommario):

The Protein Data Bank (PDB) is the definitive electronic repository for experimentally-derived protein structures, composed mainly of those determined by X-ray crystallography. Approximately 200 new structures are added weekly to the PDB, and at the time of writing, it contains approximately 97,000 structures. This represents an expanding wealth of high-quality information but there seem to be few bioinformatics tools that consider and analyse these data as an ensemble. This thesis explores the development of three efficient, fast algorithms and software implementations to study protein structure using the entire PDB. The first project is a crystal-form matching tool that takes a unit cell and quickly (< 1 second) retrieves the most related matches from the PDB. The unit cell matches are combined with sequence alignments using a novel Family Clustering Algorithm to display the results in a user-friendly way. The software tool, Nearest-cell, has been incorporated into the X-ray data collection pipeline at the Diamond Light Source, and is also available as a public web service. The bulk of the thesis is devoted to the study and prediction of protein disorder. Initially, trying to update and extend an existing predictor, RONN, the limitations of the method were exposed and a novel predictor (called MoreRONN) was developed that incorporates a novel sequence-based clustering approach to disorder data inferred from the PDB and DisProt. MoreRONN is now clearly the best-in-class disorder predictor and will soon be offered as a public web service. The third project explores the development of a clustering algorithm for protein structural fragments that can work on the scale of the whole PDB. While protein structures have long been clustered into loose families, there has to date been no comprehensive analytical clustering of short (~6 residue) fragments. A novel fragment clustering tool was built that is now leading to a public database of fragment families and representative structural fragments that should prove extremely helpful for both basic understanding and experimentation. Together, these three projects exemplify how cutting-edge computational approaches applied to extensive protein structure libraries can provide user-friendly tools that address critical everyday issues for structural biologists.

Gli stili APA, Harvard, Vancouver, ISO e altri

38

Jelínková, Jana. "Rozpoznání hudebního slohu z orchestrální nahrávky za pomoci technik Music Information Retrieval". Master's thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2020. http://www.nusl.cz/ntk/nusl-413256.

Testo completo

Abstract (sommario):

As all genres of popular music, classical music consists of many different subgenres. The aim of this work is to recognize those subgenres from orchestral recordings. It is focused on the time period from the very end of 16th century to the beginning of 20th century, which means that Baroque era, Classical era and Romantic era are researched. The Music Information Retrieval (MIR) method was used to classify chosen subgenres. In the first phase of MIR method, parameters were extracted from musical recordings and were evaluated. Only the best parameters were used as input data for machine learning classifiers, to be specific: kNN (K-Nearest Neighbor), LDA (Linear Discriminant Analysis), GMM (Gaussian Mixture Models) and SVM (Support Vector Machines). In the final chapter, all the best results are summarized. According to the results, there is significant difference between the Baroque era and the other researched eras. This significant difference led to better identification of the Baroque era recordings. On the contrary, Classical era ended up to be relatively similar to Romantic era and therefore all classifiers had less success in identification of recordings from this era. The results are in line with music theory and characteristics of chosen musical eras.

Gli stili APA, Harvard, Vancouver, ISO e altri

39

Bílý, Ondřej. "Moderní řečové příznaky používané při diagnóze chorob". Master's thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2011. http://www.nusl.cz/ntk/nusl-218971.

Testo completo

Abstract (sommario):

This work deals with the diagnosis of Parkinson's disease by analyzing the speech signal. At the beginning of this work there is described speech signal production. The following is a description of the speech signal analysis, its preparation and subsequent feature extraction. Next there is described Parkinson's disease and change of the speech signal by this disability. The following describes the symptoms, which are used for the diagnosis of Parkinson's disease (FCR, VSA, VOT, etc.). Another part of the work deals with the selection and reduction symptoms using the learning algorithms (SVM, ANN, k-NN) and their subsequent evaluation. In the last part of the thesis is described a program to count symptoms. Further is described selection and the end evaluated all the result.

Gli stili APA, Harvard, Vancouver, ISO e altri

40

(11181162), Jiexin Duan. "DISTRIBUTED NEAREST NEIGHBOR CLASSIFICATION WITH APPLICATIONS TO CROWDSOURCING". Thesis, 2021.

Cerca il testo completo

Abstract (sommario):

The aim of this dissertation is to study two problems of distributed nearest neighbor classification (DiNN) systematically. The first one compares two DiNN classifiers based on different schemes: majority voting and weighted voting. The second one is an extension of the DiNN method to the crowdsourcing application, which allows each worker data has a different size and noisy labels due to low worker quality. Both statistical guarantees and numerical comparisons are studied in depth.

The first part of the dissertation focuses on the distributed nearest neighbor classification in big data. The sheer volume and spatial/temporal disparity of big data may prohibit centrally processing and storing the data. This has imposed a considerable hurdle for nearest neighbor predictions since the entire training data must be memorized. One effective way to overcome this issue is the distributed learning framework. Through majority voting, the distributed nearest neighbor classifier achieves the same rate of convergence as its oracle version in terms of the regret, up to a multiplicative constant that depends solely on the data dimension. The multiplicative difference can be eliminated by replacing majority voting with the weighted voting scheme. In addition, we provide sharp theoretical upper bounds of the number of subsamples in order for the distributed nearest neighbor classifier to reach the optimal convergence rate. It is interesting to note that the weighted voting scheme allows a larger number of subsamples than the majority voting one.

The second part of the dissertation extends the DiNN methods to the application in crowdsourcing. The noisy labels in crowdsourcing data and different sizes of worker data will deteriorate the performance of DiNN methods. We propose an enhanced nearest neighbor classifier (ENN) to overcome this issue. Our proposed method achieves the same regret as its oracle version on the expert data with the same size. We also propose two algorithms to estimate the worker quality if it is unknown in practice. One method constructs the estimators for worker quality based on the denoised worker labels through applying kNN classifier on expert data. Unlike previous worker quality estimation methods, which have no statistical guarantee, it achieves the same regret as the ENN with observed worker quality. The other method estimates the worker quality iteratively based on ENN, and it works well without expert data required by most previous methods.

Gli stili APA, Harvard, Vancouver, ISO e altri

41

"Superseding neighbor search on uncertain data". 2009. http://library.cuhk.edu.hk/record=b5894020.

Testo completo

Abstract (sommario):

Yuen, Sze Man.
Thesis (M.Phil.)--Chinese University of Hong Kong, 2009.
Includes bibliographical references (leaves [44]-46).
Abstract also in Chinese.
Thesis Committee --- p.i
Abstract --- p.ii
Acknowledgement --- p.iv
Chapter 1 --- Introduction --- p.1
Chapter 2 --- Related Work --- p.6
Chapter 2.1 --- Nearest Neighbor Search on Precise Data --- p.6
Chapter 2.2 --- NN Search on Uncertain Data --- p.8
Chapter 3 --- Problem Definitions and Basic Characteristics --- p.11
Chapter 4 --- The Full-Graph Approach --- p.16
Chapter 5 --- The Pipeline Approach --- p.19
Chapter 5.1 --- The Algorithm --- p.20
Chapter 5.2 --- Edge Phase --- p.24
Chapter 5.3 --- Pruning Phase --- p.27
Chapter 5.4 --- Validating Phase --- p.28
Chapter 5.5 --- Discussion --- p.29
Chapter 6 --- Extension --- p.31
Chapter 7 --- Experiment --- p.34
Chapter 7.1 --- Properties of the SNN-core --- p.34
Chapter 7.2 --- Efficiency of Our Algorithms --- p.38
Chapter 8 --- Conclusions and Future Work --- p.42
Chapter A --- List of Publications --- p.43
Bibliography --- p.44

Gli stili APA, Harvard, Vancouver, ISO e altri

42

Chamness, Kevin Andrew. "Multivariate fault detection and visualization in the semiconductor industry". Thesis, 2006. http://hdl.handle.net/2152/2830.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

43

"Automatic text categorization for information filtering". 1998. http://library.cuhk.edu.hk/record=b5889734.

Testo completo

Abstract (sommario):

Ho Chao Yang.
Thesis (M.Phil.)--Chinese University of Hong Kong, 1998.
Includes bibliographical references (leaves 157-163).
Abstract also in Chinese.
Abstract --- p.i
Acknowledgment --- p.iii
List of Figures --- p.viii
List of Tables --- p.xiv
Chapter 1 --- Introduction --- p.1
Chapter 1.1 --- Automatic Document Categorization --- p.1
Chapter 1.2 --- Information Filtering --- p.3
Chapter 1.3 --- Contributions --- p.6
Chapter 1.4 --- Organization of the Thesis --- p.7
Chapter 2 --- Related Work --- p.9
Chapter 2.1 --- Existing Automatic Document Categorization Approaches --- p.9
Chapter 2.1.1 --- Rule-Based Approach --- p.10
Chapter 2.1.2 --- Similarity-Based Approach --- p.13
Chapter 2.2 --- Existing Information Filtering Approaches --- p.19
Chapter 2.2.1 --- Information Filtering Systems --- p.19
Chapter 2.2.2 --- Filtering in TREC --- p.21
Chapter 3 --- Document Pre-Processing --- p.23
Chapter 3.1 --- Document Representation --- p.23
Chapter 3.2 --- Classification Scheme Learning Strategy --- p.26
Chapter 4 --- A New Approach - IBRI --- p.31
Chapter 4.1 --- Overview of Our New IBRI Approach --- p.31
Chapter 4.2 --- The IBRI Representation and Definitions --- p.34
Chapter 4.3 --- The IBRI Learning Algorithm --- p.37
Chapter 5 --- IBRI Experiments --- p.43
Chapter 5.1 --- Experimental Setup --- p.43
Chapter 5.2 --- Evaluation Metric --- p.45
Chapter 5.3 --- Results --- p.46
Chapter 6 --- A New Approach - GIS --- p.50
Chapter 6.1 --- Motivation of GIS --- p.50
Chapter 6.2 --- Similarity-Based Learning --- p.51
Chapter 6.3 --- The Generalized Instance Set Algorithm (GIS) --- p.58
Chapter 6.4 --- Using GIS Classifiers for Classification --- p.63
Chapter 6.5 --- Time Complexity --- p.64
Chapter 7 --- GIS Experiments --- p.68
Chapter 7.1 --- Experimental Setup --- p.68
Chapter 7.2 --- Results --- p.73
Chapter 8 --- A New Information Filtering Approach Based on GIS --- p.87
Chapter 8.1 --- Information Filtering Systems --- p.87
Chapter 8.2 --- GIS-Based Information Filtering --- p.90
Chapter 9 --- Experiments on GIS-based Information Filtering --- p.95
Chapter 9.1 --- Experimental Setup --- p.95
Chapter 9.2 --- Results --- p.100
Chapter 10 --- Conclusions and Future Work --- p.108
Chapter 10.1 --- Conclusions --- p.108
Chapter 10.2 --- Future Work --- p.110
Chapter A --- Sample Documents in the corpora --- p.111
Chapter B --- Details of Experimental Results of GIS --- p.120
Chapter C --- Computational Time of Reuters-21578 Experiments --- p.141

Gli stili APA, Harvard, Vancouver, ISO e altri

44

Lawson, Kathryn Sahara. "Defining activity areas in the Early Neolithic site at Foeni-Salaş (southwest Romania): A spatial analytic approach with geographical information systems in archaeology". 2007. http://hdl.handle.net/1993/2838.

Testo completo

Abstract (sommario):

Through the years, there has been a great deal of archaeological research focused on the earliest farming cultures of Europe (i.e. Early Neolithic). However, little effort has been expended to uncover the type and nature of daily activities performed within Early Neolithic dwellings, particularly in the Balkans. This thesis conducts a spatial analysis of the Early Neolithic pit house levels of the Foeni-Salaş site in southeast Romania, in the northern half of the Balkans, to determine the kinds and locations of activities that occurred in these pit houses. Characteristic Early Neolithic dwellings in the northern Balkans are pit houses. The data are analyzed using Geographic Information Systems (GIS) technology in an attempt to identify non-random patterns that will indicate how the pit house inhabitants used their space. Both visual and statistical (Nearest Neighbor) techniques are used to identify spatial patterns. Spreadsheet data are incorporated into the map database in order to compare and contrast the results from the two techniques of analysis. Map data provides precise artefact locations, while spreadsheet data yield more generalized quad centroid information. Unlike the mapped data, the spreadsheet data also included artefacts recovered in sieves. Utilizing both data types gave a more complexand fuller understanding of how space was used at Foeni-Salaş. The results show that different types of activity areas are present within each of the pit houses. Comparison of interior to exterior artifact distributions demonstrates that most activities take place within pit house. Some of the activities present include weaving, food preparation, butchering, hide processing, pottery making, ritual, and other activities related to the running of households. It was found that these activities are placed in specific locations relative to features within the pit house and the physical structure of the pit house itself. This research adds to the growing body of archaeological research that implements GIS to answer questions and solve problems related to the spatial dimension of human behaviour.
February 2008

Gli stili APA, Harvard, Vancouver, ISO e altri

45

Cheema, Muhammad Aamir Computer Science &amp Engineering Faculty of Engineering UNSW. "CircularTrip and ArcTrip:effective grid access methods for continuous spatial queries". 2007. http://handle.unsw.edu.au/1959.4/40512.

Testo completo

Abstract (sommario):

A k nearest neighbor query q retrieves k objects that lie closest to the query point q among a given set of objects P. With the availability of inexpensive location aware mobile devices, the continuous monitoring of such queries has gained lot of attention and many methods have been proposed for continuously monitoring the kNNs in highly dynamic environment. Multiple continuous queries require real-time results and both the objects and queries issue frequent location updates. Most popular spatial index, R-tree, is not suitable for continuous monitoring of these queries due to its inefficiency in handling frequent updates. Recently, the interest of database community has been shifting towards using grid-based index for continuous queries due to its simplicity and efficient update handling. For kNN queries, the order in which cells of the grid are accessed is very important. In this research, we present two efficient and effective grid access methods, CircularTrip and ArcTrip, that ensure that the number of cells visited for any continuous kNN query is minimum. Our extensive experimental study demonstrates that CircularTrip-based continuous kNN algorithm outperforms existing approaches in terms of both efficiency and space requirement. Moreover, we show that CircularTrip and ArcTrip can be used for many other variants of nearest neighbor queries like constrained nearest neighbor queries, farthest neighbor queries and (k + m)-NN queries. All the algorithms presented for these queries preserve the properties that they visit minimum number of cells for each query and the space requirement is low. Our proposed techniques are flexible and efficient and can be used to answer any query that is hybrid of above mentioned queries. For example, our algorithms can easily be used to efficiently monitor a (k + m) farthest neighbor query in a constrained region with the flexibility that the spatial conditions that constrain the region can be changed by the user at any time.

Gli stili APA, Harvard, Vancouver, ISO e altri

46

Chen, Hue-Ling, e 陳慧玲. "Design and Analysis of Nearest Neighbor Search Strategies". Thesis, 2002. http://ndltd.ncl.edu.tw/handle/87412624225347213682.

Testo completo

Abstract (sommario):

碩士
國立中山大學
資訊工程學系研究所
90
With the proliferation of wireless communications and rapid advances in technologies, algorithms for efficiently answering queries about large number of spatial data are needed. Spatial data consists of spatial objects including data of higher dimension. Neighbor finding is one of the most important spatial operations in the field of spatial data structures. In recent years, many researchers have focused on finding efficient solutions to the nearest neighbor problem (NN) which involves determining the point in a data set that is the nearest to a given query point. It is frequently used in Geographical Information Systems (GIS). A block B is said to be the neighbor of another block A, if block B has the same property as block A has and covers an equal-sized neighbor of block A. Jozef Voros has proposed a neighbor finding strategy on images represented by quadtrees, in which the four equal-sized neighbors (the east, west, north, and south directions) of block A can be found. However, based on Voros''s strategy, the case that the nearest neighbor occurs in the diagonal directions (the northeast, northwest, southeast, and southwest directions) will be ignored. Moreover, there is no total ordering that preserve proximity when mapping a spatial data from a higher dimensional space to a 1D-space. One way of effecting such a mapping is to utilize space-filling curves. Space-filling curves pass through every point in a space and give a one-one correspondence between the coordinate and the 1D-sequence number of the point. The Peano curve, proposed by Orenstein, which maps the 1D-coordinate of a point by simply interleaving the bits of the X and Y coordinates in the 2D-space, can be easily used in neighbor finding. But with the data ordered by the RBG curve or the Hilbert curve, the neighbor finding would be complex. The RBG curve achieves savings in random accesses on the disk for range queries and the Hilbert curve achieves the best clustering for range queries. Therefore, in this thesis, we first show the missing case in the Voros''s strategy and show the ways to find it. Next, we show that the Peano curve is the best mapping function used in the nearest neighbor finding. We also show the transformation rules between the Peano curve and the other curves such that we can efficiently find the nearest neighbor, when the data is linearly ordered by the other curves. From our simulation, we show that our proposed two strategies can work correctly and faster than the conventional strategies in nearest neighbor finding. Finally, we present a revised version of NA-Trees, which can work for exact match queries and range queries from a large, dynamic index, where an exact match query means finding the specific data object in a spatial database and a range query means reporting all data objects which are located in a specific range. By large, we mean that most of the index must be stored in secondary memory. By dynamic, we mean that insertions and deletions are intermixed with queries, so that the index cannot be built beforehand.

Gli stili APA, Harvard, Vancouver, ISO e altri

47

Li, Yung-Hsu, e 黎詠絮. "Adaptive Nearest Neighbor Discriminant Analysis for High-dimensional Data Classification". Thesis, 2010. http://ndltd.ncl.edu.tw/handle/22255030417583197406.

Testo completo

Abstract (sommario):

碩士
輔仁大學
應用統計學研究所
98
As the advancement in technology, we would like to collect a lot of data attributes. Since data usually with massive attributes, we called this kind of data the high dimensional data. For example, each pixel in a hyperspectral image is consisted of about hundreds or even thousands of bands. However, in high dimensional data classification, the number of available training samples might be very limited. Actually, only relatively small training sets are available is a common problem in high-dimensional data analysis. As a consequence, based on the cure of dimensionality, the accuracy rate might be unsatisfied due to this data property. DANN is a adaptive classifer for high dimensional data. If the within-class covariance is singular, which often occurs in high-dimensional problems, DANN will have a poor performance on classification. In this paper we proposed DANN_PRDA to reduce the effect of high dimensionality and small sample classification situation. In our study, there are many different data, among them hyperspectral image, face recognition. First we used LDC, QDC, SVM, k-NN, DANN, DANN_DA, DANN_PRDA etc. to classify the material, and the experimental result was using DANN_PRDA classification accuracy access other classification.

Gli stili APA, Harvard, Vancouver, ISO e altri

48

Chen, Chih-Han, e 陳志翰. "Fault Diagnosis of Steam Turbine-Generator Sets Using K-Nearest Neighbor and Principal Component Analysis Methods". Thesis, 2015. http://ndltd.ncl.edu.tw/handle/85499230315301703030.

Testo completo

Abstract (sommario):

碩士
正修科技大學
電機工程研究所
103
From the view of preventive measures, the work on earlier detection of the incipient fault of the fundamental equipment in the steam power plants, especially the steam turbine-generator sets, has attracted quite much attention. As a vital device in the power system, the fault of the steam turbine-generator set will lead itself to a very wide range of outage of the system. Due to the increasing capacity and structure complexity of steam turbine-generator sets, the relations among components of the set become closer than before. Thus, the research on vibration fault diagnosis not only has great importance and benefit for the machine to operate safely and stably, but also is a frontier issue for electrical engineering. This thesis presents a data mining approach based on K-nearest neighbor (K-NN) classifier and principal component analysis (PCA) to diagnose the vibration faults of turbine-generator units. The PCA is used to reduce the dimension of the input attributes through a linear combination of the original attributes. The testing results demonstrates the feasibility of the proposed approaches to diagnose the vibration faults of turbine-generator units.

Gli stili APA, Harvard, Vancouver, ISO e altri

49

Chang, Huan Ling, e 張華玲. "A Study on Spatial Analysis of National Monument through the Method of Nearest Neighbor Index in Tainan City". Thesis, 2017. http://ndltd.ncl.edu.tw/handle/pd99gr.

Testo completo

Abstract (sommario):

碩士
康寧大學
休閒管理研究所
105
The purpose of this study is to discuss the definition and classification of the National Monument. In order to analyze the spatial characteristics of 22 National Monument which are distributed in Tainan to mark them on the map and discuss whether the area is centralized or decentralized. The research method is using the "Nearest – Neighbor Index" to analyze the number of National Monument which are located in 7 different districts. The investigation results are: (A) the spatial characteristics of the National Monument not only can provide humanistic characteristics but also combine the different regional spatial characteristics as the resources of local cultural tourism. (B) Most of the National Monument are located in the central and western districts of Tainan. The population density is also concentrated. In the limited geographical space, it attracts many tourists compare to other districts because of its concentration of National Monument and the convenience. It attracts more tourists because some of the National Monument are walking distance. (C) Comparing the number of National Monument in Old Tainan City and Tainan now，the result is decentralized.

Gli stili APA, Harvard, Vancouver, ISO e altri

50

Lo, Yu-Yan, e 羅玉燕. "Extracting Function-level Statements in Biological Expression Language from Biomedical Literature:A K Nearest Neighbor approach inspired by Principal Component Analysis". Thesis, 2016. http://ndltd.ncl.edu.tw/handle/a7p3u7.

Testo completo

Abstract (sommario):

碩士
國立中央大學
資訊工程學系
104
Nowadays, understanding pathway is one of the main purpose of biomedical domains, because the biological pathway involves various regulation mechanisms. Many regulation mechanisms have being discovered and presented in biomedical literature, allowing life scientists to perceive the latest results. It also has being highly demanded within the scientific community in the text mining for biomedical researches. Biological Expression Language (BEL) is designed to capture relationships between the two biological entities, such as gene, protein and chemical in scientific literatures. This is can not only describe the positive/negative relationship between biomedical entities, but represent biomedical function-level information, such as complex abundance, chaperone protein, catalyst and so on. In related research, the latest performance of function-level classification is 30.5\%, and the performance will effect on the BEL full-statement performance. In order to enhance the integrity of the BEL full-statements, we proposed a K-nearest neighbor (KNN) approach inspired by Principal Component Analysis (PCA) to recognize the function-level terms automatically. In experimental results, combination of PCA and KNN has the higher performance than SVM-based method, and it can achieve F-score of 59.70\%. In conclusion, we hope that the higher performance of function-level classification can not only enhance the integrity of BEL full-statement, but help to construct complete biological networks and to accelerate the biomedical research processes for life scientists.

Gli stili APA, Harvard, Vancouver, ISO e altri

Tesi sul tema "Nearest neighbor analysis (Statistics)"

Cita una fonte nei formati APA, MLA, Chicago, Harvard e in molti altri stili