Dissertations / Theses on the topic 'Small datasets'

To see the other types of publications on this topic, follow the link: Small datasets.

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 26 dissertations / theses for your research on the topic 'Small datasets.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Shi, Xiaojin. "Visual learning from small training datasets /." Diss., Digital Dissertations Database. Restricted to UC campuses, 2005. http://uclibs.org/PID/11984.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Van, Koten Chikako, and n/a. "Bayesian statistical models for predicting software effort using small datasets." University of Otago. Department of Information Science, 2007. http://adt.otago.ac.nz./public/adt-NZDU20071009.120134.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
The need of today�s society for new technology has resulted in the development of a growing number of software systems. Developing a software system is a complex endeavour that requires a large amount of time. This amount of time is referred to as software development effort. Software development effort is the sum of hours spent by all individuals involved. Therefore, it is not equal to the duration of the development. Accurate prediction of the effort at an early stage of development is an important factor in the successful completion of a software system, since it enables the developing organization to allocate and manage their resource effectively. However, for many software systems, accurately predicting the effort is a challenge. Hence, a model that assists in the prediction is of active interest to software practitioners and researchers alike. Software development effort varies depending on many variables that are specific to the system, its developmental environment and the organization in which it is being developed. An accurate model for predicting software development effort can often be built specifically for the target system and its developmental environment. A local dataset of similar systems to the target system, developed in a similar environment, is then used to calibrate the model. However, such a dataset often consists of fewer than 10 software systems, causing a serious problem in the prediction, since predictive accuracy of existing models deteriorates as the size of the dataset decreases. This research addressed this problem with a new approach using Bayesian statistics. This particular approach was chosen, since the predictive accuracy of a Bayesian statistical model is not so dependent on a large dataset as other models. As the size of the dataset decreases to fewer than 10 software systems, the accuracy deterioration of the model is expected to be less than that of existing models. The Bayesian statistical model can also provide additional information useful for predicting software development effort, because it is also capable of selecting important variables from multiple candidates. In addition, it is parametric and produces an uncertainty estimate. This research developed new Bayesian statistical models for predicting software development effort. Their predictive accuracy was then evaluated in four case studies using different datasets, and compared with other models applicable to the same small dataset. The results have confirmed that the best new models are not only accurate but also consistently more accurate than their regression counterpart, when calibrated with fewer than 10 systems. They can thus replace the regression model when using small datasets. Furthermore, one case study has shown that the best new models are more accurate than a simple model that predicts the effort by calculating the average value of the calibration data. Two case studies has also indicated that the best new models can be more accurate for some software systems than a case-based reasoning model. Since the case studies provided sufficient empirical evidence that the new models are generally more accurate than existing models compared, in the case of small datasets, this research has produced a methodology for predicting software development effort using the new models.
3

Zhao, Amy(Xiaoyu Amy). "Learning distributions of transformations from small datasets for applied image synthesis." Thesis, Massachusetts Institute of Technology, 2019. https://hdl.handle.net/1721.1/128342.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2020
Cataloged from PDF of thesis. "February 2020."
Includes bibliographical references (pages 75-91).
Much of the recent research in machine learning and computer vision focuses on applications with large labeled datasets. However, in realistic settings, it is much more common to work with limited data. In this thesis, we investigate two applications of image synthesis using small datasets. First, we demonstrate how to use image synthesis to perform data augmentation, enabling the use of supervised learning methods with limited labeled data. Data augmentation -- typically the application of simple, hand-designed transformations such as rotation and scaling -- is often used to expand small datasets. We present a method for learning complex data augmentation transformations, producing examples that are more diverse, realistic, and useful for training supervised systems than hand-engineered augmentation. We demonstrate our proposed augmentation method for improving few-shot object classification performance, using a new dataset of collectible cards with fine-grained differences. We also apply our method to medical image segmentation, enabling the training of a supervised segmentation system using just a single labeled example. In our second application, we present a novel image synthesis task: synthesizing time lapse videos of the creation of digital and watercolor paintings. Using a recurrent model of paint strokes and a novel training scheme, we create videos that tell a plausible visual story of the painting process.
by Amy (Xiaoyu) Zhao.
Ph. D.
Ph.D. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science
4

Arzamasov, Vadim [Verfasser], and K. [Akademischer Betreuer] Böhm. "Comprehensible and Robust Knowledge Discovery from Small Datasets / Vadim Arzamasov ; Betreuer: K. Böhm." Karlsruhe : KIT-Bibliothek, 2021. http://d-nb.info/1238148166/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Lazarovici, Allan 1979. "Development of gene-finding algorithms for fungal genomes : dealing with small datasets and leveraging comparative genomics." Thesis, Massachusetts Institute of Technology, 2003. http://hdl.handle.net/1721.1/29681.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Thesis (M.Eng. and S.B.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2003.
Includes bibliographical references (leaves 60-62).
A computer program called FUNSCAN was developed which identifies protein coding regions in fungal genomes. Gene structural and compositional properties are modeled using a Hidden Markov Model. Separate training and testing sets for FUNSCAN were obtained by aligning cDNAs from an organism to their genomic loci, generating a 'gold standard' set of annotated genes. The performance of FUNSCAN is competitive with other computer programs design to identify protein coding regions in fungal genomes. A technique called 'Training Set Augmentation' is described which can be used to train FUNSCAN when only a small training set of genes is available. Techniques that combine alignment algorithms with FUNSCAN to identify novel genes are also discussed and explored.
by Allan Lazarovici.
M.Eng.and S.B.
6

Horečný, Peter. "Metody segmentace obrazu s malými trénovacími množinami." Master's thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2020. http://www.nusl.cz/ntk/nusl-412996.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
The goal of this thesis was to propose an image segmentation method, which is capable of effective segmentation process with small datasets. Recently published ODE neural network was used for this method, because its features should provide better generalization in case of tasks with only small datasets available. The proposed ODE-UNet network was created by combining UNet architecture with ODE neural network, while using benefits of both networks. ODE-UNet reached following results on ISBI dataset: Rand: 0,950272 and Info: 0,978061. These results are better than the ones received from UNet model, which was also tested in this thesis, but it has been proven that state of the art can not be outperformed using ODE neural networks. However, the advantages of ODE neural network over tested UNet architecture and other methods were confirmed, and there is still a room for improvement by extending this method.
7

Lucy, Caleb O. "Rapid Acquisition of Low Cost High-Resolution Elevation Datasets Using a Small Unmanned Aircraft System: An Application for Measuring River Geomorphic Change." Thesis, Boston College, 2015. http://hdl.handle.net/2345/bc-ir:104880.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Thesis advisor: Noah P. Snyder
Emerging methods for acquiring high-resolution topographic datasets have the potential to open new opportunities for quantitative geomorphic analysis. This study demonstrates a technique for rapidly obtaining structure from motion (SfM) photogrammetry-derived digital elevation models (DEMs) using aerial photographs acquired with a small unmanned aircraft system (sUAS). In conjunction with collection of aerial imagery, study sites are surveyed with a differential global position system (dGPS)-enabled total station (TPS) for georeferencing and accuracy assessment of sUAS SfM measurements. Results from sUAS SfM surveys of upland river channels in northern New England consistently produce DEMs and orthoimagery with ~1 cm pixel resolution. One-to-one point measurement comparisons demonstrate sUAS SfM systematically measures elevations about 0.16 ±0.23 m higher than TPS equivalents (0.28 m RMSE). Bathymetric (i.e. submerged or subaqueous) sUAS SfM measurements are 0.20 ±0.24 m (0.31 m RMSE) higher than TPS, whereas exposed (subaerial) points are 0.14 ±0.22 m (0.26 m RMSE) higher than TPS. Serial comparison of DEMs obtained before and after a two-year flood event indicates cut bank erosion and point bar deposition of ~0.10 m, consistent with expectations for channel evolution. DEMs acquired with the sUAS SfM are of comparable resolution but a lower cost alternative to those from airborne light detection and ranging (lidar), the current standard for topographic imagery. Furthermore, lidar is not available for much of the United States and sUAS SfM provides an efficient means for expanding coverage of this critical elevation dataset. Due to their utility in municipal, land use, and emergency planning, the demand for high-resolution topographic datasets continues to increase among governments, research institutions, and private sector consulting firms. Terrain analysis using sUAS SfM could therefore be a boon to river management and restoration in northern New England and other regions
Thesis (MS) — Boston College, 2015
Submitted to: Boston College. Graduate School of Arts and Sciences
Discipline: Geology and Geophysics
8

Oppon, Ekow CruickShank. "Synergistic use of promoter prediction algorithms: a choice of small training dataset?" Thesis, University of the Western Cape, 2000. http://etd.uwc.ac.za/index.php?module=etd&action=viewtitle&id=gen8Srv25Nme4_8222_1185436339.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:

Promoter detection, especially in prokaryotes, has always been an uphill task and may remain so, because of the many varieties of sigma factors employed by various organisms in transcription. The situation is made more complex by the fact, that any seemingly unimportant sequence segment may be turned into a promoter sequence by an activator or repressor (if the actual promoter sequence is made unavailable). Nevertheless, a computational approach to promoter detection has to be performed due to number of reasons. The obvious that comes to mind is the long and tedious process involved in elucidating promoters in the &lsquo
wet&rsquo
laboratories not to mention the financial aspect of such endeavors. Promoter detection/prediction of an organism with few characterized promoters (M.tuberculosis) as envisaged at the beginning of this work was never going to be easy. Even for the few known Mycobacterial promoters, most of the respective sigma factors associated with their transcription were not known. If the information (promoter-sigma) were available, the research would have been focused on categorizing the promoters according to sigma factors and training the methods on the respective categories. That is assuming that, there would be enough training data for the respective categories. Most promoter detection/prediction studies have been carried out on E.coli because of the availability of a number of experimentally characterized promoters (+- 310). Even then, no researcher to date has extended the research to the entire E.coli genome.

9

Forsberg, Fredrik, and Gonzalez Pierre Alvarez. "Unsupervised Machine Learning: An Investigation of Clustering Algorithms on a Small Dataset." Thesis, Blekinge Tekniska Högskola, Institutionen för programvaruteknik, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-16300.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Context: With the rising popularity of machine learning, looking at its shortcomings is valuable in seeing how well machine learning is applicable. Is it possible to apply the clustering with a small dataset? Objectives: This thesis consists of a literature study, a survey and an experiment. It investigates how two different unsupervised machine learning algorithms DBSCAN(Density-Based Spatial Clustering of Applications with Noise) and K-means run on a dataset gathered from a survey. Methods: Making a survey where we can see statistically what most people chose and apply clustering with the data from the survey to confirm if the clustering has the same patterns as what people have picked statistically. Results: It was possible to identify patterns with clustering algorithms using a small dataset. The literature studies show examples that both algorithms have been used successfully. Conclusions: It's possible to see patterns using DBSCAN and K-means on a small dataset. The size of the dataset is not necessarily the only aspect to take into consideration, feature and parameter selection are both important as well since the algorithms need to be tuned and customized to the data.
10

Gay, Antonin. "Pronostic de défaillance basé sur les données pour la prise de décision en maintenance : Exploitation du principe d'augmentation de données avec intégration de connaissances à priori pour faire face aux problématiques du small data set." Electronic Thesis or Diss., Université de Lorraine, 2023. http://www.theses.fr/2023LORR0059.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Cette thèse CIFRE est un projet commun entre ArcelorMittal et le laboratoire CRAN, dont l'objectif est d'optimiser la prise de décision en maintenance industrielle par l'exploitation des sources d'information disponibles, c'est-à-dire des données et des connaissances industrielles, dans le cadre des contraintes industrielles présentées par le contexte sidérurgique. La stratégie actuelle de maintenance des lignes sidérurgiques est basée sur une maintenance préventive régulière. L'évolution de la maintenance préventive vers une stratégie dynamique se fait par le biais de la maintenance prédictive. La maintenance prédictive a été formalisée au sein du paradigme Prognostics and Health Management (PHM) sous la forme d'un processus en sept étapes. Parmi ces étapes de la PHM, le travail de ce doctorat se concentre sur la prise de décision et le pronostic. En regard de cette maintenance prédictive, le contexte de l'Industrie 4.0 met l'accent sur les approches basées sur les données, qui nécessitent une grande quantité de données que les systèmes industriels ne peuvent pas fournir systématiquement. La première contribution de la thèse consiste donc à proposer une équation permettant de lier les performances du pronostic au nombre d'échantillons d'entraînement disponibles. Cette contribution permet de prédire quelles performances le pronostic pourraient atteindre avec des données supplémentaires dans le cas de petits jeux de données (small datasets). La deuxième contribution de la thèse porte sur l'évaluation et l'analyse des performances de l'augmentation de données appliquée au pronostic sur des petits jeux de données. L'augmentation de données conduit à une amélioration de la performance du pronostic jusqu'à 10%. La troisième contribution de la thèse est l'intégration de connaissances expertes au sein de l'augmentation de données. L'intégration de connaissances statistiques s'avère efficace pour éviter la dégradation des performances causée par l'augmentation de données sous certaines conditions défavorables. Enfin, la quatrième contribution consiste en l'intégration des résultats du pronostic dans la modélisation des coûts de la prise de décision en maintenance et en l'évaluation de l'impact du pronostic sur ce coût. Elle démontre que (i) la mise en œuvre de la maintenance prédictive réduit les coûts de maintenance jusqu'à 18-20% et (ii) l'amélioration de 10% du pronostic peut réduire les coûts de maintenance de 1% supplémentaire
This CIFRE PhD is a joint project between ArcelorMittal and the CRAN laboratory, with theaim to optimize industrial maintenance decision-making through the exploitation of the available sources of information, i.e. industrial data and knowledge, under the industrial constraints presented by the steel-making context. Current maintenance strategy on steel lines is based on regular preventive maintenance. Evolution of preventive maintenance towards a dynamic strategy is done through predictive maintenance. Predictive maintenance has been formalized within the Prognostics and Health Management (PHM) paradigm as a seven steps process. Among these PHM steps, this PhD's work focuses on decision-making and prognostics. The Industry 4.0 context put emphasis on data-driven approaches, which require large amount of data that industrial systems cannot ystematically supply. The first contribution of the PhD consists in proposing an equation to link prognostics performances to the number of available training samples. This contribution allows to predict prognostics performances that could be obtained with additional data when dealing with small datasets. The second contribution of the PhD focuses on evaluating and analyzing the performance of data augmentation when applied to rognostics on small datasets. Data augmentation leads to an improvement of prognostics performance up to 10%. The third contribution of the PhD consists in the integration of expert knowledge into data augmentation. Statistical knowledge integration proved efficient to avoid performance degradation caused by data augmentation under some unfavorable conditions. Finally, the fourth contribution consists in the integration of prognostics in maintenance decision-making cost modeling and the evaluation of prognostics impact on maintenance decision cost. It demonstrates that (i) the implementation of predictive maintenance reduces maintenance cost up to 18-20% and ii) the 10% prognostics improvement can reduce maintenance cost by an additional 1%
11

Tilgner, Martin. "Detekce chodců ve snímku pomocí metod strojového učení." Master's thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2019. http://www.nusl.cz/ntk/nusl-400707.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Tato práce se zabývá detekcí chodců pomocí konvolučních neuronových sítí z pohledu autonomního vozidla. A to zejména jejich otestováním ve smyslu nalezení vhodné praxe tvorby datasetu pro machine learning modely. V práci bylo natrénováno celkem deset machine learning modelů meta architektur Faster R-CNN s ResNet 101 jako feature extraktorem a SSDLite s feature extraktorem MobileNet_v2. Tyto modely byly natrénovány na datasetech o různých velikostech. Nejlépší výsledky byly dosaženy na datasetu o velikosti 5000 snímků. Kromě těchto modelů byl vytvořen nový dataset zaměřující se na chodce v noci. Dále byla vytvořena knihovna Python funkcí pro práci s datasety a automatickou tvorbu datasetu.
12

Guin, Agneev. "Terrain Classification to find Drivable Surfaces using Deep Neural Networks : Semantic segmentation for unstructured roads combined with the use of Gabor filters to determine drivable regions trained on a small dataset." Thesis, KTH, Robotik, perception och lärande, RPL, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-222021.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Autonomous vehicles face various challenges under difficult terrain conditions such as marginally rural or back-country roads, due to the lack of lane information, road signs or traffic signals. In this thesis, we investigate a novel approach of using Deep Neural Networks (DNNs) to classify off-road surfaces into the types of terrains with the aim of supporting autonomous navigation in unstructured environments. For example, off-road surfaces can be classified as asphalt, gravel, grass, mud, snow, etc. Images from the camera mounted on a mining truck were used to perform semantic segmentation and to classify road surface types. Camera images were segmented manually for training into sets of 16 and 9 classes, for all relevant classes and the drivable classes respectively. A small but diverse dataset of 100 images was augmented and compiled along with nearby frames from the video clips to expand this dataset. Neural networks were used to test the performance for the classification under these off-road conditions. Pre-trained AlexNet was compared to the networks without pre-training. Gabor filters, known to distinguish textured surfaces, was further used to improve the results of the neural network. The experiments show that pre-trained networks perform well with small datasets and many classes. A combination of Gabor filters with pre-trained networks can establish a dependable navigation path under difficult terrain conditions. While the results seem positive for images similar to the training image scenes, the networks fail to perform well in other situations. Though the tests imply that larger datasets are required for dependable results, this is a step closer to making the autonomous vehicles drivable under off-road conditions.
Autonoma fordon står inför olika utmaningar under svåra terrängförhållanden som landsbygds- eller skogsvägar på grund av bristen av körfältinformation, vägskyltar och trafikljus. I denna avhandling undersöker vi ett nytt tillvägagångssätt att använda Djupa Neurala Nätverk (DNN) för att klassificera terrängytor utifrån deras körbarhet i syfte att stödja autonom navigering i ostrukturerade miljöer.Till exempel kan terrängytor klassificeras som asfalt, grus, gräs, lera, snö etc. Bilder från kameran monterad på en gruvbil användes för att utföra semantisk segmentering och klassificera vägytor. Bilderna delades manuellt upp i träningsset på 16 samt 9 klasser för alla relevanta klasser respektive körbara klasser. Ett litet men mångsidigt dataset med 100 bilder förstärktes med närliggande bilder från videoklippen för att expandera detta dataset. Neurala nätverk användes för att testa prestandan hos klassificeringen under dessa terrängförhållanden. Det förtränade nätverket AlexNet jämfördes med nätverken utan träning. Gaborfilter, kända för att särskilja texturerade ytor, användes vidare för att förbättra resultaten av det neurala nätverket. Experimenten visar att förtränade nätverk presterar bra med små dataset och många klasser. En kombination av Gaborfilter med förtränade nätverk kan skapa en pålitlig navigationsväg under svåra terrängförhållanden. Även om resultaten verkar positiva för bilder som liknar träningsbildscenen presterar nätverken inte bra i andra situationer. Även om testen tyder på att stora dataset krävs för tillförlitliga resultat, är detta ett steg närmare att göra de autonoma bilarna körbara i svåra terrängförhållanden.
13

Durand, Marie. "La découverte et la compréhension des profils d’apprenants : classification semi-supervisée et acquisition d’une langue seconde." Thesis, Paris 8, 2019. http://www.theses.fr/2019PA080029.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Cette thèse a pour ambition l'élaboration d’une méthodologie efficace pour la découverte et la description du profil de l'apprenant d'une L2 à partir de données d'acquisition (perception, compréhension et production). Nous souhaitons détecter des régularités dans les comportements acquisitionnels de sous-groupes d'apprenants, en tenant compte de l'aspect multidimensionnel du processus d'apprentissage L2. La méthodologie proposée appartient au domaine de l'intelligence artificielle, plus spécifiquement aux techniques de clustering semi supervisé.Notre algorithme a été appliqué à la base de données du projet VILLA qui inclut les données d'acquisition d'apprenants de 5 langues sources différentes (français, italien, néerlandais, allemand et anglais) avec le polonais comme langue cible. 156 apprenants adultes ont chacun été testé avec une variété de tâches en polonais pendant 14h de session d'enseignement, à partir de l'exposition initiale. Ces tests ont permis d’évaluer leurs performances sur les niveaux d'analyse linguistique que sont la phonologie, la morphologie, la morphosyntaxe et le lexique. La base de données inclut également leur sensibilité aux caractéristiques de l'input, telles que la fréquence et la transparence des éléments lexicaux utilisés dans les tâches linguistiques.La mesure de similarité utilisée dans les techniques classiques de clustering est revisitée dans ce travail afin d'évaluer la distance entre deux apprenants d'un point de vue acquisitionniste. Elle repose sur l'identification de la stratégie de réponse de l'apprenant à une structure de test linguistique spécifique. Nous montrons que cette mesure permet de détecter la présence ou l'absence dans les réponses de l'apprenant d'une stratégie proche du système flexionnel de la LC. Ce procédé fournit une classification des apprenants cohérente avec la recherche sur l'acquisition de la langue seconde et apporte de nouvelles pistes de réflexion sur les parcours acquisitionnels des apprenants ab initio
This thesis aims to develop an effective methodology for the discovery and description of the learner's profile of an L2 based on acquisition data (perception, understanding and production). We want to detect patterns in the acquisition behaviours of subgroups of learners, taking into account the multidimensional aspect of the L2 learning process. The proposed methodology belongs to the field of artificial intelligence, more specifically to semi supervised clustering techniques.Our algorithm has been applied to the data base of the VILLA project, which includes the performance of learners from 5 different source languages (French, Italian, Dutch, German and English) with Polish as the target language. 156 adult learners were each tested with a variety of tasks in Polish during 14 hours of teaching session, starting from the initial exposure. These tests made it possible to evaluate their performance on the levels of linguistic analysis that are phonology, morphology, morphosyntax and lexicon. The database also includes their sensitivity to input characteristics, such as the frequency and transparency of lexical elements used in linguistic tasks.The similarity measure used in traditional clustering techniques is revisited in this work in order to evaluate the distance between two learners from an acquisitionist point of view. It is based on the identification of the learner's response strategy to a specific language test structure. We show that this measure makes it possible to detect the presence or absence in the learner's responses of a strategy similar to the LC flexional system, and so enables our algorithm to provide a resulting classification consistent with second language acquisition research. As a result, we claim that our algorithm might be relevant in the empirical establishment of learners' profiles and the discovery of new opportunities for reflection or analysis
14

Hung-YuChen and 陳泓佑. "Learning from small datasets containing nominal attributes." Thesis, 2019. http://ndltd.ncl.edu.tw/handle/y2qgaw.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
博士
國立成功大學
資訊管理研究所
107
In many small-data-learning problems, owing to the incomplete data structure, explicit information for decision makers is limited. Although machine learning algorithms are extensively applied to extract knowledge, most of them are developed without considering whether the training sets can fully represent the population properties. Focusing on small data which contains nominal inputs and continuous outputs, this paper develops an effective sample generating procedure based on fuzzy theories to tackle the learning issue by data preprocessing. According to the derived fuzzy relations between categories and continuous outputs, the possibilities of the combinations of categories (virtual samples) can be aggregated when continuous outputs are given. Proper virtual samples are further selected by using fuzzy alpha-cut on the possibility distributions, and these are added to the training sets to form new ones. In the experiment, sixteen datasets taken from the UC Irvine Machine Learning Repository are examined with back-propagation neural networks and support vector regressions. The results reveal that the forecasting accuracies of the two models are significantly improved when they are built with the proposed new training sets. Moreover, the results also indicate the proposed method outperforms bootstrap aggregating and the synthetic minority over-sampling technique-Nominal-Continuous with the greatest amount of statistical support.
15

Chun-WeiChen and 陳俊偉. "Applying Box-and-Whisker Plots for Learning from Small Datasets." Thesis, 2011. http://ndltd.ncl.edu.tw/handle/24953023842861205688.

Full text
APA, Harvard, Vancouver, ISO, and other styles
16

CHOU, TSAI-YUAN, and 周才淵. "Generating Virtual Attributes by Fuzzy Clustering Algorithm for Small Datasets Learning." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/9dnjea.

Full text
APA, Harvard, Vancouver, ISO, and other styles
17

Chien-ChihChen and 陳建智. "Employing Dependent Virtual Samples for Learning More Information from Small Datasets." Thesis, 2011. http://ndltd.ncl.edu.tw/handle/50934699766590312854.

Full text
APA, Harvard, Vancouver, ISO, and other styles
18

Hong-YangLin and 林泓暘. "Generating Aggregated Weights to Improve the Predictive Accuracy of Single-Model Ensemble Numerical Predicting Method in Small Datasets." Thesis, 2017. http://ndltd.ncl.edu.tw/handle/6b385s.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
碩士
國立成功大學
工業與資訊管理學系
105
In the age of information explosion,it’s easier to reach out to information,so how to explore and conclude some useful information in limited data is a pretty important study in small data learning.nowadays,the studies in ensemble method mostly focus on the process instead of the result.the methods in datamining can be divided into classification and prediction.in ensemble method ,voting is the most common way to deal with classification,but in numerical prediction problem,average method is the most common way to calculate the result,but it can be easily affected by some extreme values,especially in the circumstances of small datasets We make an improvement in Bagging.We use SVR as our prediction model ,and calculate the error value based on our prediction model,so we can get a corresponding weight value of each prediction value,and then we can calculate the compromise prediction value under the purpose of getting the smallest error value.Therefore,we can stabilize our system,and we compare our method to average method in order to examine the effect of our study,and we also take the practical case in panel factory to prove the improvement in single-model ensemble method
19

Hult, Jim, and Pontus Pihl. "Inspecting product quality with computer vision techniques : Comparing traditional image processingmethodswith deep learning methodson small datasets in finding surface defects." Thesis, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:hj:diva-54056.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Quality control is an important part of any production line. It can be done manually but is most efficient if automated. Inspecting qualitycan include many different processes but this thesisisfocusedon the visual inspection for cracks and scratches. The best way of doingthis at the time of writing is with the help of Artificial Intelligence (AI), more specifically Deep Learning (DL).However, these need a training datasetbeforehand to train on and for some smaller companies, this mightnotbean option. This study triesto find an alternative visual inspection method,that does notrelyon atrained deep learning modelfor when trainingdata is severely limited. Our method is to use edge detection algorithmsin combination with a template to find any edge that doesn’t belong. These include scratches, cracks, or misaligned stickers. These anomalies arethen highlighted in the original picture to show where the defect is. Since deep learningis stateof the art ofvisual inspection, it is expected to outperform template matching when sufficiently trained.To find where this occurs,the accuracy of template matching iscompared to the accuracy of adeep learning modelat different training levels. The deep learning modelisto be trained onimage augmenteddatasets of size: 6, 12, 24, 48, 84, 126, 180, 210, 315, and 423. Both template matching and the deep learning modelwas tested on the samebalanceddataset of size 216. Half of the dataset was images of scratched units,and the other half was of unscratched units. This gave a baseline of 50% where anything under would be worse thanjust guessing. Template matching achieved an accuracy of 88%, and the deep learning modelaccuracyrose from 51% to 100%as the training setincreased. This makes template matching have better accuracy then AI trained on dataset of 84imagesor smaller. But a deep learning modeltrained on 126 images doesstart to outperform template matching. Template matching did perform well where no data was available and training adeep learning modelis no option. But unlike a deep learning model, template matching would not need retraining to find other kinds of surface defects. Template matching could also be used to find for example, misplaced stickers. Due to the use of a template, any edge that doesnot match isdetected.  The ways to train deep learning modelis highly customizable to the users need. Due to resourceand knowledge restrictions, a deep dive into this subject was not conducted.For template matching, only Canny edge detection was used whenmeasuringaccuracy. Other edge detection methodssuch as, Sobel, and Prewitt was ruledoutearlier in this study.
20

Wu-KuoLin and 林武國. "Rebuilding Sample Distributions for Small Dataset Learning." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/344kev.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
博士
國立成功大學
工業與資訊管理學系
106
Over the past few decades, numerous learning algorithms have been proposed to extract knowledge from data. The majority of these algorithms have been developed with the assumption that training sets can denote populations. When the training sets contain only a few properties of their populations, the algorithms may extract minimal and/or biased knowledge for decision makers. This study develops a systematic procedure based on fuzzy theories to create new training sets by rebuilding the possible sample distributions, where the procedure contains new functions that estimate domains and a sample generating method. In this study, two real cases of a leading company in the thin film transistor liquid crystal display (TFT-LCD) industry are examined. Two learning algorithms, a back-propagation neural network and support vector regression, are employed for modeling, and two sample generation approaches, bootstrap aggregating (bagging) and the synthetic minority over-sampling technique (SMOTE), are employed to compare the accuracy of the models. The results indicate that the proposed method outperforms bagging and the SMOTE with the greatest amount of statistical support.
21

Chang, Ya-Chun, and 張雅君. "A research on intelligent parameters searching in small dataset." Thesis, 2010. http://ndltd.ncl.edu.tw/handle/08483726387452009983.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
碩士
東海大學
工業工程與經營資訊學系
98
Practically, the experiment is the major methodology in the R&D stage of searching for the right parameter settings of a new product development. However, the searching procedure is very much consuming the cost, time and manpower. That is, a method in enhancing the speed and quality of the searching process will be very much benefit in the product development process. This research is focused on the developing a searching mechanism under the small size datasets to achieve a better quality of the region of parameter settings in a faster way. A goal-oriented method is developed in effectively using the previous experiments information to limit the further explore region. This research adopted Intervalized Kernel Density Estimation (IKDE) method to generate the virtual dataset based on the existed real small dataset. And then, Support Vector Machine (SVM) is used to find the classifier. In this research, three improved methods have been developed: 1) purely IKDE combined with SVM to construct a classifier, 2) limited the generation of virtual dataset and achieve an equal quality of the classifier which showed the efficiency in computation time, 3) using roulette wheel method in exploring the region of virtual dataset but without losing the quality of the classifier and showed a better convergence property. All the methods showed a better quality than the general random methods. And, the last method showed a convergence property in out run all methods.
22

I-HsiangWen and 溫怡翔. "A New Data Transformation Model for Small Dataset Learning." Thesis, 2016. http://ndltd.ncl.edu.tw/handle/81814135131034384018.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
博士
國立成功大學
工業與資訊管理學系
104
In most highly competitive manufacturing industries, the sample sizes are usually very small in pilot runs, in order to quickly launch new products. However, it is always difficult for engineers to improve the quality in mass production runs based on the limited data obtained in this way. Past research has demonstrated that adding artificial samples can be an effective approach when learning with small datasets. However, a prior analysis of the data is needed to deduce the appropriate sample distributions within which the artificial samples are generated. Johnson transformation is one of the well-known models that can be applied to bring data close to a normal distribution with the satisfaction of certain statistical assumptions. The sample size required for such data transformation methods is usually large, and this thus motivates the efforts of the current study to develop a new method which is suitable for small datasets. Accordingly, this research proposes the Small-Johnson Data Transformation (SJDT) method to transform small raw data to normal distributions to generate virtual samples. When compared with four other methods, the results obtained with a real small dataset drawn from the Film Transistor Liquid Crystal Display (TFT-LCD) industry in Taiwan demonstrate that the proposed method is able to effectively improve the forecasting ability with small sample sizes.
23

Yu-ChunChiang and 江裕群. "Generating fuzzy-rule based attributes to improve small dataset learning." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/49511140842879638349.

Full text
APA, Harvard, Vancouver, ISO, and other styles
24

Wei-ShanLing and 凌偉珊. "Constructing a new virtual sample generation technique for small dataset learning." Thesis, 2016. http://ndltd.ncl.edu.tw/handle/74105744105429620295.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
碩士
國立成功大學
工業與資訊管理學系碩士在職專班
104
Since the rise of Generation Network, big data has become the hottest topic issue even small data recently. It is difficult to do further analysis and prediction due to small data is not easy to obtain and high cost. Virtual sample generation method proved an effective way to solve small data problem. The main technique is Mega-trend diffusion (MTD) that defined database on status of uniform distribution and skewness. These studies propose a non-parametric multi-modal virtual sample generation for multi-modal population. After running data preprocess, it will capture the maximum and useful data by using soft DBSCAN cluster method. Using estimated data range by MTD Algorithm and generate virtual sample for prediction.
25

Mahdi, Md Safiur Rahman. "Identifying conserved microRNAs in a large dataset of wheat small RNAs." 2015. http://hdl.handle.net/1993/30677.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
MicroRNAs (miRNAs) play a vital role in regulating gene expression. Detecting conserved and novel miRNAs in very large genomic datasets generated using next generation sequencing platforms is a new research area in the field of gene regulation, but finding useful miRNA information from a large wheat genome is a challenging research project. We propose to design a toolchain that will identify conserved miRNAs using various software tools such as Basic Local Alignment Search Tool (BLAST), Bowtie 2, MAFFT and RNAfold. Our toolchain identified 36 wheat conserved miRNA families that matched with 232 experimental sequences. Moreover, we found 87 plant conserved miRNA families that matched between 613 experimental sequences and the miRBase dataset. In addition, we observed significant differential expression for the wheat exposed to the heat stress compared to those exposed to light and UV stresses or no stress (control).
October 2015
26

Bing-MinWang and 王秉民. "Exploring neural network hyperparameters on small dataset and hand-crafted features: take credit scoring as an example." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/f3kq35.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
碩士
國立成功大學
電機工程學系
106
Deep learning has achieved remarkable success in various fields, e.g. computer vision, natural language processing, and games, etc, and developed many novel techniques. These fields have a large number of data with raw features. But there are still numerous problems in other fields with few data and hand-crafted features, such as credit scoring, stock prediction, HIV prediction, etc. We want to explore whether deep learning techniques developed from remarkable tasks work in other machine learning tasks. We compared the combinations of 9 activation functions and 12 weight initializations, found that the result from original paper is the same as from credit scoring dataset. We further explored the regularization methods affect the results while model gets deeper and used SMBO method to replace grid search and random search methods for hyperparameter tuning. Last, we compared the time of training a model between neural network and ensemble method (bstacking). We showed that neural network could get a better accuracy while using 0.27 times the time for training a model. We showed that deep learning can still outperform traditional machine learning method (bstacking) in small and hand-crafted feature dataset, and we should not be using smaller networks because of overfitting. Instead, use big network, and properly choose regularization techniques to control overfitting. In deep network, l2 and dropout are the better choices than early stopping. From the efficiency point of view, some traditional machine learning algorithms would need much time to train than neural networks.

To the bibliography