Academic literature on the topic 'Data / knowledge partitioning and distribution'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Data / knowledge partitioning and distribution.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Data / knowledge partitioning and distribution"

1

Rota, Jadranka, Tobias Malm, Nicolas Chazot, Carlos Peña, and Niklas Wahlberg. "A simple method for data partitioning based on relative evolutionary rates." PeerJ 6 (August 28, 2018): e5498. http://dx.doi.org/10.7717/peerj.5498.

Full text
Abstract:
Background Multiple studies have demonstrated that partitioning of molecular datasets is important in model-based phylogenetic analyses. Commonly, partitioning is done a priori based on some known properties of sequence evolution, e.g. differences in rate of evolution among codon positions of a protein-coding gene. Here we propose a new method for data partitioning based on relative evolutionary rates of the sites in the alignment of the dataset being analysed. The rates are inferred using the previously published Tree Independent Generation of Evolutionary Rates (TIGER), and the partitioning is conducted using our novel python script RatePartitions. We conducted simulations to assess the performance of our new method, and we applied it to eight published multi-locus phylogenetic datasets, representing different taxonomic ranks within the insect order Lepidoptera (butterflies and moths) and one phylogenomic dataset, which included ultra-conserved elements as well as introns. Methods We used TIGER-rates to generate relative evolutionary rates for all sites in the alignments. Then, using RatePartitions, we partitioned the data into partitions based on their relative evolutionary rate. RatePartitions applies a simple formula that ensures a distribution of sites into partitions following the distribution of rates of the characters from the full dataset. This ensures that the invariable sites are placed in a partition with slowly evolving sites, avoiding the pitfalls of previously used methods, such as k-means. Different partitioning strategies were evaluated using BIC scores as calculated by PartitionFinder. Results Simulations did not highlight any misbehaviour of our partitioning approach, even under difficult parameter conditions or missing data. In all eight phylogenetic datasets, partitioning using TIGER-rates and RatePartitions was significantly better as measured by the BIC scores than other partitioning strategies, such as the commonly used partitioning by gene and codon position. We compared the resulting topologies and node support for these eight datasets as well as for the phylogenomic dataset. Discussion We developed a new method of partitioning phylogenetic datasets without using any prior knowledge (e.g. DNA sequence evolution). This method is entirely based on the properties of the data being analysed and can be applied to DNA sequences (protein-coding, introns, ultra-conserved elements), protein sequences, as well as morphological characters. A likely explanation for why our method performs better than other tested partitioning strategies is that it accounts for the heterogeneity in the data to a much greater extent than when data are simply subdivided based on prior knowledge.
APA, Harvard, Vancouver, ISO, and other styles
2

Shaikh, M. Bilal, M. Abdul Rehman, and Attaullah Sahito. "Optimizing Distributed Machine Learning for Large Scale EEG Data Set." Sukkur IBA Journal of Computing and Mathematical Sciences 1, no. 1 (2017): 114. http://dx.doi.org/10.30537/sjcms.v1i1.14.

Full text
Abstract:
Distributed Machine Learning (DML) has gained its importance more than ever in this era of Big Data. There are a lot of challenges to scale machine learning techniques on distributed platforms. When it comes to scalability, improving the processor technology for high level computation of data is at its limit, however increasing machine nodes and distributing data along with computation looks as a viable solution. Different frameworks and platforms are available to solve DML problems. These platforms provide automated random data distribution of datasets which miss the power of user defined intelligent data partitioning based on domain knowledge. We have conducted an empirical study which uses an EEG Data Set collected through P300 Speller component of an ERP (Event Related Potential) which is widely used in BCI problems; it helps in translating the intention of subject w h i l e performing any cognitive task. EEG data contains noise due to waves generated by other activities in the brain which contaminates true P300Speller. Use of Machine Learning techniques could help in detecting errors made by P300 Speller. We are solving this classification problem by partitioning data into different chunks and preparing distributed models using Elastic CV Classifier. To present a case of optimizing distributed machine learning, we propose an intelligent user defined data partitioning approach that could impact on the accuracy of distributed machine learners on average. Our results show better average AUC as compared to average AUC obtained after applying random data partitioning which gives no control to user over data partitioning. It improves the average accuracy of distributed learner due to the domain specific intelligent partitioning by the user. Our customized approach achieves 0.66 AUC on individual sessions and 0.75 AUC on mixed sessions, whereas random / uncontrolled data distribution records 0.63 AUC.
APA, Harvard, Vancouver, ISO, and other styles
3

Liu, Richen, Liming Shen, Xueyi Chen, et al. "Sketch-Based Slice Interpretative Visualization for Stratigraphic Data." Journal of Imaging Science and Technology 63, no. 6 (2019): 60505–1. http://dx.doi.org/10.2352/j.imagingsci.technol.2019.63.6.060505.

Full text
Abstract:
Abstract In this article, the authors propose a stratigraphic slice interpretative visualization system, namely slice analyzer. It enables the domain experts, i.e., geologists and oil/gas exploration experts, to interactively interpret the slices with domain knowledge, which helps them get a better understanding of stratigraphic structures and the distribution of the geological materials, e.g., underground flow path (UFP), river delta, floodplain, slump fan, etc. In addition to some domain-specific slice edit manipulations, a sketch-based sub-region partitioning approach is further presented to help users divide the slice into individual sub-regions with homologous characteristics according to their domain knowledge. Consequently, the geological materials they are interested in can be extracted automatically and visualized by the proposed geological symbol definition algorithm. Feedback from domain experts suggests that the proposed system is capable of interpreting the stratigraphic slice, compared with their currently used tools.
APA, Harvard, Vancouver, ISO, and other styles
4

Zhu, Zichen, Xiao Hu, and Manos Athanassoulis. "NOCAP: Near-Optimal Correlation-Aware Partitioning Joins." Proceedings of the ACM on Management of Data 1, no. 4 (2023): 1–27. http://dx.doi.org/10.1145/3626739.

Full text
Abstract:
Storage-based joins are still commonly used today because the memory budget does not always scale with the data size. One of the many join algorithms developed that has been widely deployed and proven to be efficient is the Hybrid Hash Join (HHJ), which is designed to exploit any available memory to maximize the data that is joined directly in memory. However, HHJ cannot fully exploit detailed knowledge of the join attribute correlation distribution. In this paper, we show that given a correlation skew in the join attributes, HHJ partitions data in a suboptimal way. To do that, we derive the optimal partitioning using a new cost-based analysis of partitioning-based joins that is tailored for primary key - foreign key (PK-FK) joins, one of the most common join types. This optimal partitioning strategy has a high memory cost, thus, we further derive an approximate algorithm that has tunable memory cost and leads to near-optimal results. Our algorithm, termed NOCAP (Near-Optimal Correlation-Aware Partitioning) join, outperforms the state of the art for skewed correlations by up to 30%, and the textbook Grace Hash Join by up to 4×. Further, for a limited memory budget, NOCAP outperforms HHJ by up to 10%, even for uniform correlation. Overall, NOCAP dominates state-of-the-art algorithms and mimics the best algorithm for a memory budget varying from below √||relation|| to more than ||relation||.
APA, Harvard, Vancouver, ISO, and other styles
5

Sineglazov, Victor, Olena Chumachenko, and Eduard Heilyk. "Semi-controlled Learning in Information Processing Problems." Electronics and Control Systems 4, no. 70 (2022): 37–43. http://dx.doi.org/10.18372/1990-5548.70.16754.

Full text
Abstract:
The article substantiates the need for further research of known methods and the development of new methods of machine learning – semi-supervized learning. It is shown that knowledge of the probability distribution density of the initial data obtained using unlabeled data should carry information useful for deriving the conditional probability distribution density of labels and input data. If this is not the case, semi-supervised learning will not provide any improvement over supervised learning. It may even happen that the use of unlabeled data reduces the accuracy of the prediction. For semi-supervised learning to work, certain assumptions must hold, namely: the semi-supervised smoothness assumption, the clustering assumption (low-density partitioning), and the manifold assumption. A new hybrid semi-supervised learning algorithm using the label propagation method has been developed. An example of using the proposed algorithm is given.
APA, Harvard, Vancouver, ISO, and other styles
6

Sirbiladze, Gia, Bidzina Matsaberidze, Bezhan Ghvaberidze, Bidzina Midodashvili, and David Mikadze. "Fuzzy TOPSIS based selection index in the planning of emergency service facilities locations and goods transportation." Journal of Intelligent & Fuzzy Systems 41, no. 1 (2021): 1949–62. http://dx.doi.org/10.3233/jifs-210636.

Full text
Abstract:
The attributes influencing the decision-making process in planning transportation of goods from selected facilities locations in disaster zones are considered. Experts evaluate each candidate for humanitarian aid distribution centers (HADCs) (service centers) against each uncertainty factor in q-rung orthopair fuzzy sets (q-ROFS). For representation of experts’ knowledge in the input data for planning emergency service facilities locations a q-rung orthopair fuzzy TOPSIS (Technique for Order Preference by Similarity to Ideal Solution) approach is developed. Based on the offered fuzzy TOPSIS aggregation a new innovative objective function is introduced which maximizes a candidate HADC’s selection index and reduces HADCs opening risks in disaster zones. The HADCs location and goods transportation problem is reduced to the bi-criteria problem of partitioning the set of customers by the set of service centers: 1) Minimization of opened HADCs and goods transportation total costs; 2) Maximization of HADCs selection index. Partitioning type transportation constraints are also constructed. Our approach for solving the constructed bi-criteria partitioning problem consists of two phases. In the first phase, based on the covering’s matrix, we generate a new matrix with columns allowing to find all possible partitioning of the demand points with the opened HADCs. In the second phase, using the generated matrix and our exact algorithm we find the partitioning –allocations of the HADCs to the centers corresponded to the Pareto-optimal solutions. The constructed model is illustrated with a numerical example.
APA, Harvard, Vancouver, ISO, and other styles
7

Smith, Bruce R., Christophe M. Herbinger, and Heather R. Merry. "Accurate Partition of Individuals Into Full-Sib Families From Genetic Data Without Parental Information." Genetics 158, no. 3 (2001): 1329–38. http://dx.doi.org/10.1093/genetics/158.3.1329.

Full text
Abstract:
Abstract Two Markov chain Monte Carlo algorithms are proposed that allow the partitioning of individuals into full-sib groups using single-locus genetic marker data when no parental information is available. These algorithms present a method of moving through the sibship configuration space and locating the configuration that maximizes an overall score on the basis of pairwise likelihood ratios of being full-sib or unrelated or maximizes the full joint likelihood of the proposed family structure. Using these methods, up to 757 out of 759 Atlantic salmon were correctly classified into 12 full-sib families of unequal size using four microsatellite markers. Large-scale simulations were performed to assess the sensitivity of the procedures to the number of loci and number of alleles per locus, the allelic distribution type, the distribution of families, and the independent knowledge of population allelic frequencies. The number of loci and the number of alleles per locus had the most impact on accuracy. Very good accuracy can be obtained with as few as four loci when they have at least eight alleles. Accuracy decreases when using allelic frequencies estimated in small target samples with skewed family distributions with the pairwise likelihood approach. We present an iterative approach that partly corrects that problem. The full likelihood approach is less sensitive to the precision of allelic frequencies estimates but did not perform as well with the large data set or when little information was available (e.g., four loci with four alleles).
APA, Harvard, Vancouver, ISO, and other styles
8

Grard, Aline, and Jean-François Deliège. "Characterizing Trace Metal Contamination and Partitioning in the Rivers and Sediments of Western Europe Watersheds." Hydrology 10, no. 2 (2023): 51. http://dx.doi.org/10.3390/hydrology10020051.

Full text
Abstract:
Adsorption and desorption processes occurring on suspended and bed sediments were studied in two datasets from western Europe watersheds (Meuse and Mosel). Copper and zinc dissolved and total concentrations, total suspended sediment concentrations, mass concentrations, and grain sizes were analyzed. Four classes of mineral particle size were determined. Grain size distribution had to be considered in order to assess the trace metal particulate phase in the water column. The partitioning coefficients of trace metals between the dissolved and particulate phases were calculated. The objective of this study was to improve the description of the processes involved in the transportation and fate of trace metals in river aquatic ecosystems. Useful data for future modelling, management and contamination assessment of river sediments were provided. As it is confirmed by a literature review, the copper and zinc partitioning coefficients calculated in this study are reliable. The knowledge related to copper and zinc (e.g., partitioning coefficients) will allow us to begin investigations into environmental modelling. This modelling will allow us to consider new sorption processes and better describe trace metal and sediment fates as well as pressure–impact relationships.
APA, Harvard, Vancouver, ISO, and other styles
9

McDonald, H. Gregory. "Yukon to the Yucatan: Habitat partitioning in North American Late Pleistocene ground sloths (Xenarthra, Pilosa)." Journal of Palaeosciences 70, no. (1-2) (2021): 237–52. http://dx.doi.org/10.54991/jop.2021.17.

Full text
Abstract:
The late Pleistocene mammalian fauna of North America included seven genera of ground sloth, representing four families. This cohort of megaherbivores had an extensive geographic range in North America from the Yukon in Canada to the Yucatan Peninsula in Mexico and inhabited a variety of biomes. Within this latitudinal range there are taxa with a distribution limited to temperate latitudes while others have a distribution restricted to tropical latitudes. Some taxa are better documented than others and more is known about their palaeoecology and habitat preferences, while our knowledge of the palaeoecology of taxa more recently discovered remains limited. In order to better understand what aspects of their palaeoecology allowed their dispersal from South America, long–term success in North America and ultimately the underlying causes for their extinction at the end of the Pleistocene more information is needed. A summary overview of the differences in the palaeoecology of the late Pleistocene sloths in North America and their preferred habitats is presented based on different data sources.
APA, Harvard, Vancouver, ISO, and other styles
10

Dalton, Lori A., and Mohammadmahdi R. Yousefi. "Data Requirements for Model-Based Cancer Prognosis Prediction." Cancer Informatics 14s5 (January 2015): CIN.S30801. http://dx.doi.org/10.4137/cin.s30801.

Full text
Abstract:
Cancer prognosis prediction is typically carried out without integrating scientific knowledge available on genomic pathways, the effect of drugs on cell dynamics, or modeling mutations in the population. Recent work addresses some of these problems by formulating an uncertainty class of Boolean regulatory models for abnormal gene regulation, assigning prognosis scores to each network based on intervention outcomes, and partitioning networks in the uncertainty class into prognosis classes based on these scores. For a new patient, the probability distribution of the prognosis class was evaluated using optimal Bayesian classification, given patient data. It was assumed that (1) disease is the result of several mutations of a known healthy network and that these mutations and their probability distribution in the population are known and (2) only a single snapshot of the patient's gene activity profile is observed. It was shown that, even in ideal settings where cancer in the population and the effect of a drug are fully modeled, a single static measurement is typically not sufficient. Here, we study what measurements are sufficient to predict prognosis. In particular, we relax assumption (1) by addressing how population data may be used to estimate network probabilities, and extend assumption (2) to include static and time-series measurements of both population and patient data. Furthermore, we extend the prediction of prognosis classes to optimal Bayesian regression of prognosis metrics. Even when time-series data is preferable to infer a stochastic dynamical network, we show that static data can be superior for prognosis prediction when constrained to small samples. Furthermore, although population data is helpful, performance is not sensitive to inaccuracies in the estimated network probabilities.
APA, Harvard, Vancouver, ISO, and other styles
More sources
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography