Journal articles: 'Web data sets'

1

Alder, J. R., and S. W. Hostetler. "Web based visualization of large climate data sets." Environmental Modelling & Software 68 (June 2015): 175–80. http://dx.doi.org/10.1016/j.envsoft.2015.02.016.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

G.V., Suresh, and Srinivasa Reddy E.V. "Uncertain Data Analysis with Regularized XGBoost." Webology 19, no. 1 (January 20, 2022): 3722–40. http://dx.doi.org/10.14704/web/v19i1/web19245.

Full text

Abstract:

Uncertainty is a ubiquitous element in available knowledge about the real world. Data sampling error, obsolete sources, network latency, and transmission error are all factors that contribute to the uncertainty. These kinds of uncertainty have to be handled cautiously, or else the classification results could be unreliable or even erroneous. There are numerous methodologies developed to comprehend and control uncertainty in data. There are many faces for uncertainty i.e., inconsistency, imprecision, ambiguity, incompleteness, vagueness, unpredictability, noise, and unreliability. Missing information is inevitable in real-world data sets. While some conventional multiple imputation approaches are well studied and have shown empirical validity, they entail limitations in processing large datasets with complex data structures. In addition, these standard approaches tend to be computationally inefficient for medium and large datasets. In this paper, we propose a scalable multiple imputation frameworks based on XGBoost, bootstrapping and regularized method. XGBoost, one of the fastest implementations of gradient boosted trees, is able to automatically retain interactions and non-linear relations in a dataset while achieving high computational efficiency with the aid of bootstrapping and regularized methods. In the context of high-dimensional data, this methodology provides fewer biased estimates and reflects acceptable imputation variability than previous regression approaches. We validate our adaptive imputation approaches with standard methods on numerical and real data sets and shown promising results.

APA, Harvard, Vancouver, ISO, and other styles

3

Fernández, Javier D., Miguel A. Martínez-Prieto, Pablo de la Fuente Redondo, and Claudio Gutiérrez. "Characterising RDF data sets." Journal of Information Science 44, no. 2 (January 9, 2017): 203–29. http://dx.doi.org/10.1177/0165551516677945.

Full text

Abstract:

The publication of semantic web data, commonly represented in Resource Description Framework (RDF), has experienced outstanding growth over the last few years. Data from all fields of knowledge are shared publicly and interconnected in active initiatives such as Linked Open Data. However, despite the increasing availability of applications managing large-scale RDF information such as RDF stores and reasoning tools, little attention has been given to the structural features emerging in real-world RDF data. Our work addresses this issue by proposing specific metrics to characterise RDF data. We specifically focus on revealing the redundancy of each data set, as well as common structural patterns. We evaluate the proposed metrics on several data sets, which cover a wide range of designs and models. Our findings provide a basis for more efficient RDF data structures, indexes and compressors.

APA, Harvard, Vancouver, ISO, and other styles

4

Endsley, K. A., and M. G. Billmire. "Distributed visualization of gridded geophysical data: the Carbon Data Explorer, version 0.2.3." Geoscientific Model Development 9, no. 1 (January 29, 2016): 383–92. http://dx.doi.org/10.5194/gmd-9-383-2016.

Full text

Abstract:

Abstract. Due to the proliferation of geophysical models, particularly climate models, the increasing resolution of their spatiotemporal estimates of Earth system processes, and the desire to easily share results with collaborators, there is a genuine need for tools to manage, aggregate, visualize, and share data sets. We present a new, web-based software tool – the Carbon Data Explorer – that provides these capabilities for gridded geophysical data sets. While originally developed for visualizing carbon flux, this tool can accommodate any time-varying, spatially explicit scientific data set, particularly NASA Earth system science level III products. In addition, the tool's open-source licensing and web presence facilitate distributed scientific visualization, comparison with other data sets and uncertainty estimates, and data publishing and distribution.

APA, Harvard, Vancouver, ISO, and other styles

5

Evans, William N., Helen Levy, and Kosali I. Simon. "Data Watch: Research Data in Health Economics." Journal of Economic Perspectives 14, no. 4 (November 1, 2000): 203–16. http://dx.doi.org/10.1257/jep.14.4.203.

Full text

Abstract:

In this paper, we discuss some important data sets that can be used by economists interested in conducting research in health economics. We describe six types of data sets: health components of data sets traditionally used by economists; longitudinal surveys of health and economic behavior; data on employer-provided insurance; cross-sectional surveys of households that focus on health; data on health care providers; and vital statistics. We summarize some of the leading surveys, discuss the availability of the data, identify how researchers have utilized these data and when possible, include a web address that contains more detailed information about each survey.

APA, Harvard, Vancouver, ISO, and other styles

6

Ibrahim, Nadia, Alaa Hassan, and Marwah Nihad. "Big Data Analysis of Web Data Extraction." International Journal of Engineering & Technology 7, no. 4.37 (December 13, 2018): 168. http://dx.doi.org/10.14419/ijet.v7i4.37.24095.

Full text

Abstract:

In this study, the large data extraction techniques; include detection of patterns and secret relationships between factors numbering and bring in the required information. Rapid analysis of massive data can lead to innovation and concepts of the theoretical value. Compared with results from mining between traditional data sets and the vast amount of large heterogeneous data interdependent it has the ability expand the knowledge and ideas about the target domain. We studied in this research data mining on the Internet. The various networks that are used to extract data onto different locations complex may appear sometimes and has been used to extract information on the web technology to extract and data analysis (Marwah et al., 2016). In this research, we extracted the information on large quantities of the web pages and examined the pages of the site using Java code, and we added the extracted information on a special database for the web page. We used the data network function to get accurate results of evaluating and categorizing the data pages found, which identifies the trusted web or risky web pages, and imported the data onto a CSV extension. Consequently, examine and categorize these data using WEKA to obtain accurate results. We concluded from the results that the applied data mining algorithms are better than other techniques in classification and extraction of data and high performance.

APA, Harvard, Vancouver, ISO, and other styles

7

De Souza, Jessica Oliveira, and Jose Eduardo Santarem Segundo. "Mapeamento de Problemas de Qualidade no Linked Data." Journal on Advances in Theoretical and Applied Informatics 1, no. 1 (October 6, 2015): 38. http://dx.doi.org/10.26729/jadi.v1i1.1043.

Full text

Abstract:

Since the Semantic Web was created in order to improve the current web user experience, the Linked Data is the primary means in which semantic web application is theoretically full, respecting appropriate criteria and requirements. Therefore, the quality of data and information stored on the linked data sets is essential to meet the basic semantic web objectives. Hence, this article aims to describe and present specific dimensions and their related quality issues.

APA, Harvard, Vancouver, ISO, and other styles

8

Rodriguez-Garcia, Mercedes, Antonio Balderas, and Juan Manuel Dodero. "Privacy Preservation and Analytical Utility of E-Learning Data Mashups in the Web of Data." Applied Sciences 11, no. 18 (September 13, 2021): 8506. http://dx.doi.org/10.3390/app11188506.

Full text

Abstract:

Virtual learning environments contain valuable data about students that can be correlated and analyzed to optimize learning. Modern learning environments based on data mashups that collect and integrate data from multiple sources are relevant for learning analytics systems because they provide insights into students’ learning. However, data sets involved in mashups may contain personal information of sensitive nature that raises legitimate privacy concerns. Average privacy preservation methods are based on preemptive approaches that limit the published data in a mashup based on access control and authentication schemes. Such limitations may reduce the analytical utility of the data exposed to gain students’ learning insights. In order to reconcile utility and privacy preservation of published data, this research proposes a new data mashup protocol capable of merging and k-anonymizing data sets in cloud-based learning environments without jeopardizing the analytical utility of the information. The implementation of the protocol is based on linked data so that data sets involved in the mashups are semantically described, thereby enabling their combination with relevant educational data sources. The k-anonymized data sets returned by the protocol still retain essential information for supporting general data exploration and statistical analysis tasks. The analytical and empirical evaluation shows that the proposed protocol prevents individuals’ sensitive information from re-identifying.

APA, Harvard, Vancouver, ISO, and other styles

9

Xiang-Wei, Li, Zheng Gang, and Kang Yu-Xue. "A Rough Sets Based Data Preprocessing Algorithm for Web Structure Mining." Information Technology Journal 11, no. 8 (July 15, 2012): 1127–30. http://dx.doi.org/10.3923/itj.2012.1127.1130.

Full text

APA, Harvard, Vancouver, ISO, and other styles

10

Ma, Dong Lin, and Wei Jun Gao. "An Association-Analysis Based Web Mining Preprocessing Algorithm." Applied Mechanics and Materials 121-126 (October 2011): 3642–46. http://dx.doi.org/10.4028/www.scientific.net/amm.121-126.3642.

Full text

Abstract:

Aimed to overcome the deficiency of abundant data to web mining, the paper proposed an association-analysis based algorithm. Firstly, we construct the relation Information System using original data sets. Secondly, make use of attribute reduction theory of Rough sets to produce the Core of Information System. Core is the most important and necessary information which cannot reduce in original Information System. So it can get a same effect as original data sets to data analysis, and can construct classification modeling using it. Thirdly, construct indiscernibility matrix using reduced Information System, and finally, get the classification of original data sets. The experiments shows that the proposed algorithm can get high efficiency and can avoid the abundant data in follow-up data processing.

APA, Harvard, Vancouver, ISO, and other styles

11

Weichselbraun, Albert, Daniel Streiff, and Arno Scharl. "Consolidating Heterogeneous Enterprise Data for Named Entity Linking and Web Intelligence." International Journal on Artificial Intelligence Tools 24, no. 02 (April 2015): 1540008. http://dx.doi.org/10.1142/s0218213015400084.

Full text

Abstract:

Linking named entities to structured knowledge sources paves the way for state-of-the-art Web intelligence applications which assign sentiment to the correct entities, identify trends, and reveal relations between organizations, persons and products. For this purpose this paper introduces Recognyze, a named entity linking component that uses background knowledge obtained from linked data repositories, and outlines the process of transforming heterogeneous data silos within an organization into a linked enterprise data repository which draws upon popular linked open data vocabularies to foster interoperability with public data sets. The presented examples use comprehensive real-world data sets from Orell Füssli Business Information, Switzerland's largest business information provider. The linked data repository created from these data sets comprises more than nine million triples on companies, the companies' contact information, key people, products and brands. We identify the major challenges of tapping into such sources for named entity linking, and describe required data pre-processing techniques to use and integrate such data sets, with a special focus on disambiguation and ranking algorithms. Finally, we conduct a comprehensive evaluation based on business news from the New Journal of Zurich and AWP Financial News to illustrate how these techniques improve the performance of the Recognyze named entity linking component.

APA, Harvard, Vancouver, ISO, and other styles

12

Oh, Solgil, Sujin Yoo, Yuri Kim, Jisoo Song, and Seongbin Park. "Implementation of a System That Helps Novice Users Work with Linked Data." Electronics 10, no. 11 (May 22, 2021): 1237. http://dx.doi.org/10.3390/electronics10111237.

Full text

Abstract:

On the Semantic Web, resources are connected to each other by the IRI. As the basic unit is comprised of linked data, machines can use semantic data and reason their relations without additional intervention on the Semantic Web. However, it is necessary for users who first encounter the Semantic Web to understand its underlying structure and some grammatical rules. This study suggests linking data sets of the Semantic Web through the Euler diagram, which does not require any prior knowledge. We performed a user study with our relationship-building system and verified that users could better understand linked data through the usage of the system. Users can indirectly be guided by using our Euler diagram-based data relationship-building system to understand the Semantic Web and its data linkage system. We also expect that the data sets defined through our system can be used in various applications.

APA, Harvard, Vancouver, ISO, and other styles

13

Li, Quanzhi, Sameena Shah, Xiaomo Liu, and Armineh Nourbakhsh. "Data Sets: Word Embeddings Learned from Tweets and General Data." Proceedings of the International AAAI Conference on Web and Social Media 11, no. 1 (May 3, 2017): 428–36. http://dx.doi.org/10.1609/icwsm.v11i1.14859.

Full text

Abstract:

A word embedding is a low-dimensional, dense and real-valued vector representation of a word. Word embeddings have been used in many NLP tasks. They are usually generated from a large text corpus. The embedding of a word captures both its syntactic and semantic aspects. Tweets are short, noisy and have unique lexical and semantic features that are different from other types of text. Therefore, it is necessary to have word embeddings learned specifically from tweets. In this paper, we present ten word embedding data sets. In addition to the data sets learned from just tweet data, we also built embedding sets from the general data and the combination of tweets and the general data. The general data consist of news articles, Wikipedia data and other web data. These ten embedding models were learned from about 400 million tweets and 7 billion words from the general data. In this paper, we also present two experiments demonstrating how to use the data sets in some NLP tasks, such as tweet sentiment analysis and tweet topic classification tasks.

APA, Harvard, Vancouver, ISO, and other styles

14

Dzemyda, Gintautas, Virginijus Marcinkevičius, and Viktor Medvedev. "WEB APPLICATION FOR LARGE-SCALE MULTIDIMENSIONAL DATA VISUALIZATION." Mathematical Modelling and Analysis 16, no. 1 (June 24, 2011): 273–85. http://dx.doi.org/10.3846/13926292.2011.580381.

Full text

Abstract:

In this paper, we present an approach of the web application (as a service) for data mining oriented to the multidimensional data visualization. This paper focuses on visualization methods as a tool for the visual presentation of large-scale multidimensional data sets. The proposed implementation of such a web application obtains a multidimensional data set and as a result produces a visualization of this data set. It also supports different configuration parameters of the data mining methods used. Parallel computation has been used in the proposed implementation to run the algorithms simultaneously on different computers.

APA, Harvard, Vancouver, ISO, and other styles

15

Sharma, Vagisha, Josh Eckels, Birgit Schilling, Christina Ludwig, Jacob D. Jaffe, Michael J. MacCoss, and Brendan MacLean. "Panorama Public: A Public Repository for Quantitative Data Sets Processed in Skyline." Molecular & Cellular Proteomics 17, no. 6 (February 27, 2018): 1239–44. http://dx.doi.org/10.1074/mcp.ra117.000543.

Full text

Abstract:

To address the growing need for a centralized, community resource of published results processed with Skyline, and to provide reviewers and readers immediate visual access to the data behind published conclusions, we present Panorama Public (https://panoramaweb.org/public.url), a repository of Skyline documents supporting published results. Panorama Public is built on Panorama, an open source data management system for mass spectrometry data processed with the Skyline targeted mass spectrometry environment. The Panorama web application facilitates viewing, sharing, and disseminating results contained in Skyline documents via a web-browser. Skyline users can easily upload their documents to a Panorama server and allow other researchers to explore uploaded results in the Panorama web-interface through a variety of familiar summary graphs as well as annotated views of the chromatographic peaks processed with Skyline. This makes Panorama ideal for sharing targeted, quantitative results contained in Skyline documents with collaborators, reviewers, and the larger proteomics community. The Panorama Public repository employs the full data visualization capabilities of Panorama which facilitates sharing results with reviewers during manuscript review.

APA, Harvard, Vancouver, ISO, and other styles

16

Niestroj, M. G., D. A. McMeekin, and P. Helmholz. "INTRODUCING A FRAMEWORK FOR CONFLATING ROAD NETWORK DATA WITH SEMANTIC WEB TECHNOLOGIES." ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences IV-2/W5 (May 29, 2019): 231–38. http://dx.doi.org/10.5194/isprs-annals-iv-2-w5-231-2019.

Full text

Abstract:

Abstract. Road network asset management is a challenging task as many data sources with different road asset location accuracies are available. In Australia and New Zealand transport agencies are investigating into harmonisation of road asset data, whereby two or more data sets are merged to create a new data set. Currently, identifying relations between road assets of the same meaning is not always possible, as road authorities of these countries use their own data structures and standards. This paper employs SemanticWeb Technologies, such as RDF/Turtle ontologies and semantic rules to enable road network conflation (merge multiple data sets without creating a new data set) as a first step towards data harmonisation by means of information exchange, and shifts road network data from intersections and road nodes to data sets considering the accuracy of the data sets in the selected area. The data integration from GeoJSON into RDF/Turtle files is processed with Python. A geographic coordinates shifting algorithm reads unique data entries that have been extracted from RDF/Turtle into JSON-LD and saves the processed data in their origin file format, so that a closed data flow can be approached.

APA, Harvard, Vancouver, ISO, and other styles

17

Hu, Ye, and Jürgen Bajorath. "Compound data sets and software tools for chemoinformatics and medicinal chemistry applications: update and data transfer." F1000Research 3 (March 11, 2014): 69. http://dx.doi.org/10.12688/f1000research.3713.1.

Full text

Abstract:

In 2012, we reported 30 compound data sets and/or programs developed in our laboratory in a data article and made them freely available to the scientific community to support chemoinformatics and computational medicinal chemistry applications. These data sets and computational tools were provided for download from our website. Since publication of this data article, we have generated 13 new data sets with which we further extend our collection of publicly available data and tools. Due to changes in web servers and website architectures, data accessibility has recently been limited at times. Therefore, we have also transferred our data sets and tools to a public repository to ensure full and stable accessibility. To aid in data selection, we have classified the data sets according to scientific subject areas. Herein, we describe new data sets, introduce the data organization scheme, summarize the database content and provide detailed access information in ZENODO (doi: 10.5281/zenodo.8451 and doi:10.5281/zenodo.8455).

APA, Harvard, Vancouver, ISO, and other styles

18

H.G., Mohan, Nandish M., and Devaraj F.V. "Automatic Composition of Machine Learning Models as Web Services across Data Sets." International Journal of Computer Sciences and Engineering 10, no. 2 (February 28, 2022): 7–10. http://dx.doi.org/10.26438/ijcse/v10i2.710.

Full text

APA, Harvard, Vancouver, ISO, and other styles

19

Brady, Siobhan M., and Nicholas J. Provart. "Web-Queryable Large-Scale Data Sets for Hypothesis Generation in Plant Biology." Plant Cell 21, no. 4 (April 2009): 1034–51. http://dx.doi.org/10.1105/tpc.109.066050.

Full text

APA, Harvard, Vancouver, ISO, and other styles

20

Garbow, Zachary A., Nicholas R. Olson, David A. Yuen, and John M. Boggs. "Interactive Web-Based Map: Applications to Large Data Sets in the Geosciences." Visual Geosciences 6, no. 3 (November 2001): 1–14. http://dx.doi.org/10.1007/s10069-001-1018-z.

Full text

APA, Harvard, Vancouver, ISO, and other styles

21

Garbow1, Zachary A., Gordon Erlebacher, David A. Yuen, John M. Boggs, and Fabien W. Dubuffet. "Web-Based interrogation of large-scale geophysical data sets from handheld devices." Visual Geosciences 8, no. 1 (May 2003): 1–20. http://dx.doi.org/10.1007/s10069-003-0007-9.

Full text

APA, Harvard, Vancouver, ISO, and other styles

22

Makris, Christos, Yannis Panagis, Evangelos Sakkopoulos, and Athanasios Tsakalidis. "Efficient and adaptive discovery techniques of Web Services handling large data sets." Journal of Systems and Software 79, no. 4 (April 2006): 480–95. http://dx.doi.org/10.1016/j.jss.2005.06.002.

Full text

APA, Harvard, Vancouver, ISO, and other styles

23

Dahiya, Vandna, and Sandeep Dalal. "A Scalable Approach for Data Mining – AHUIM." Webology 18, no. 1 (April 1, 2021): 92–103. http://dx.doi.org/10.14704/web/v18i1/web18029.

Full text

Abstract:

Utility itemset mining, which finds the item sets based on utility factors, has established itself as an essential form of data mining. The utility is defined in terms of quantity and some interest factor. Various methods have been developed so far by the researchers to mine these itemsets but most of them are not scalable. In the present times, a scalable approach is required that can fulfill the budding needs of data mining. A Spark based novel technique has been recommended in this research paper for mining the data in a distributed way, called as Absolute High Utility Itemset Mining (AHUIM). The technique is suitable for small as well as large datasets. The performance of the technique is being measured for various parameters such as speed, scalability, and accuracy etc.

APA, Harvard, Vancouver, ISO, and other styles

24

Yau, Ng Qi, and Wan Zainon. "UNDERSTANDING WEB TRAFFIC ACTIVITIES USING WEB MINING TECHNIQUES." International Journal of Engineering Technologies and Management Research 4, no. 9 (February 1, 2020): 18–26. http://dx.doi.org/10.29121/ijetmr.v4.i9.2017.96.

Full text

Abstract:

Web Usage Mining is a computational process of discovering patterns in large data sets involving methods using the artificial intelligence, machine learning, statistical analysis and database systems with the goal to extract valuable information from accessing server logs of World Wide Web data repositories and transform it into an understandable structure for further understanding and use. Main focus of this paper will be centered on exploring methods that expedites the log mining process and present the result of log mining process through data visualization and compare data-mining algorithms. For the comparison between classification techniques, precision, recall and ROC area are the correct measures that are used to compare algorithms. Based on this study it shows that Naïve Bayes and Bayes Network are proven to be the best algorithms for that.

APA, Harvard, Vancouver, ISO, and other styles

25

Abdulkadium, Ahmed Mahdi, Raid Abd Alreda Shekan, and Haitham Ali Hussain. "Application of Data Mining and Knowledge Discovery in Medical Databases." Webology 19, no. 1 (January 20, 2022): 4912–24. http://dx.doi.org/10.14704/web/v19i1/web19329.

Full text

Abstract:

While technical improvements in the form of computer-based healthcare information applications as well as hardware are enabling collecting of and access to healthcare data wieldier. In this context, there are tools to analyse and examine this medical data once it has been acquired and saved. Analysis of documented medical data records may help in the identification of hidden features and patterns that could significantly increase our understanding of disease onset and treatment therapies. Significantly, the progress in information and communications technologies (ICT) has outpaced our capacity to assess summarise, and extract insight from the data. Today, database management system has equipped us with the fundamental tools for the effective storage as well as lookup of massive data sets, but the topic of how to allow human beings to interpret and analyse huge data remains a challenging and unsolved challenge. So, sophisticated methods for automated data mining and knowledge discovery are required to deal with large data. In this study, an effort was made employing machine learning approach to acquire knowledge that will aid various personnel in taking decisions that will guarantee that the sustainability objectives on Health is achieved. Finally, the present data mining methodologies with data mining methods and also its deployment tools that are more helpful for healthcare services are addressed in depth.

APA, Harvard, Vancouver, ISO, and other styles

26

Rajawat, Anand Singh, Pradeep Bedi, S. B. Goyal, Sandeep Kautish, Zhang Xihua, Hanan Aljuaid, and Ali Wagdy Mohamed. "Dark Web Data Classification Using Neural Network." Computational Intelligence and Neuroscience 2022 (March 28, 2022): 1–11. http://dx.doi.org/10.1155/2022/8393318.

Full text

Abstract:

There are several issues associated with Dark Web Structural Patterns mining (including many redundant and irrelevant information), which increases the numerous types of cybercrime like illegal trade, forums, terrorist activity, and illegal online shopping. Understanding online criminal behavior is challenging because the data is available in a vast amount. To require an approach for learning the criminal behavior to check the recent request for improving the labeled data as a user profiling, Dark Web Structural Patterns mining in the case of multidimensional data sets gives uncertain results. Uncertain classification results cause a problem of not being able to predict user behavior. Since data of multidimensional nature has feature mixes, it has an adverse influence on classification. The data associated with Dark Web inundation has restricted us from giving the appropriate solution according to the need. In the research design, a Fusion NN (Neural network)-S3VM for Criminal Network activity prediction model is proposed based on the neural network; NN- S3VM can improve the prediction.

APA, Harvard, Vancouver, ISO, and other styles

27

Yarlagadda, Dedeepya. "Survey on Big data Analytics." International Journal for Research in Applied Science and Engineering Technology 10, no. 9 (September 30, 2022): 79–84. http://dx.doi.org/10.22214/ijraset.2022.46566.

Full text

Abstract:

Abstract: Big Data is an idea used to portray informational collections that are excessively enormous or complex for standard social data sets to catch, handle, and interact in an opportune way. Huge information has at any rate, one of the going with credits: a huge volume, a quick speed, or a wide assortment. Computerized logic, the web, social media, and the Internet of Things are both speeding up the complexity of knowledge by new directions and wellsprings of data. Sensors, PCs, video, log documents, value-based programming, web-based media, for instance, all create huge measures of information progressively

APA, Harvard, Vancouver, ISO, and other styles

28

Buncher, Brandon, and Matias Carrasco Kind. "Probabilistic cosmic web classification using fast-generated training data." Monthly Notices of the Royal Astronomical Society 497, no. 4 (July 13, 2020): 5041–60. http://dx.doi.org/10.1093/mnras/staa2008.

Full text

Abstract:

ABSTRACT We present a novel method of robust probabilistic cosmic web particle classification in three dimensions using a supervised machine learning algorithm. Training data were generated using a simplified ΛCDM toy model with pre-determined algorithms for generating haloes, filaments, and voids. While this framework is not constrained by physical modelling, it can be generated substantially more quickly than an N-body simulation without loss in classification accuracy. For each particle in this data set, measurements were taken of the local density field magnitude and directionality. These measurements were used to train a random forest algorithm, which was used to assign class probabilities to each particle in a ΛCDM, dark matter-only N-body simulation with 2563 particles, as well as on another toy model data set. By comparing the trends in the ROC curves and other statistical metrics of the classes assigned to particles in each data set using different feature sets, we demonstrate that the combination of measurements of the local density field magnitude and directionality enables accurate and consistent classification of halo, filament, and void particles in varied environments. We also show that this combination of training features ensures that the construction of our toy model does not affect classification. The use of a fully supervised algorithm allows greater control over the information deemed important for classification, preventing issues arising from arbitrary hyperparameters and mode collapse in deep learning models. Due to the speed of training data generation, our method is highly scalable, making it particularly suited for classifying large data sets, including observed data.

APA, Harvard, Vancouver, ISO, and other styles

29

Hettne, Kristina, Reinout van Schouwen, Eleni Mina, Eelke van der Horst, Mark Thompson, Rajaram Kaliyaperumal, Barend Mons, Erik van Mulligen, Jan A. Kors, and Marco Roos. "Explain your data by Concept Profile Analysis Web Services." F1000Research 3 (July 25, 2014): 173. http://dx.doi.org/10.12688/f1000research.4830.1.

Full text

Abstract:

The Concept Profile Analysis technology (overlapping co-occurring concept sets based on knowledge contained in biomedical abstracts) has led to new biomedical discoveries, and users have been able to interact with concept profiles through the interactive tool “Anni” (http://biosemantics.org/anni). However, Anni provides no way for users to save their procedures, results, or related provenance. Here we present a new suite of Web Service operations that allows bioinformaticians to design and execute their own Concept Profile Analysis workflow, possibly as part of a larger bioinformatics analysis. The source code can be downloaded from ZENODO at http://www.dx.doi.org/10.5281/zenodo.10963.

APA, Harvard, Vancouver, ISO, and other styles

30

Duan, Long Zhen, Zhi Xin Zou, and Gui Fen Wang. "The Web Classification Based on ROUGH-GA-BP." Advanced Materials Research 328-330 (September 2011): 1037–40. http://dx.doi.org/10.4028/www.scientific.net/amr.328-330.1037.

Full text

Abstract:

The shortcomings of the BP algorithm are analyzed, and a text classification algorithm based on rough-GA-BP is constructed by combining the genetic algorithm with the rough sets theory. This algorithm reduces the data of the text input vector by the data reduction method based on rough sets theory, and presenting a genetic algorithm approach for feature selection. Experimental results indicate this method is more effective than traditional methods.

APA, Harvard, Vancouver, ISO, and other styles

31

Islam, A. K. M. Saiful, and Michael Piasecki. "A generic metadata description for hydrodynamic model data." Journal of Hydroinformatics 8, no. 2 (March 1, 2006): 141–48. http://dx.doi.org/10.2166/hydro.2006.017b.

Full text

Abstract:

Sharing of data sets between numerical models is considered an important and pressing issue in the modeling community, because of (i) the time consumed to convert data sets and (ii) the need to connect different types of numerical codes to better map inter-connectedness of aquatic domains. One of the reasons for the data sharing problem arises from the lack of sufficient description of the data, or lack of metadata, which is due to the absence of a standardized framework for these metadata sets. This paper describes the development of a metadata framework for hydrodynamic data descriptions using the Geographic Information Metadata, 19115:2003 standard published by the International Standards Organization (ISO). This standard has been chosen not only because of its extent and adequacy to describe geospatial data, but also because of its widespread use and flexibility to extend the coverage. The latter is particularly important, as further extensions of the metadata standard are needed to provide a comprehensive metadata representation of hydrodynamics and their I/O data. In order to enable the community to share and reuse numerical code data sets, however, they need to be published in both human and machine understandable format. One such format is the Web Ontology language (OWL), whose syntax is compliant with the Extensible Markup Language (XML). In this paper, we present an extensive metadata profile using the available elements of ISO 19115:2003 as well as its extension rules. Based on the metadata profile, an explicit specification or ontology for the model data domain has been created using OWL. The use of OWL not only permits flexibility when extending the coverage but also to share data sets as resources across the internet as part of the Semantic Web. We demonstrate the use of the framework using a two-dimensional finite element code and its associated data sets.

APA, Harvard, Vancouver, ISO, and other styles

32

Redmon, Rob J., Juan V. Rodriguez, Janet C. Green, Dan Ober, Gordon Wilson, Delores Knipp, Liam Kilcommons, and Robert McGuire. "Improved Polar and Geosynchronous Satellite Data Sets Available in Common Data Format at the Coordinated Data Analysis Web." Space Weather 13, no. 5 (May 2015): 254–56. http://dx.doi.org/10.1002/2015sw001176.

Full text

APA, Harvard, Vancouver, ISO, and other styles

33

Denney, Dennis. "Web-Portal Reservoir Knowledge Base Integrates Engineering, Production, Geoscience, and Economics Data Sets." Journal of Petroleum Technology 63, no. 10 (October 1, 2011): 98–101. http://dx.doi.org/10.2118/1011-0098-jpt.

Full text

APA, Harvard, Vancouver, ISO, and other styles

34

Rogers, Frank. "EDUCATIONAL FUZZY DATA-SETS AND DATA MINING IN A LINEAR FUZZY REAL ENVIRONMENT." Journal of Honai Math 2, no. 2 (August 8, 2019): 77–84. http://dx.doi.org/10.30862/jhm.v2i2.81.

Full text

Abstract:

Educational data mining is the process of converting raw data from educational systems to useful information that can be used by educational software developers, students, teachers, parents, and other educational researchers. Fuzzy educational datasets are datasets consisting of uncertain values. The purpose of this study is to develop and test a classification model under uncertainty unique to the modern student. This is done by developing a model of the uncertain data that come from an educational setting with Linear Fuzzy Real data. Machine learning was then used to understand students and their optimal learning environment. The ability to predict student performance is important in a web or online environment. This is true in the brick and mortar classroom as well and is especially important in rural areas where academic achievement is lower than ideal.

APA, Harvard, Vancouver, ISO, and other styles

35

Zhang, Linlin, and Sujuan Zhang. "Research on information classification and storage in cloud computing data center based on group collaboration intelligent clustering." Web Intelligence 19, no. 1-2 (December 3, 2021): 159–68. http://dx.doi.org/10.3233/web-210464.

Full text

Abstract:

In order to overcome the problems of long time and low accuracy of traditional methods, a cloud computing data center information classification and storage method based on group collaborative intelligent clustering was proposed. The cloud computing data center information is collected in real time through the information acquisition terminal, and the collected information is transmitted. The optimization function of information classification storage location was constructed by using the group collaborative intelligent clustering algorithm, and the optimal solutions of all storage locations were evolved to obtain the elite set. According to the information attribute characteristics, different information was allocated to different elite sets to realize the classified storage of information in the cloud computing data center. The experimental results show that the longest time of information classification storage is only 0.6 s, the highest information loss rate is 10.0%, and the highest accuracy rate is more than 80%.

APA, Harvard, Vancouver, ISO, and other styles

36

Chen, I.-Cheng, and I.-Ching Hsu. "Open Taiwan Government data recommendation platform using DBpedia and Semantic Web based on cloud computing." International Journal of Web Information Systems 15, no. 2 (June 17, 2019): 236–54. http://dx.doi.org/10.1108/ijwis-02-2018-0015.

Full text

Abstract:

Purpose In recent years, governments around the world are actively promoting the Open Government Data (OGD) to facilitate reusing open data and developing information applications. Currently, there are more than 35,000 data sets available on the Taiwan OGD website. However, the existing Taiwan OGD website only provides keyword queries and lacks a friendly query interface. This study aims to address these issues by defining a DBpedia cloud computing framework (DCCF) for integrating DBpedia with Semantic Web technologies into Spark cluster cloud computing environment. Design/methodology/approach The proposed DCCF is used to develop a Taiwan OGD recommendation platform (TOGDRP) that provides a friendly query interface to automatically filter out the relevant data sets and visualize relationships between these data sets. Findings To demonstrate the feasibility of TOGDRP, the experimental results illustrate the efficiency of the different cloud computing models, including Hadoop YARN cluster model, Spark standalone cluster model and Spark YARN cluster model. Originality/value The novel solution proposed in this study is a hybrid approach for integrating Semantic Web technologies into Hadoop and Spark cloud computing environment to provide OGD data sets recommendation.

APA, Harvard, Vancouver, ISO, and other styles

37

HOGO, MOFREH, MIROSLAV SNOREK, and PAWAN LINGRAS. "TEMPORAL VERSUS LATEST SNAPSHOT WEB USAGE MINING USING KOHONEN SOM AND MODIFIED KOHONEN SOM BASED ON THE PROPERTIES OF ROUGH SETS THEORY." International Journal on Artificial Intelligence Tools 13, no. 03 (September 2004): 569–91. http://dx.doi.org/10.1142/s0218213004001697.

Full text

Abstract:

Temporal Web usage mining involves application of data mining techniques on temporal Web usage data to discover temporal usage patterns, which describe the temporal behavior of users on the Internet Web site, to understand the temporal users' behavior during different time slices. Clustering and classification are two important functions in Web mining. Classes, and associations in Web mining do not necessarily have crisp boundaries. Therefore the conventional clustering techniques became unsuitable to find such clusters and associations, where these conventional classification algorithms provide crisp classes, which are not suitable in real world applications. This gives the chance of using the non-conventional clustering techniques as fuzzy and rough sets in Web mining clustering applications. Recent research introduced the adaptation of Kohonen SOM based on the properties of rough sets theory to find the interval set clusters for the users on the Internet. This paper introduces the comparison between the latest snapshot Web usage mining and the temporal Web usage mining, and. the comparison between the temporal Web usage mining using the conventional Kohonen SOM and the modified Kohonen SOM based on the properties of sets theory.

APA, Harvard, Vancouver, ISO, and other styles

38

Kiziloluk, Soner, and Ahmet Bedri Ozer. "Web Pages Classification with Parliamentary Optimization Algorithm." International Journal of Software Engineering and Knowledge Engineering 27, no. 03 (April 2017): 499–513. http://dx.doi.org/10.1142/s0218194017500188.

Full text

Abstract:

In recent years, data on the Internet has grown exponentially, attaining enormous dimensions. This situation makes it difficult to obtain useful information from such data. Web mining is the process of using data mining techniques such as association rules, classification, clustering, and statistics to discover and extract information from Web documents. Optimization algorithms play an important role in such techniques. In this work, the parliamentary optimization algorithm (POA), which is one of the latest social-based metaheuristic algorithms, has been adopted for Web page classification. Two different data sets (Course and Student) were selected for experimental evaluation, and HTML tags were used as features. The data sets were tested using different classification algorithms implemented in WEKA, and the results were compared with those of the POA. The POA was found to yield promising results compared to the other algorithms. This study is the first to propose the POA for effective Web page classification.

APA, Harvard, Vancouver, ISO, and other styles

39

Sarkar, Prakash. "Quantifying the Cosmic Web using the Shapefinder diagonistic." Proceedings of the International Astronomical Union 11, S308 (June 2014): 250–53. http://dx.doi.org/10.1017/s1743921316009960.

Full text

Abstract:

AbstractOne of the most successful method in quantifying the structures in the Cosmic Web is the Minkowski Functionals. In 3D, there are four minkowski Functionals: Area, Volume, Integrated Mean Curvature and the Integrated Gaussian Curvature. For defining the Minkowski Functionals one should define a surface. We have developed a method based on Marching cube 33 algorithm to generate a surface from a discrete data sets. Next we calculate the Minkowski Functionals and Shapefinder from the triangulated polyhedral surface. Applying this methodology to different data sets , we obtain interesting results related to geometry, morphology and topology of the large scale structure

APA, Harvard, Vancouver, ISO, and other styles

40

Hu, Zhongyi, Raymond Chiong, Ilung Pranata, Yukun Bao, and Yuqing Lin. "Malicious web domain identification using online credibility and performance data by considering the class imbalance issue." Industrial Management & Data Systems 119, no. 3 (April 8, 2019): 676–96. http://dx.doi.org/10.1108/imds-02-2018-0072.

Full text

Abstract:

Purpose Malicious web domain identification is of significant importance to the security protection of internet users. With online credibility and performance data, the purpose of this paper to investigate the use of machine learning techniques for malicious web domain identification by considering the class imbalance issue (i.e. there are more benign web domains than malicious ones). Design/methodology/approach The authors propose an integrated resampling approach to handle class imbalance by combining the synthetic minority oversampling technique (SMOTE) and particle swarm optimisation (PSO), a population-based meta-heuristic algorithm. The authors use the SMOTE for oversampling and PSO for undersampling. Findings By applying eight well-known machine learning classifiers, the proposed integrated resampling approach is comprehensively examined using several imbalanced web domain data sets with different imbalance ratios. Compared to five other well-known resampling approaches, experimental results confirm that the proposed approach is highly effective. Practical implications This study not only inspires the practical use of online credibility and performance data for identifying malicious web domains but also provides an effective resampling approach for handling the class imbalance issue in the area of malicious web domain identification. Originality/value Online credibility and performance data are applied to build malicious web domain identification models using machine learning techniques. An integrated resampling approach is proposed to address the class imbalance issue. The performance of the proposed approach is confirmed based on real-world data sets with different imbalance ratios.

APA, Harvard, Vancouver, ISO, and other styles

41

XIE, MING. "MULTI-GRANULARITY KNOWLEDGE MINING ON THE WEB." International Journal of Software Engineering and Knowledge Engineering 22, no. 01 (February 2012): 1–16. http://dx.doi.org/10.1142/s0218194012500015.

Full text

Abstract:

We tackle the problem of knowledge mining on the Web. In this paper, we propose MGKM algebraic system for iterative search documents sets, and then develop an approach to extract topics on the web with Multi-Granularity Knowledge Mining algorithm (MGKM). The proposed approach maps the data space of the original method to a vector space of sentence, improving the original DBCO algorithm. We outline the interface between our scheme and the current data Web, and show that, in contrast to the existing approaches, no exponential blowup is produced by the MGKM. Based on the experiments with real-world data sets of 310 users in three study sites, we demonstrate that knowledge mining in the proposed approach is efficient, especially for large-scale web learning resources. According to the user ratings data of four learning sites in the 150 days, the average rate of increase of user rating after the system is used reaches 25.18%.

APA, Harvard, Vancouver, ISO, and other styles

42

Maria Brunetti, Josep, and Roberto García. "User-centered design and evaluation of overview components for semantic data exploration." Aslib Journal of Information Management 66, no. 5 (September 9, 2014): 519–36. http://dx.doi.org/10.1108/ajim-12-2013-0153.

Full text

Abstract:

Purpose – The growing volumes of semantic data available in the web result in the need for handling the information overload phenomenon. The potential of this amount of data is enormous but in most cases it is very difficult for users to visualize, explore and use this data, especially for lay-users without experience with Semantic Web technologies. The paper aims to discuss these issues. Design/methodology/approach – The Visual Information-Seeking Mantra “Overview first, zoom and filter, then details-on-demand” proposed by Shneiderman describes how data should be presented in different stages to achieve an effective exploration. The overview is the first user task when dealing with a data set. The objective is that the user is capable of getting an idea about the overall structure of the data set. Different information architecture (IA) components supporting the overview tasks have been developed, so they are automatically generated from semantic data, and evaluated with end-users. Findings – The chosen IA components are well known to web users, as they are present in most web pages: navigation bars, site maps and site indexes. The authors complement them with Treemaps, a visualization technique for displaying hierarchical data. These components have been developed following an iterative User-Centered Design methodology. Evaluations with end-users have shown that they get easily used to them despite the fact that they are generated automatically from structured data, without requiring knowledge about the underlying semantic technologies, and that the different overview components complement each other as they focus on different information search needs. Originality/value – Obtaining semantic data sets overviews cannot be easily done with the current semantic web browsers. Overviews become difficult to achieve with large heterogeneous data sets, which is typical in the Semantic Web, because traditional IA techniques do not easily scale to large data sets. There is little or no support to obtain overview information quickly and easily at the beginning of the exploration of a new data set. This can be a serious limitation when exploring a data set for the first time, especially for lay-users. The proposal is to reuse and adapt existing IA components to provide this overview to users and show that they can be generated automatically from the thesaurus and ontologies that structure semantic data while providing a comparable user experience to traditional web sites.

APA, Harvard, Vancouver, ISO, and other styles

43

Sekhar Babu, B., P. Lakshmi Prasanna, and P. Vidyullatha. "Personalized web search on e-commerce using ontology based association mining." International Journal of Engineering & Technology 7, no. 1.1 (December 21, 2017): 286. http://dx.doi.org/10.14419/ijet.v7i1.1.9487.

Full text

Abstract:

In current days, World Wide Web has grown into a familiar medium to investigate the new information, Business trends, trading strategies so on. Several organizations and companies are also contracting the web in order to present their products or services across the world. E-commerce is a kind of business or saleable transaction that comprises the transfer of statistics across the web or internet. In this situation huge amount of data is obtained and dumped into the web services. This data overhead tends to arise difficulties in determining the accurate and valuable information, hence the web data mining is used as a tool to determine and mine the knowledge from the web. Web data mining technology can be applied by the E-commerce organizations to offer personalized E-commerce solutions and better meet the desires of customers. By using data mining algorithm such as ontology based association rule mining using apriori algorithms extracts the various useful information from the large data sets .We are implementing the above data mining technique in JAVA and data sets are dynamically generated while transaction is processing and extracting various patterns.

APA, Harvard, Vancouver, ISO, and other styles

44

Chisholm, Andrew, and Ben Hachey. "Entity Disambiguation with Web Links." Transactions of the Association for Computational Linguistics 3 (December 2015): 145–56. http://dx.doi.org/10.1162/tacl_a_00129.

Full text

Abstract:

Entity disambiguation with Wikipedia relies on structured information from redirect pages, article text, inter-article links, and categories. We explore whether web links can replace a curated encyclopaedia, obtaining entity prior, name, context, and coherence models from a corpus of web pages with links to Wikipedia. Experiments compare web link models to Wikipedia models on well-known conll and tac data sets. Results show that using 34 million web links approaches Wikipedia performance. Combining web link and Wikipedia models produces the best-known disambiguation accuracy of 88.7 on standard newswire test data.

APA, Harvard, Vancouver, ISO, and other styles

45

Quiroz, Andres, Eric Huang, and Luca Ceriani. "A Robust and Extensible Tool for Data Integration Using Data Type Models." Proceedings of the AAAI Conference on Artificial Intelligence 29, no. 2 (January 25, 2015): 3993–98. http://dx.doi.org/10.1609/aaai.v29i2.19060.

Full text

Abstract:

Integrating heterogeneous data sets has been a significant barrier to many analytics tasks, due to the variety in structure and level of cleanliness of raw data sets requiring one-off ETL code. We propose HiperFuse, which significantly automates the data integration process by providing a declarative interface, robust type inference, extensible domain-specific data models, and a data integration planner which optimizes for plan completion time. The proposed tool is designed for schema-less data querying, code reuse within specific domains, and robustness in the face of messy unstructured data. To demonstrate the tool and its reference implementation, we show the requirements and execution steps for a use case in which IP addresses from a web clickstream log are joined with census data to obtain average income for particular site visitors (IPs), and offer preliminary performance results and qualitative comparisons to existing data integration and ETL tools.

APA, Harvard, Vancouver, ISO, and other styles

46

Comander, J. "Argus---A New Database System for Web-Based Analysis of Multiple Microarray Data Sets." Genome Research 11, no. 9 (September 1, 2001): 1603–10. http://dx.doi.org/10.1101/gr.186601.

Full text

APA, Harvard, Vancouver, ISO, and other styles

47

Claverie, Jean-Michel, and Thi Ngan Ta. "ACDtool: a web-server for the generic analysis of large data sets of counts." Bioinformatics 35, no. 1 (July 18, 2018): 170–71. http://dx.doi.org/10.1093/bioinformatics/bty640.

Full text

APA, Harvard, Vancouver, ISO, and other styles

48

Chen, Shang-Yang, Gaurav Gadhvi, and Deborah R. Winter. "MAGNET: A web-based application for gene set enrichment analysis using macrophage data sets." PLOS ONE 18, no. 1 (January 11, 2023): e0272166. http://dx.doi.org/10.1371/journal.pone.0272166.

Full text

Abstract:

Characterization of gene lists obtained from high-throughput genomic experiments is an essential task to uncover the underlying biological insights. A common strategy is to perform enrichment analyses that utilize standardized biological annotations, such as GO and KEGG pathways, which attempt to encompass all domains of biology. However, this approach provides generalized, static results that may fail to capture subtleties associated with research questions within a specific domain. Thus, there is a need for an application that can provide precise, relevant results by leveraging the latest research. We have therefore developed an interactive web application, Macrophage Annotation of Gene Network Enrichment Tool (MAGNET), for performing enrichment analyses on gene sets that are specifically relevant to macrophages. Using the hypergeometric distribution, MAGNET assesses the significance of overlapping genes with annotations that were curated from published manuscripts and data repositories. We implemented numerous features that enhance utility and user-friendliness, such as the simultaneous testing of multiple gene sets, different visualization options, option to upload custom datasets, and downloadable outputs. Here, we use three example studies compared against our current database of ten publications on mouse macrophages to demonstrate that MAGNET provides relevant and unique results that complement conventional enrichment analysis tools. Although specific to macrophage datasets, we envision MAGNET will catalyze developments of similar applications in other domains of interest. MAGNET can be freely accessed at the URL https://magnet-winterlab.herokuapp.com. Website implemented in Python and PostgreSQL, with all major browsers supported. The source code is available at https://github.com/sychen9584/MAGNET.

APA, Harvard, Vancouver, ISO, and other styles

49

Barbosa, Luciano. "Learning representations of Web entities for entity resolution." International Journal of Web Information Systems 15, no. 3 (August 19, 2019): 346–58. http://dx.doi.org/10.1108/ijwis-07-2018-0059.

Full text

Abstract:

Purpose Matching instances of the same entity, a task known as entity resolution, is a key step in the process of data integration. This paper aims to propose a deep learning network that learns different representations of Web entities for entity resolution. Design/methodology/approach To match Web entities, the proposed network learns the following representations of entities: embeddings, which are vector representations of the words in the entities in a low-dimensional space; convolutional vectors from a convolutional layer, which capture short-distance patterns in word sequences in the entities; and bag-of-word vectors, created by a bow layer that learns weights for words in the vocabulary based on the task at hand. Given a pair of entities, the similarity between their learned representations is used as a feature to a binary classifier that identifies a possible match. In addition to those features, the classifier also uses a modification of inverse document frequency for pairs, which identifies discriminative words in pairs of entities. Findings The proposed approach was evaluated in two commercial and two academic entity resolution benchmarking data sets. The results have shown that the proposed strategy outperforms previous approaches in the commercial data sets, which are more challenging, and have similar results to its competitors in the academic data sets. Originality/value No previous work has used a single deep learning framework to learn different representations of Web entities for entity resolution.

APA, Harvard, Vancouver, ISO, and other styles

50

Fosci, Paolo, and Giuseppe Psaila. "Towards Flexible Retrieval, Integration and Analysis of JSON Data Sets through Fuzzy Sets: A Case Study." Information 12, no. 7 (June 22, 2021): 258. http://dx.doi.org/10.3390/info12070258.

Full text

Abstract:

How to exploit the incredible variety of JSON data sets currently available on the Internet, for example, on Open Data portals? The traditional approach would require getting them from the portals, then storing them into some JSON document store and integrating them within the document store. However, once data are integrated, the lack of a query language that provides flexible querying capabilities could prevent analysts from successfully completing their analysis. In this paper, we show how the J-CO Framework, a novel framework that we developed at the University of Bergamo (Italy) to manage large collections of JSON documents, is a unique and innovative tool that provides analysts with querying capabilities based on fuzzy sets over JSON data sets. Its query language, called J-CO-QL, is continuously evolving to increase potential applications; the most recent extensions give analysts the capability to retrieve data sets directly from web portals as well as constructs to apply fuzzy set theory to JSON documents and to provide analysts with the capability to perform imprecise queries on documents by means of flexible soft conditions. This paper presents a practical case study in which real data sets are retrieved, integrated and analyzed to effectively show the unique and innovative capabilities of the J-CO Framework.

APA, Harvard, Vancouver, ISO, and other styles

Journal articles on the topic 'Web data sets'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles