To see the other types of publications on this topic, follow the link: STATISTICAL FEATURE RANKING.

Journal articles on the topic 'STATISTICAL FEATURE RANKING'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'STATISTICAL FEATURE RANKING.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

MANSOORI, EGHBAL G. "USING STATISTICAL MEASURES FOR FEATURE RANKING." International Journal of Pattern Recognition and Artificial Intelligence 27, no. 01 (February 2013): 1350003. http://dx.doi.org/10.1142/s0218001413500031.

Full text
Abstract:
Feature ranking is a fundamental preprocess for feature selection, before performing any data mining task. Essentially, when there are too many features in the problem, dimensionality reduction through discarding weak features is highly desirable. In this paper, we have developed an efficient feature ranking algorithm for selecting the more relevant features prior to derivation of classification predictors. Regardless the ranking criteria which rely on the training error of a predictor based on a feature, our approach is distance-based, employing only the statistical distribution of classes in each feature. It uses a scoring function as ranking criterion to evaluate the correlation measure between each feature and the classes. This function comprises three measures for each class: the statistical between-class distance, the interclass overlapping measure, and an estimate of class impurity. In order to compute the statistical parameters, used in these measures, a normalized form of histogram, obtained for each class, is employed as its a priori probability density. Since the proposed algorithm examines each feature individually, it provides a fast and cost-effective method for feature ranking. We have tested the effectiveness of our approach on some benchmark data sets with high dimensions. For this purpose, some top-ranked features are selected and are used in some rule-based classifiers as the target data mining task. Comparing with some popular feature ranking methods, the experimental results show that our approach has better performance as it can identify the more relevant features eventuate to lower classification error.
APA, Harvard, Vancouver, ISO, and other styles
2

Naim, Faradila, Mahfuzah Mustafa, Norizam Sulaiman, and Zarith Liyana Zahari. "Dual-Layer Ranking Feature Selection Method Based on Statistical Formula for Driver Fatigue Detection of EMG Signals." Traitement du Signal 39, no. 3 (June 30, 2022): 1079–88. http://dx.doi.org/10.18280/ts.390335.

Full text
Abstract:
Electromyography (EMG) signals are one of the most studied inputs for driver drowsiness detection systems. As the number of EMG features available can be daunting, finding the most significant and minimal subset features is desirable. Hence, a simplified feature selection method is necessary. This work proposed a dual-layer ranking feature selection algorithm based on statistical formula f EMG signals for driver fatigue detection. In the beginning, in the first layer, 21 filter algorithms were calculated to rank 47 sets of EMG features (25 time-domain and 9 frequency-domain) and applied to six classifiers. Then, in the second layer, all the ranks were re-ranked based on the statistical formula (average, median, mode and variance). The classification performance of all rankings was compared along with the number of features. The highest classification accuracy achieved was 95% for 12 features using the Average Statistical Rank (ASR) and LDA classifier. It is conclusive that a combination of features from the time domain and frequency domain can deliver better performance compared to a single domain feature. Concurrently, the statistical rank ASR performed better than the single filter rank by reducing the number of features. The proposed model can be a benchmark for the enhanced feature selection method for EMG driver fatigue signal.
APA, Harvard, Vancouver, ISO, and other styles
3

Soheili, Majid, Amir-Masoud Eftekhari Moghadam, and Mehdi Dehghan. "Statistical Analysis of the Performance of Rank Fusion Methods Applied to a Homogeneous Ensemble Feature Ranking." Scientific Programming 2020 (September 10, 2020): 1–14. http://dx.doi.org/10.1155/2020/8860044.

Full text
Abstract:
The feature ranking as a subcategory of the feature selection is an essential preprocessing technique that ranks all features of a dataset such that many important features denote a lot of information. The ensemble learning has two advantages. First, it has been based on the assumption that combining different model’s output can lead to a better outcome than the output of any individual models. Second, scalability is an intrinsic characteristic that is so crucial in coping with a large scale dataset. In this paper, a homogeneous ensemble feature ranking algorithm is considered, and the nine rank fusion methods used in this algorithm are analyzed comparatively. The experimental studies are performed on real six medium datasets, and the area under the feature-forward-addition curve criterion is assessed. Finally, the statistical analysis by repeated-measures analysis of variance results reveals that there is no big difference in the performance of the rank fusion methods applied in a homogeneous ensemble feature ranking; however, this difference is a statistical significance, and the B-Min method has a little better performance.
APA, Harvard, Vancouver, ISO, and other styles
4

Mogstad, Magne, Joseph Romano, Azeem Shaikh, and Daniel Wilhelm. "Statistical Uncertainty in the Ranking of Journals and Universities." AEA Papers and Proceedings 112 (May 1, 2022): 630–34. http://dx.doi.org/10.1257/pandp.20221064.

Full text
Abstract:
Economists are obsessed with rankings of institutions, journals, or scholars according to the value of some feature of interest. These rankings are invariably computed using estimates rather than the true values of such features. As a result, there may be considerable uncertainty concerning the ranks. In this paper, we consider the problem of accounting for such uncertainty by constructing confidence sets for the ranks. We consider the problem of constructing marginal confidence sets for the rank of, say, a particular journal as well as simultaneous confidence sets for the ranks of all journals.
APA, Harvard, Vancouver, ISO, and other styles
5

Zhang, Zhicheng, Xiaokun Liang, Wenjian Qin, Shaode Yu, and Yaoqin Xie. "matFR: a MATLAB toolbox for feature ranking." Bioinformatics 36, no. 19 (July 8, 2020): 4968–69. http://dx.doi.org/10.1093/bioinformatics/btaa621.

Full text
Abstract:
Abstract Summary Nowadays, it is feasible to collect massive features for quantitative representation and precision medicine, and thus, automatic ranking to figure out the most informative and discriminative ones becomes increasingly important. To address this issue, 42 feature ranking (FR) methods are integrated to form a MATLAB toolbox (matFR). The methods apply mutual information, statistical analysis, structure clustering and other principles to estimate the relative importance of features in specific measure spaces. Specifically, these methods are summarized, and an example shows how to apply a FR method to sort mammographic breast lesion features. The toolbox is easy to use and flexible to integrate additional methods. Importantly, it provides a tool to compare, investigate and interpret the features selected for various applications. Availability and implementation The toolbox is freely available at http://github.com/NicoYuCN/matFR. A tutorial and an example with a dataset are provided.
APA, Harvard, Vancouver, ISO, and other styles
6

SADEGHI, SABEREH, and HAMID BEIGY. "A NEW ENSEMBLE METHOD FOR FEATURE RANKING IN TEXT MINING." International Journal on Artificial Intelligence Tools 22, no. 03 (June 2013): 1350010. http://dx.doi.org/10.1142/s0218213013500103.

Full text
Abstract:
Dimensionality reduction is a necessary task in data mining when working with high dimensional data. A type of dimensionality reduction is feature selection. Feature selection based on feature ranking has received much attention by researchers. The major reasons are its scalability, ease of use, and fast computation. Feature ranking methods can be divided into different categories and may use different measures for ranking features. Recently, ensemble methods have entered in the field of ranking and achieved more accuracy among others. Accordingly, in this paper a Heterogeneous ensemble based algorithm for feature ranking is proposed. The base ranking methods in this ensemble structure are chosen from different categories like information theoretic, distance based, and statistical methods. The results of the base ranking methods are then fused into a final feature subset by means of genetic algorithm. The diversity of the base methods improves the quality of initial population of the genetic algorithm and thus reducing the convergence time of the genetic algorithm. In most of ranking methods, it's the user's task to determine the threshold for choosing the appropriate subset of features. It is a problem, which may cause the user to try many different values to select a good one. In the proposed algorithm, the difficulty of determining a proper threshold by the user is decreased. The performance of the algorithm is evaluated on four different text datasets and the experimental results show that the proposed method outperforms all other five feature ranking methods used for comparison. One advantage of the proposed method is that it is independent to the classification method used for classification.
APA, Harvard, Vancouver, ISO, and other styles
7

Novakovic, Jasmina, Perica Strbac, and Dusan Bulatovic. "Toward optimal feature selection using ranking methods and classification algorithms." Yugoslav Journal of Operations Research 21, no. 1 (2011): 119–35. http://dx.doi.org/10.2298/yjor1101119n.

Full text
Abstract:
We presented a comparison between several feature ranking methods used on two real datasets. We considered six ranking methods that can be divided into two broad categories: statistical and entropy-based. Four supervised learning algorithms are adopted to build models, namely, IB1, Naive Bayes, C4.5 decision tree and the RBF network. We showed that the selection of ranking methods could be important for classification accuracy. In our experiments, ranking methods with different supervised learning algorithms give quite different results for balanced accuracy. Our cases confirm that, in order to be sure that a subset of features giving the highest accuracy has been selected, the use of many different indices is recommended.
APA, Harvard, Vancouver, ISO, and other styles
8

Leguia, Marc G., Zoran Levnajić, Ljupčo Todorovski, and Bernard Ženko. "Reconstructing dynamical networks via feature ranking." Chaos: An Interdisciplinary Journal of Nonlinear Science 29, no. 9 (September 2019): 093107. http://dx.doi.org/10.1063/1.5092170.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Wang, W., P. Jones, and D. Partridge. "A Comparative Study of Feature-Salience Ranking Techniques." Neural Computation 13, no. 7 (July 1, 2001): 1603–23. http://dx.doi.org/10.1162/089976601750265027.

Full text
Abstract:
We assess the relative merits of a number of techniques designed to determine the relative salience of the elements of a feature set with respect to their ability to predict a category outcome-for example, which features of a character contribute most to accurate character recognition. A number of different neural-net-based techniques have been proposed (by us and others) in addition to a standard statistical technique, and we add a technique based on inductively generated decision trees. The salience of the features that compose a proposed set is an important problem to solve efficiently and effectively, not only for neural computing technology but also in order to provide a sound basis for any attempt to design an optimal computational system. The focus of this study is the efficiency and the effectiveness with which high-salience subsets of features can be identified in the context of ill-understood and potentially noisy real-world data. Our two simple approaches, weight clamping using a neural network and feature ranking using a decision tree, generally provide a good, consistent ordering of features. In addition, linear correlation often works well.
APA, Harvard, Vancouver, ISO, and other styles
10

Werner, Tino. "A review on instance ranking problems in statistical learning." Machine Learning 111, no. 2 (November 18, 2021): 415–63. http://dx.doi.org/10.1007/s10994-021-06122-3.

Full text
Abstract:
AbstractRanking problems, also known as preference learning problems, define a widely spread class of statistical learning problems with many applications, including fraud detection, document ranking, medicine, chemistry, credit risk screening, image ranking or media memorability. While there already exist reviews concentrating on specific types of ranking problems like label and object ranking problems, there does not yet seem to exist an overview concentrating on instance ranking problems that both includes developments in distinguishing between different types of instance ranking problems as well as careful discussions about their differences and the applicability of the existing ranking algorithms to them. In instance ranking, one explicitly takes the responses into account with the goal to infer a scoring function which directly maps feature vectors to real-valued ranking scores, in contrast to object ranking problems where the ranks are given as preference information with the goal to learn a permutation. In this article, we systematically review different types of instance ranking problems and the corresponding loss functions resp. goodness criteria. We discuss the difficulties when trying to optimize those criteria. As for a detailed and comprehensive overview of existing machine learning techniques to solve such ranking problems, we systematize existing techniques and recapitulate the corresponding optimization problems in a unified notation. We also discuss to which of the instance ranking problems the respective algorithms are tailored and identify their strengths and limitations. Computational aspects and open research problems are also considered.
APA, Harvard, Vancouver, ISO, and other styles
11

Polaka, Inese. "Feature Selection Approaches In Antibody Display." Environment. Technology. Resources. Proceedings of the International Scientific and Practical Conference 2 (August 5, 2015): 16. http://dx.doi.org/10.17770/etr2011vol2.998.

Full text
Abstract:
Molecular diagnostics tools provide specific data that have high dimensionality due to many factors analyzed in one experiment and few records due to high costs of the experiments. This study addresses the problem of dimensionality in melanoma patient antibody display data by applying data mining feature selection techniques. The article describes feature selection ranking and subset selection approaches and analyzes the performance of various methods evaluating selected feature subsets using classification algorithms C4.5, Random Forest, SVM and Naïve Bayes, which have to differentiate between cancer patient data and healthy donor data. The feature selection methods include correlation-based, consistency based and wrapper subset selection algorithms as well as statistical, information evaluation, prediction potential of rules and SVM feature selection evaluation of single features for ranking purposes.
APA, Harvard, Vancouver, ISO, and other styles
12

Hasan and Kim. "A Hybrid Feature Pool-Based Emotional Stress State Detection Algorithm Using EEG Signals." Brain Sciences 9, no. 12 (December 13, 2019): 376. http://dx.doi.org/10.3390/brainsci9120376.

Full text
Abstract:
Human stress analysis using electroencephalogram (EEG) signals requires a detailed and domain-specific information pool to develop an effective machine learning model. In this study, a multi-domain hybrid feature pool is designed to identify most of the important information from the signal. The hybrid feature pool contains features from two types of analysis: (a) statistical parametric analysis from the time domain, and (b) wavelet-based bandwidth specific feature analysis from the time-frequency domain. Then, a wrapper-based feature selector, Boruta, is applied for ranking all the relevant features from that feature pool instead of considering only the non-redundant features. Finally, the k-nearest neighbor (k-NN) algorithm is used for final classification. The proposed model yields an overall accuracy of 73.38% for the total considered dataset. To validate the performance of the proposed model and highlight the necessity of designing a hybrid feature pool, the model was compared to non-linear dimensionality reduction techniques, as well as those without feature ranking.
APA, Harvard, Vancouver, ISO, and other styles
13

Seo, Jinwook, and Ben Shneiderman. "A Rank-by-Feature Framework for Interactive Exploration of Multidimensional Data." Information Visualization 4, no. 2 (May 19, 2005): 96–113. http://dx.doi.org/10.1057/palgrave.ivs.9500091.

Full text
Abstract:
Interactive exploration of multidimensional data sets is challenging because: (1) it is difficult to comprehend patterns in more than three dimensions, and (2) current systems often are a patchwork of graphical and statistical methods leaving many researchers uncertain about how to explore their data in an orderly manner. We offer a set of principles and a novel rank-by-feature framework that could enable users to better understand distributions in one (1D) or two dimensions (2D), and then discover relationships, clusters, gaps, outliers, and other features. Users of our framework can view graphical presentations (histograms, boxplots, and scatterplots), and then choose a feature detection criterion to rank 1D or 2D axis-parallel projections. By combining information visualization techniques (overview, coordination, and dynamic query) with summaries and statistical methods users can systematically examine the most important 1D and 2D axis-parallel projections. We summarize our Graphics, Ranking, and Interaction for Discovery (GRID) principles as: (1) study 1D, study 2D, then find features (2) ranking guides insight, statistics confirm. We implemented the rank-by-feature framework in the Hierarchical Clustering Explorer, but the same data exploration principles could enable users to organize their discovery process so as to produce more thorough analyses and extract deeper insights in any multidimensional data application, such as spreadsheets, statistical packages, or information visualization tools.
APA, Harvard, Vancouver, ISO, and other styles
14

Lee, Won-Yung, Sang Hyuk Kim, Siwoo Lee, Young Woo Kim, and Ji-Hwan Kim. "Exploratory Analysis of the Sasang Constitution by Combining Network Analysis and Information Entropy." Healthcare 10, no. 11 (November 10, 2022): 2248. http://dx.doi.org/10.3390/healthcare10112248.

Full text
Abstract:
Sasang constitutional medicine is a unique concept in Korean medicine that can provide valuable insights into personalized healthcare and disease treatment. In this study, we combined network analysis and information entropy to systematically investigate the related information of Sasang constitutional (SC) types. A feature network was constructed using SC type and clinical information. The SC type-associated features and feature classes were identified using statistical analysis and entropy ranking. The patient network was constructed based on SC-type-associated features. We found that the feature network was closely connected within the features of the same classes and between several feature class pairs, including the symptom class. Most of the separation values between the feature classes, including the symptom class, were negative. In addition, we found 42 clinical features related to the SC type, and two important classes -personality and cold/heat- that increase the entropy ranking of the SC type. In the patient network, we found sparsely connected modules between SC types and a positive separation value between the Taeeumin–Soeumin and Taeeumin–Soyangin pairs. Our data-driven approach provides a deeper understanding of modernized forms of SC types and suggests that SC type is a practically useful concept for stratified healthcare and personalized medicine.
APA, Harvard, Vancouver, ISO, and other styles
15

Patil, Abhijeet R., Jongwha Chang, Ming-Ying Leung, and Sangjin Kim. "Analyzing high dimensional correlated data using feature ranking and classifiers." Computational and Mathematical Biophysics 7, no. 1 (December 31, 2019): 98–120. http://dx.doi.org/10.1515/cmb-2019-0008.

Full text
Abstract:
AbstractThe Illumina Infinium HumanMethylation27 (Illumina 27K) BeadChip assay is a relatively recent high-throughput technology that allows over 27,000 CpGs to be assayed. The Illumina 27K methylation data is less commonly used in comparison to gene expression in bioinformatics. It provides a critical need to find the optimal feature ranking (FR) method for handling the high dimensional data. The optimal FR method on the classifier is not well known, and choosing the best performing FR method becomes more challenging in high dimensional data setting. Therefore, identifying the statistical methods which boost the inference is of crucial importance in this context. This paper describes the detailed performances of FR methods such as fisher score, information gain, chi-square, and minimum redundancy and maximum relevance on different classification methods such as Adaboost, Random Forest, Naive Bayes, and Support Vector Machines. Through simulation study and real data applications, we show that the fisher score as an FR method, when applied on all the classifiers, achieved best prediction accuracy with significantly small number of ranked features.
APA, Harvard, Vancouver, ISO, and other styles
16

Rana, Bharti, Akanksha Juneja, and Ramesh Kumar Agrawal. "Relevant Feature Subset Selection from Ensemble of Multiple Feature Extraction Methods for Texture Classification." International Journal of Computer Vision and Image Processing 5, no. 1 (January 2015): 48–65. http://dx.doi.org/10.4018/ijcvip.2015010103.

Full text
Abstract:
Performance of texture classification for a given set of texture patterns depends on the choice of feature extraction technique. Integration of features from various feature extraction methods not only eliminates risk of method selection but also brings benefits from the participating methods which play complimentary role among themselves to represent underlying texture pattern. However, it comes at the cost of a large feature vector which may contain redundant features. The presence of such redundant features leads to high computation time, memory requirement and may deteriorate the performance of the classifier. In this research workMonirst phase, a pool of texture features is constructed by integrating features from seven well known feature extraction methods. In the second phase, a few popular feature subset selection techniques are investigated to determine a minimal subset of relevant features from this pool of features. In order to check the efficacy of the proposed approach, performance is evaluated on publically available Brodatz dataset, in terms of classification error. Experimental results demonstrate substantial improvement in classification performance over existing feature extraction techniques. Furthermore, ranking and statistical test also strengthen the results.
APA, Harvard, Vancouver, ISO, and other styles
17

Salama, Mostafa A., and Ghada Hassan. "A Novel Feature Selection Measure Partnership-Gain." International Journal of Online and Biomedical Engineering (iJOE) 15, no. 04 (February 27, 2019): 4. http://dx.doi.org/10.3991/ijoe.v15i04.9831.

Full text
Abstract:
Multivariate feature selection techniques search for the optimal features subset to reduce the dimensionality and hence the complexity of a classification task. Statistical feature selection techniques measure the mutual correlation between features well as the correlation of each feature to the tar- get feature. However, adding a feature to a feature subset could deteriorate the classification accuracy even though this feature positively correlates to the target class. Although most of existing feature ranking/selection techniques consider the interdependency between features, the nature of interaction be- tween features in relationship to the classification problem is still not well investigated. This study proposes a technique for forward feature selection that calculates the novel measure Partnership-Gain to select a subset of features whose partnership constructively correlates to the target feature classification. Comparative analysis to other well-known techniques shows that the proposed technique has either an enhanced or a comparable classification accuracy on the datasets studied. We present a visualization of the degree and direction of the proposed measure of features’ partnerships for a better understanding of the measure’s nature.
APA, Harvard, Vancouver, ISO, and other styles
18

Shi, Xin, and Jiangming Kan. "The Sensitivity Feature Analysis for Tree Species Based on Image Statistical Properties." Forests 14, no. 5 (May 21, 2023): 1057. http://dx.doi.org/10.3390/f14051057.

Full text
Abstract:
While the statistical properties of images are vital in forestry engineering, the usefulness of these properties in various forestry tasks may vary, and certain image properties might not be enough to adequately describe a particular tree species. To address this problem, we propose a novel method to comprehensively analyze the relationship between various image statistical properties and images of different tree species, and to determine the subset of features that best describe each individual tree species. In this study, we employed various image statistical properties to quantify images of five distinct tree species from diverse places. Multiple feature-filtering methods were used to find the feature subset with the greatest correlation with the tree species category variable. Support Vector Machines (SVM) were employed to determine the number of features with the greatest correlation with the tree species, and a grid search was used to optimize the model. For each type of tree species image, we obtained the important ranking of all features in this type of tree species, and the sensitive feature subset of various tree species according to the order of features was determined by adding them to the Deep Support Vector Data Description (Deep SVDD). Finally, the feasibility of using a sensitive subset of the tree species was confirmed. The experimental results revealed that by utilizing the filtering method in conjunction with SVM, a total of eight feature subsets with the highest correlation with tree species categories were identified. Additionally, the sensitive feature subsets of different tree species exhibited significant differences. Remarkably, employing the sensitive feature subset of each tree species resulted in F1-score higher than 0.7 for all tree species. These experimental results demonstrate that the sensitive feature subset of tree species based on image statistical properties can serve as a potential representation of a specific tree species, while features that are less strongly associated with tree species may be significant in related areas, such as forestry protection and other related fields.
APA, Harvard, Vancouver, ISO, and other styles
19

Grover, Chhaya, and Neelam Turk. "Optimal Statistical Feature Subset Selection for Bearing Fault Detection and Severity Estimation." Shock and Vibration 2020 (August 26, 2020): 1–18. http://dx.doi.org/10.1155/2020/5742053.

Full text
Abstract:
The performance of bearing fault detection systems based on machine learning techniques largely depends on the selected features. Hence, selection of an ideal number of dominant features from a comprehensive list of features is needed to decrease the number of computations involved in fault detection. In this paper, we attempted statistical time-domain features, namely, Hjorth parameters (activity, mobility, and complexity) and normal negative log likelihood for Gaussian mixture model (GMM) for the first time in addition to 26 other established statistical features for identification of bearing fault type and severity. Two datasets are derived from a publicly available database of Case Western Reserve University to identify the capability of features in fault identification under various fault sizes and motor loads. Features have been investigated using a two-step approach—filter-based ranking with 3 metrics followed by feature subset selection with 11 search techniques. The results indicate that the set of features root mean square, geometric mean, zero crossing rate, Hjorth parameter—mobility, and normal negative log likelihood for GMM outperforms other features. We also compared the diagnostic performance of normal negative log likelihood for GMM with the established feature normal negative log likelihood for single Gaussian. The selected set of statistical features is validated using ensemble rule-based classifiers and showed an average accuracy of 96.75% with proposed statistical features subset and 99.63% with all 30 features. F-measure and G-mean scores are also calculated to investigate their performance on datasets with class imbalance. The diagnostic effectiveness of the features was further validated on a bearing dataset obtained from an operating thermal power plant. The results obtained show that our newly proposed feature subset plays a major role in achieving good classification results and has a future potential of being used in a high-dimensional dataset with multidomain features.
APA, Harvard, Vancouver, ISO, and other styles
20

Vashisht, Rohit, and Syed Afzal Murtaza Rizvi. "An Empirical Study of Heterogeneous Cross-Project Defect Prediction Using Various Statistical Techniques." International Journal of e-Collaboration 17, no. 2 (April 2021): 55–71. http://dx.doi.org/10.4018/ijec.2021040104.

Full text
Abstract:
Cross-project defect prediction (CPDP) forecasts flaws in a target project through defect prediction models (DPM) trained by defect data of another project. However, CPDP has a prevalent problem (i.e., distinct projects must have identical features to describe themselves). This article emphasizes on heterogeneous CPDP (HCPDP) modeling that does not require same metric set between two applications and builds DPM based on metrics showing comparable distribution in their values for a given pair of datasets. This paper evaluates empirically and theoretically HCPDP modeling, which comprises of three main phases: feature ranking and feature selection, metric matching, and finally, predicting defects in the target application. The research work has been experimented on 13 benchmarked datasets of three open source projects. Results show that performance of HCPDP is very much comparable to baseline within project defect prediction (WPDP) and XG boosting classification model gives best results when used in conjunction with Kendall's method of correlation as compared to other set of classifiers.
APA, Harvard, Vancouver, ISO, and other styles
21

Sireci, Stephen. "Beyond Ranking of Nations: Innovative Research on PISA." Teachers College Record: The Voice of Scholarship in Education 117, no. 1 (January 2015): 1–8. http://dx.doi.org/10.1177/016146811511700101.

Full text
Abstract:
In this article, I review and provide comments on the articles that comprise this special issue on research conducted using PISA data. The articles represent a variety of issues and methods related to contemporary educational assessments and education policies. They feature state-of-the-art statistical analyses and instructive exploration of complex issues related to international assessment of students’ math, reading, and science achievement. A common theme underlying the articles is improving the interpretations of the results of educational assessments. Some articles address this theme via post hoc analysis or discussion of results, while others conduct research that informs future test development efforts.
APA, Harvard, Vancouver, ISO, and other styles
22

Sireci, Stephen. "Beyond Ranking of Nations: Innovative Research on PISA." Teachers College Record: The Voice of Scholarship in Education 117, no. 1 (January 2015): 1–8. http://dx.doi.org/10.1177/016146811511700114.

Full text
Abstract:
In this article, I review and provide comments on the articles that comprise this special issue on research conducted using PISA data. The articles represent a variety of issues and methods related to contemporary educational assessments and education policies. They feature state-of-the-art statistical analyses and instructive exploration of complex issues related to international assessment of students’ math, reading, and science achievement. A common theme underlying the articles is improving the interpretations of the results of educational assessments. Some articles address this theme via post hoc analysis or discussion of results, while others conduct research that informs future test development efforts.
APA, Harvard, Vancouver, ISO, and other styles
23

KHOSHGOFTAAR, TAGHI M., KEHAN GAO, and AMRI NAPOLITANO. "AN EMPIRICAL STUDY OF FEATURE RANKING TECHNIQUES FOR SOFTWARE QUALITY PREDICTION." International Journal of Software Engineering and Knowledge Engineering 22, no. 02 (March 2012): 161–83. http://dx.doi.org/10.1142/s0218194012400013.

Full text
Abstract:
The primary goal of software quality engineering is to produce a high quality software product through the use of some specific techniques and processes. One strategy is applying data mining techniques to software metric and defect data collected during the software development process to identify potential low-quality program modules. In this paper, we investigate the use of feature selection in the context of software quality estimation (also referred to as software defect prediction), where a classification model is used to predict whether program modules (instances) are fault-prone or not-fault-prone. Seven filter-based feature ranking techniques are examined. Among them, six are commonly used, and the other one, named signal to noise ratio (SNR), is rarely employed. The objective of the paper is to compare these seven techniques for various software data sets and assess their effectiveness for software quality modeling. A case study is performed on 16 software data sets, and classification models are built with five different learners and evaluated with two performance metrics. Our experimental results are summarized based on statistical tests for significance. The main conclusion is that the SNR technique performs as well as the best performer of the six commonly used techniques.
APA, Harvard, Vancouver, ISO, and other styles
24

Yin, Xiaoxia, Samra Irshad, and Yanchun Zhang. "Classifiers fusion for improved vessel recognition with application in quantification of generalized arteriolar narrowing." Journal of Innovative Optical Health Sciences 13, no. 01 (November 25, 2019): 1950021. http://dx.doi.org/10.1142/s1793545819500214.

Full text
Abstract:
This paper attempts to estimate diagnostically relevant measure, i.e., Arteriovenous Ratio with an improved retinal vessel classification using feature ranking strategies and multiple classifiers decision-combination scheme. The features exploited for retinal vessel characterization are based on statistical measures of histogram, different filter responses of images and local gradient information. The feature selection process is based on two feature ranking approaches (Pearson Correlation Coefficient technique and Relief-F method) to rank the features followed by use of maximum classification accuracy of three supervised classifiers (k-Nearest Neighbor, Support Vector Machine and Naïve Bayes) as a threshold for feature subset selection. Retinal vessels are labeled using the selected feature subset and proposed hybrid classification scheme, i.e., decision fusion of multiple classifiers. The comparative analysis shows an increase in vessel classification accuracy as well as Arteriovenous Ratio calculation performance. The system is tested on three databases, a local dataset of 44 images and two publically available databases, INSPIRE-AVR containing 40 images and VICAVR containing 58 images. The local database also contains images with pathologically diseased structures. The performance of the proposed system is assessed by comparing the experimental results with the gold standard estimations as well as with the results of previous methodologies. Overall, an accuracy of 90.45%, 93.90% and 87.82% is achieved in retinal blood vessel separation with 0.0565, 0.0650 and 0.0849 mean error in Arteriovenous Ratio calculation for Local, INSPIRE-AVR and VICAVR dataset, respectively.
APA, Harvard, Vancouver, ISO, and other styles
25

Lee, Younghee Cheri, and Soomin Jwa. "Feature Importance Ranking of Translationese Markers in L2 Writing: A Corpus-Based Statistical Analysis Across Disciplines." English Teaching 78, no. 2 (June 30, 2023): 55–81. http://dx.doi.org/10.15858/engtea.78.2.202206.55.

Full text
APA, Harvard, Vancouver, ISO, and other styles
26

Lee, Younghee Cheri, and Soomin Jwa. "Feature Importance Ranking of Translationese Markers in L2 Writing: A Corpus-Based Statistical Analysis Across Disciplines." English Teaching 78, no. 2 (June 30, 2023): 55–81. http://dx.doi.org/10.15858/engtea.78.2.202306.55.

Full text
APA, Harvard, Vancouver, ISO, and other styles
27

Orzechowska, Paula. "In search of phonotactic preferences." Yearbook of the Poznan Linguistic Meeting 2, no. 1 (September 1, 2016): 167–93. http://dx.doi.org/10.1515/yplm-2016-0008.

Full text
Abstract:
Abstract The objective of this contribution is to provide an analysis of consonant clusters based on the assumption that phonotactic preferences are encoded in phonological features of individual segments forming a cluster. This encoding is expressed by a set of parameters established for the following features: complexity, place of articulation, manner of articulation and voicing. On the basis of empirically observed tendencies of feature distribution and co-occurrence, novel phonotactic preferences for English word-initial consonant clusters are proposed. Statistical methods allow us to weigh the preferences and determine a ranking of phonological features in cluster formation.
APA, Harvard, Vancouver, ISO, and other styles
28

Chimlek, Sutasinee, Part Pramokchon, and Punpiti Piamsa-nga. "The Selection of Useful Visual Words for Class-Imbalanced Data in Image Classification." International Journal of Electrical and Computer Engineering (IJECE) 6, no. 1 (February 1, 2016): 307. http://dx.doi.org/10.11591/ijece.v6i1.8633.

Full text
Abstract:
<span>The bag of visual words (BOVW) has recently been used for image classification in large datasets. A major problem of image classification using BOVW is high dimensionality, with most features usually being irrelevant and different BOVW for multi-view images in each class. Therefore, the selection of significant visual words for multi-view images in each class is an essential method to reduce the size of BOVW while retaining the high performance of image classification. Many feature scores for ranking produce low classification performance for class imbalanced distributions and multi-views in each class. We propose a feature score based on the statistical t-test technique, which is a statistical evaluation of the difference between two sample means, to assess the discriminating power of each individual feature. The multi-class image classification performance of the proposed feature score is compared with four modern feature scores, such as Document Frequency (DF), Mutual information (MI), Pointwise Mutual information (PMI) and Chi-square statistics (CHI). The results show that the average F1-measure performance on the Paris dataset and the SUN397 dataset using the proposed feature score are 92% and 94%, respectively, while all other feature scores do not exceed 80%.</span>
APA, Harvard, Vancouver, ISO, and other styles
29

Chimlek, Sutasinee, Part Pramokchon, and Punpiti Piamsa-nga. "The Selection of Useful Visual Words for Class-Imbalanced Data in Image Classification." International Journal of Electrical and Computer Engineering (IJECE) 6, no. 1 (February 1, 2016): 307. http://dx.doi.org/10.11591/ijece.v6i1.pp307-319.

Full text
Abstract:
<span>The bag of visual words (BOVW) has recently been used for image classification in large datasets. A major problem of image classification using BOVW is high dimensionality, with most features usually being irrelevant and different BOVW for multi-view images in each class. Therefore, the selection of significant visual words for multi-view images in each class is an essential method to reduce the size of BOVW while retaining the high performance of image classification. Many feature scores for ranking produce low classification performance for class imbalanced distributions and multi-views in each class. We propose a feature score based on the statistical t-test technique, which is a statistical evaluation of the difference between two sample means, to assess the discriminating power of each individual feature. The multi-class image classification performance of the proposed feature score is compared with four modern feature scores, such as Document Frequency (DF), Mutual information (MI), Pointwise Mutual information (PMI) and Chi-square statistics (CHI). The results show that the average F1-measure performance on the Paris dataset and the SUN397 dataset using the proposed feature score are 92% and 94%, respectively, while all other feature scores do not exceed 80%.</span>
APA, Harvard, Vancouver, ISO, and other styles
30

Lin, Xixun, Yanchun Liang, Limin Wang, Xu Wang, Mary Qu Yang, and Renchu Guan. "A Knowledge Base Completion Model Based on Path Feature Learning." International Journal of Computers Communications & Control 13, no. 1 (February 12, 2018): 71. http://dx.doi.org/10.15837/ijccc.2018.1.3104.

Full text
Abstract:
Large-scale knowledge bases, as the foundations for promoting the development of artificial intelligence, have attracted increasing attention in recent years. These knowledge bases contain billions of facts in triple format; yet, they suffer from sparse relations between entities. Researchers proposed the path ranking algorithm (PRA) to solve this fatal problem. To improve the scalability of knowledge inference, PRA exploits random walks to find Horn clauses with chain structures to predict new relations given existing facts. This method can be regarded as a statistical classification issue for statistical relational learning (SRL). However, large-scale knowledge base completion demands superior accuracy and scalability. In this paper, we propose the path feature learning model (PFLM) to achieve this urgent task. More precisely, we define a two-stage model: the first stage aims to learn path features from the existing knowledge base and extra parsed corpus; the second stage uses these path features to predict new relations. The experimental results demonstrate that the PFLM can learn meaningful features and can achieve significant and consistent improvements compared with previous work.
APA, Harvard, Vancouver, ISO, and other styles
31

Balogun, Abdullateef O., Shuib Basri, Saipunidzam Mahamad, Said J. Abdulkadir, Malek A. Almomani, Victor E. Adeyemo, Qasem Al-Tashi, Hammed A. Mojeed, Abdullahi A. Imam, and Amos O. Bajeh. "Impact of Feature Selection Methods on the Predictive Performance of Software Defect Prediction Models: An Extensive Empirical Study." Symmetry 12, no. 7 (July 9, 2020): 1147. http://dx.doi.org/10.3390/sym12071147.

Full text
Abstract:
Feature selection (FS) is a feasible solution for mitigating high dimensionality problem, and many FS methods have been proposed in the context of software defect prediction (SDP). Moreover, many empirical studies on the impact and effectiveness of FS methods on SDP models often lead to contradictory experimental results and inconsistent findings. These contradictions can be attributed to relative study limitations such as small datasets, limited FS search methods, and unsuitable prediction models in the respective scope of studies. It is hence critical to conduct an extensive empirical study to address these contradictions to guide researchers and buttress the scientific tenacity of experimental conclusions. In this study, we investigated the impact of 46 FS methods using Naïve Bayes and Decision Tree classifiers over 25 software defect datasets from 4 software repositories (NASA, PROMISE, ReLink, and AEEEM). The ensuing prediction models were evaluated based on accuracy and AUC values. Scott–KnottESD and the novel Double Scott–KnottESD rank statistical methods were used for statistical ranking of the studied FS methods. The experimental results showed that there is no one best FS method as their respective performances depends on the choice of classifiers, performance evaluation metrics, and dataset. However, we recommend the use of statistical-based, probability-based, and classifier-based filter feature ranking (FFR) methods, respectively, in SDP. For filter subset selection (FSS) methods, correlation-based feature selection (CFS) with metaheuristic search methods is recommended. For wrapper feature selection (WFS) methods, the IWSS-based WFS method is recommended as it outperforms the conventional SFS and LHS-based WFS methods.
APA, Harvard, Vancouver, ISO, and other styles
32

Liu, Rui, Yangze Lu, Meng Huang, Qijia Xie, and Yalong Tu. "Research on excavation and identification method of typical interference signal during transformer partial discharge test." Journal of Physics: Conference Series 2246, no. 1 (April 1, 2022): 012026. http://dx.doi.org/10.1088/1742-6596/2246/1/012026.

Full text
Abstract:
Abstract —In order to remove interference signal and external discharge signal during partial discharge test and get the effective signal of partial discharge, this paper carries out the research of typical interference signal mining and identification method for transformer partial discharge test. Firstly, based on the experience of previous researchers in partial discharge research, 47 feature parameters including statistical, texture and shape features are extracted, and the original feature space is constructed based on them, which provides the data basis for the optimization of the feature space and signal classification identification later. Next, the random forest algorithm is used to measure and rank the importance of the feature parameters, and the original feature space is optimized according to the ranking results to obtain the optimized feature space. Finally, the optimized feature space is used as the input of linear discriminant analysis algorithm and K-nearest neighbor algorithm to classify and identify the local discharge signal and typical interference signal, and the recognition accuracy of these two classification and identification algorithms is compared and analyzed, and the K-nearest neighbor algorithm is selected as the final classification and identification algorithm.
APA, Harvard, Vancouver, ISO, and other styles
33

Zhao, Jin Xian, Long Li, and Min Liu. "The Research of Integrated Dynamic Analysis of Metro Project Objectives Based on PCA." Applied Mechanics and Materials 584-586 (July 2014): 2577–80. http://dx.doi.org/10.4028/www.scientific.net/amm.584-586.2577.

Full text
Abstract:
This paper establish the index system of metro project which included four control targets of the engineering project management that is quality, cost, construction period, safety into integrated dynamic system. We realizes the PCA dimension of the 17indexes in the set and gets four principal components by using PCA of multivariate statistical methods. According to the feature vector of four main components, the main ingredients are classified and reorganized. By comparing the scores ranking of four redefined main components and evaluation scores ranking, we conduct integrated dynamic analysis of metro project objectives. Finally, the results of case study shows the conclusion is in line with the possible situations of the metro project management.
APA, Harvard, Vancouver, ISO, and other styles
34

Omejc, Nina, Manca Peskar, Aleksandar Miladinović, Voyko Kavcic, Sašo Džeroski, and Uros Marusic. "On the Influence of Aging on Classification Performance in the Visual EEG Oddball Paradigm Using Statistical and Temporal Features." Life 13, no. 2 (January 31, 2023): 391. http://dx.doi.org/10.3390/life13020391.

Full text
Abstract:
The utilization of a non-invasive electroencephalogram (EEG) as an input sensor is a common approach in the field of the brain–computer interfaces (BCI). However, the collected EEG data pose many challenges, one of which may be the age-related variability of event-related potentials (ERPs), which are often used as primary EEG BCI signal features. To assess the potential effects of aging, a sample of 27 young and 43 older healthy individuals participated in a visual oddball study, in which they passively viewed frequent stimuli among randomly occurring rare stimuli while being recorded with a 32-channel EEG set. Two types of EEG datasets were created to train the classifiers, one consisting of amplitude and spectral features in time and another with extracted time-independent statistical ERP features. Among the nine classifiers tested, linear classifiers performed best. Furthermore, we show that classification performance differs between dataset types. When temporal features were used, maximum individuals’ performance scores were higher, had lower variance, and were less affected overall by within-class differences such as age. Finally, we found that the effect of aging on classification performance depends on the classifier and its internal feature ranking. Accordingly, performance will differ if the model favors features with large within-class differences. With this in mind, care must be taken in feature extraction and selection to find the correct features and consequently avoid potential age-related performance degradation in practice.
APA, Harvard, Vancouver, ISO, and other styles
35

MÀRQUEZ, LLUÍS, and ALESSANDRO MOSCHITTI. "Special issue on statistical learning of natural language structured input and output." Natural Language Engineering 18, no. 2 (March 14, 2012): 147–53. http://dx.doi.org/10.1017/s135132491200006x.

Full text
Abstract:
AbstractDuring last decade, machine learning and, in particular, statistical approaches have become more and more important for research in Natural Language Processing (NLP) and Computational Linguistics. Nowadays, most stakeholders of the field use machine learning, as it can significantly enhance both system design and performance. However, machine learning requires careful parameter tuning and feature engineering for representing language phenomena. The latter becomes more complex when the system input/output data is structured, since the designer has both to (i) engineer features for representing structure and model interdependent layers of information, which is usually a non-trivial task; and (ii) generate a structured output using classifiers, which, in their original form, were developed only for classification or regression. Research in empirical NLP has been tackling this problem by constructing output structures as a combination of the predictions of independent local classifiers, eventually applying post-processing heuristics to correct incompatible outputs by enforcing global properties. More recently, some advances of the statistical learning theory, namely structured output spaces and kernel methods, have brought techniques for directly encoding dependencies between data items in a learning algorithm that performs global optimization. Within this framework, this special issue aims at studying, comparing, and reconciling the typical domain/task-specific NLP approaches to structured data with the most advanced machine learning methods. In particular, the selected papers analyze the use of diverse structured input/output approaches, ranging from re-ranking to joint constraint-based global models, for diverse natural language tasks, i.e., document ranking, syntactic parsing, sequence supertagging, and relation extraction between terms and entities. Overall, the experience with this special issue shows that, although a definitive unifying theory for encoding and generating structured information in NLP applications is still far from being shaped, some interesting and effective best practice can be defined to guide practitioners in modeling their own natural language application on complex data.
APA, Harvard, Vancouver, ISO, and other styles
36

Iqbal, Talha, Adnan Elahi, William Wijns, Bilal Amin, and Atif Shahzad. "Improved Stress Classification Using Automatic Feature Selection from Heart Rate and Respiratory Rate Time Signals." Applied Sciences 13, no. 5 (February 24, 2023): 2950. http://dx.doi.org/10.3390/app13052950.

Full text
Abstract:
Time-series features are the characteristics of data periodically collected over time. The calculation of time-series features helps in understanding the underlying patterns and structure of the data, as well as in visualizing the data. The manual calculation and selection of time-series feature from a large temporal dataset are time-consuming. It requires researchers to consider several signal-processing algorithms and time-series analysis methods to identify and extract meaningful features from the given time-series data. These features are the core of a machine learning-based predictive model and are designed to describe the informative characteristics of the time-series signal. For accurate stress monitoring, it is essential that these features are not only informative but also well-distinguishable and interpretable by the classification models. Recently, a lot of work has been carried out on automating the extraction and selection of times-series features. In this paper, a correlation-based time-series feature selection algorithm is proposed and evaluated on the stress-predict dataset. The algorithm calculates a list of 1578 features of heart rate and respiratory rate signals (combined) using the tsfresh library. These features are then shortlisted to the more specific time-series features using Principal Component Analysis (PCA) and Pearson, Kendall, and Spearman correlation ranking techniques. A comparative study of conventional statistical features (like, mean, standard deviation, median, and mean absolute deviation) versus correlation-based selected features is performed using linear (logistic regression), ensemble (random forest), and clustering (k-nearest neighbours) predictive models. The correlation-based selected features achieved higher classification performance with an accuracy of 98.6% as compared to the conventional statistical feature’s 67.4%. The outcome of the proposed study suggests that it is vital to have better analytical features rather than conventional statistical features for accurate stress classification.
APA, Harvard, Vancouver, ISO, and other styles
37

Mila Desi Anasanti, Khairunisa Hilyati, and Annisa Novtariany. "The Exploring feature selection techniques on Classification Algorithms for Predicting Type 2 Diabetes at Early Stage." Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) 6, no. 5 (November 2, 2022): 832–39. http://dx.doi.org/10.29207/resti.v6i5.4419.

Full text
Abstract:
Predicting early Type 2 diabetes (T2D) is critical for improved care and better T2D outcomes. An accurate and efficient T2D prediction relies on unbiased relevant features. In this study, we searched for important features to predict T2D by integrating ML-based models for feature selection and classification from 520 individuals newly diagnosed with diabetes or who will develop it. We used standard machine learning classifications, such as logistic regression (LR), Gaussian naive Bayes (NB), decision tree (DT), random forest (RF), support vector machine (SVM) with linear basis function, and k-nearest neighbors (KNN). We set out to systematically explore the viability of main feature selection representing each different technique, such as a statistical filter method (F-score), an entropy-based filter method (mutual information), an ensemble-based filter method (random forest importance), and a stochastic optimization (simultaneous perturbation feature selection and ranking (SpFSR)). We used a stratified 10-fold cross-validation technique and assessed the performance of discrimination, calibration, and clinical utility. We attained the highest accuracy of 98% using RF with the full set of features (16 features), then used RF as a classifier wrapper to select the important features. We observed a combination of SpFSR and RF as the best model with a P-value above 0.05 (P-value = 0.26), statistically attaining the same accuracy as the full features. The study's findings support the efficiency and usefulness of the suggested method for choosing the most important features of diabetic data: polyuria, gender, polydipsia, age, itching, sudden weight loss, delayed healing, and alopecia.
APA, Harvard, Vancouver, ISO, and other styles
38

Kshatriya, T. Tarun, R. Kumaraperumal, D. Muthumanickam, S. Pazhanivelan, K. P. Ragunath, and M. Nivas Raj. "Identifying Prominent Environmental Covariates Using Variable Selection Methodologies for Digital Soil Mapping of Tamil Nadu, India." International Journal of Environment and Climate Change 13, no. 9 (July 29, 2023): 2358–76. http://dx.doi.org/10.9734/ijecc/2023/v13i92469.

Full text
Abstract:
High dimensional datasets that depict intricate spatial variations are necessary to predict complex landscape structures and the corresponding soil properties taking into account the size of the research region in addition to the data attributes. The number and quality of the input datasets taken into consideration essentially determine the quantity and quality of the soil properties that may be predicted thanks to data-driven learning algorithms. The use of variable selection strategies both before and after the prediction can have a significant impact on the outcome and can lower the related computing load. The majority of commonly used variable selection techniques such as correlation analysis, stepwise regression and recursive feature elimination, among others perform recursive statistical/mathematical comparison to identify the significant covariates that improve the effectiveness of the algorithm proposed. In order to identify the effective environmental variables in predicting the soil attribute, this article investigated a widely used recursive ranking method called recursive feature elimination. The covariate layer that produced the lowest RMSE was placed first according to the rankings of the covariates provided by recursive feature elimination. The findings showed that among other factors physiography, mean rainfall, rock outcrop difference ratio, elevation and mean temperature will be effective in predicting the soil properties required for digital soil mapping.
APA, Harvard, Vancouver, ISO, and other styles
39

Djemili, Rafik. "Analysis of statistical coefficients and autoregressive parameters over intrinsic mode functions (IMFs) for epileptic seizure detection." Biomedical Engineering / Biomedizinische Technik 65, no. 6 (November 18, 2020): 693–704. http://dx.doi.org/10.1515/bmt-2019-0233.

Full text
Abstract:
AbstractEpilepsy is a persistent neurological disorder impacting over 50 million people around the world. It is characterized by repeated seizures defined as brief episodes of involuntary movement that might entail the human body. Electroencephalography (EEG) signals are usually used for the detection of epileptic seizures. This paper introduces a new feature extraction method for the classification of seizure and seizure-free EEG time segments. The proposed method relies on the empirical mode decomposition (EMD), statistics and autoregressive (AR) parameters. The EMD method decomposes an EEG time segment into a finite set of intrinsic mode functions (IMFs) from which statistical coefficients and autoregressive parameters are computed. Nevertheless, the calculated features could be of high dimension as the number of IMFs increases, the Student’s t-test and the Mann–Whitney U test were thus employed for features ranking in order to withdraw lower significant features. The obtained features have been used for the classification of seizure and seizure-free EEG signals by the application of a feed-forward multilayer perceptron neural network (MLPNN) classifier. Experimental results carried out on the EEG database provided by the University of Bonn, Germany, demonstrated the effectiveness of the proposed method which performance assessed by the classification accuracy (CA) is compared to other existing performances reported in the literature.
APA, Harvard, Vancouver, ISO, and other styles
40

Wiliński, A., and S. Osowski. "Ensemble of data mining methods for gene ranking." Bulletin of the Polish Academy of Sciences: Technical Sciences 60, no. 3 (December 1, 2012): 461–70. http://dx.doi.org/10.2478/v10175-012-0058-x.

Full text
Abstract:
Abstract The paper presents the ensemble of data mining methods for discovering the most important genes and gene sequences generated by the gene expression arrays, responsible for the recognition of a particular type of cancer. The analyzed methods include the correlation of the feature with a class, application of the statistical hypotheses, the Fisher measure of discrimination and application of the linear Support Vector Machine for characterization of the discrimination ability of the features. In the first step of ranking we apply each method individually, choosing the genes most often selected in the cross validation of the available data set. In the next step we combine the results of different selection methods together and once again choose the genes most frequently appearing in the selected sets. On the basis of this we form the final ranking of the genes. The most important genes form the input information delivered to the Support Vector Machine (SVM) classifier, responsible for the final recognition of tumor from non-tumor data. Different forms of checking the correctness of the proposed ranking procedure have been applied. The first one is relied on mapping the distribution of selected genes on the two-coordinate system formed by two most important principal components of the PCA transformation and applying the cluster quality measures. The other one depicts the results in the graphical form by presenting the gene expressions in the form of pixel intensity for the available data. The final confirmation of the quality of the proposed ranking method are the classification results of recognition of the cancer cases from the non-cancer (normal) ones, performed using the Gaussian kernel SVM. The results of selection of the most significant genes used by the SVM for recognition of the prostate cancer cases from normal cases have confirmed a good accuracy of results. The presented methodology is of potential use for practical application in bioinformatics.
APA, Harvard, Vancouver, ISO, and other styles
41

Potharlanka, Jhansi Lakshmi, Maruthi Padmaja Turumella, and Radha Krishna P. "A Study on Class Imbalancing Feature Selection and Ensembles on Software Reliability Prediction." International Journal of Open Source Software and Processes 10, no. 4 (October 2019): 20–43. http://dx.doi.org/10.4018/ijossp.2019100102.

Full text
Abstract:
Software quality can be improved by early software defect prediction models. However, class imbalance due to under representation of defects and the irrelevant metrics used to predict them are two major challenges that hinder the model performance. This article presents a new two-stage framework of Ensemble of Hybrid Feature selection (EHF) with Weighted Support Vector Machine Boosting (WSVMBoost), which further enhance the model performance. The EHF is the ensemble feature ranking of feature selection models such as filters and embedded models to select the relevant metrics. The classification ensembles, namely Random Forest, RUSBoost, WSVMBoost, and the base learners, namely Decision Tree, and SVM are also explored in this study using five software reliability datasets. From the statistical tests, EHF with WSVMBoost attained best mean rank in terms of performance than the rest of the feature selection hybrids in predicting the software defects. Additionally, this study has shown that both McCabe and Hasalted method level metrics are equally important in improving the model performance.
APA, Harvard, Vancouver, ISO, and other styles
42

Fushing, Hsieh, Michael P. McAssey, and Brenda McCowan. "Computing a ranking network with confidence bounds from a graph-based Beta random field." Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences 467, no. 2136 (August 3, 2011): 3590–612. http://dx.doi.org/10.1098/rspa.2011.0268.

Full text
Abstract:
We address two largely overlooked, fundamental issues in computing a ranking hierarchy within a society: which information in the network is relevant, and what effect chance has on the hierarchy. To properly account for uncertainty from limited data, we construct a random field in a matrix form having entry-wise posterior Beta distributions based on a graph of pairwise conflict outcomes. To evaluate relevant network information using information transitivity, another random matrix of synthesized transitive dominance odds is computed collectively along observed dominance paths. These two matrices are coupled together to fuse both direct and indirect dominance information. An ensemble of realizations of this fused random matrix facilitates an ensemble of optimal ranking networks by means of simulated annealing. Conditional statistical inferences regarding network features are derived, manifesting the effect of uncertainty. Our computational approach is suitable for large graphs of pairwise conflict outcomes, and can accommodate tremendous data heterogeneity—a typical feature in such studies. We also demonstrate the infeasibility of the classical maximum-likelihood approach, and expose the mechanistic flaws that stem from completely ignoring relevant information residing in the graph. We analyse two real datasets of decisive conflict outcomes, the first involving college football teams, and the second involving an adult rhesus macaque society in captivity.
APA, Harvard, Vancouver, ISO, and other styles
43

Shi, Di, Gunnar Gidion, Leonhard M. Reindl, and Stefan J. Rupitsch. "Automatic Life Detection Based on Efficient Features of Ground-Penetrating Rescue Radar Signals." Sensors 23, no. 15 (July 28, 2023): 6771. http://dx.doi.org/10.3390/s23156771.

Full text
Abstract:
Good feature engineering is a prerequisite for accurate classification, especially in challenging scenarios such as detecting the breathing of living persons trapped under building rubble using bioradar. Unlike monitoring patients’ breathing through the air, the measuring conditions of a rescue bioradar are very complex. The ultimate goal of search and rescue is to determine the presence of a living person, which requires extracting representative features that can distinguish measurements with the presence of a person and without. To address this challenge, we conducted a bioradar test scenario under laboratory conditions and decomposed the radar signal into different range intervals to derive multiple virtual scenes from the real one. We then extracted physical and statistical quantitative features that represent a measurement, aiming to find those features that are robust to the complexity of rescue-radar measuring conditions, including different rubble sites, breathing rates, signal strengths, and short-duration disturbances. To this end, we utilized two methods, Analysis of Variance (ANOVA), and Minimum Redundancy Maximum Relevance (MRMR), to analyze the significance of the extracted features. We then trained the classification model using a linear kernel support vector machine (SVM). As the main result of this work, we identified an optimal feature set of four features based on the feature ranking and the improvement in the classification accuracy of the SVM model. These four features are related to four different physical quantities and independent from different rubble sites.
APA, Harvard, Vancouver, ISO, and other styles
44

Carafini, Adriano, Isabel C. N. Sacco, and Marcus Fraga Vieira. "Pelvic floor pressure distribution profile in urinary incontinence: a classification study with feature selection." PeerJ 7 (December 9, 2019): e8207. http://dx.doi.org/10.7717/peerj.8207.

Full text
Abstract:
Background Pelvic floor pressure distribution profiles, obtained by a novel instrumented non-deformable probe, were used as the input to a feature extraction, selection, and classification approach to test their potential for an automatic diagnostic system for objective female urinary incontinence assessment. We tested the performance of different feature selection approaches and different classifiers, as well as sought to establish the group of features that provides the greatest discrimination capability between continent and incontinent women. Methods The available data for evaluation consisted of intravaginal spatiotemporal pressure profiles acquired from 24 continent and 24 incontinent women while performing four pelvic floor maneuvers: the maximum contraction maneuver, Valsalva maneuver, endurance maneuver, and wave maneuver. Feature extraction was guided by previous studies on the characterization of pressure profiles in the vaginal canal, where the extracted features were tested concerning their repeatability. Feature selection was achieved through a combination of a ranking method and a complete non-exhaustive subset search algorithm: branch and bound and recursive feature elimination. Three classifiers were tested: k-nearest neighbors (k-NN), support vector machine, and logistic regression. Results Of the classifiers employed, there was not one that outperformed the others; however, k-NN presented statistical inferiority in one of the maneuvers. The best result was obtained through the application of recursive feature elimination on the features extracted from all the maneuvers, resulting in 77.1% test accuracy, 74.1% precision, and 83.3 recall, using SVM. Moreover, the best feature subset, obtained by observing the selection frequency of every single feature during the application of branch and bound, was directly employed on the classification, thus reaching 95.8% accuracy. Although not at the level required by an automatic system, the results show the potential use of pelvic floor pressure distribution profiles data and provide insights into the pelvic floor functioning aspects that contribute to urinary incontinence.
APA, Harvard, Vancouver, ISO, and other styles
45

Mularska-Kucharek, Monika, and Kamil Brzeziński. "The Economic Dimension of Social Trust." European Spatial Research and Policy 23, no. 2 (June 8, 2017): 83–95. http://dx.doi.org/10.1515/esrp-2016-0012.

Full text
Abstract:
Social trust is increasingly seen as a non-economic determinant of economic development. Its positive impact on the economic sphere of social life, proven by numerous studies, is an incentive for new research initiatives examining the social trust level, since the results may be vital for the local policy-making. The main aim of the article is to study the relationships between social trust and the economic development. To accomplish this goal, a social trust indicator and an economic ranking list for the researched units were created.The statistical analyses performed demonstrated a statistically significant correlation between the examined phenomena and proved that the highest developmental level is a characteristic feature of the districts with a high level of social trust. This conforms the claims of Polish and international scholars who see trust as a non-economic determinant of economic development.
APA, Harvard, Vancouver, ISO, and other styles
46

Baek, Seung Hyun, Alberto Garcia-Diaz, and Yuanshun Dai. "Multi-choice wavelet thresholding based binary classification method." Methodology 16, no. 2 (June 18, 2020): 127–46. http://dx.doi.org/10.5964/meth.2787.

Full text
Abstract:
Data mining is one of the most effective statistical methodologies to investigate a variety of problems in areas including pattern recognition, machine learning, bioinformatics, chemometrics, and statistics. In particular, statistically-sophisticated procedures that emphasize on reliability of results and computational efficiency are required for the analysis of high-dimensional data. Optimization principles can play a significant role in the rationalization and validation of specialized data mining procedures. This paper presents a novel methodology which is Multi-Choice Wavelet Thresholding (MCWT) based three-step methodology consists of three processes: perception (dimension reduction), decision (feature ranking), and cognition (model selection). In these steps three concepts known as wavelet thresholding, support vector machines for classification and information complexity are integrated to evaluate learning models. Three published data sets are used to illustrate the proposed methodology. Additionally, performance comparisons with recent and widely applied methods are shown.
APA, Harvard, Vancouver, ISO, and other styles
47

Mairesse, F., M. A. Walker, M. R. Mehl, and R. K. Moore. "Using Linguistic Cues for the Automatic Recognition of Personality in Conversation and Text." Journal of Artificial Intelligence Research 30 (November 28, 2007): 457–500. http://dx.doi.org/10.1613/jair.2349.

Full text
Abstract:
It is well known that utterances convey a great deal of information about the speaker in addition to their semantic content. One such type of information consists of cues to the speaker's personality traits, the most fundamental dimension of variation between humans. Recent work explores the automatic detection of other types of pragmatic variation in text and conversation, such as emotion, deception, speaker charisma, dominance, point of view, subjectivity, opinion and sentiment. Personality affects these other aspects of linguistic production, and thus personality recognition may be useful for these tasks, in addition to many other potential applications. However, to date, there is little work on the automatic recognition of personality traits. This article reports experimental results for recognition of all Big Five personality traits, in both conversation and text, utilising both self and observer ratings of personality. While other work reports classification results, we experiment with classification, regression and ranking models. For each model, we analyse the effect of different feature sets on accuracy. Results show that for some traits, any type of statistical model performs significantly better than the baseline, but ranking models perform best overall. We also present an experiment suggesting that ranking models are more accurate than multi-class classifiers for modelling personality. In addition, recognition models trained on observed personality perform better than models trained using self-reports, and the optimal feature set depends on the personality trait. A qualitative analysis of the learned models confirms previous findings linking language and personality, while revealing many new linguistic markers.
APA, Harvard, Vancouver, ISO, and other styles
48

Sindhu, R., Ruzelita Ngadiran, Yasmin Mohd Yacob, Nik Adilah Hanin Zahri, M. Hariharan, and Kemal Polat. "A Hybrid SCA Inspired BBO for Feature Selection Problems." Mathematical Problems in Engineering 2019 (April 2, 2019): 1–18. http://dx.doi.org/10.1155/2019/9517568.

Full text
Abstract:
Recent trend of research is to hybridize two and more metaheuristics algorithms to obtain superior solution in the field of optimization problems. This paper proposes a newly developed wrapper-based feature selection method based on the hybridization of Biogeography Based Optimization (BBO) and Sine Cosine Algorithm (SCA) for handling feature selection problems. The position update mechanism of SCA algorithm is introduced into the BBO algorithm to enhance the diversity among the habitats. In BBO, the mutation operator is got rid of and instead of it, a position update mechanism of SCA algorithm is applied after the migration operator, to enhance the global search ability of Basic BBO. This mechanism tends to produce the highly fit solutions in the upcoming iterations, which results in the improved diversity of habitats. The performance of this Improved BBO (IBBO) algorithm is investigated using fourteen benchmark datasets. Experimental results of IBBO are compared with eight other search algorithms. The results show that IBBO is able to outperform the other algorithms in majority of the datasets. Furthermore, the strength of IBBO is proved through various numerical experiments like statistical analysis, convergence curves, ranking methods, and test functions. The results of the simulation have revealed that IBBO has produced very competitive and promising results, compared to the other search algorithms.
APA, Harvard, Vancouver, ISO, and other styles
49

Sysyn, Mykola, Ulf Gerber, Olga Nabochenko, Yangyang Li, and Vitalii Kovalchuk. "INDICATORS FOR COMMON CROSSING STRUCTURAL HEALTH MONITORING WITH TRACK-SIDE INERTIAL MEASUREMENTS." Acta Polytechnica 59, no. 2 (April 30, 2019): 170–81. http://dx.doi.org/10.14311/ap.2019.59.0170.

Full text
Abstract:
This paper focuses on the experimental study of an alteration in the railway crossing dynamic response due to the rolling surface degradation during a crossing’s lifecycle. The maximal acceleration measured with the track-side measurement system as well as the impact position monitoring show no significant statistical relation to the rolling surface degradation. The additional spectral features are extracted from the acceleration measurements with a wavelet transform to improve the information usage. The reliable prediction of the railway crossing remaining useful life (RUL) demands the trustworthy indicators of structural health that systematically change during the lifecycle. The popular simple machine learning methods like principal component analysis and partial least square regression are used to retrieve two indicators from the experimental information. The feature ranking and selection are used to remove the redundant information and increase the relation of indicators to the lifetime.
APA, Harvard, Vancouver, ISO, and other styles
50

Kovács, Melinda, Ferenc Lilik, and Szilvia Nagy. "On Selecting, Ranking, and Quantifying Features for Building a Liver CT Diagnosis Aiding Computational Intelligence Method." Applied Sciences 13, no. 6 (March 8, 2023): 3462. http://dx.doi.org/10.3390/app13063462.

Full text
Abstract:
The liver is one of the most common locations for incidental findings during abdominal CT scans. There are multiple types of disease that can arise within the liver and many of them are nodular. The ultimate goal of our research is to develop an expert knowledge-based system using fuzzy signatures, to support decisions during diagnosis of the most frequent of these nodular lesions. Since the literature contains limited information about the graphical properties of CT images that must be taken into consideration and their relationship to one another, in this paper we focused on selecting and ranking the input parameters using expert knowledge and determining their importance. Six visual attributes of lesions (size, shape, density, homogeneity contour, and other features) were selected based on textbooks of radiology and expert opinion. The importance of these attributes was ranked by radiologist experts using questionnaires and a pairwise comparison technique. The most important feature was found to be the density of the lesion on the various CT phases, and the least important was the size, the order of the other attributes was other features, contour, homogeneity, and shape, with a Kendall concordance coefficient of 0.612. Weights for the attributes, to be used in the future fuzzy signatures, were also determined. As a last step, several statistical parameter-based quantities were generated to represent the above abstract attributes and evaluated by comparing them to expert opinions.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography