Journal articles on the topic 'Protein Representation Learning'

To see the other types of publications on this topic, follow the link: Protein Representation Learning.

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Protein Representation Learning.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Kim, Paul T., Robin Winter, and Djork-Arné Clevert. "Unsupervised Representation Learning for Proteochemometric Modeling." International Journal of Molecular Sciences 22, no. 23 (November 28, 2021): 12882. http://dx.doi.org/10.3390/ijms222312882.

Full text
Abstract:
In silico protein–ligand binding prediction is an ongoing area of research in computational chemistry and machine learning based drug discovery, as an accurate predictive model could greatly reduce the time and resources necessary for the detection and prioritization of possible drug candidates. Proteochemometric modeling (PCM) attempts to create an accurate model of the protein–ligand interaction space by combining explicit protein and ligand descriptors. This requires the creation of information-rich, uniform and computer interpretable representations of proteins and ligands. Previous studies in PCM modeling rely on pre-defined, handcrafted feature extraction methods, and many methods use protein descriptors that require alignment or are otherwise specific to a particular group of related proteins. However, recent advances in representation learning have shown that unsupervised machine learning can be used to generate embeddings that outperform complex, human-engineered representations. Several different embedding methods for proteins and molecules have been developed based on various language-modeling methods. Here, we demonstrate the utility of these unsupervised representations and compare three protein embeddings and two compound embeddings in a fair manner. We evaluate performance on various splits of a benchmark dataset, as well as on an internal dataset of protein–ligand binding activities and find that unsupervised-learned representations significantly outperform handcrafted representations.
APA, Harvard, Vancouver, ISO, and other styles
2

Heinzinger, Michael, Christian Dallago, and Burkhard Rost. "Protein matchmaking through representation learning." Cell Systems 12, no. 10 (October 2021): 948–50. http://dx.doi.org/10.1016/j.cels.2021.09.007.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Fasoulis, Romanos, Georgios Paliouras, and Lydia E. Kavraki. "Graph representation learning for structural proteomics." Emerging Topics in Life Sciences 5, no. 6 (October 19, 2021): 789–802. http://dx.doi.org/10.1042/etls20210225.

Full text
Abstract:
The field of structural proteomics, which is focused on studying the structure–function relationship of proteins and protein complexes, is experiencing rapid growth. Since the early 2000s, structural databases such as the Protein Data Bank are storing increasing amounts of protein structural data, in addition to modeled structures becoming increasingly available. This, combined with the recent advances in graph-based machine-learning models, enables the use of protein structural data in predictive models, with the goal of creating tools that will advance our understanding of protein function. Similar to using graph learning tools to molecular graphs, which currently undergo rapid development, there is also an increasing trend in using graph learning approaches on protein structures. In this short review paper, we survey studies that use graph learning techniques on proteins, and examine their successes and shortcomings, while also discussing future directions.
APA, Harvard, Vancouver, ISO, and other styles
4

Rives, Alexander, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, et al. "Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences." Proceedings of the National Academy of Sciences 118, no. 15 (April 5, 2021): e2016239118. http://dx.doi.org/10.1073/pnas.2016239118.

Full text
Abstract:
In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In the life sciences, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Protein language modeling at the scale of evolution is a logical step toward predictive and generative artificial intelligence for biology. To this end, we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity. The resulting model contains information about biological properties in its representations. The representations are learned from sequence data alone. The learned representation space has a multiscale organization reflecting structure from the level of biochemical properties of amino acids to remote homology of proteins. Information about secondary and tertiary structure is encoded in the representations and can be identified by linear projections. Representation learning produces features that generalize across a range of applications, enabling state-of-the-art supervised prediction of mutational effect and secondary structure and improving state-of-the-art features for long-range contact prediction.
APA, Harvard, Vancouver, ISO, and other styles
5

Warikoo, Neha, Yung-Chun Chang, and Shang-Pin Ma. "Gradient Boosting over Linguistic-Pattern-Structured Trees for Learning Protein–Protein Interaction in the Biomedical Literature." Applied Sciences 12, no. 20 (October 11, 2022): 10199. http://dx.doi.org/10.3390/app122010199.

Full text
Abstract:
Protein-based studies contribute significantly to gathering functional information about biological systems; therefore, the protein–protein interaction detection task is one of the most researched topics in the biomedical literature. To this end, many state-of-the-art systems using syntactic tree kernels (TK) and deep learning have been developed. However, these models are computationally complex and have limited learning interpretability. In this paper, we introduce a linguistic-pattern-representation-based Gradient-Tree Boosting model, i.e., LpGBoost. It uses linguistic patterns to optimize and generate semantically relevant representation vectors for learning over the gradient-tree boosting. The patterns are learned via unsupervised modeling by clustering invariant semantic features. These linguistic representations are semi-interpretable with rich semantic knowledge, and owing to their shallow representation, they are also computationally less expensive. Our experiments with six protein–protein interaction (PPI) corpora demonstrate that LpGBoost outperforms the SOTA tree-kernel models, as well as the CNN-based interaction detection studies for BioInfer and AIMed corpora.
APA, Harvard, Vancouver, ISO, and other styles
6

Chornozhuk, S. "The New Geometric “State-Action” Space Representation for Q-Learning Algorithm for Protein Structure Folding Problem." Cybernetics and Computer Technologies, no. 3 (October 27, 2020): 59–73. http://dx.doi.org/10.34229/2707-451x.20.3.6.

Full text
Abstract:
Introduction. The spatial protein structure folding is an important and actual problem in computational biology. Considering the mathematical model of the task, it can be easily concluded that finding an optimal protein conformation in a three dimensional grid is a NP-hard problem. Therefore some reinforcement learning techniques such as Q-learning approach can be used to solve the problem. The article proposes a new geometric “state-action” space representation which significantly differs from all alternative representations used for this problem. The purpose of the article is to analyze existing approaches of different states and actions spaces representations for Q-learning algorithm for protein structure folding problem, reveal their advantages and disadvantages and propose the new geometric “state-space” representation. Afterwards the goal is to compare existing and the proposed approaches, make conclusions with also describing possible future steps of further research. Result. The work of the proposed algorithm is compared with others on the basis of 10 known chains with a length of 48 first proposed in [16]. For each of the chains the Q-learning algorithm with the proposed “state-space” representation outperformed the same Q-learning algorithm with alternative existing “state-space” representations both in terms of average and minimal energy values of resulted conformations. Moreover, a plenty of existing representations are used for a 2D protein structure predictions. However, during the experiments both existing and proposed representations were slightly changed or developed to solve the problem in 3D, which is more computationally demanding task. Conclusion. The quality of the Q-learning algorithm with the proposed geometric “state-action” space representation has been experimentally confirmed. Consequently, it’s proved that the further research is promising. Moreover, several steps of possible future research such as combining the proposed approach with deep learning techniques has been already suggested. Keywords: Spatial protein structure, combinatorial optimization, relative coding, machine learning, Q-learning, Bellman equation, state space, action space, basis in 3D space.
APA, Harvard, Vancouver, ISO, and other styles
7

Yao, Yu, Xiuquan Du, Yanyu Diao, and Huaixu Zhu. "An integration of deep learning with feature embedding for protein–protein interaction prediction." PeerJ 7 (June 17, 2019): e7126. http://dx.doi.org/10.7717/peerj.7126.

Full text
Abstract:
Protein–protein interactions are closely relevant to protein function and drug discovery. Hence, accurately identifying protein–protein interactions will help us to understand the underlying molecular mechanisms and significantly facilitate the drug discovery. However, the majority of existing computational methods for protein–protein interactions prediction are focused on the feature extraction and combination of features and there have been limited gains from the state-of-the-art models. In this work, a new residue representation method named Res2vec is designed for protein sequence representation. Residue representations obtained by Res2vec describe more precisely residue-residue interactions from raw sequence and supply more effective inputs for the downstream deep learning model. Combining effective feature embedding with powerful deep learning techniques, our method provides a general computational pipeline to infer protein–protein interactions, even when protein structure knowledge is entirely unknown. The proposed method DeepFE-PPI is evaluated on the S. Cerevisiae and human datasets. The experimental results show that DeepFE-PPI achieves 94.78% (accuracy), 92.99% (recall), 96.45% (precision), 89.62% (Matthew’s correlation coefficient, MCC) and 98.71% (accuracy), 98.54% (recall), 98.77% (precision), 97.43% (MCC), respectively. In addition, we also evaluate the performance of DeepFE-PPI on five independent species datasets and all the results are superior to the existing methods. The comparisons show that DeepFE-PPI is capable of predicting protein–protein interactions by a novel residue representation method and a deep learning classification framework in an acceptable level of accuracy. The codes along with instructions to reproduce this work are available from https://github.com/xal2019/DeepFE-PPI.
APA, Harvard, Vancouver, ISO, and other styles
8

Garruss, Alexander S., Katherine M. Collins, and George M. Church. "Deep representation learning improves prediction of LacI-mediated transcriptional repression." Proceedings of the National Academy of Sciences 118, no. 27 (June 29, 2021): e2022838118. http://dx.doi.org/10.1073/pnas.2022838118.

Full text
Abstract:
Recent progress in DNA synthesis and sequencing technology has enabled systematic studies of protein function at a massive scale. We explore a deep mutational scanning study that measured the transcriptional repression function of 43,669 variants of the Escherichia coli LacI protein. We analyze structural and evolutionary aspects that relate to how the function of this protein is maintained, including an in-depth look at the C-terminal domain. We develop a deep neural network to predict transcriptional repression mediated by the lac repressor of Escherichia coli using experimental measurements of variant function. When measured across 10 separate training and validation splits using 5,009 single mutations of the lac repressor, our best-performing model achieved a median Pearson correlation of 0.79, exceeding any previous model. We demonstrate that deep representation learning approaches, first trained in an unsupervised manner across millions of diverse proteins, can be fine-tuned in a supervised fashion using lac repressor experimental datasets to more effectively predict a variant’s effect on repression. These findings suggest a deep representation learning model may improve the prediction of other important properties of proteins.
APA, Harvard, Vancouver, ISO, and other styles
9

Rahman, Julia, Nazrul Islam Mondal, Khaled Ben Islam, and Al Mehedi Hasan. "Feature Fusion Based SVM Classifier for Protein Subcellular Localization Prediction." Journal of Integrative Bioinformatics 13, no. 1 (March 1, 2016): 23–33. http://dx.doi.org/10.1515/jib-2016-288.

Full text
Abstract:
Summary For the importance of protein subcellular localization in different branch of life science and drug discovery, researchers have focused their attentions on protein subcellular localization prediction. Effective representation of features from protein sequences plays most vital role in protein subcellular localization prediction specially in case of machine learning technique. Single feature representation like pseudo amino acid composition (PseAAC), physiochemical property model (PPM), amino acid index distribution (AAID) contains insufficient information from protein sequences. To deal with such problem, we have proposed two feature fusion representations AAIDPAAC and PPMPAAC to work with Support Vector Machine classifier, which fused PseAAC with PPM and AAID accordingly. We have evaluated performance for both single and fused feature representation of Gram-negative bacterial dataset. We have got at least 3% more actual accuracy by AAIDPAAC and 2% more locative accuracy by PPMPAAC than single feature representation.
APA, Harvard, Vancouver, ISO, and other styles
10

Jin, Chen, Zhuangwei Shi, Chuanze Kang, Ken Lin, and Han Zhang. "TLCrys: Transfer Learning Based Method for Protein Crystallization Prediction." International Journal of Molecular Sciences 23, no. 2 (January 16, 2022): 972. http://dx.doi.org/10.3390/ijms23020972.

Full text
Abstract:
X-ray diffraction technique is one of the most common methods of ascertaining protein structures, yet only 2–10% of proteins can produce diffraction-quality crystals. Several computational methods have been proposed so far to predict protein crystallization. Nevertheless, the current state-of-the-art computational methods are limited by the scarcity of experimental data. Thus, the prediction accuracy of existing models hasn’t reached the ideal level. To address the problems above, we propose a novel transfer-learning-based framework for protein crystallization prediction, named TLCrys. The framework proceeds in two steps: pre-training and fine-tuning. The pre-training step adopts attention mechanism to extract both global and local information of the protein sequences. The representation learned from the pre-training step is regarded as knowledge to be transferred and fine-tuned to enhance the performance of crystalization prediction. During pre-training, TLCrys adopts a multi-task learning method, which not only improves the learning ability of protein encoding, but also enhances the robustness and generalization of protein representation. The multi-head self-attention layer guarantees that different levels of the protein representation can be extracted by the fine-tuned step. During transfer learning, the fine-tuning strategy used by TLCrys improves the task-specialized learning ability of the network. Our method outperforms all previous predictors significantly in five crystallization stages of prediction. Furthermore, the proposed methodology can be well generalized to other protein sequence classification tasks.
APA, Harvard, Vancouver, ISO, and other styles
11

Löchel, Hannah F., Dominic Eger, Theodor Sperlea, and Dominik Heider. "Deep learning on chaos game representation for proteins." Bioinformatics 36, no. 1 (June 21, 2019): 272–79. http://dx.doi.org/10.1093/bioinformatics/btz493.

Full text
Abstract:
AbstractMotivationClassification of protein sequences is one big task in bioinformatics and has many applications. Different machine learning methods exist and are applied on these problems, such as support vector machines (SVM), random forests (RF) and neural networks (NN). All of these methods have in common that protein sequences have to be made machine-readable and comparable in the first step, for which different encodings exist. These encodings are typically based on physical or chemical properties of the sequence. However, due to the outstanding performance of deep neural networks (DNN) on image recognition, we used frequency matrix chaos game representation (FCGR) for encoding of protein sequences into images. In this study, we compare the performance of SVMs, RFs and DNNs, trained on FCGR encoded protein sequences. While the original chaos game representation (CGR) has been used mainly for genome sequence encoding and classification, we modified it to work also for protein sequences, resulting in n-flakes representation, an image with several icosagons.ResultsWe could show that all applied machine learning techniques (RF, SVM and DNN) show promising results compared to the state-of-the-art methods on our benchmark datasets, with DNNs outperforming the other methods and that FCGR is a promising new encoding method for protein sequences.Availability and implementationhttps://cran.r-project.org/.Supplementary informationSupplementary data are available at Bioinformatics online.
APA, Harvard, Vancouver, ISO, and other styles
12

Li, Yan, Yu-Ren Zhang, Ping Zhang, Dong-Xu Li, and Tian-Long Xiao. "Protein–Protein Interactions Prediction Base on Multiple Information Fusion via Graph Representation Learning." Journal of Biomaterials and Tissue Engineering 12, no. 4 (April 1, 2022): 807–12. http://dx.doi.org/10.1166/jbt.2022.2953.

Full text
Abstract:
It is a critical impact on the processing of biological cells to protein–protein interactions (PPIs) in nature. Traditional PPIs predictive biological experiments consume a lot of human and material costs and time. Therefore, there is a great need to use computational methods to forecast PPIs. Most of the existing calculation methods are based on the sequence characteristics or internal structural characteristics of proteins, and most of them have the singleness of features. Therefore, we propose a novel method to predict PPIs base on multiple information fusion through graph representation learning. Specifically, firstly, the known protein sequences are calculated, and the properties of each protein are obtained by k-mer. Then, the known protein relationship pairs were constructed into an adjacency graph, and the graph representation learning method–graph convolution network was used to fuse the attributes of each protein with the graph structure information to obtain the features containing a variety of information. Finally, we put the multi-information features into the random forest classifier species for prediction and classification. Experimental results indicate that our method has high accuracy and AUC of 78.83% and 86.10%, respectively. In conclusion, our method has an excellent application prospect for predicting unknown PPIs.
APA, Harvard, Vancouver, ISO, and other styles
13

Orasch, Oliver, Noah Weber, Michael Müller, Amir Amanzadi, Chiara Gasbarri, and Christopher Trummer. "Protein–Protein Interaction Prediction for Targeted Protein Degradation." International Journal of Molecular Sciences 23, no. 13 (June 24, 2022): 7033. http://dx.doi.org/10.3390/ijms23137033.

Full text
Abstract:
Protein–protein interactions (PPIs) play a fundamental role in various biological functions; thus, detecting PPI sites is essential for understanding diseases and developing new drugs. PPI prediction is of particular relevance for the development of drugs employing targeted protein degradation, as their efficacy relies on the formation of a stable ternary complex involving two proteins. However, experimental methods to detect PPI sites are both costly and time-intensive. In recent years, machine learning-based methods have been developed as screening tools. While they are computationally more efficient than traditional docking methods and thus allow rapid execution, these tools have so far primarily been based on sequence information, and they are therefore limited in their ability to address spatial requirements. In addition, they have to date not been applied to targeted protein degradation. Here, we present a new deep learning architecture based on the concept of graph representation learning that can predict interaction sites and interactions of proteins based on their surface representations. We demonstrate that our model reaches state-of-the-art performance using AUROC scores on the established MaSIF dataset. We furthermore introduce a new dataset with more diverse protein interactions and show that our model generalizes well to this new data. These generalization capabilities allow our model to predict the PPIs relevant for targeted protein degradation, which we show by demonstrating the high accuracy of our model for PPI prediction on the available ternary complex data. Our results suggest that PPI prediction models can be a valuable tool for screening protein pairs while developing new drugs for targeted protein degradation.
APA, Harvard, Vancouver, ISO, and other styles
14

Kabir, Anowarul, and Amarda Shehu. "GOProFormer: A Multi-Modal Transformer Method for Gene Ontology Protein Function Prediction." Biomolecules 12, no. 11 (November 18, 2022): 1709. http://dx.doi.org/10.3390/biom12111709.

Full text
Abstract:
Protein Language Models (PLMs) are shown to be capable of learning sequence representations useful for various prediction tasks, from subcellular localization, evolutionary relationships, family membership, and more. They have yet to be demonstrated useful for protein function prediction. In particular, the problem of automatic annotation of proteins under the Gene Ontology (GO) framework remains open. This paper makes two key contributions. It debuts a novel method that leverages the transformer architecture in two ways. A sequence transformer encodes protein sequences in a task-agnostic feature space. A graph transformer learns a representation of GO terms while respecting their hierarchical relationships. The learned sequence and GO terms representations are combined and utilized for multi-label classification, with the labels corresponding to GO terms. The method is shown superior over recent representative GO prediction methods. The second major contribution in this paper is a deep investigation of different ways of constructing training and testing datasets. The paper shows that existing approaches under- or over-estimate the generalization power of a model. A novel approach is proposed to address these issues, resulting in a new benchmark dataset to rigorously evaluate and compare methods and advance the state-of-the-art.
APA, Harvard, Vancouver, ISO, and other styles
15

Cruz-Barbosa, Raúl, Erik-German Ramos-Pérez, and Jesús Giraldo. "Representation Learning for Class C G Protein-Coupled Receptors Classification." Molecules 23, no. 3 (March 19, 2018): 690. http://dx.doi.org/10.3390/molecules23030690.

Full text
APA, Harvard, Vancouver, ISO, and other styles
16

Alley, Ethan C., Grigory Khimulya, Surojit Biswas, Mohammed AlQuraishi, and George M. Church. "Unified rational protein engineering with sequence-based deep representation learning." Nature Methods 16, no. 12 (October 21, 2019): 1315–22. http://dx.doi.org/10.1038/s41592-019-0598-1.

Full text
APA, Harvard, Vancouver, ISO, and other styles
17

Díaz-Eufracio, Bárbara I., and José L. Medina-Franco. "Machine Learning Models to Predict Protein–Protein Interaction Inhibitors." Molecules 27, no. 22 (November 17, 2022): 7986. http://dx.doi.org/10.3390/molecules27227986.

Full text
Abstract:
Protein–protein interaction (PPI) inhibitors have an increasing role in drug discovery. It is hypothesized that machine learning (ML) algorithms can classify or identify PPI inhibitors. This work describes the performance of different algorithms and molecular fingerprints used in chemoinformatics to develop a classification model to identify PPI inhibitors making the codes freely available to the community, particularly the medicinal chemistry research groups working with PPI inhibitors. We found that classification algorithms have different performances according to various features employed in the training process. Random forest (RF) models with the extended connectivity fingerprint radius 2 (ECFP4) had the best classification abilities compared to those models trained with ECFP6 o MACCS keys (166-bits). In general, logistic regression (LR) models had lower performance metrics than RF models, but ECFP4 was the representation most appropriate for LR. ECFP4 also generated models with high-performance metrics with support vector machines (SVM). We also constructed ensemble models based on the top-performing models. As part of this work and to help non-computational experts, we developed a pipeline code freely available.
APA, Harvard, Vancouver, ISO, and other styles
18

Bussey, Thomas J., and MaryKay Orgill. "Biochemistry instructors’ use of intentions for student learning to evaluate and select external representations of protein translation." Chemistry Education Research and Practice 20, no. 4 (2019): 787–803. http://dx.doi.org/10.1039/c9rp00025a.

Full text
Abstract:
Instructors draw on their intentions for student learning in the enactment of curriculum, particularly in the selection and presentation of external representation of scientific phenomena. These representations both create opportunities for students to experience non-experiential biochemical phenomena, such as protein translation, and constrain the possibilities for student learning based on the limited number of features depicted and the visual cues used to draw viewers attention to those features. In this study, we explore biochemistry instructors’ intentions for student learning about protein translation and how those intentions influence their selection of external representations for instruction. A series of instructor interviews were used to identify information that students need to know in order to develop a biochemically accurate understanding of protein translation. We refer to this information as the “critical features” of protein translation. Two dominant themes of critical features were identified: (1) components/structures of protein translation and (2) interactions/chemistry of protein translation. Three general components (the ribosome, the mRNA, and the tRNA) and two primary interactions (base pairing and peptide bond formation) were described by all instructors. Instructors tended to favor simpler, stylized representations that closely aligned with their stated critical features of translation for instructional purposes.
APA, Harvard, Vancouver, ISO, and other styles
19

Tsubaki, Masashi, Masashi Shimbo, and Yuji Matsumoto. "Protein Fold Recognition with Representation Learning and Long Short-Term Memory." IPSJ Transactions on Bioinformatics 10 (2017): 2–8. http://dx.doi.org/10.2197/ipsjtbio.10.2.

Full text
APA, Harvard, Vancouver, ISO, and other styles
20

Li, Bo, Lijun Cai, Bo Liao, Xiangzheng Fu, Pingping Bing, and Jialiang Yang. "Prediction of Protein Subcellular Localization Based on Fusion of Multi-view Features." Molecules 24, no. 5 (March 6, 2019): 919. http://dx.doi.org/10.3390/molecules24050919.

Full text
Abstract:
The prediction of protein subcellular localization is critical for inferring protein functions, gene regulations and protein-protein interactions. With the advances of high-throughput sequencing technologies and proteomic methods, the protein sequences of numerous yeasts have become publicly available, which enables us to computationally predict yeast protein subcellular localization. However, widely-used protein sequence representation techniques, such as amino acid composition and the Chou’s pseudo amino acid composition (PseAAC), are difficult in extracting adequate information about the interactions between residues and position distribution of each residue. Therefore, it is still urgent to develop novel sequence representations. In this study, we have presented two novel protein sequence representation techniques including Generalized Chaos Game Representation (GCGR) based on the frequency and distributions of the residues in the protein primary sequence, and novel statistics and information theory (NSI) reflecting local position information of the sequence. In the GCGR + NSI representation, a protein primary sequence is simply represented by a 5-dimensional feature vector, while other popular methods like PseAAC and dipeptide adopt features of more than hundreds of dimensions. In practice, the feature representation is highly efficient in predicting protein subcellular localization. Even without using machine learning-based classifiers, a simple model based on the feature vector can achieve prediction accuracies of 0.8825 and 0.7736 respectively for the CL317 and ZW225 datasets. To further evaluate the effectiveness of the proposed encoding schemes, we introduce a multi-view features-based method to combine the two above-mentioned features with other well-known features including PseAAC and dipeptide composition, and use support vector machine as the classifier to predict protein subcellular localization. This novel model achieves prediction accuracies of 0.927 and 0.871 respectively for the CL317 and ZW225 datasets, better than other existing methods in the jackknife tests. The results suggest that the GCGR and NSI features are useful complements to popular protein sequence representations in predicting yeast protein subcellular localization. Finally, we validate a few newly predicted protein subcellular localizations by evidences from some published articles in authority journals and books.
APA, Harvard, Vancouver, ISO, and other styles
21

Cretin, Gabriel, Tatiana Galochkina, Alexandre G. de Brevern, and Jean-Christophe Gelly. "PYTHIA: Deep Learning Approach for Local Protein Conformation Prediction." International Journal of Molecular Sciences 22, no. 16 (August 17, 2021): 8831. http://dx.doi.org/10.3390/ijms22168831.

Full text
Abstract:
Protein Blocks (PBs) are a widely used structural alphabet describing local protein backbone conformation in terms of 16 possible conformational states, adopted by five consecutive amino acids. The representation of complex protein 3D structures as 1D PB sequences was previously successfully applied to protein structure alignment and protein structure prediction. In the current study, we present a new model, PYTHIA (predicting any conformation at high accuracy), for the prediction of the protein local conformations in terms of PBs directly from the amino acid sequence. PYTHIA is based on a deep residual inception-inside-inception neural network with convolutional block attention modules, predicting 1 of 16 PB classes from evolutionary information combined to physicochemical properties of individual amino acids. PYTHIA clearly outperforms the LOCUSTRA reference method for all PB classes and demonstrates great performance for PB prediction on particularly challenging proteins from the CASP14 free modelling category.
APA, Harvard, Vancouver, ISO, and other styles
22

Yan, Zichao, William L. Hamilton, and Mathieu Blanchette. "Graph neural representational learning of RNA secondary structures for predicting RNA-protein interactions." Bioinformatics 36, Supplement_1 (July 1, 2020): i276—i284. http://dx.doi.org/10.1093/bioinformatics/btaa456.

Full text
Abstract:
Abstract Motivation RNA-protein interactions are key effectors of post-transcriptional regulation. Significant experimental and bioinformatics efforts have been expended on characterizing protein binding mechanisms on the molecular level, and on highlighting the sequence and structural traits of RNA that impact the binding specificity for different proteins. Yet our ability to predict these interactions in silico remains relatively poor. Results In this study, we introduce RPI-Net, a graph neural network approach for RNA-protein interaction prediction. RPI-Net learns and exploits a graph representation of RNA molecules, yielding significant performance gains over existing state-of-the-art approaches. We also introduce an approach to rectify an important type of sequence bias caused by the RNase T1 enzyme used in many CLIP-Seq experiments, and we show that correcting this bias is essential in order to learn meaningful predictors and properly evaluate their accuracy. Finally, we provide new approaches to interpret the trained models and extract simple, biologically interpretable representations of the learned sequence and structural motifs. Availability and implementation Source code can be accessed at https://www.github.com/HarveyYan/RNAonGraph. Supplementary information Supplementary data are available at Bioinformatics online.
APA, Harvard, Vancouver, ISO, and other styles
23

Xia, Chunqiu, Shi-Hao Feng, Ying Xia, Xiaoyong Pan, and Hong-Bin Shen. "Fast protein structure comparison through effective representation learning with contrastive graph neural networks." PLOS Computational Biology 18, no. 3 (March 24, 2022): e1009986. http://dx.doi.org/10.1371/journal.pcbi.1009986.

Full text
Abstract:
Protein structure alignment algorithms are often time-consuming, resulting in challenges for large-scale protein structure similarity-based retrieval. There is an urgent need for more efficient structure comparison approaches as the number of protein structures increases rapidly. In this paper, we propose an effective graph-based protein structure representation learning method, GraSR, for fast and accurate structure comparison. In GraSR, a graph is constructed based on the intra-residue distance derived from the tertiary structure. Then, deep graph neural networks (GNNs) with a short-cut connection learn graph representations of the tertiary structures under a contrastive learning framework. To further improve GraSR, a novel dynamic training data partition strategy and length-scaling cosine distance are introduced. We objectively evaluate our method GraSR on SCOPe v2.07 and a new released independent test set from PDB database with a designed comprehensive performance metric. Compared with other state-of-the-art methods, GraSR achieves about 7%-10% improvement on two benchmark datasets. GraSR is also much faster than alignment-based methods. We dig into the model and observe that the superiority of GraSR is mainly brought by the learned discriminative residue-level and global descriptors. The web-server and source code of GraSR are freely available at www.csbio.sjtu.edu.cn/bioinf/GraSR/ for academic use.
APA, Harvard, Vancouver, ISO, and other styles
24

Dai, Bowen, and Chris Bailey-Kellogg. "Protein interaction interface region prediction by geometric deep learning." Bioinformatics 37, no. 17 (March 8, 2021): 2580–88. http://dx.doi.org/10.1093/bioinformatics/btab154.

Full text
Abstract:
Abstract Motivation Protein–protein interactions drive wide-ranging molecular processes, and characterizing at the atomic level how proteins interact (beyond just the fact that they interact) can provide key insights into understanding and controlling this machinery. Unfortunately, experimental determination of three-dimensional protein complex structures remains difficult and does not scale to the increasingly large sets of proteins whose interactions are of interest. Computational methods are thus required to meet the demands of large-scale, high-throughput prediction of how proteins interact, but unfortunately, both physical modeling and machine learning methods suffer from poor precision and/or recall. Results In order to improve performance in predicting protein interaction interfaces, we leverage the best properties of both data- and physics-driven methods to develop a unified Geometric Deep Neural Network, ‘PInet’ (Protein Interface Network). PInet consumes pairs of point clouds encoding the structures of two partner proteins, in order to predict their structural regions mediating interaction. To make such predictions, PInet learns and utilizes models capturing both geometrical and physicochemical molecular surface complementarity. In application to a set of benchmarks, PInet simultaneously predicts the interface regions on both interacting proteins, achieving performance equivalent to or even much better than the state-of-the-art predictor for each dataset. Furthermore, since PInet is based on joint segmentation of a representation of a protein surfaces, its predictions are meaningful in terms of the underlying physical complementarity driving molecular recognition. Availability and implementation PInet scripts and models are available at https://github.com/FTD007/PInet. Supplementary information Supplementary data are available at Bioinformatics online.
APA, Harvard, Vancouver, ISO, and other styles
25

van den Bent, Irene, Stavros Makrodimitris, and Marcel Reinders. "The Power of Universal Contextualized Protein Embeddings in Cross-species Protein Function Prediction." Evolutionary Bioinformatics 17 (January 2021): 117693432110626. http://dx.doi.org/10.1177/11769343211062608.

Full text
Abstract:
Computationally annotating proteins with a molecular function is a difficult problem that is made even harder due to the limited amount of available labeled protein training data. Unsupervised protein embeddings partly circumvent this limitation by learning a universal protein representation from many unlabeled sequences. Such embeddings incorporate contextual information of amino acids, thereby modeling the underlying principles of protein sequences insensitive to the context of species. We used an existing pre-trained protein embedding method and subjected its molecular function prediction performance to detailed characterization, first to advance the understanding of protein language models, and second to determine areas of improvement. Then, we applied the model in a transfer learning task by training a function predictor based on the embeddings of annotated protein sequences of one training species and making predictions on the proteins of several test species with varying evolutionary distance. We show that this approach successfully generalizes knowledge about protein function from one eukaryotic species to various other species, outperforming both an alignment-based and a supervised-learning-based baseline. This implies that such a method could be effective for molecular function prediction in inadequately annotated species from understudied taxonomic kingdoms.
APA, Harvard, Vancouver, ISO, and other styles
26

Liu, Xianggen, Yunan Luo, Pengyong Li, Sen Song, and Jian Peng. "Deep geometric representations for modeling effects of mutations on protein-protein binding affinity." PLOS Computational Biology 17, no. 8 (August 4, 2021): e1009284. http://dx.doi.org/10.1371/journal.pcbi.1009284.

Full text
Abstract:
Modeling the impact of amino acid mutations on protein-protein interaction plays a crucial role in protein engineering and drug design. In this study, we develop GeoPPI, a novel structure-based deep-learning framework to predict the change of binding affinity upon mutations. Based on the three-dimensional structure of a protein, GeoPPI first learns a geometric representation that encodes topology features of the protein structure via a self-supervised learning scheme. These representations are then used as features for training gradient-boosting trees to predict the changes of protein-protein binding affinity upon mutations. We find that GeoPPI is able to learn meaningful features that characterize interactions between atoms in protein structures. In addition, through extensive experiments, we show that GeoPPI achieves new state-of-the-art performance in predicting the binding affinity changes upon both single- and multi-point mutations on six benchmark datasets. Moreover, we show that GeoPPI can accurately estimate the difference of binding affinities between a few recently identified SARS-CoV-2 antibodies and the receptor-binding domain (RBD) of the S protein. These results demonstrate the potential of GeoPPI as a powerful and useful computational tool in protein design and engineering. Our code and datasets are available at: https://github.com/Liuxg16/GeoPPI.
APA, Harvard, Vancouver, ISO, and other styles
27

Xie, Ziwei, and Jinbo Xu. "Deep graph learning of inter-protein contacts." Bioinformatics 38, no. 4 (November 10, 2021): 947–53. http://dx.doi.org/10.1093/bioinformatics/btab761.

Full text
Abstract:
Abstract Motivation Inter-protein (interfacial) contact prediction is very useful for in silico structural characterization of protein–protein interactions. Although deep learning has been applied to this problem, its accuracy is not as good as intra-protein contact prediction. Results We propose a new deep learning method GLINTER (Graph Learning of INTER-protein contacts) for interfacial contact prediction of dimers, leveraging a rotational invariant representation of protein tertiary structures and a pretrained language model of multiple sequence alignments. Tested on the 13th and 14th CASP-CAPRI datasets, the average top L/10 precision achieved by GLINTER is 54% on the homodimers and 52% on all the dimers, much higher than 30% obtained by the latest deep learning method DeepHomo on the homodimers and 15% obtained by BIPSPI on all the dimers. Our experiments show that GLINTER-predicted contacts help improve selection of docking decoys. Availability and implementation The software is available at https://github.com/zw2x/glinter. The datasets are available at https://github.com/zw2x/glinter/data. Supplementary information Supplementary data are available at Bioinformatics online.
APA, Harvard, Vancouver, ISO, and other styles
28

Wang, Xian-Fang, Peng Gao, Yi-Feng Liu, Hong-Fei Li, and Fan Lu. "Predicting Thermophilic Proteins by Machine Learning." Current Bioinformatics 15, no. 5 (October 14, 2020): 493–502. http://dx.doi.org/10.2174/1574893615666200207094357.

Full text
Abstract:
Background: Thermophilic proteins can maintain good activity under high temperature, therefore, it is important to study thermophilic proteins for the thermal stability of proteins. Objective: In order to solve the problem of low precision and low efficiency in predicting thermophilic proteins, a prediction method based on feature fusion and machine learning was proposed in this paper. Methods: For the selected thermophilic data sets, firstly, the thermophilic protein sequence was characterized based on feature fusion by the combination of g-gap dipeptide, entropy density and autocorrelation coefficient. Then, Kernel Principal Component Analysis (KPCA) was used to reduce the dimension of the expressed protein sequence features in order to reduce the training time and improve efficiency. Finally, the classification model was designed by using the classification algorithm. Results: A variety of classification algorithms was used to train and test on the selected thermophilic dataset. By comparison, the accuracy of the Support Vector Machine (SVM) under the jackknife method was over 92%. The combination of other evaluation indicators also proved that the SVM performance was the best. Conclusion: Because of choosing an effectively feature representation method and a robust classifier, the proposed method is suitable for predicting thermophilic proteins and is superior to most reported methods.
APA, Harvard, Vancouver, ISO, and other styles
29

Zhang, Haiping, Konda Mani Saravanan, Jinzhi Lin, Linbu Liao, Justin Tze-Yang Ng, Jiaxiu Zhou, and Yanjie Wei. "DeepBindPoc: a deep learning method to rank ligand binding pockets using molecular vector representation." PeerJ 8 (April 6, 2020): e8864. http://dx.doi.org/10.7717/peerj.8864.

Full text
Abstract:
Accurate identification of ligand-binding pockets in a protein is important for structure-based drug design. In recent years, several deep learning models were developed to learn important physical–chemical and spatial information to predict ligand-binding pockets in a protein. However, ranking the native ligand binding pockets from a pool of predicted pockets is still a hard task for computational molecular biologists using a single web-based tool. Hence, we believe, by using closer to real application data set as training and by providing ligand information, an enhanced model to identify accurate pockets can be obtained. In this article, we propose a new deep learning method called DeepBindPoc for identifying and ranking ligand-binding pockets in proteins. The model is built by using information about the binding pocket and associated ligand. We take advantage of the mol2vec tool to represent both the given ligand and pocket as vectors to construct a densely fully connected layer model. During the training, important features for pocket-ligand binding are automatically extracted and high-level information is preserved appropriately. DeepBindPoc demonstrated a strong complementary advantage for the detection of native-like pockets when combined with traditional popular methods, such as fpocket and P2Rank. The proposed method is extensively tested and validated with standard procedures on multiple datasets, including a dataset with G-protein Coupled receptors. The systematic testing and validation of our method suggest that DeepBindPoc is a valuable tool to rank near-native pockets for theoretically modeled protein with unknown experimental active site but have known ligand. The DeepBindPoc model described in this article is available at GitHub (https://github.com/haiping1010/DeepBindPoc) and the webserver is available at (http://cbblab.siat.ac.cn/DeepBindPoc/index.php).
APA, Harvard, Vancouver, ISO, and other styles
30

Liu, Xiang, Huitao Feng, Jie Wu, and Kelin Xia. "Dowker complex based machine learning (DCML) models for protein-ligand binding affinity prediction." PLOS Computational Biology 18, no. 4 (April 6, 2022): e1009943. http://dx.doi.org/10.1371/journal.pcbi.1009943.

Full text
Abstract:
With the great advancements in experimental data, computational power and learning algorithms, artificial intelligence (AI) based drug design has begun to gain momentum recently. AI-based drug design has great promise to revolutionize pharmaceutical industries by significantly reducing the time and cost in drug discovery processes. However, a major issue remains for all AI-based learning model that is efficient molecular representations. Here we propose Dowker complex (DC) based molecular interaction representations and Riemann Zeta function based molecular featurization, for the first time. Molecular interactions between proteins and ligands (or others) are modeled as Dowker complexes. A multiscale representation is generated by using a filtration process, during which a series of DCs are generated at different scales. Combinatorial (Hodge) Laplacian matrices are constructed from these DCs, and the Riemann zeta functions from their spectral information can be used as molecular descriptors. To validate our models, we consider protein-ligand binding affinity prediction. Our DC-based machine learning (DCML) models, in particular, DC-based gradient boosting tree (DC-GBT), are tested on three most-commonly used datasets, i.e., including PDBbind-2007, PDBbind-2013 and PDBbind-2016, and extensively compared with other existing state-of-the-art models. It has been found that our DC-based descriptors can achieve the state-of-the-art results and have better performance than all machine learning models with traditional molecular descriptors. Our Dowker complex based machine learning models can be used in other tasks in AI-based drug design and molecular data analysis.
APA, Harvard, Vancouver, ISO, and other styles
31

Jing, Xiaoyang, Qimin Dong, Ruqian Lu, and Qiwen Dong. "Protein Inter-Residue Contacts Prediction: Methods, Performances and Applications." Current Bioinformatics 14, no. 3 (March 7, 2019): 178–89. http://dx.doi.org/10.2174/1574893613666181109130430.

Full text
Abstract:
Background:Protein inter-residue contacts prediction play an important role in the field of protein structure and function research. As a low-dimensional representation of protein tertiary structure, protein inter-residue contacts could greatly help de novo protein structure prediction methods to reduce the conformational search space. Over the past two decades, various methods have been developed for protein inter-residue contacts prediction.Objective:We provide a comprehensive and systematic review of protein inter-residue contacts prediction methods.Results:Protein inter-residue contacts prediction methods are roughly classified into five categories: correlated mutations methods, machine-learning methods, fusion methods, templatebased methods and 3D model-based methods. In this paper, firstly we describe the common definition of protein inter-residue contacts and show the typical application of protein inter-residue contacts. Then, we present a comprehensive review of the three main categories for protein interresidue contacts prediction: correlated mutations methods, machine-learning methods and fusion methods. Besides, we analyze the constraints for each category. Furthermore, we compare several representative methods on the CASP11 dataset and discuss performances of these methods in detail.Conclusion:Correlated mutations methods achieve better performances for long-range contacts, while the machine-learning method performs well for short-range contacts. Fusion methods could take advantage of the machine-learning and correlated mutations methods. Employing more effective fusion strategy could be helpful to further improve the performances of fusion methods.
APA, Harvard, Vancouver, ISO, and other styles
32

Gokcan, Hatice, and Olexandr Isayev. "Prediction of protein pKa with representation learning." Chemical Science 13, no. 8 (2022): 2462–74. http://dx.doi.org/10.1039/d1sc05610g.

Full text
APA, Harvard, Vancouver, ISO, and other styles
33

Sun, Miao, Dong Si, Matthew Conover, Natalie Stephenson, Jesse Eickholt, Renzhi Cao, and John Smith. "TopQA: a topological representation for single-model protein quality assessment with machine learning." International Journal of Computational Biology and Drug Design 13, no. 1 (2020): 144. http://dx.doi.org/10.1504/ijcbdd.2020.10026784.

Full text
APA, Harvard, Vancouver, ISO, and other styles
34

Smith, John, Matthew Conover, Natalie Stephenson, Jesse Eickholt, Dong Si, Miao Sun, and Renzhi Cao. "TopQA: a topological representation for single-model protein quality assessment with machine learning." International Journal of Computational Biology and Drug Design 13, no. 1 (2020): 144. http://dx.doi.org/10.1504/ijcbdd.2020.105095.

Full text
APA, Harvard, Vancouver, ISO, and other styles
35

Bramer, David, and Guo-Wei Wei. "Atom-specific persistent homology and its application to protein flexibility analysis." Computational and Mathematical Biophysics 8, no. 1 (February 17, 2020): 1–35. http://dx.doi.org/10.1515/cmb-2020-0001.

Full text
Abstract:
AbstractRecently, persistent homology has had tremendous success in biomolecular data analysis. It works by examining the topological relationship or connectivity of a group of atoms in a molecule at a variety of scales, then rendering a family of topological representations of the molecule. However, persistent homology is rarely employed for the analysis of atomic properties, such as biomolecular flexibility analysis or B-factor prediction. This work introduces atom-specific persistent homology to provide a local atomic level representation of a molecule via a global topological tool. This is achieved through the construction of a pair of conjugated sets of atoms and corresponding conjugated simplicial complexes, as well as conjugated topological spaces. The difference between the topological invariants of the pair of conjugated sets is measured by Bottleneck and Wasserstein metrics and leads to an atom-specific topological representation of individual atomic properties in a molecule. Atom-specific topological features are integrated with various machine learning algorithms, including gradient boosting trees and convolutional neural network for protein thermal fluctuation analysis and B-factor prediction. Extensive numerical results indicate the proposed method provides a powerful topological tool for analyzing and predicting localized information in complex macromolecules.
APA, Harvard, Vancouver, ISO, and other styles
36

Wiercioch, Magdalena. "Exploring the Potential of Spherical Harmonics and PCVM for Compounds Activity Prediction." International Journal of Molecular Sciences 20, no. 9 (May 2, 2019): 2175. http://dx.doi.org/10.3390/ijms20092175.

Full text
Abstract:
Biologically active chemical compounds may provide remedies for several diseases. Meanwhile, Machine Learning techniques applied to Drug Discovery, which are cheaper and faster than wet-lab experiments, have the capability to more effectively identify molecules with the expected pharmacological activity. Therefore, it is urgent and essential to develop more representative descriptors and reliable classification methods to accurately predict molecular activity. In this paper, we investigate the potential of a novel representation based on Spherical Harmonics fed into Probabilistic Classification Vector Machines classifier, namely SHPCVM, to compound the activity prediction task. We make use of representation learning to acquire the features which describe the molecules as precise as possible. To verify the performance of SHPCVM ten-fold cross-validation tests are performed on twenty-one G protein-coupled receptors (GPCRs). Experimental outcomes (accuracy of 0.86) assessed by the classification accuracy, precision, recall, Matthews’ Correlation Coefficient and Cohen’s kappa reveal that using our Spherical Harmonics-based representation which is relatively short and Probabilistic Classification Vector Machines can achieve very satisfactory performance results for GPCRs.
APA, Harvard, Vancouver, ISO, and other styles
37

Wang, Yanbin, Zhu-Hong You, Shan Yang, Xiao Li, Tong-Hai Jiang, and Xi Zhou. "A High Efficient Biological Language Model for Predicting Protein–Protein Interactions." Cells 8, no. 2 (February 3, 2019): 122. http://dx.doi.org/10.3390/cells8020122.

Full text
Abstract:
Many life activities and key functions in organisms are maintained by different types of protein–protein interactions (PPIs). In order to accelerate the discovery of PPIs for different species, many computational methods have been developed. Unfortunately, even though computational methods are constantly evolving, efficient methods for predicting PPIs from protein sequence information have not been found for many years due to limiting factors including both methodology and technology. Inspired by the similarity of biological sequences and languages, developing a biological language processing technology may provide a brand new theoretical perspective and feasible method for the study of biological sequences. In this paper, a pure biological language processing model is proposed for predicting protein–protein interactions only using a protein sequence. The model was constructed based on a feature representation method for biological sequences called bio-to-vector (Bio2Vec) and a convolution neural network (CNN). The Bio2Vec obtains protein sequence features by using a “bio-word” segmentation system and a word representation model used for learning the distributed representation for each “bio-word”. The Bio2Vec supplies a frame that allows researchers to consider the context information and implicit semantic information of a bio sequence. A remarkable improvement in PPIs prediction performance has been observed by using the proposed model compared with state-of-the-art methods. The presentation of this approach marks the start of “bio language processing technology,” which could cause a technological revolution and could be applied to improve the quality of predictions in other problems.
APA, Harvard, Vancouver, ISO, and other styles
38

Zhang, Ting-He, and Shao-Wu Zhang. "Advances in the Prediction of Protein Subcellular Locations with Machine Learning." Current Bioinformatics 14, no. 5 (June 28, 2019): 406–21. http://dx.doi.org/10.2174/1574893614666181217145156.

Full text
Abstract:
Background: Revealing the subcellular location of a newly discovered protein can bring insight into their function and guide research at the cellular level. The experimental methods currently used to identify the protein subcellular locations are both time-consuming and expensive. Thus, it is highly desired to develop computational methods for efficiently and effectively identifying the protein subcellular locations. Especially, the rapidly increasing number of protein sequences entering the genome databases has called for the development of automated analysis methods. Methods: In this review, we will describe the recent advances in predicting the protein subcellular locations with machine learning from the following aspects: i) Protein subcellular location benchmark dataset construction, ii) Protein feature representation and feature descriptors, iii) Common machine learning algorithms, iv) Cross-validation test methods and assessment metrics, v) Web servers. Result & Conclusion: Concomitant with a large number of protein sequences generated by highthroughput technologies, four future directions for predicting protein subcellular locations with machine learning should be paid attention. One direction is the selection of novel and effective features (e.g., statistics, physical-chemical, evolutional) from the sequences and structures of proteins. Another is the feature fusion strategy. The third is the design of a powerful predictor and the fourth one is the protein multiple location sites prediction.
APA, Harvard, Vancouver, ISO, and other styles
39

Bae, Haelee, and Hojung Nam. "GraphATT-DTA: Attention-Based Novel Representation of Interaction to Predict Drug-Target Binding Affinity." Biomedicines 11, no. 1 (December 27, 2022): 67. http://dx.doi.org/10.3390/biomedicines11010067.

Full text
Abstract:
Drug-target binding affinity (DTA) prediction is an essential step in drug discovery. Drug-target protein binding occurs at specific regions between the protein and drug, rather than the entire protein and drug. However, existing deep-learning DTA prediction methods do not consider the interactions between drug substructures and protein sub-sequences. This work proposes GraphATT-DTA, a DTA prediction model that constructs the essential regions for determining interaction affinity between compounds and proteins, modeled with an attention mechanism for interpretability. We make the model consider the local-to-global interactions with the attention mechanism between compound and protein. As a result, GraphATT-DTA shows an improved prediction of DTA performance and interpretability compared with state-of-the-art models. The model is trained and evaluated with the Davis dataset, the human kinase dataset; an external evaluation is achieved with the independently proposed human kinase dataset from the BindingDB dataset.
APA, Harvard, Vancouver, ISO, and other styles
40

Rassinoux, A. M. "Knowledge Representation and Management." Yearbook of Medical Informatics 19, no. 01 (August 2010): 64–67. http://dx.doi.org/10.1055/s-0038-1638691.

Full text
Abstract:
Summary Objectives: To summarize current outstanding research in the field of knowledge representation and management. Method: Synopsis of the articles selected for the IMIA Yearbook 2010. Results: Four interesting papers, dealing with structured knowledge, have been selected for the section knowledge representation and management. Combining the newest techniques in computational linguistics and natural language processing with the latest methods in statistical data analysis, machine learning and text mining has proved to be efficient for turning unstructured textual information into meaningful knowledge. Three of the four selected papers for the section knowledge representation and management corroborate this approach and depict various experiments conducted to. extract meaningful knowledge from unstructured free texts such as extracting cancer disease characteristics from pathology reports, or extracting protein-protein interactions from biomedical papers, as well as extracting knowledge for the support of hypothesis generation in molecular biology from the Medline literature. Finally, the last paper addresses the level of formally representing and structuring informa- tion within clinical terminologies in order to render such information easily available and shareable among the health informatics com- munity. Conclusions: Delivering common powerful tools able to automati- cally extract meaningful information from the huge amount of elec- tronically unstructured free texts is an essential step towards promot- ing sharing and reusability across applications, domains, and institutions thus contributing to building capacities worldwide.
APA, Harvard, Vancouver, ISO, and other styles
41

Nido, Gonzalo S., Ludovica Bachschmid-Romano, Ugo Bastolla, and Alberto Pascual-García. "Learning structural bioinformatics and evolution with a snake puzzle." PeerJ Computer Science 2 (December 5, 2016): e100. http://dx.doi.org/10.7717/peerj-cs.100.

Full text
Abstract:
We propose here a working unit for teaching basic concepts of structural bioinformatics and evolution through the example of a wooden snake puzzle, strikingly similar to toy models widely used in the literature of protein folding. In our experience, developed at a Master’s course at the Universidad Autónoma de Madrid (Spain), the concreteness of this example helps to overcome difficulties caused by the interdisciplinary nature of this field and its high level of abstraction, in particular for students coming from traditional disciplines. The puzzle will allow us discussing a simple algorithm for finding folded solutions, through which we will introduce the concept of the configuration space and the contact matrix representation. This is a central tool for comparing protein structures, for studying simple models of protein energetics, and even for a qualitative discussion of folding kinetics, through the concept of the Contact Order. It also allows a simple representation of misfolded conformations and their free energy. These concepts will motivate evolutionary questions, which we will address by simulating a structurally constrained model of protein evolution, again modelled on the snake puzzle. In this way, we can discuss the analogy between evolutionary concepts and statistical mechanics that facilitates the understanding of both concepts. The proposed examples and literature are accessible, and we provide supplementary material (see ‘Data Availability’) to reproduce the numerical experiments. We also suggest possible directions to expand the unit. We hope that this work will further stimulate the adoption of games in teaching practice.
APA, Harvard, Vancouver, ISO, and other styles
42

Qu, Kaiyang, Leyi Wei, and Quan Zou. "A Review of DNA-binding Proteins Prediction Methods." Current Bioinformatics 14, no. 3 (March 7, 2019): 246–54. http://dx.doi.org/10.2174/1574893614666181212102030.

Full text
Abstract:
Background:DNA-binding proteins, binding to DNA, widely exist in living cells, participating in many cell activities. They can participate some DNA-related cell activities, for instance DNA replication, transcription, recombination, and DNA repair.Objective:Given the importance of DNA-binding proteins, studies for predicting the DNA-binding proteins have been a popular issue over the past decades. In this article, we review current machine-learning methods which research on the prediction of DNA-binding proteins through feature representation methods, classifiers, measurements, dataset and existing web server.Method:The prediction methods of DNA-binding protein can be divided into two types, based on amino acid composition and based on protein structure. In this article, we accord to the two types methods to introduce the application of machine learning in DNA-binding proteins prediction.Results:Machine learning plays an important role in the classification of DNA-binding proteins, and the result is better. The best ACC is above 80%.Conclusion:Machine learning can be widely used in many aspects of biological information, especially in protein classification. Some issues should be considered in future work. First, the relationship between the number of features and performance must be explored. Second, many features are used to predict DNA-binding proteins and propose solutions for high-dimensional spaces.
APA, Harvard, Vancouver, ISO, and other styles
43

Jin, Yuan, Jiarui Lu, Runhan Shi, and Yang Yang. "EmbedDTI: Enhancing the Molecular Representations via Sequence Embedding and Graph Convolutional Network for the Prediction of Drug-Target Interaction." Biomolecules 11, no. 12 (November 29, 2021): 1783. http://dx.doi.org/10.3390/biom11121783.

Full text
Abstract:
The identification of drug-target interaction (DTI) plays a key role in drug discovery and development. Benefitting from large-scale drug databases and verified DTI relationships, a lot of machine-learning methods have been developed to predict DTIs. However, due to the difficulty in extracting useful information from molecules, the performance of these methods is limited by the representation of drugs and target proteins. This study proposes a new model called EmbedDTI to enhance the representation of both drugs and target proteins, and improve the performance of DTI prediction. For protein sequences, we leverage language modeling for pretraining the feature embeddings of amino acids and feed them to a convolutional neural network model for further representation learning. For drugs, we build two levels of graphs to represent compound structural information, namely the atom graph and substructure graph, and adopt graph convolutional network with an attention module to learn the embedding vectors for the graphs. We compare EmbedDTI with the existing DTI predictors on two benchmark datasets. The experimental results show that EmbedDTI outperforms the state-of-the-art models, and the attention module can identify the components crucial for DTIs in compounds.
APA, Harvard, Vancouver, ISO, and other styles
44

Tubiana, Jérôme, Simona Cocco, and Rémi Monasson. "Learning Compositional Representations of Interacting Systems with Restricted Boltzmann Machines: Comparative Study of Lattice Proteins." Neural Computation 31, no. 8 (August 2019): 1671–717. http://dx.doi.org/10.1162/neco_a_01210.

Full text
Abstract:
A restricted Boltzmann machine (RBM) is an unsupervised machine learning bipartite graphical model that jointly learns a probability distribution over data and extracts their relevant statistical features. RBMs were recently proposed for characterizing the patterns of coevolution between amino acids in protein sequences and for designing new sequences. Here, we study how the nature of the features learned by RBM changes with its defining parameters, such as the dimensionality of the representations (size of the hidden layer) and the sparsity of the features. We show that for adequate values of these parameters, RBMs operate in a so-called compositional phase in which visible configurations sampled from the RBM are obtained by recombining these features. We then compare the performance of RBM with other standard representation learning algorithms, including principal or independent component analysis (PCA, ICA), autoencoders (AE), variational autoencoders (VAE), and their sparse variants. We show that RBMs, due to the stochastic mapping between data configurations and representations, better capture the underlying interactions in the system and are significantly more robust with respect to sample size than deterministic methods such as PCA or ICA. In addition, this stochastic mapping is not prescribed a priori as in VAE, but learned from data, which allows RBMs to show good performance even with shallow architectures. All numerical results are illustrated on synthetic lattice protein data that share similar statistical features with real protein sequences and for which ground-truth interactions are known.
APA, Harvard, Vancouver, ISO, and other styles
45

Wei, Lesong, Xiucai Ye, Tetsuya Sakurai, Zengchao Mu, and Leyi Wei. "ToxIBTL: prediction of peptide toxicity based on information bottleneck and transfer learning." Bioinformatics 38, no. 6 (January 6, 2022): 1514–24. http://dx.doi.org/10.1093/bioinformatics/btac006.

Full text
Abstract:
Abstract Motivation Recently, peptides have emerged as a promising class of pharmaceuticals for various diseases treatment poised between traditional small molecule drugs and therapeutic proteins. However, one of the key bottlenecks preventing them from therapeutic peptides is their toxicity toward human cells, and few available algorithms for predicting toxicity are specially designed for short-length peptides. Results We present ToxIBTL, a novel deep learning framework by utilizing the information bottleneck principle and transfer learning to predict the toxicity of peptides as well as proteins. Specifically, we use evolutionary information and physicochemical properties of peptide sequences and integrate the information bottleneck principle into a feature representation learning scheme, by which relevant information is retained and the redundant information is minimized in the obtained features. Moreover, transfer learning is introduced to transfer the common knowledge contained in proteins to peptides, which aims to improve the feature representation capability. Extensive experimental results demonstrate that ToxIBTL not only achieves a higher prediction performance than state-of-the-art methods on the peptide dataset, but also has a competitive performance on the protein dataset. Furthermore, a user-friendly online web server is established as the implementation of the proposed ToxIBTL. Availability and implementation The proposed ToxIBTL and data can be freely accessible at http://server.wei-group.net/ToxIBTL. Our source code is available at https://github.com/WLYLab/ToxIBTL. Supplementary information Supplementary data are available at Bioinformatics online.
APA, Harvard, Vancouver, ISO, and other styles
46

Wang, Duolin, Yanchun Liang, and Dong Xu. "Capsule network for protein post-translational modification site prediction." Bioinformatics 35, no. 14 (December 6, 2018): 2386–94. http://dx.doi.org/10.1093/bioinformatics/bty977.

Full text
Abstract:
Abstract Motivation Computational methods for protein post-translational modification (PTM) site prediction provide a useful approach for studying protein functions. The prediction accuracy of the existing methods has significant room for improvement. A recent deep-learning architecture, Capsule Network (CapsNet), which can characterize the internal hierarchical representation of input data, presents a great opportunity to solve this problem, especially using small training data. Results We proposed a CapsNet for predicting protein PTM sites, including phosphorylation, N-linked glycosylation, N6-acetyllysine, methyl-arginine, S-palmitoyl-cysteine, pyrrolidone-carboxylic-acid and SUMOylation sites. The CapsNet outperformed the baseline convolutional neural network architecture MusiteDeep and other well-known tools in most cases and provided promising results for practical use, especially in learning from small training data. The capsule length also gives an accurate estimate for the confidence of the PTM prediction. We further demonstrated that the internal capsule features could be trained as a motif detector of phosphorylation sites when no kinase-specific phosphorylation labels were provided. In addition, CapsNet generates robust representations that have strong discriminant power in distinguishing kinase substrates from different kinase families. Our study sheds some light on the recognition mechanism of PTMs and applications of CapsNet on other bioinformatic problems. Availability and implementation The codes are free to download from https://github.com/duolinwang/CapsNet_PTM. Supplementary information Supplementary data are available at Bioinformatics online.
APA, Harvard, Vancouver, ISO, and other styles
47

Xu, Chang, Limin Jiang, Zehua Zhang, Xuyao Yu, Renhai Chen, and Junhai Xu. "An Integrated Prediction Method for Identifying Protein-Protein Interactions." Current Proteomics 17, no. 4 (June 29, 2020): 271–86. http://dx.doi.org/10.2174/1570164616666190306152318.

Full text
Abstract:
Background: Protein-Protein Interactions (PPIs) play a key role in various biological processes. Many methods have been developed to predict protein-protein interactions and protein interaction networks. However, many existing applications are limited, because of relying on a large number of homology proteins and interaction marks. Methods: In this paper, we propose a novel integrated learning approach (RF-Ada-DF) with the sequence-based feature representation, for identifying protein-protein interactions. Our method firstly constructs a sequence-based feature vector to represent each pair of proteins, viaMultivariate Mutual Information (MMI) and Normalized Moreau-Broto Autocorrelation (NMBAC). Then, we feed the 638- dimentional features into an integrated learning model for judging interaction pairs and non-interaction pairs. Furthermore, this integrated model embeds Random Forest in AdaBoost framework and turns weak classifiers into a single strong classifier. Meanwhile, we also employ double fault detection in order to suppress over-adaptation during the training process. Results: To evaluate the performance of our method, we conduct several comprehensive tests for PPIs prediction. On the H. pyloridataset, our method achieves 88.16% accuracy and 87.68% sensitivity, the accuracy of our method is increased by 0.57%. On the S. cerevisiaedataset, our method achieves 95.77% accuracy and 93.36% sensitivity, the accuracy of our method is increased by 0.76%. On the Humandataset, our method achieves 98.16% accuracy and 96.80% sensitivity, the accuracy of our method is increased by 0.6%. Experiments show that our method achieves better results than other outstanding methods for sequence-based PPIs prediction. The datasets and codes are available at https://github.com/guofei-tju/RF-Ada-DF.git.
APA, Harvard, Vancouver, ISO, and other styles
48

Nair B.J, Bipin, and Lijo Joy. "A hybrid approach for hot spot prediction and deep representation of hematological protein – drug interactions." International Journal of Engineering & Technology 7, no. 1.9 (March 1, 2018): 145. http://dx.doi.org/10.14419/ijet.v7i1.9.9752.

Full text
Abstract:
In our research work we will collect the data of drugs as well as protein regarding hematic diseases, then applying feature extraction as well as classification, predict hot spot and non-hot spot then we are predicting the hot region using prediction algorithm. Parallelly from the hematological drug we are extracting the feature using molecular finger print then classifying using a classifier and applying deep learning concept to reduce the dimensionality then finally using machine learning algorithm predicting which drug will interact with the help of a hybrid approach.
APA, Harvard, Vancouver, ISO, and other styles
49

Wang, Huiqing, Juan Wang, Zhipeng Feng, Ying Li, and Hong Zhao. "PD-BertEDL: An Ensemble Deep Learning Method Using BERT and Multivariate Representation to Predict Peptide Detectability." International Journal of Molecular Sciences 23, no. 20 (October 16, 2022): 12385. http://dx.doi.org/10.3390/ijms232012385.

Full text
Abstract:
Peptide detectability is defined as the probability of identifying a peptide from a mixture of standard samples, which is a key step in protein identification and analysis. Exploring effective methods for predicting peptide detectability is helpful for disease treatment and clinical research. However, most existing computational methods for predicting peptide detectability rely on a single information. With the increasing complexity of feature representation, it is necessary to explore the influence of multivariate information on peptide detectability. Thus, we propose an ensemble deep learning method, PD-BertEDL. Bidirectional encoder representations from transformers (BERT) is introduced to capture the context information of peptides. Context information, sequence information, and physicochemical information of peptides were combined to construct the multivariate feature space of peptides. We use different deep learning methods to capture the high-quality features of different categories of peptides information and use the average fusion strategy to integrate three model prediction results to solve the heterogeneity problem and to enhance the robustness and adaptability of the model. The experimental results show that PD-BertEDL is superior to the existing prediction methods, which can effectively predict peptide detectability and provide strong support for protein identification and quantitative analysis, as well as disease treatment.
APA, Harvard, Vancouver, ISO, and other styles
50

Nguyen, Trinh‐Trung‐Duong, Nguyen‐Quoc‐Khanh Le, Quang‐Thai Ho, Dinh‐Van Phan, and Yu‐Yen Ou. "Using Language Representation Learning Approach to Efficiently Identify Protein Complex Categories in Electron Transport Chain." Molecular Informatics 39, no. 10 (July 16, 2020): 2000033. http://dx.doi.org/10.1002/minf.202000033.

Full text
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography