Articoli di riviste: "Protein language models"

1

Tang, Lin. "Protein language models using convolutions". Nature Methods 21, n. 4 (aprile 2024): 550. http://dx.doi.org/10.1038/s41592-024-02252-3.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

2

Ali, Sarwan, Prakash Chourasia e Murray Patterson. "When Protein Structure Embedding Meets Large Language Models". Genes 15, n. 1 (23 dicembre 2023): 25. http://dx.doi.org/10.3390/genes15010025.

Testo completo

Abstract (sommario):

Protein structure analysis is essential in various bioinformatics domains such as drug discovery, disease diagnosis, and evolutionary studies. Within structural biology, the classification of protein structures is pivotal, employing machine learning algorithms to categorize structures based on data from databases like the Protein Data Bank (PDB). To predict protein functions, embeddings based on protein sequences have been employed. Creating numerical embeddings that preserve vital information while considering protein structure and sequence presents several challenges. The existing literature lacks a comprehensive and effective approach that combines structural and sequence-based features to achieve efficient protein classification. While large language models (LLMs) have exhibited promising outcomes for protein function prediction, their focus primarily lies on protein sequences, disregarding the 3D structures of proteins. The quality of embeddings heavily relies on how well the geometry of the embedding space aligns with the underlying data structure, posing a critical research question. Traditionally, Euclidean space has served as a widely utilized framework for embeddings. In this study, we propose a novel method for designing numerical embeddings in Euclidean space for proteins by leveraging 3D structure information, specifically employing the concept of contact maps. These embeddings are synergistically combined with features extracted from LLMs and traditional feature engineering techniques to enhance the performance of embeddings in supervised protein analysis. Experimental results on benchmark datasets, including PDB Bind and STCRDAB, demonstrate the superior performance of the proposed method for protein function prediction.

Gli stili APA, Harvard, Vancouver, ISO e altri

3

Ferruz, Noelia, e Birte Höcker. "Controllable protein design with language models". Nature Machine Intelligence 4, n. 6 (giugno 2022): 521–32. http://dx.doi.org/10.1038/s42256-022-00499-z.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

4

Li, Xiang, Zhuoyu Wei, Yueran Hu e Xiaolei Zhu. "GraphNABP: Identifying nucleic acid-binding proteins with protein graphs and protein language models". International Journal of Biological Macromolecules 280 (novembre 2024): 135599. http://dx.doi.org/10.1016/j.ijbiomac.2024.135599.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

5

Singh, Arunima. "Protein language models guide directed antibody evolution". Nature Methods 20, n. 6 (giugno 2023): 785. http://dx.doi.org/10.1038/s41592-023-01924-w.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

6

Tran, Chau, Siddharth Khadkikar e Aleksey Porollo. "Survey of Protein Sequence Embedding Models". International Journal of Molecular Sciences 24, n. 4 (14 febbraio 2023): 3775. http://dx.doi.org/10.3390/ijms24043775.

Testo completo

Abstract (sommario):

Derived from the natural language processing (NLP) algorithms, protein language models enable the encoding of protein sequences, which are widely diverse in length and amino acid composition, in fixed-size numerical vectors (embeddings). We surveyed representative embedding models such as Esm, Esm1b, ProtT5, and SeqVec, along with their derivatives (GoPredSim and PLAST), to conduct the following tasks in computational biology: embedding the Saccharomyces cerevisiae proteome, gene ontology (GO) annotation of the uncharacterized proteins of this organism, relating variants of human proteins to disease status, correlating mutants of beta-lactamase TEM-1 from Escherichia coli with experimentally measured antimicrobial resistance, and analyzing diverse fungal mating factors. We discuss the advances and shortcomings, differences, and concordance of the models. Of note, all of the models revealed that the uncharacterized proteins in yeast tend to be less than 200 amino acids long, contain fewer aspartates and glutamates, and are enriched for cysteine. Less than half of these proteins can be annotated with GO terms with high confidence. The distribution of the cosine similarity scores of benign and pathogenic mutations to the reference human proteins shows a statistically significant difference. The differences in embeddings of the reference TEM-1 and mutants have low to no correlation with minimal inhibitory concentrations (MIC).

Gli stili APA, Harvard, Vancouver, ISO e altri

7

Pokharel, Suresh, Pawel Pratyush, Hamid D. Ismail, Junfeng Ma e Dukka B. KC. "Integrating Embeddings from Multiple Protein Language Models to Improve Protein O-GlcNAc Site Prediction". International Journal of Molecular Sciences 24, n. 21 (6 novembre 2023): 16000. http://dx.doi.org/10.3390/ijms242116000.

Testo completo

Abstract (sommario):

O-linked β-N-acetylglucosamine (O-GlcNAc) is a distinct monosaccharide modification of serine (S) or threonine (T) residues of nucleocytoplasmic and mitochondrial proteins. O-GlcNAc modification (i.e., O-GlcNAcylation) is involved in the regulation of diverse cellular processes, including transcription, epigenetic modifications, and cell signaling. Despite the great progress in experimentally mapping O-GlcNAc sites, there is an unmet need to develop robust prediction tools that can effectively locate the presence of O-GlcNAc sites in protein sequences of interest. In this work, we performed a comprehensive evaluation of a framework for prediction of protein O-GlcNAc sites using embeddings from pre-trained protein language models. In particular, we compared the performance of three protein sequence-based large protein language models (pLMs), Ankh, ESM-2, and ProtT5, for prediction of O-GlcNAc sites and also evaluated various ensemble strategies to integrate embeddings from these protein language models. Upon investigation, the decision-level fusion approach that integrates the decisions of the three embedding models, which we call LM-OGlcNAc-Site, outperformed the models trained on these individual language models as well as other fusion approaches and other existing predictors in almost all of the parameters evaluated. The precise prediction of O-GlcNAc sites will facilitate the probing of O-GlcNAc site-specific functions of proteins in physiology and diseases. Moreover, these findings also indicate the effectiveness of combined uses of multiple protein language models in post-translational modification prediction and open exciting avenues for further research and exploration in other protein downstream tasks. LM-OGlcNAc-Site’s web server and source code are publicly available to the community.

Gli stili APA, Harvard, Vancouver, ISO e altri

8

Wang, Wenkai, Zhenling Peng e Jianyi Yang. "Single-sequence protein structure prediction using supervised transformer protein language models". Nature Computational Science 2, n. 12 (19 dicembre 2022): 804–14. http://dx.doi.org/10.1038/s43588-022-00373-3.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

9

Pang, Yihe, e Bin Liu. "IDP-LM: Prediction of protein intrinsic disorder and disorder functions based on language models". PLOS Computational Biology 19, n. 11 (22 novembre 2023): e1011657. http://dx.doi.org/10.1371/journal.pcbi.1011657.

Testo completo

Abstract (sommario):

Intrinsically disordered proteins (IDPs) and regions (IDRs) are a class of functionally important proteins and regions that lack stable three-dimensional structures under the native physiologic conditions. They participate in critical biological processes and thus are associated with the pathogenesis of many severe human diseases. Identifying the IDPs/IDRs and their functions will be helpful for a comprehensive understanding of protein structures and functions, and inform studies of rational drug design. Over the past decades, the exponential growth in the number of proteins with sequence information has deepened the gap between uncharacterized and annotated disordered sequences. Protein language models have recently demonstrated their powerful abilities to capture complex structural and functional information from the enormous quantity of unlabelled protein sequences, providing opportunities to apply protein language models to uncover the intrinsic disorders and their biological properties from the amino acid sequences. In this study, we proposed a computational predictor called IDP-LM for predicting intrinsic disorder and disorder functions by leveraging the pre-trained protein language models. IDP-LM takes the embeddings extracted from three pre-trained protein language models as the exclusive inputs, including ProtBERT, ProtT5 and a disorder specific language model (IDP-BERT). The ablation analysis shown that the IDP-BERT provided fine-grained feature representations of disorder, and the combination of three language models is the key to the performance improvement of IDP-LM. The evaluation results on independent test datasets demonstrated that the IDP-LM provided high-quality prediction results for intrinsic disorder and four common disordered functions.

Gli stili APA, Harvard, Vancouver, ISO e altri

10

Weber, Leon, Kirsten Thobe, Oscar Arturo Migueles Lozano, Jana Wolf e Ulf Leser. "PEDL: extracting protein–protein associations using deep language models and distant supervision". Bioinformatics 36, Supplement_1 (1 luglio 2020): i490—i498. http://dx.doi.org/10.1093/bioinformatics/btaa430.

Testo completo

Abstract (sommario):

Abstract Motivation A significant portion of molecular biology investigates signalling pathways and thus depends on an up-to-date and complete resource of functional protein–protein associations (PPAs) that constitute such pathways. Despite extensive curation efforts, major pathway databases are still notoriously incomplete. Relation extraction can help to gather such pathway information from biomedical publications. Current methods for extracting PPAs typically rely exclusively on rare manually labelled data which severely limits their performance. Results We propose PPA Extraction with Deep Language (PEDL), a method for predicting PPAs from text that combines deep language models and distant supervision. Due to the reliance on distant supervision, PEDL has access to an order of magnitude more training data than methods solely relying on manually labelled annotations. We introduce three different datasets for PPA prediction and evaluate PEDL for the two subtasks of predicting PPAs between two proteins, as well as identifying the text spans stating the PPA. We compared PEDL with a recently published state-of-the-art model and found that on average PEDL performs better in both tasks on all three datasets. An expert evaluation demonstrates that PEDL can be used to predict PPAs that are missing from major pathway databases and that it correctly identifies the text spans supporting the PPA. Availability and implementation PEDL is freely available at https://github.com/leonweber/pedl. The repository also includes scripts to generate the used datasets and to reproduce the experiments from this article. Supplementary information Supplementary data are available at Bioinformatics online.

Gli stili APA, Harvard, Vancouver, ISO e altri

11

Wang, Yang. "Enhanced protein function prediction by fusion embedding based on protein language models". Highlights in Science, Engineering and Technology 66 (20 settembre 2023): 177–84. http://dx.doi.org/10.54097/hset.v66i.11697.

Testo completo

Abstract (sommario):

Natural language models can accomplish non-natural language tasks such as protein prediction, but the actual prediction effect is low and occupies large computational resources. In this paper, a fusion embedding model is proposed to improve the prediction effect of the model and reduce the computational cost of the model by fusing information of different dimensions. The paper is validated by the downstream task of protein function prediction, which provides a reference for solving practical tasks using fusion embedding methods.

Gli stili APA, Harvard, Vancouver, ISO e altri

12

Sun, Yuanfei, e Yang Shen. "Variant effect prediction using structure-informed protein language models". Biophysical Journal 122, n. 3 (febbraio 2023): 473a. http://dx.doi.org/10.1016/j.bpj.2022.11.2537.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

13

Qu, Yang, Zitong Niu, Qiaojiao Ding, Taowa Zhao, Tong Kong, Bing Bai, Jianwei Ma, Yitian Zhao e Jianping Zheng. "Ensemble Learning with Supervised Methods Based on Large-Scale Protein Language Models for Protein Mutation Effects Prediction". International Journal of Molecular Sciences 24, n. 22 (18 novembre 2023): 16496. http://dx.doi.org/10.3390/ijms242216496.

Testo completo

Abstract (sommario):

Machine learning has been increasingly utilized in the field of protein engineering, and research directed at predicting the effects of protein mutations has attracted increasing attention. Among them, so far, the best results have been achieved by related methods based on protein language models, which are trained on a large number of unlabeled protein sequences to capture the generally hidden evolutionary rules in protein sequences, and are therefore able to predict their fitness from protein sequences. Although numerous similar models and methods have been successfully employed in practical protein engineering processes, the majority of the studies have been limited to how to construct more complex language models to capture richer protein sequence feature information and utilize this feature information for unsupervised protein fitness prediction. There remains considerable untapped potential in these developed models, such as whether the prediction performance can be further improved by integrating different models to further improve the accuracy of prediction. Furthermore, how to utilize large-scale models for prediction methods of mutational effects on quantifiable properties of proteins due to the nonlinear relationship between protein fitness and the quantification of specific functionalities has yet to be explored thoroughly. In this study, we propose an ensemble learning approach for predicting mutational effects of proteins integrating protein sequence features extracted from multiple large protein language models, as well as evolutionarily coupled features extracted in homologous sequences, while comparing the differences between linear regression and deep learning models in mapping these features to quantifiable functional changes. We tested our approach on a dataset of 17 protein deep mutation scans and indicated that the integrated approach together with linear regression enables the models to have higher prediction accuracy and generalization. Moreover, we further illustrated the reliability of the integrated approach by exploring the differences in the predictive performance of the models across species and protein sequence lengths, as well as by visualizing clustering of ensemble and non-ensemble features.

Gli stili APA, Harvard, Vancouver, ISO e altri

14

Thumuluri, Vineet, Hannah-Marie Martiny, Jose J. Almagro Armenteros, Jesper Salomon, Henrik Nielsen e Alexander Rosenberg Johansen. "NetSolP: predicting protein solubility in Escherichia coli using language models". Bioinformatics 38, n. 4 (27 novembre 2021): 941–46. http://dx.doi.org/10.1093/bioinformatics/btab801.

Testo completo

Abstract (sommario):

Abstract Motivation Solubility and expression levels of proteins can be a limiting factor for large-scale studies and industrial production. By determining the solubility and expression directly from the protein sequence, the success rate of wet-lab experiments can be increased. Results In this study, we focus on predicting the solubility and usability for purification of proteins expressed in Escherichia coli directly from the sequence. Our model NetSolP is based on deep learning protein language models called transformers and we show that it achieves state-of-the-art performance and improves extrapolation across datasets. As we find current methods are built on biased datasets, we curate existing datasets by using strict sequence-identity partitioning and ensure that there is minimal bias in the sequences. Availability and implementation The predictor and data are available at https://services.healthtech.dtu.dk/service.php?NetSolP and the open-sourced code is available at https://github.com/tvinet/NetSolP-1.0. Supplementary information Supplementary data are available at Bioinformatics online.

Gli stili APA, Harvard, Vancouver, ISO e altri

15

Deutschmann, Nicolas, Aurelien Pelissier, Anna Weber, Shuaijun Gao, Jasmina Bogojeska e María Rodríguez Martínez. "Do domain-specific protein language models outperform general models on immunology-related tasks?" ImmunoInformatics 14 (giugno 2024): 100036. http://dx.doi.org/10.1016/j.immuno.2024.100036.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

16

Wang, Bo, e Wenjin Li. "Advances in the Application of Protein Language Modeling for Nucleic Acid Protein Binding Site Prediction". Genes 15, n. 8 (18 agosto 2024): 1090. http://dx.doi.org/10.3390/genes15081090.

Testo completo

Abstract (sommario):

Protein and nucleic acid binding site prediction is a critical computational task that benefits a wide range of biological processes. Previous studies have shown that feature selection holds particular significance for this prediction task, making the generation of more discriminative features a key area of interest for many researchers. Recent progress has shown the power of protein language models in handling protein sequences, in leveraging the strengths of attention networks, and in successful applications to tasks such as protein structure prediction. This naturally raises the question of the applicability of protein language models in predicting protein and nucleic acid binding sites. Various approaches have explored this potential. This paper first describes the development of protein language models. Then, a systematic review of the latest methods for predicting protein and nucleic acid binding sites is conducted by covering benchmark sets, feature generation methods, performance comparisons, and feature ablation studies. These comparisons demonstrate the importance of protein language models for the prediction task. Finally, the paper discusses the challenges of protein and nucleic acid binding site prediction and proposes possible research directions and future trends. The purpose of this survey is to furnish researchers with actionable suggestions for comprehending the methodologies used in predicting protein–nucleic acid binding sites, fostering the creation of protein-centric language models, and tackling real-world obstacles encountered in this field.

Gli stili APA, Harvard, Vancouver, ISO e altri

17

Bhat, Suhaas, Garyk Brixi, Kalyan Palepu, Lauren Hong, Vivian Yudistyra, Tianlai Chen, Sophia Vincoff, Lin Zhao e Pranam Chatterjee. "Abstract C118: Design of programmable peptide-guided oncoprotein degraders via generative language models". Molecular Cancer Therapeutics 22, n. 12_Supplement (1 dicembre 2023): C118. http://dx.doi.org/10.1158/1535-7163.targ-23-c118.

Testo completo

Abstract (sommario):

Abstract Targeted protein degradation of pathogenic proteins represents a powerful new treatment strategy for multiple cancers. Unfortunately, a sizable portion of these proteins are considered “undruggable” by standard small molecule-based approaches, including PROTACs and molecular glues, largely due to their disordered nature, instability, and lack of binding site accessibility. As a more modular strategy, we have developed a genetically-encoded protein architecture by fusing target-specific peptides to E3 ubiquitin ligase domains for selective and potent intracellular degradation of oncoproteins. To enable programmability of our system, we develop a suite of algorithms that enable the design of target-specific peptides via protein language model (pLM) embeddings, without the requirement of 3D structures. First, we train a model that leverages pLM embeddings to efficiently select high-affinity peptides from natural protein interaction interfaces. Next, we develop a high-accuracy discriminator, based on the contrastive language-image pretraining (CLIP) architecture underlying OpenAI's DALL-E model, to prioritize and screen peptides with selectivity to a specified target oncoprotein. As input to the discriminator, we create a Gaussian diffusion generator to sample a pLM latent space, fine-tuned on experimentally-valid peptide sequences. Finally, to enable de novo design of binding peptides, we train an instance of GPT-2 with protein interacting sequences to enable peptide generation conditioned on target oncoprotein sequences. Our models demonstrate low perplexities across both existing and generated peptide sequences, highlighting their robust generative capability. By experimentally fusing model-derived peptides to E3 ubiquitin ligase domains, we reliably identify candidates exhibiting robust and selective endogenous degradation of diverse, "undruggable" oncoproteins in cancer cell models, including tumorigenic regulators such as β-catenin and TRIM8, as well as oncogenic fusion proteins, such as EWS-FLI1, PAX3-FOXO1, and DNAJB1-PRKACA. We further show that our peptide-guided degraders have negligible off-target effects via whole-cell proteomics and demonstrate their modulation of transcriptional and apoptotic pathways, motivating further translation of our therapeutic platform. Together, our work establishes a CRISPR-analogous system for programmable protein degradation applications across the oncoproteome. Citation Format: Suhaas Bhat, Garyk Brixi, Kalyan Palepu, Lauren Hong, Vivian Yudistyra, Tianlai Chen, Sophia Vincoff, Lin Zhao, Pranam Chatterjee. Design of programmable peptide-guided oncoprotein degraders via generative language models [abstract]. In: Proceedings of the AACR-NCI-EORTC Virtual International Conference on Molecular Targets and Cancer Therapeutics; 2023 Oct 11-15; Boston, MA. Philadelphia (PA): AACR; Mol Cancer Ther 2023;22(12 Suppl):Abstract nr C118.

Gli stili APA, Harvard, Vancouver, ISO e altri

18

Mardikoraem, Mehrsa, e Daniel Woldring. "Protein Fitness Prediction Is Impacted by the Interplay of Language Models, Ensemble Learning, and Sampling Methods". Pharmaceutics 15, n. 5 (25 aprile 2023): 1337. http://dx.doi.org/10.3390/pharmaceutics15051337.

Testo completo

Abstract (sommario):

Advances in machine learning (ML) and the availability of protein sequences via high-throughput sequencing techniques have transformed the ability to design novel diagnostic and therapeutic proteins. ML allows protein engineers to capture complex trends hidden within protein sequences that would otherwise be difficult to identify in the context of the immense and rugged protein fitness landscape. Despite this potential, there persists a need for guidance during the training and evaluation of ML methods over sequencing data. Two key challenges for training discriminative models and evaluating their performance include handling severely imbalanced datasets (e.g., few high-fitness proteins among an abundance of non-functional proteins) and selecting appropriate protein sequence representations (numerical encodings). Here, we present a framework for applying ML over assay-labeled datasets to elucidate the capacity of sampling techniques and protein encoding methods to improve binding affinity and thermal stability prediction tasks. For protein sequence representations, we incorporate two widely used methods (One-Hot encoding and physiochemical encoding) and two language-based methods (next-token prediction, UniRep; masked-token prediction, ESM). Elaboration on performance is provided over protein fitness, protein size, and sampling techniques. In addition, an ensemble of protein representation methods is generated to discover the contribution of distinct representations and improve the final prediction score. We then implement multiple criteria decision analysis (MCDA; TOPSIS with entropy weighting), using multiple metrics well-suited for imbalanced data, to ensure statistical rigor in ranking our methods. Within the context of these datasets, the synthetic minority oversampling technique (SMOTE) outperformed undersampling while encoding sequences with One-Hot, UniRep, and ESM representations. Moreover, ensemble learning increased the predictive performance of the affinity-based dataset by 4% compared to the best single-encoding candidate (F1-score = 97%), while ESM alone was rigorous enough in stability prediction (F1-score = 92%).

Gli stili APA, Harvard, Vancouver, ISO e altri

19

Nana Teukam, Yves Gaetan, Loïc Kwate Dassi, Matteo Manica, Daniel Probst, Philippe Schwaller e Teodoro Laino. "Language models can identify enzymatic binding sites in protein sequences". Computational and Structural Biotechnology Journal 23 (dicembre 2024): 1929–37. http://dx.doi.org/10.1016/j.csbj.2024.04.012.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

20

Yadalam, Pradeep kumar, Ramya Ramadoss, Pradeep kumar R e Jishnu Krishna Kumar. "Pre-Trained Language Models Based Sequence Prediction of Wnt-Sclerostin Protein Sequences in Alveolar Bone Formation". Journal of Pioneering Medical Science 12, n. 3 (31 dicembre 2023): 55–60. http://dx.doi.org/10.61091/jpms202312311.

Testo completo

Abstract (sommario):

Background and Introduction: Osteocytes, the most numerous bone cells, create sclerostin. The sclerostin protein sequence predictive model helps create novel medications and produce alveolar bone in periodontitis and other oral bone illnesses, including osteoporosis. Neural networks examine protein variants for protein engineering and predict their structure and function impacts. Proteins with improved function and stability have been engineered using LLMs and CNNs. Sequence-based models, especially protein LLMs, predict variation effects, fitness, post-translational modifications, biophysical properties, and protein structure. CNNs trained on structural data also improve enzyme function. It is unknown if these models differ or forecast similarly. This study seeks Pre-trained language models to predict Wnt-Sclerostin Protein sequences in alveolar bone formation. Methods: Using UniProt ID, sclerostin and related proteins (Q9BQB4, Q9BQB4-1, Q9BQB4-2, Q6X4U4, O75197) were identified and quality-checked. Deepbio analyzed FASTA sequences. Deep Bio is a one-stop web service allowing academics to build any biological deep-learning architecture. DeepBIO used deep learning to improve and visualize biological sequencing data. LLM BASED Reformer, AAPNP, TEXTRGNN, VDCNN, and split sequence-based datasets into test and training. We randomly partitioned each dataset into 1000 training and 200 testing sets to change hyperparameters and measure performance. Results: Reformer, AAPNP, TEXTRGNN, VDCNN, RNN CNN exhibit 93, 64, 51, 91, and 64 percent accuracy. Conclusion: Protein sequence-based massive language models are growing, and R\&D is solving complicated challenges.

Gli stili APA, Harvard, Vancouver, ISO e altri

21

Wang, Yan, Huiting Sun, Nan Sheng, Kai He, Wenjv Hou, Ziqi Zhao, Qixing Yang e Lan Huang. "ESMSec: Prediction of Secreted Proteins in Human Body Fluids Using Protein Language Models and Attention". International Journal of Molecular Sciences 25, n. 12 (9 giugno 2024): 6371. http://dx.doi.org/10.3390/ijms25126371.

Testo completo

Abstract (sommario):

The secreted proteins of human body fluid have the potential to be used as biomarkers for diseases. These biomarkers can be used for early diagnosis and risk prediction of diseases, so the study of secreted proteins of human body fluid has great application value. In recent years, the deep-learning-based transformer language model has transferred from the field of natural language processing (NLP) to the field of proteomics, leading to the development of protein language models (PLMs) for protein sequence representation. Here, we propose a deep learning framework called ESM Predict Secreted Proteins (ESMSec) to predict three types of proteins secreted in human body fluid. The ESMSec is based on the ESM2 model and attention architecture. Specifically, the protein sequence data are firstly put into the ESM2 model to extract the feature information from the last hidden layer, and all the input proteins are encoded into a fixed 1000 × 480 matrix. Secondly, multi-head attention with a fully connected neural network is employed as the classifier to perform binary classification according to whether they are secreted into each body fluid. Our experiment utilized three human body fluids that are important and ubiquitous markers. Experimental results show that ESMSec achieved average accuracy of 0.8486, 0.8358, and 0.8325 on the testing datasets for plasma, cerebrospinal fluid (CSF), and seminal fluid, which on average outperform the state-of-the-art (SOTA) methods. The outstanding performance results of ESMSec demonstrate that the ESM can improve the prediction performance of the model and has great potential to screen the secretion information of human body fluid proteins.

Gli stili APA, Harvard, Vancouver, ISO e altri

22

Zhu, Yi-Heng, Chengxin Zhang, Dong-Jun Yu e Yang Zhang. "Integrating unsupervised language model with triplet neural networks for protein gene ontology prediction". PLOS Computational Biology 18, n. 12 (22 dicembre 2022): e1010793. http://dx.doi.org/10.1371/journal.pcbi.1010793.

Testo completo

Abstract (sommario):

Accurate identification of protein function is critical to elucidate life mechanisms and design new drugs. We proposed a novel deep-learning method, ATGO, to predict Gene Ontology (GO) attributes of proteins through a triplet neural-network architecture embedded with pre-trained language models from protein sequences. The method was systematically tested on 1068 non-redundant benchmarking proteins and 3328 targets from the third Critical Assessment of Protein Function Annotation (CAFA) challenge. Experimental results showed that ATGO achieved a significant increase of the GO prediction accuracy compared to the state-of-the-art approaches in all aspects of molecular function, biological process, and cellular component. Detailed data analyses showed that the major advantage of ATGO lies in the utilization of pre-trained transformer language models which can extract discriminative functional pattern from the feature embeddings. Meanwhile, the proposed triplet network helps enhance the association of functional similarity with feature similarity in the sequence embedding space. In addition, it was found that the combination of the network scores with the complementary homology-based inferences could further improve the accuracy of the predicted models. These results demonstrated a new avenue for high-accuracy deep-learning function prediction that is applicable to large-scale protein function annotations from sequence alone.

Gli stili APA, Harvard, Vancouver, ISO e altri

23

Lin, Zeming, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin et al. "Evolutionary-scale prediction of atomic-level protein structure with a language model". Science 379, n. 6637 (17 marzo 2023): 1123–30. http://dx.doi.org/10.1126/science.ade2574.

Testo completo

Abstract (sommario):

Recent advances in machine learning have leveraged evolutionary information in multiple sequence alignments to predict protein structure. We demonstrate direct inference of full atomic-level protein structure from primary sequence using a large language model. As language models of protein sequences are scaled up to 15 billion parameters, an atomic-resolution picture of protein structure emerges in the learned representations. This results in an order-of-magnitude acceleration of high-resolution structure prediction, which enables large-scale structural characterization of metagenomic proteins. We apply this capability to construct the ESM Metagenomic Atlas by predicting structures for >617 million metagenomic protein sequences, including >225 million that are predicted with high confidence, which gives a view into the vast breadth and diversity of natural proteins.

Gli stili APA, Harvard, Vancouver, ISO e altri

24

Strodthoff, Nils, Patrick Wagner, Markus Wenzel e Wojciech Samek. "UDSMProt: universal deep sequence models for protein classification". Bioinformatics 36, n. 8 (8 gennaio 2020): 2401–9. http://dx.doi.org/10.1093/bioinformatics/btaa003.

Testo completo

Abstract (sommario):

Abstract Motivation Inferring the properties of a protein from its amino acid sequence is one of the key problems in bioinformatics. Most state-of-the-art approaches for protein classification are tailored to single classification tasks and rely on handcrafted features, such as position-specific-scoring matrices from expensive database searches. We argue that this level of performance can be reached or even be surpassed by learning a task-agnostic representation once, using self-supervised language modeling, and transferring it to specific tasks by a simple fine-tuning step. Results We put forward a universal deep sequence model that is pre-trained on unlabeled protein sequences from Swiss-Prot and fine-tuned on protein classification tasks. We apply it to three prototypical tasks, namely enzyme class prediction, gene ontology prediction and remote homology and fold detection. The proposed method performs on par with state-of-the-art algorithms that were tailored to these specific tasks or, for two out of three tasks, even outperforms them. These results stress the possibility of inferring protein properties from the sequence alone and, on more general grounds, the prospects of modern natural language processing methods in omics. Moreover, we illustrate the prospects for explainable machine learning methods in this field by selected case studies. Availability and implementation Source code is available under https://github.com/nstrodt/UDSMProt. Supplementary information Supplementary data are available at Bioinformatics online.

Gli stili APA, Harvard, Vancouver, ISO e altri

25

Gonzales, Mark Edward M., Jennifer C. Ureta e Anish M. S. Shrestha. "Protein embeddings improve phage-host interaction prediction". PLOS ONE 18, n. 7 (24 luglio 2023): e0289030. http://dx.doi.org/10.1371/journal.pone.0289030.

Testo completo

Abstract (sommario):

With the growing interest in using phages to combat antimicrobial resistance, computational methods for predicting phage-host interactions have been explored to help shortlist candidate phages. Most existing models consider entire proteomes and rely on manual feature engineering, which poses difficulty in selecting the most informative sequence properties to serve as input to the model. In this paper, we framed phage-host interaction prediction as a multiclass classification problem that takes as input the embeddings of a phage’s receptor-binding proteins, which are known to be the key machinery for host recognition, and predicts the host genus. We explored different protein language models to automatically encode these protein sequences into dense embeddings without the need for additional alignment or structural information. We show that the use of embeddings of receptor-binding proteins presents improvements over handcrafted genomic and protein sequence features. The highest performance was obtained using the transformer-based protein language model ProtT5, resulting in a 3% to 4% increase in weighted F1 and recall scores across different prediction confidence thresholds, compared to using selected handcrafted sequence features.

Gli stili APA, Harvard, Vancouver, ISO e altri

26

Becker, Felix, e Mario Stanke. "learnMSA2: deep protein multiple alignments with large language and hidden Markov models". Bioinformatics 40, Supplement_2 (1 settembre 2024): ii79—ii86. http://dx.doi.org/10.1093/bioinformatics/btae381.

Testo completo

Abstract (sommario):

Abstract Motivation For the alignment of large numbers of protein sequences, tools are predominant that decide to align two residues using only simple prior knowledge, e.g. amino acid substitution matrices, and using only part of the available data. The accuracy of state-of-the-art programs declines with decreasing sequence identity and when increasingly large numbers of sequences are aligned. Recently, transformer-based deep-learning models started to harness the vast amount of protein sequence data, resulting in powerful pretrained language models with the main purpose of generating high-dimensional numerical representations, embeddings, for individual sites that agglomerate evolutionary, structural, and biophysical information. Results We extend the traditional profile hidden Markov model so that it takes as inputs unaligned protein sequences and the corresponding embeddings. We fit the model with gradient descent using our existing differentiable hidden Markov layer. All sequences and their embeddings are jointly aligned to a model of the protein family. We report that our upgraded HMM-based aligner, learnMSA2, combined with the ProtT5-XL protein language model aligns on average almost 6% points more columns correctly than the best amino acid-based competitor and scales well with sequence number. The relative advantage of learnMSA2 over other programs tends to be greater when the sequence identity is lower and when the number of sequences is larger. Our results strengthen the evidence on the rich information contained in protein language models’ embeddings and their potential downstream impact on the field of bioinformatics. Availability and implementation: https://github.com/Gaius-Augustus/learnMSA, PyPI and Bioconda, evaluation: https://github.com/felbecker/snakeMSA

Gli stili APA, Harvard, Vancouver, ISO e altri

27

Outeiral, Carlos, e Charlotte M. Deane. "Codon language embeddings provide strong signals for use in protein engineering". Nature Machine Intelligence 6, n. 2 (23 febbraio 2024): 170–79. http://dx.doi.org/10.1038/s42256-024-00791-0.

Testo completo

Abstract (sommario):

AbstractProtein representations from deep language models have yielded state-of-the-art performance across many tasks in computational protein engineering. In recent years, progress has primarily focused on parameter count, with recent models’ capacities surpassing the size of the very datasets they were trained on. Here we propose an alternative direction. We show that large language models trained on codons, instead of amino acid sequences, provide high-quality representations that outperform comparable state-of-the-art models across a variety of tasks. In some tasks, such as species recognition, prediction of protein and transcript abundance or melting point estimation, we show that a language model trained on codons outperforms every other published protein language model, including some that contain over 50 times more parameters. These results indicate that, in addition to commonly studied scale and model complexity, the information content of biological data provides an orthogonal direction to improve the power of machine learning in biology.

Gli stili APA, Harvard, Vancouver, ISO e altri

28

Medina-Ortiz, David, Seba Contreras, Diego Fernández, Nicole Soto-García, Iván Moya, Gabriel Cabas-Mora e Álvaro Olivera-Nappa. "Protein Language Models and Machine Learning Facilitate the Identification of Antimicrobial Peptides". International Journal of Molecular Sciences 25, n. 16 (14 agosto 2024): 8851. http://dx.doi.org/10.3390/ijms25168851.

Testo completo

Abstract (sommario):

Peptides are bioactive molecules whose functional versatility in living organisms has led to successful applications in diverse fields. In recent years, the amount of data describing peptide sequences and function collected in open repositories has substantially increased, allowing the application of more complex computational models to study the relations between the peptide composition and function. This work introduces AMP-Detector, a sequence-based classification model for the detection of peptides’ functional biological activity, focusing on accelerating the discovery and de novo design of potential antimicrobial peptides (AMPs). AMP-Detector introduces a novel sequence-based pipeline to train binary classification models, integrating protein language models and machine learning algorithms. This pipeline produced 21 models targeting antimicrobial, antiviral, and antibacterial activity, achieving average precision exceeding 83%. Benchmark analyses revealed that our models outperformed existing methods for AMPs and delivered comparable results for other biological activity types. Utilizing the Peptide Atlas, we applied AMP-Detector to discover over 190,000 potential AMPs and demonstrated that it is an integrative approach with generative learning to aid in de novo design, resulting in over 500 novel AMPs. The combination of our methodology, robust models, and a generative design strategy offers a significant advancement in peptide-based drug discovery and represents a pivotal tool for therapeutic applications.

Gli stili APA, Harvard, Vancouver, ISO e altri

29

Chu, Hongkang, e Taigang Liu. "Comprehensive Research on Druggable Proteins: From PSSM to Pre-Trained Language Models". International Journal of Molecular Sciences 25, n. 8 (19 aprile 2024): 4507. http://dx.doi.org/10.3390/ijms25084507.

Testo completo

Abstract (sommario):

Identification of druggable proteins can greatly reduce the cost of discovering new potential drugs. Traditional experimental approaches to exploring these proteins are often costly, slow, and labor-intensive, making them impractical for large-scale research. In response, recent decades have seen a rise in computational methods. These alternatives support drug discovery by creating advanced predictive models. In this study, we proposed a fast and precise classifier for the identification of druggable proteins using a protein language model (PLM) with fine-tuned evolutionary scale modeling 2 (ESM-2) embeddings, achieving 95.11% accuracy on the benchmark dataset. Furthermore, we made a careful comparison to examine the predictive abilities of ESM-2 embeddings and position-specific scoring matrix (PSSM) features by using the same classifiers. The results suggest that ESM-2 embeddings outperformed PSSM features in terms of accuracy and efficiency. Recognizing the potential of language models, we also developed an end-to-end model based on the generative pre-trained transformers 2 (GPT-2) with modifications. To our knowledge, this is the first time a large language model (LLM) GPT-2 has been deployed for the recognition of druggable proteins. Additionally, a more up-to-date dataset, known as Pharos, was adopted to further validate the performance of the proposed model.

Gli stili APA, Harvard, Vancouver, ISO e altri

30

Pang, Yihe, e Bin Liu. "DMFpred: Predicting protein disorder molecular functions based on protein cubic language model". PLOS Computational Biology 18, n. 10 (31 ottobre 2022): e1010668. http://dx.doi.org/10.1371/journal.pcbi.1010668.

Testo completo

Abstract (sommario):

Intrinsically disordered proteins and regions (IDP/IDRs) are widespread in living organisms and perform various essential molecular functions. These functions are summarized as six general categories, including entropic chain, assembler, scavenger, effector, display site, and chaperone. The alteration of IDP functions is responsible for many human diseases. Therefore, identifying the function of disordered proteins is helpful for the studies of drug target discovery and rational drug design. Experimental identification of the molecular functions of IDP in the wet lab is an expensive and laborious procedure that is not applicable on a large scale. Some computational methods have been proposed and mainly focus on predicting the entropic chain function of IDRs, while the computational predictive methods for the remaining five important categories of disordered molecular functions are desired. Motivated by the growing numbers of experimental annotated functional sequences and the need to expand the coverage of disordered protein function predictors, we proposed DMFpred for disordered molecular functions prediction, covering disordered assembler, scavenger, effector, display site and chaperone. DMFpred employs the Protein Cubic Language Model (PCLM), which incorporates three protein language models for characterizing sequences, structural and functional features of proteins, and attention-based alignment for understanding the relationship among three captured features and generating a joint representation of proteins. The PCLM was pre-trained with large-scaled IDR sequences and fine-tuned with functional annotation sequences for molecular function prediction. The predictive performance evaluation on five categories of functional and multi-functional residues suggested that DMFpred provides high-quality predictions. The web-server of DMFpred can be freely accessed from http://bliulab.net/DMFpred/.

Gli stili APA, Harvard, Vancouver, ISO e altri

31

Valentini, Giorgio, Dario Malchiodi, Jessica Gliozzo, Marco Mesiti, Mauricio Soto-Gomez, Alberto Cabri, Justin Reese, Elena Casiraghi e Peter N. Robinson. "The promises of large language models for protein design and modeling". Frontiers in Bioinformatics 3 (23 novembre 2023). http://dx.doi.org/10.3389/fbinf.2023.1304099.

Testo completo

Abstract (sommario):

The recent breakthroughs of Large Language Models (LLMs) in the context of natural language processing have opened the way to significant advances in protein research. Indeed, the relationships between human natural language and the “language of proteins” invite the application and adaptation of LLMs to protein modelling and design. Considering the impressive results of GPT-4 and other recently developed LLMs in processing, generating and translating human languages, we anticipate analogous results with the language of proteins. Indeed, protein language models have been already trained to accurately predict protein properties, generate novel functionally characterized proteins, achieving state-of-the-art results. In this paper we discuss the promises and the open challenges raised by this novel and exciting research area, and we propose our perspective on how LLMs will affect protein modeling and design.

Gli stili APA, Harvard, Vancouver, ISO e altri

32

Avraham, Orly, Tomer Tsaban, Ziv Ben-Aharon, Linoy Tsaban e Ora Schueler-Furman. "Protein language models can capture protein quaternary state". BMC Bioinformatics 24, n. 1 (14 novembre 2023). http://dx.doi.org/10.1186/s12859-023-05549-w.

Testo completo

Abstract (sommario):

Abstract Background Determining a protein’s quaternary state, i.e. the number of monomers in a functional unit, is a critical step in protein characterization. Many proteins form multimers for their activity, and over 50% are estimated to naturally form homomultimers. Experimental quaternary state determination can be challenging and require extensive work. To complement these efforts, a number of computational tools have been developed for quaternary state prediction, often utilizing experimentally validated structural information. Recently, dramatic advances have been made in the field of deep learning for predicting protein structure and other characteristics. Protein language models, such as ESM-2, that apply computational natural-language models to proteins successfully capture secondary structure, protein cell localization and other characteristics, from a single sequence. Here we hypothesize that information about the protein quaternary state may be contained within protein sequences as well, allowing us to benefit from these novel approaches in the context of quaternary state prediction. Results We generated ESM-2 embeddings for a large dataset of proteins with quaternary state labels from the curated QSbio dataset. We trained a model for quaternary state classification and assessed it on a non-overlapping set of distinct folds (ECOD family level). Our model, named QUEEN (QUaternary state prediction using dEEp learNing), performs worse than approaches that include information from solved crystal structures. However, it successfully learned to distinguish multimers from monomers, and predicts the specific quaternary state with moderate success, better than simple sequence similarity-based annotation transfer. Our results demonstrate that complex, quaternary state related information is included in such embeddings. Conclusions QUEEN is the first to investigate the power of embeddings for the prediction of the quaternary state of proteins. As such, it lays out strengths as well as limitations of a sequence-based protein language model approach, compared to structure-based approaches. Since it does not require any structural information and is fast, we anticipate that it will be of wide use both for in-depth investigation of specific systems, as well as for studies of large sets of protein sequences. A simple colab implementation is available at: https://colab.research.google.com/github/Furman-Lab/QUEEN/blob/main/QUEEN_prediction_notebook.ipynb.

Gli stili APA, Harvard, Vancouver, ISO e altri

33

Boshar, Sam, Evan Trop, Bernardo P. de Almeida, Liviu Copoiu e Thomas Pierrot. "Are Genomic Language Models All You Need? Exploring Genomic Language Models on Protein Downstream Tasks". Bioinformatics, 30 agosto 2024. http://dx.doi.org/10.1093/bioinformatics/btae529.

Testo completo

Abstract (sommario):

Abstract Motivation Large language models, trained on enormous corpora of biological sequences, are state-of-the-art for downstream genomic and proteomic tasks. Since the genome contains the information to encode all proteins, genomic language models (gLMs) hold the potential to make downstream predictions not only about DNA sequences, but also about proteins. However, the performance of gLMs on protein tasks remains unknown, due to few tasks pairing proteins with the coding DNA sequences (CDS) that can be processed by gLMs. Results In this work, we curated five such datasets and used them to evaluate the performance of gLMs and proteomic language models (pLMs). We show that gLMs are competitive and even outperform their pLMs counterparts on some tasks. The best performance was achieved using the retrieved CDS compared to sampling strategies. We found that training a joint genomic-proteomic model outperforms each individual approach, showing that they capture different but complementary sequence representations, as we demonstrate through model interpretation of their embeddings. Lastly, we explored different genomic tokenization schemes to improve downstream protein performance. We trained a new Nucleotide Transformer (50M) foundation model with 3mer tokenization that outperforms its 6mer counterpart on protein tasks while maintaining performance on genomics tasks. The application of gLMs to proteomics offers the potential to leverage rich CDS data, and in the spirit of the central dogma, the possibility of a unified and synergistic approach to genomics and proteomics. Availability and Implementation We make our inference code, 3mer pre-trained model weights and datasets available. Supplementary information Supplementary data are available at Journal Name online.

Gli stili APA, Harvard, Vancouver, ISO e altri

34

An, Jingmin, e Xiaogang Weng. "Collectively encoding protein properties enriches protein language models". BMC Bioinformatics 23, n. 1 (8 novembre 2022). http://dx.doi.org/10.1186/s12859-022-05031-z.

Testo completo

Abstract (sommario):

AbstractPre-trained natural language processing models on a large natural language corpus can naturally transfer learned knowledge to protein domains by fine-tuning specific in-domain tasks. However, few studies focused on enriching such protein language models by jointly learning protein properties from strongly-correlated protein tasks. Here we elaborately designed a multi-task learning (MTL) architecture, aiming to decipher implicit structural and evolutionary information from three sequence-level classification tasks for protein family, superfamily and fold. Considering the co-existing contextual relevance between human words and protein language, we employed BERT, pre-trained on a large natural language corpus, as our backbone to handle protein sequences. More importantly, the encoded knowledge obtained in the MTL stage can be well transferred to more fine-grained downstream tasks of TAPE. Experiments on structure- or evolution-related applications demonstrate that our approach outperforms many state-of-the-art Transformer-based protein models, especially in remote homology detection.

Gli stili APA, Harvard, Vancouver, ISO e altri

35

McWhite, Claire Darnell, Isabel Armour-Garb e Mona Singh. "Leveraging protein language models for accurate multiple sequence alignments". Genome Research, 6 luglio 2023, gr.277675.123. http://dx.doi.org/10.1101/gr.277675.123.

Testo completo

Abstract (sommario):

Multiple sequence alignment is a critical step in the study of protein sequence and function. Typically, multiple sequence alignment algorithms progressively align pairs of sequences and combine these alignments with the aid of a guide tree. These alignment algorithms use scoring systems based on substitution matrices to measure amino acid similarities. While successful, standard methods struggle on sets of proteins with low sequence identity - the so-called twilight zone of protein alignment. For these difficult cases, another source of information is needed. Protein language models are a powerful new approach that leverage massive sequence datasets to produce high-dimensional contextual embeddings for each amino acid in a sequence. These embeddings have been shown to reflect physicochemical and higher-order structural and functional attributes of amino acids within proteins. Here, we present a novel approach to multiple sequence alignment, based on clustering and ordering amino acid contextual embeddings. Our method for aligning semantically consistent groups of proteins circumvents the need for many standard components of multiple sequence alignment algorithms, avoiding initial guide tree construction, intermediate pairwise alignments, gap penalties, and substitution matrices. The added information from contextual embeddings leads to higher accuracy alignments for structurally similar proteins with low amino acid similarity. We anticipate that protein language models will become a fundamental component of the next generation of algorithms for generating MSAs.

Gli stili APA, Harvard, Vancouver, ISO e altri

36

Jing, Xiaoyang, Fandi Wu, Xiao Luo e Jinbo Xu. "Single-sequence protein structure prediction by integrating protein language models". Proceedings of the National Academy of Sciences 121, n. 13 (20 marzo 2024). http://dx.doi.org/10.1073/pnas.2308788121.

Testo completo

Abstract (sommario):

Protein structure prediction has been greatly improved by deep learning in the past few years. However, the most successful methods rely on multiple sequence alignment (MSA) of the sequence homologs of the protein under prediction. In nature, a protein folds in the absence of its sequence homologs and thus, a MSA-free structure prediction method is desired. Here, we develop a single-sequence-based protein structure prediction method RaptorX-Single by integrating several protein language models and a structure generation module and then study its advantage over MSA-based methods. Our experimental results indicate that in addition to running much faster than MSA-based methods such as AlphaFold2, RaptorX-Single outperforms AlphaFold2 and other MSA-free methods in predicting the structure of antibodies (after fine-tuning on antibody data), proteins of very few sequence homologs, and single mutation effects. By comparing different protein language models, our results show that not only the scale but also the training data of protein language models will impact the performance. RaptorX-Single also compares favorably to MSA-based AlphaFold2 when the protein under prediction has a large number of sequence homologs.

Gli stili APA, Harvard, Vancouver, ISO e altri

37

Vitale, Rosario, Leandro A. Bugnon, Emilio Luis Fenoy, Diego H. Milone e Georgina Stegmayer. "Evaluating large language models for annotating proteins". Briefings in Bioinformatics 25, n. 3 (27 marzo 2024). http://dx.doi.org/10.1093/bib/bbae177.

Testo completo

Abstract (sommario):

Abstract In UniProtKB, up to date, there are more than 251 million proteins deposited. However, only 0.25% have been annotated with one of the more than 15000 possible Pfam family domains. The current annotation protocol integrates knowledge from manually curated family domains, obtained using sequence alignments and hidden Markov models. This approach has been successful for automatically growing the Pfam annotations, however at a low rate in comparison to protein discovery. Just a few years ago, deep learning models were proposed for automatic Pfam annotation. However, these models demand a considerable amount of training data, which can be a challenge with poorly populated families. To address this issue, we propose and evaluate here a novel protocol based on transfer learningṪhis requires the use of protein large language models (LLMs), trained with self-supervision on big unnanotated datasets in order to obtain sequence embeddings. Then, the embeddings can be used with supervised learning on a small and annotated dataset for a specialized task. In this protocol we have evaluated several cutting-edge protein LLMs together with machine learning architectures to improve the actual prediction of protein domain annotations. Results are significatively better than state-of-the-art for protein families classification, reducing the prediction error by an impressive 60% compared to standard methods. We explain how LLMs embeddings can be used for protein annotation in a concrete and easy way, and provide the pipeline in a github repo. Full source code and data are available at https://github.com/sinc-lab/llm4pfam

Gli stili APA, Harvard, Vancouver, ISO e altri

38

Lin, Peicong, Huanyu Tao, Hao Li e Sheng-You Huang. "Protein–protein contact prediction by geometric triangle-aware protein language models". Nature Machine Intelligence, 19 ottobre 2023. http://dx.doi.org/10.1038/s42256-023-00741-2.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

39

Haselbeck, Florian, Maura John, Yuqi Zhang, Jonathan Pirnay, Juan Pablo Fuenzalida-Werner, Rubén D. Costa e Dominik G. Grimm. "Superior protein thermophilicity prediction with protein language model embeddings". NAR Genomics and Bioinformatics 5, n. 4 (11 ottobre 2023). http://dx.doi.org/10.1093/nargab/lqad087.

Testo completo

Abstract (sommario):

Abstract Protein thermostability is important in many areas of biotechnology, including enzyme engineering and protein-hybrid optoelectronics. Ever-growing protein databases and information on stability at different temperatures allow the training of machine learning models to predict whether proteins are thermophilic. In silico predictions could reduce costs and accelerate the development process by guiding researchers to more promising candidates. Existing models for predicting protein thermophilicity rely mainly on features derived from physicochemical properties. Recently, modern protein language models that directly use sequence information have demonstrated superior performance in several tasks. In this study, we evaluate the usefulness of protein language model embeddings for thermophilicity prediction with ProLaTherm, a Protein Language model-based Thermophilicity predictor. ProLaTherm significantly outperforms all feature-, sequence- and literature-based comparison partners on multiple evaluation metrics. In terms of the Matthew’s correlation coefficient, ProLaTherm outperforms the second-best competitor by 18.1% in a nested cross-validation setup. Using proteins from species not overlapping with species from the training data, ProLaTherm outperforms all competitors by at least 9.7%. On these data, it misclassified only one nonthermophilic protein as thermophilic. Furthermore, it correctly identified 97.4% of all thermophilic proteins in our test set with an optimal growth temperature above 70°C.

Gli stili APA, Harvard, Vancouver, ISO e altri

40

Ieremie, Ioan, Rob M. Ewing e Mahesan Niranjan. "Protein language models meet reduced amino acid alphabets". Bioinformatics, 3 febbraio 2024. http://dx.doi.org/10.1093/bioinformatics/btae061.

Testo completo

Abstract (sommario):

Abstract Motivation Protein Language Models (PLMs), which borrowed ideas for modelling and inference from Natural Language Processing, have demonstrated the ability to extract meaningful representations in an unsupervised way. This led to significant performance improvement in several downstream tasks. Clustering amino acids based on their physical-chemical properties to achieve reduced alphabets has been of interest in past research, but their application to PLMs or folding models is unexplored. Results Here, we investigate the efficacy of PLMs trained on reduced amino acid alphabets in capturing evolutionary information, and we explore how the loss of protein sequence information impacts learned representations and downstream task performance. Our empirical work shows that PLMs trained on the full alphabet and a large number of sequences capture fine details that are lost in alphabet reduction methods. We further show the ability of a structure prediction model(ESMFold) to fold CASP14 protein sequences translated using a reduced alphabet. For 10 proteins out of the 50 targets, reduced alphabets improve structural predictions with LDDT-Cα differences of up to 19%. Availability Trained models and code are available at github.com/Ieremie/reduced-alph-PLM Supplementary information Supplementary data are available at Bioinformatics online.

Gli stili APA, Harvard, Vancouver, ISO e altri

41

Pudžiuvelytė, Ieva, Kliment Olechnovič, Egle Godliauskaite, Kristupas Sermokas, Tomas Urbaitis, Giedrius Gasiunas e Darius Kazlauskas. "TemStaPro: protein thermostability prediction using sequence representations from protein language models". Bioinformatics, 20 marzo 2024. http://dx.doi.org/10.1093/bioinformatics/btae157.

Testo completo

Abstract (sommario):

Abstract Motivation Reliable prediction of protein thermostability from its sequence is valuable for both academic and industrial research. This prediction problem can be tackled using machine learning and by taking advantage of the recent blossoming of deep learning methods for sequence analysis. These methods can facilitate training on more data and, possibly, enable development of more versatile thermostability predictors for multiple ranges of temperatures. Results We applied the principle of transfer learning to predict protein thermostability using embeddings generated by protein language models (pLMs) from an input protein sequence. We used large pLMs that were pre-trained on hundreds of millions of known sequences. The embeddings from such models allowed us to efficiently train and validate a high-performing prediction method using over one million sequences that we collected from organisms with annotated growth temperatures. Our method, TemStaPro (Temperatures of Stability for Proteins), was used to predict thermostability of CRISPR-Cas Class II effector proteins (C2EPs). Predictions indicated sharp differences among groups of C2EPs in terms of thermostability and were largely in tune with previously published and our newly obtained experimental data. Availability and Implementation TemStaPro software and the related data are freely available from https://github.com/ievapudz/TemStaPro and https://doi.org/10.5281/zenodo.7743637.

Gli stili APA, Harvard, Vancouver, ISO e altri

42

Kabir, Anowarul, Asher Moldwin, Yana Bromberg e Amarda Shehu. "In the Twilight Zone of Protein Sequence Homology: Do Protein Language Models Learn Protein Structure?" Bioinformatics Advances, 17 agosto 2024. http://dx.doi.org/10.1093/bioadv/vbae119.

Testo completo

Abstract (sommario):

Abstract Protein language models based on the transformer architecture are increasingly improving performance on protein prediction tasks, including secondary structure, subcellular localization, and more. Despite being trained only on protein sequences, protein language models appear to implicitly learn protein structure. This paper investigates whether sequence representations learned by protein language models encode structural information and to what extent. We address this by evaluating protein language models on remote homology prediction, where identifying remote homologs from sequence information alone requires structural knowledge, especially in the “twilight zone” of very low sequence identity. Through rigorous testing at progressively lower sequence identities, we profile the performance of protein language models ranging from millions to billions of parameters in a zero-shot setting. Our findings indicate that while transformer-based protein language models outperform traditional sequence alignment methods, they still struggle in the twilight zone. This suggests that current protein language models have not sufficiently learned protein structure to address remote homology prediction when sequence signals are weak. We believe this opens the way for further research both on remote homology prediction and on the broader goal of learning sequence- and structure-rich representations of protein molecules. All code, data, and models are made publicly available.

Gli stili APA, Harvard, Vancouver, ISO e altri

43

Chen, Bo, Ziwei Xie, Jiezhong Qiu, Zhaofeng Ye, Jinbo Xu e Jie Tang. "Improved the heterodimer protein complex prediction with protein language models". Briefings in Bioinformatics, 16 giugno 2023. http://dx.doi.org/10.1093/bib/bbad221.

Testo completo

Abstract (sommario):

Abstract AlphaFold-Multimer has greatly improved the protein complex structure prediction, but its accuracy also depends on the quality of the multiple sequence alignment (MSA) formed by the interacting homologs (i.e. interologs) of the complex under prediction. Here we propose a novel method, ESMPair, that can identify interologs of a complex using protein language models. We show that ESMPair can generate better interologs than the default MSA generation method in AlphaFold-Multimer. Our method results in better complex structure prediction than AlphaFold-Multimer by a large margin (+10.7% in terms of the Top-5 best DockQ), especially when the predicted complex structures have low confidence. We further show that by combining several MSA generation methods, we may yield even better complex structure prediction accuracy than Alphafold-Multimer (+22% in terms of the Top-5 best DockQ). By systematically analyzing the impact factors of our algorithm we find that the diversity of MSA of interologs significantly affects the prediction accuracy. Moreover, we show that ESMPair performs particularly well on complexes in eucaryotes.

Gli stili APA, Harvard, Vancouver, ISO e altri

44

Tang Tian-Yi, Xiong Yi-Ming, Zhang Rui-Ge, Zhang Jian, Li Wen-Fei, Wang Jun e Wang Wei. "Progress in Protein Pre-training Models Integrated with Structural Knowledge". Acta Physica Sinica, 2024, 0. http://dx.doi.org/10.7498/aps.73.20240811.

Testo completo

Abstract (sommario):

The AI revolution sparked by natural language and image processing has brought new ideas and research paradigms to the field of protein computing. One significant advancement is the development of pre-trained protein language models through self-supervised learning from massive protein sequences. These pre-trained models encode various information about protein sequences, evolution, structures, and even functions, which can be easily transferred to various downstream tasks and demonstrate robust generalization capabilities. Recently, researchers are further developing multimodal pre-trained models that integrate more diverse types of data. This review summarizes the recent studies in this direction from the following aspects. Firstly, it reviews protein pre-trained models that integrate protein structures into language models; this is of particular importance since protein structure is the primary determinant of its function. Secondly, the pre-trained models that integrate protein dynamic information are introduced. These models may benefit downstream tasks such as protein-protein interactions, soft docking of ligands, and interactions involving allosteric proteins and intrinsic disordered proteins. Thirdly, the pre-trained models that integrate knowledge such as gene ontology are described. Fourthly, we briefly introduce pretrained models in RNA fields. Lastly, we introduce the most recent developments in protein designs and discuss the relations of these models with respect to the aforementioned pre-trained models that integrate protein structure information.

Gli stili APA, Harvard, Vancouver, ISO e altri

45

Livesey, Benjamin J., e Joseph A. Marsh. "Advancing variant effect prediction using protein language models". Nature Genetics, 10 agosto 2023. http://dx.doi.org/10.1038/s41588-023-01470-3.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

46

Nijkamp, Erik, Jeffrey A. Ruffolo, Eli N. Weinstein, Nikhil Naik e Ali Madani. "ProGen2: Exploring the boundaries of protein language models". Cell Systems, ottobre 2023. http://dx.doi.org/10.1016/j.cels.2023.10.002.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

47

Marquet, Céline, Michael Heinzinger, Tobias Olenyi, Christian Dallago, Kyra Erckert, Michael Bernhofer, Dmitrii Nechaev e Burkhard Rost. "Embeddings from protein language models predict conservation and variant effects". Human Genetics, 30 dicembre 2021. http://dx.doi.org/10.1007/s00439-021-02411-y.

Testo completo

Abstract (sommario):

AbstractThe emergence of SARS-CoV-2 variants stressed the demand for tools allowing to interpret the effect of single amino acid variants (SAVs) on protein function. While Deep Mutational Scanning (DMS) sets continue to expand our understanding of the mutational landscape of single proteins, the results continue to challenge analyses. Protein Language Models (pLMs) use the latest deep learning (DL) algorithms to leverage growing databases of protein sequences. These methods learn to predict missing or masked amino acids from the context of entire sequence regions. Here, we used pLM representations (embeddings) to predict sequence conservation and SAV effects without multiple sequence alignments (MSAs). Embeddings alone predicted residue conservation almost as accurately from single sequences as ConSeq using MSAs (two-state Matthews Correlation Coefficient—MCC—for ProtT5 embeddings of 0.596 ± 0.006 vs. 0.608 ± 0.006 for ConSeq). Inputting the conservation prediction along with BLOSUM62 substitution scores and pLM mask reconstruction probabilities into a simplistic logistic regression (LR) ensemble for Variant Effect Score Prediction without Alignments (VESPA) predicted SAV effect magnitude without any optimization on DMS data. Comparing predictions for a standard set of 39 DMS experiments to other methods (incl. ESM-1v, DeepSequence, and GEMME) revealed our approach as competitive with the state-of-the-art (SOTA) methods using MSA input. No method outperformed all others, neither consistently nor statistically significantly, independently of the performance measure applied (Spearman and Pearson correlation). Finally, we investigated binary effect predictions on DMS experiments for four human proteins. Overall, embedding-based methods have become competitive with methods relying on MSAs for SAV effect prediction at a fraction of the costs in computing/energy. Our method predicted SAV effects for the entire human proteome (~ 20 k proteins) within 40 min on one Nvidia Quadro RTX 8000. All methods and data sets are freely available for local and online execution through bioembeddings.com, https://github.com/Rostlab/VESPA, and PredictProtein.

Gli stili APA, Harvard, Vancouver, ISO e altri

48

Si, Yunda, e Chengfei Yan. "Improved inter-protein contact prediction using dimensional hybrid residual networks and protein language models". Briefings in Bioinformatics, 9 febbraio 2023. http://dx.doi.org/10.1093/bib/bbad039.

Testo completo

Abstract (sommario):

Abstract The knowledge of contacting residue pairs between interacting proteins is very useful for the structural characterization of protein–protein interactions (PPIs). However, accurately identifying the tens of contacting ones from hundreds of thousands of inter-protein residue pairs is extremely challenging, and performances of the state-of-the-art inter-protein contact prediction methods are still quite limited. In this study, we developed a deep learning method for inter-protein contact prediction, which is referred to as DRN-1D2D_Inter. Specifically, we employed pretrained protein language models to generate structural information-enriched input features to residual networks formed by dimensional hybrid residual blocks to perform inter-protein contact prediction. Extensively bechmarking DRN-1D2D_Inter on multiple datasets, including both heteromeric PPIs and homomeric PPIs, we show DRN-1D2D_Inter consistently and significantly outperformed two state-of-the-art inter-protein contact prediction methods, including GLINTER and DeepHomo, although both the latter two methods leveraged the native structures of interacting proteins in the prediction, and DRN-1D2D_Inter made the prediction purely from sequences. We further show that applying the predicted contacts as constraints for protein–protein docking can significantly improve its performance for protein complex structure prediction.

Gli stili APA, Harvard, Vancouver, ISO e altri

49

Harrigan, William L., Barbra D. Ferrell, K. Eric Wommack, Shawn W. Polson, Zachary D. Schreiber e Mahdi Belcaid. "Improvements in viral gene annotation using large language models and soft alignments". BMC Bioinformatics 25, n. 1 (25 aprile 2024). http://dx.doi.org/10.1186/s12859-024-05779-6.

Testo completo

Abstract (sommario):

Abstract Background The annotation of protein sequences in public databases has long posed a challenge in molecular biology. This issue is particularly acute for viral proteins, which demonstrate limited homology to known proteins when using alignment, k-mer, or profile-based homology search approaches. A novel methodology employing Large Language Models (LLMs) addresses this methodological challenge by annotating protein sequences based on embeddings. Results Central to our contribution is the soft alignment algorithm, drawing from traditional protein alignment but leveraging embedding similarity at the amino acid level to bypass the need for conventional scoring matrices. This method not only surpasses pooled embedding-based models in efficiency but also in interpretability, enabling users to easily trace homologous amino acids and delve deeper into the alignments. Far from being a black box, our approach provides transparent, BLAST-like alignment visualizations, combining traditional biological research with AI advancements to elevate protein annotation through embedding-based analysis while ensuring interpretability. Tests using the Virus Orthologous Groups and ViralZone protein databases indicated that the novel soft alignment approach recognized and annotated sequences that both blastp and pooling-based methods, which are commonly used for sequence annotation, failed to detect. Conclusion The embeddings approach shows the great potential of LLMs for enhancing protein sequence annotation, especially in viral genomics. These findings present a promising avenue for more efficient and accurate protein function inference in molecular biology.

Gli stili APA, Harvard, Vancouver, ISO e altri

50

Hie, Brian L., Kevin K. Yang e Peter S. Kim. "Evolutionary velocity with protein language models predicts evolutionary dynamics of diverse proteins". Cell Systems, febbraio 2022. http://dx.doi.org/10.1016/j.cels.2022.01.003.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Articoli di riviste sul tema "Protein language models"

Cita una fonte nei formati APA, MLA, Chicago, Harvard e in molti altri stili