Artykuły w czasopismach: „Protein language models”

1

Tang, Lin. "Protein language models using convolutions". Nature Methods 21, nr 4 (kwiecień 2024): 550. http://dx.doi.org/10.1038/s41592-024-02252-3.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

2

Ali, Sarwan, Prakash Chourasia i Murray Patterson. "When Protein Structure Embedding Meets Large Language Models". Genes 15, nr 1 (23.12.2023): 25. http://dx.doi.org/10.3390/genes15010025.

Pełny tekst źródła

Streszczenie:

Protein structure analysis is essential in various bioinformatics domains such as drug discovery, disease diagnosis, and evolutionary studies. Within structural biology, the classification of protein structures is pivotal, employing machine learning algorithms to categorize structures based on data from databases like the Protein Data Bank (PDB). To predict protein functions, embeddings based on protein sequences have been employed. Creating numerical embeddings that preserve vital information while considering protein structure and sequence presents several challenges. The existing literature lacks a comprehensive and effective approach that combines structural and sequence-based features to achieve efficient protein classification. While large language models (LLMs) have exhibited promising outcomes for protein function prediction, their focus primarily lies on protein sequences, disregarding the 3D structures of proteins. The quality of embeddings heavily relies on how well the geometry of the embedding space aligns with the underlying data structure, posing a critical research question. Traditionally, Euclidean space has served as a widely utilized framework for embeddings. In this study, we propose a novel method for designing numerical embeddings in Euclidean space for proteins by leveraging 3D structure information, specifically employing the concept of contact maps. These embeddings are synergistically combined with features extracted from LLMs and traditional feature engineering techniques to enhance the performance of embeddings in supervised protein analysis. Experimental results on benchmark datasets, including PDB Bind and STCRDAB, demonstrate the superior performance of the proposed method for protein function prediction.

Style APA, Harvard, Vancouver, ISO itp.

3

Ferruz, Noelia, i Birte Höcker. "Controllable protein design with language models". Nature Machine Intelligence 4, nr 6 (czerwiec 2022): 521–32. http://dx.doi.org/10.1038/s42256-022-00499-z.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

4

Li, Xiang, Zhuoyu Wei, Yueran Hu i Xiaolei Zhu. "GraphNABP: Identifying nucleic acid-binding proteins with protein graphs and protein language models". International Journal of Biological Macromolecules 280 (listopad 2024): 135599. http://dx.doi.org/10.1016/j.ijbiomac.2024.135599.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

5

Singh, Arunima. "Protein language models guide directed antibody evolution". Nature Methods 20, nr 6 (czerwiec 2023): 785. http://dx.doi.org/10.1038/s41592-023-01924-w.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

6

Tran, Chau, Siddharth Khadkikar i Aleksey Porollo. "Survey of Protein Sequence Embedding Models". International Journal of Molecular Sciences 24, nr 4 (14.02.2023): 3775. http://dx.doi.org/10.3390/ijms24043775.

Pełny tekst źródła

Streszczenie:

Derived from the natural language processing (NLP) algorithms, protein language models enable the encoding of protein sequences, which are widely diverse in length and amino acid composition, in fixed-size numerical vectors (embeddings). We surveyed representative embedding models such as Esm, Esm1b, ProtT5, and SeqVec, along with their derivatives (GoPredSim and PLAST), to conduct the following tasks in computational biology: embedding the Saccharomyces cerevisiae proteome, gene ontology (GO) annotation of the uncharacterized proteins of this organism, relating variants of human proteins to disease status, correlating mutants of beta-lactamase TEM-1 from Escherichia coli with experimentally measured antimicrobial resistance, and analyzing diverse fungal mating factors. We discuss the advances and shortcomings, differences, and concordance of the models. Of note, all of the models revealed that the uncharacterized proteins in yeast tend to be less than 200 amino acids long, contain fewer aspartates and glutamates, and are enriched for cysteine. Less than half of these proteins can be annotated with GO terms with high confidence. The distribution of the cosine similarity scores of benign and pathogenic mutations to the reference human proteins shows a statistically significant difference. The differences in embeddings of the reference TEM-1 and mutants have low to no correlation with minimal inhibitory concentrations (MIC).

Style APA, Harvard, Vancouver, ISO itp.

7

Pokharel, Suresh, Pawel Pratyush, Hamid D. Ismail, Junfeng Ma i Dukka B. KC. "Integrating Embeddings from Multiple Protein Language Models to Improve Protein O-GlcNAc Site Prediction". International Journal of Molecular Sciences 24, nr 21 (6.11.2023): 16000. http://dx.doi.org/10.3390/ijms242116000.

Pełny tekst źródła

Streszczenie:

O-linked β-N-acetylglucosamine (O-GlcNAc) is a distinct monosaccharide modification of serine (S) or threonine (T) residues of nucleocytoplasmic and mitochondrial proteins. O-GlcNAc modification (i.e., O-GlcNAcylation) is involved in the regulation of diverse cellular processes, including transcription, epigenetic modifications, and cell signaling. Despite the great progress in experimentally mapping O-GlcNAc sites, there is an unmet need to develop robust prediction tools that can effectively locate the presence of O-GlcNAc sites in protein sequences of interest. In this work, we performed a comprehensive evaluation of a framework for prediction of protein O-GlcNAc sites using embeddings from pre-trained protein language models. In particular, we compared the performance of three protein sequence-based large protein language models (pLMs), Ankh, ESM-2, and ProtT5, for prediction of O-GlcNAc sites and also evaluated various ensemble strategies to integrate embeddings from these protein language models. Upon investigation, the decision-level fusion approach that integrates the decisions of the three embedding models, which we call LM-OGlcNAc-Site, outperformed the models trained on these individual language models as well as other fusion approaches and other existing predictors in almost all of the parameters evaluated. The precise prediction of O-GlcNAc sites will facilitate the probing of O-GlcNAc site-specific functions of proteins in physiology and diseases. Moreover, these findings also indicate the effectiveness of combined uses of multiple protein language models in post-translational modification prediction and open exciting avenues for further research and exploration in other protein downstream tasks. LM-OGlcNAc-Site’s web server and source code are publicly available to the community.

Style APA, Harvard, Vancouver, ISO itp.

8

Weissenow, Konstantin, i Burkhard Rost. "Are protein language models the new universal key?" Current Opinion in Structural Biology 91 (kwiecień 2025): 102997. https://doi.org/10.1016/j.sbi.2025.102997.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

9

Wang, Wenkai, Zhenling Peng i Jianyi Yang. "Single-sequence protein structure prediction using supervised transformer protein language models". Nature Computational Science 2, nr 12 (19.12.2022): 804–14. http://dx.doi.org/10.1038/s43588-022-00373-3.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

10

Kenlay, Henry, Frédéric A. Dreyer, Aleksandr Kovaltsuk, Dom Miketa, Douglas Pires i Charlotte M. Deane. "Large scale paired antibody language models". PLOS Computational Biology 20, nr 12 (6.12.2024): e1012646. https://doi.org/10.1371/journal.pcbi.1012646.

Pełny tekst źródła

Streszczenie:

Antibodies are proteins produced by the immune system that can identify and neutralise a wide variety of antigens with high specificity and affinity, and constitute the most successful class of biotherapeutics. With the advent of next-generation sequencing, billions of antibody sequences have been collected in recent years, though their application in the design of better therapeutics has been constrained by the sheer volume and complexity of the data. To address this challenge, we present IgBert and IgT5, the best performing antibody-specific language models developed to date which can consistently handle both paired and unpaired variable region sequences as input. These models are trained comprehensively using the more than two billion unpaired sequences and two million paired sequences of light and heavy chains present in the Observed Antibody Space dataset. We show that our models outperform existing antibody and protein language models on a diverse range of design and regression tasks relevant to antibody engineering. This advancement marks a significant leap forward in leveraging machine learning, large scale data sets and high-performance computing for enhancing antibody design for therapeutic development.

Style APA, Harvard, Vancouver, ISO itp.

11

Pang, Yihe, i Bin Liu. "IDP-LM: Prediction of protein intrinsic disorder and disorder functions based on language models". PLOS Computational Biology 19, nr 11 (22.11.2023): e1011657. http://dx.doi.org/10.1371/journal.pcbi.1011657.

Pełny tekst źródła

Streszczenie:

Intrinsically disordered proteins (IDPs) and regions (IDRs) are a class of functionally important proteins and regions that lack stable three-dimensional structures under the native physiologic conditions. They participate in critical biological processes and thus are associated with the pathogenesis of many severe human diseases. Identifying the IDPs/IDRs and their functions will be helpful for a comprehensive understanding of protein structures and functions, and inform studies of rational drug design. Over the past decades, the exponential growth in the number of proteins with sequence information has deepened the gap between uncharacterized and annotated disordered sequences. Protein language models have recently demonstrated their powerful abilities to capture complex structural and functional information from the enormous quantity of unlabelled protein sequences, providing opportunities to apply protein language models to uncover the intrinsic disorders and their biological properties from the amino acid sequences. In this study, we proposed a computational predictor called IDP-LM for predicting intrinsic disorder and disorder functions by leveraging the pre-trained protein language models. IDP-LM takes the embeddings extracted from three pre-trained protein language models as the exclusive inputs, including ProtBERT, ProtT5 and a disorder specific language model (IDP-BERT). The ablation analysis shown that the IDP-BERT provided fine-grained feature representations of disorder, and the combination of three language models is the key to the performance improvement of IDP-LM. The evaluation results on independent test datasets demonstrated that the IDP-LM provided high-quality prediction results for intrinsic disorder and four common disordered functions.

Style APA, Harvard, Vancouver, ISO itp.

12

Zhao, Long, Qiang He, Huijia Song, Tianqian Zhou, An Luo, Zhenguo Wen, Teng Wang i Xiaozhu Lin. "Protein A-like Peptide Design Based on Diffusion and ESM2 Models". Molecules 29, nr 20 (21.10.2024): 4965. http://dx.doi.org/10.3390/molecules29204965.

Pełny tekst źródła

Streszczenie:

Proteins are the foundation of life, and designing functional proteins remains a key challenge in biotechnology. Before the development of AlphaFold2, the focus of design was primarily on structure-centric approaches such as using the well-known open-source software Rosetta3. Following the development of AlphaFold2, deep-learning techniques for protein design gained prominence. This study proposes a new method to generate functional proteins using the diffusion model and ESM2 protein language model. Diffusion models, which are widely used in image and natural language generation, are used here for protein design, facilitating the controlled generation of new sequences. The ESM2 model, trained on the basis of large-scale protein sequence data, provides a deep understanding of the context of the sequence, thus improving the model’s ability to generate biologically relevant proteins. In this study, we used the Protein A-like peptide as a model study object, combined the diffusion model and the ESM2 model to generate new peptide sequences from minimal input data, and verified their biological activities through experiments such as the BLI affinity test. In conclusion, we developed a new method for protein design that provides a novel strategy to meet the challenges of generic protein generation.

Style APA, Harvard, Vancouver, ISO itp.

13

Weber, Leon, Kirsten Thobe, Oscar Arturo Migueles Lozano, Jana Wolf i Ulf Leser. "PEDL: extracting protein–protein associations using deep language models and distant supervision". Bioinformatics 36, Supplement_1 (1.07.2020): i490—i498. http://dx.doi.org/10.1093/bioinformatics/btaa430.

Pełny tekst źródła

Streszczenie:

Abstract Motivation A significant portion of molecular biology investigates signalling pathways and thus depends on an up-to-date and complete resource of functional protein–protein associations (PPAs) that constitute such pathways. Despite extensive curation efforts, major pathway databases are still notoriously incomplete. Relation extraction can help to gather such pathway information from biomedical publications. Current methods for extracting PPAs typically rely exclusively on rare manually labelled data which severely limits their performance. Results We propose PPA Extraction with Deep Language (PEDL), a method for predicting PPAs from text that combines deep language models and distant supervision. Due to the reliance on distant supervision, PEDL has access to an order of magnitude more training data than methods solely relying on manually labelled annotations. We introduce three different datasets for PPA prediction and evaluate PEDL for the two subtasks of predicting PPAs between two proteins, as well as identifying the text spans stating the PPA. We compared PEDL with a recently published state-of-the-art model and found that on average PEDL performs better in both tasks on all three datasets. An expert evaluation demonstrates that PEDL can be used to predict PPAs that are missing from major pathway databases and that it correctly identifies the text spans supporting the PPA. Availability and implementation PEDL is freely available at https://github.com/leonweber/pedl. The repository also includes scripts to generate the used datasets and to reproduce the experiments from this article. Supplementary information Supplementary data are available at Bioinformatics online.

Style APA, Harvard, Vancouver, ISO itp.

14

Wang, Yang. "Enhanced protein function prediction by fusion embedding based on protein language models". Highlights in Science, Engineering and Technology 66 (20.09.2023): 177–84. http://dx.doi.org/10.54097/hset.v66i.11697.

Pełny tekst źródła

Streszczenie:

Natural language models can accomplish non-natural language tasks such as protein prediction, but the actual prediction effect is low and occupies large computational resources. In this paper, a fusion embedding model is proposed to improve the prediction effect of the model and reduce the computational cost of the model by fusing information of different dimensions. The paper is validated by the downstream task of protein function prediction, which provides a reference for solving practical tasks using fusion embedding methods.

Style APA, Harvard, Vancouver, ISO itp.

15

Sun, Yuanfei, i Yang Shen. "Variant effect prediction using structure-informed protein language models". Biophysical Journal 122, nr 3 (luty 2023): 473a. http://dx.doi.org/10.1016/j.bpj.2022.11.2537.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

16

Qu, Yang, Zitong Niu, Qiaojiao Ding, Taowa Zhao, Tong Kong, Bing Bai, Jianwei Ma, Yitian Zhao i Jianping Zheng. "Ensemble Learning with Supervised Methods Based on Large-Scale Protein Language Models for Protein Mutation Effects Prediction". International Journal of Molecular Sciences 24, nr 22 (18.11.2023): 16496. http://dx.doi.org/10.3390/ijms242216496.

Pełny tekst źródła

Streszczenie:

Machine learning has been increasingly utilized in the field of protein engineering, and research directed at predicting the effects of protein mutations has attracted increasing attention. Among them, so far, the best results have been achieved by related methods based on protein language models, which are trained on a large number of unlabeled protein sequences to capture the generally hidden evolutionary rules in protein sequences, and are therefore able to predict their fitness from protein sequences. Although numerous similar models and methods have been successfully employed in practical protein engineering processes, the majority of the studies have been limited to how to construct more complex language models to capture richer protein sequence feature information and utilize this feature information for unsupervised protein fitness prediction. There remains considerable untapped potential in these developed models, such as whether the prediction performance can be further improved by integrating different models to further improve the accuracy of prediction. Furthermore, how to utilize large-scale models for prediction methods of mutational effects on quantifiable properties of proteins due to the nonlinear relationship between protein fitness and the quantification of specific functionalities has yet to be explored thoroughly. In this study, we propose an ensemble learning approach for predicting mutational effects of proteins integrating protein sequence features extracted from multiple large protein language models, as well as evolutionarily coupled features extracted in homologous sequences, while comparing the differences between linear regression and deep learning models in mapping these features to quantifiable functional changes. We tested our approach on a dataset of 17 protein deep mutation scans and indicated that the integrated approach together with linear regression enables the models to have higher prediction accuracy and generalization. Moreover, we further illustrated the reliability of the integrated approach by exploring the differences in the predictive performance of the models across species and protein sequence lengths, as well as by visualizing clustering of ensemble and non-ensemble features.

Style APA, Harvard, Vancouver, ISO itp.

17

Thumuluri, Vineet, Hannah-Marie Martiny, Jose J. Almagro Armenteros, Jesper Salomon, Henrik Nielsen i Alexander Rosenberg Johansen. "NetSolP: predicting protein solubility in Escherichia coli using language models". Bioinformatics 38, nr 4 (27.11.2021): 941–46. http://dx.doi.org/10.1093/bioinformatics/btab801.

Pełny tekst źródła

Streszczenie:

Abstract Motivation Solubility and expression levels of proteins can be a limiting factor for large-scale studies and industrial production. By determining the solubility and expression directly from the protein sequence, the success rate of wet-lab experiments can be increased. Results In this study, we focus on predicting the solubility and usability for purification of proteins expressed in Escherichia coli directly from the sequence. Our model NetSolP is based on deep learning protein language models called transformers and we show that it achieves state-of-the-art performance and improves extrapolation across datasets. As we find current methods are built on biased datasets, we curate existing datasets by using strict sequence-identity partitioning and ensure that there is minimal bias in the sequences. Availability and implementation The predictor and data are available at https://services.healthtech.dtu.dk/service.php?NetSolP and the open-sourced code is available at https://github.com/tvinet/NetSolP-1.0. Supplementary information Supplementary data are available at Bioinformatics online.

Style APA, Harvard, Vancouver, ISO itp.

18

Deutschmann, Nicolas, Aurelien Pelissier, Anna Weber, Shuaijun Gao, Jasmina Bogojeska i María Rodríguez Martínez. "Do domain-specific protein language models outperform general models on immunology-related tasks?" ImmunoInformatics 14 (czerwiec 2024): 100036. http://dx.doi.org/10.1016/j.immuno.2024.100036.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

19

Liu, Dan, Francesca Young, Kieran D. Lamb, David L. Robertson i Ke Yuan. "Prediction of virus-host associations using protein language models and multiple instance learning". PLOS Computational Biology 20, nr 11 (19.11.2024): e1012597. http://dx.doi.org/10.1371/journal.pcbi.1012597.

Pełny tekst źródła

Streszczenie:

Predicting virus-host associations is essential to determine the specific host species that viruses interact with, and discover if new viruses infect humans and animals. Currently, the host of the majority of viruses is unknown, particularly in microbiomes. To address this challenge, we introduce EvoMIL, a deep learning method that predicts the host species for viruses from viral sequences only. It also identifies important viral proteins that significantly contribute to host prediction. The method combines a pre-trained large protein language model (ESM) and attention-based multiple instance learning to allow protein-orientated predictions. Our results show that protein embeddings capture stronger predictive signals than sequence composition features, including amino acids, physiochemical properties, and DNA k-mers. In multi-host prediction tasks, EvoMIL achieves median F1 score improvements of 10.8%, 16.2%, and 4.9% in prokaryotic hosts, and 1.7%, 6.6% and 11.5% in eukaryotic hosts. EvoMIL binary classifiers achieve impressive AUC over 0.95 for all prokaryotic hosts and range from roughly 0.8 to 0.9 for eukaryotic hosts. Furthermore, EvoMIL identifies important proteins in the prediction task. We found them capturing key functions in virus-host specificity.

Style APA, Harvard, Vancouver, ISO itp.

20

Wang, Bo, i Wenjin Li. "Advances in the Application of Protein Language Modeling for Nucleic Acid Protein Binding Site Prediction". Genes 15, nr 8 (18.08.2024): 1090. http://dx.doi.org/10.3390/genes15081090.

Pełny tekst źródła

Streszczenie:

Protein and nucleic acid binding site prediction is a critical computational task that benefits a wide range of biological processes. Previous studies have shown that feature selection holds particular significance for this prediction task, making the generation of more discriminative features a key area of interest for many researchers. Recent progress has shown the power of protein language models in handling protein sequences, in leveraging the strengths of attention networks, and in successful applications to tasks such as protein structure prediction. This naturally raises the question of the applicability of protein language models in predicting protein and nucleic acid binding sites. Various approaches have explored this potential. This paper first describes the development of protein language models. Then, a systematic review of the latest methods for predicting protein and nucleic acid binding sites is conducted by covering benchmark sets, feature generation methods, performance comparisons, and feature ablation studies. These comparisons demonstrate the importance of protein language models for the prediction task. Finally, the paper discusses the challenges of protein and nucleic acid binding site prediction and proposes possible research directions and future trends. The purpose of this survey is to furnish researchers with actionable suggestions for comprehending the methodologies used in predicting protein–nucleic acid binding sites, fostering the creation of protein-centric language models, and tackling real-world obstacles encountered in this field.

Style APA, Harvard, Vancouver, ISO itp.

21

Bhat, Suhaas, Garyk Brixi, Kalyan Palepu, Lauren Hong, Vivian Yudistyra, Tianlai Chen, Sophia Vincoff, Lin Zhao i Pranam Chatterjee. "Abstract C118: Design of programmable peptide-guided oncoprotein degraders via generative language models". Molecular Cancer Therapeutics 22, nr 12_Supplement (1.12.2023): C118. http://dx.doi.org/10.1158/1535-7163.targ-23-c118.

Pełny tekst źródła

Streszczenie:

Abstract Targeted protein degradation of pathogenic proteins represents a powerful new treatment strategy for multiple cancers. Unfortunately, a sizable portion of these proteins are considered “undruggable” by standard small molecule-based approaches, including PROTACs and molecular glues, largely due to their disordered nature, instability, and lack of binding site accessibility. As a more modular strategy, we have developed a genetically-encoded protein architecture by fusing target-specific peptides to E3 ubiquitin ligase domains for selective and potent intracellular degradation of oncoproteins. To enable programmability of our system, we develop a suite of algorithms that enable the design of target-specific peptides via protein language model (pLM) embeddings, without the requirement of 3D structures. First, we train a model that leverages pLM embeddings to efficiently select high-affinity peptides from natural protein interaction interfaces. Next, we develop a high-accuracy discriminator, based on the contrastive language-image pretraining (CLIP) architecture underlying OpenAI's DALL-E model, to prioritize and screen peptides with selectivity to a specified target oncoprotein. As input to the discriminator, we create a Gaussian diffusion generator to sample a pLM latent space, fine-tuned on experimentally-valid peptide sequences. Finally, to enable de novo design of binding peptides, we train an instance of GPT-2 with protein interacting sequences to enable peptide generation conditioned on target oncoprotein sequences. Our models demonstrate low perplexities across both existing and generated peptide sequences, highlighting their robust generative capability. By experimentally fusing model-derived peptides to E3 ubiquitin ligase domains, we reliably identify candidates exhibiting robust and selective endogenous degradation of diverse, "undruggable" oncoproteins in cancer cell models, including tumorigenic regulators such as β-catenin and TRIM8, as well as oncogenic fusion proteins, such as EWS-FLI1, PAX3-FOXO1, and DNAJB1-PRKACA. We further show that our peptide-guided degraders have negligible off-target effects via whole-cell proteomics and demonstrate their modulation of transcriptional and apoptotic pathways, motivating further translation of our therapeutic platform. Together, our work establishes a CRISPR-analogous system for programmable protein degradation applications across the oncoproteome. Citation Format: Suhaas Bhat, Garyk Brixi, Kalyan Palepu, Lauren Hong, Vivian Yudistyra, Tianlai Chen, Sophia Vincoff, Lin Zhao, Pranam Chatterjee. Design of programmable peptide-guided oncoprotein degraders via generative language models [abstract]. In: Proceedings of the AACR-NCI-EORTC Virtual International Conference on Molecular Targets and Cancer Therapeutics; 2023 Oct 11-15; Boston, MA. Philadelphia (PA): AACR; Mol Cancer Ther 2023;22(12 Suppl):Abstract nr C118.

Style APA, Harvard, Vancouver, ISO itp.

22

Mardikoraem, Mehrsa, i Daniel Woldring. "Protein Fitness Prediction Is Impacted by the Interplay of Language Models, Ensemble Learning, and Sampling Methods". Pharmaceutics 15, nr 5 (25.04.2023): 1337. http://dx.doi.org/10.3390/pharmaceutics15051337.

Pełny tekst źródła

Streszczenie:

Advances in machine learning (ML) and the availability of protein sequences via high-throughput sequencing techniques have transformed the ability to design novel diagnostic and therapeutic proteins. ML allows protein engineers to capture complex trends hidden within protein sequences that would otherwise be difficult to identify in the context of the immense and rugged protein fitness landscape. Despite this potential, there persists a need for guidance during the training and evaluation of ML methods over sequencing data. Two key challenges for training discriminative models and evaluating their performance include handling severely imbalanced datasets (e.g., few high-fitness proteins among an abundance of non-functional proteins) and selecting appropriate protein sequence representations (numerical encodings). Here, we present a framework for applying ML over assay-labeled datasets to elucidate the capacity of sampling techniques and protein encoding methods to improve binding affinity and thermal stability prediction tasks. For protein sequence representations, we incorporate two widely used methods (One-Hot encoding and physiochemical encoding) and two language-based methods (next-token prediction, UniRep; masked-token prediction, ESM). Elaboration on performance is provided over protein fitness, protein size, and sampling techniques. In addition, an ensemble of protein representation methods is generated to discover the contribution of distinct representations and improve the final prediction score. We then implement multiple criteria decision analysis (MCDA; TOPSIS with entropy weighting), using multiple metrics well-suited for imbalanced data, to ensure statistical rigor in ranking our methods. Within the context of these datasets, the synthetic minority oversampling technique (SMOTE) outperformed undersampling while encoding sequences with One-Hot, UniRep, and ESM representations. Moreover, ensemble learning increased the predictive performance of the affinity-based dataset by 4% compared to the best single-encoding candidate (F1-score = 97%), while ESM alone was rigorous enough in stability prediction (F1-score = 92%).

Style APA, Harvard, Vancouver, ISO itp.

23

Nana Teukam, Yves Gaetan, Loïc Kwate Dassi, Matteo Manica, Daniel Probst, Philippe Schwaller i Teodoro Laino. "Language models can identify enzymatic binding sites in protein sequences". Computational and Structural Biotechnology Journal 23 (grudzień 2024): 1929–37. http://dx.doi.org/10.1016/j.csbj.2024.04.012.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

24

Yadalam, Pradeep kumar, Ramya Ramadoss, Pradeep kumar R i Jishnu Krishna Kumar. "Pre-Trained Language Models Based Sequence Prediction of Wnt-Sclerostin Protein Sequences in Alveolar Bone Formation". Journal of Pioneering Medical Science 12, nr 3 (31.12.2023): 55–60. http://dx.doi.org/10.61091/jpms202312311.

Pełny tekst źródła

Streszczenie:

Background and Introduction: Osteocytes, the most numerous bone cells, create sclerostin. The sclerostin protein sequence predictive model helps create novel medications and produce alveolar bone in periodontitis and other oral bone illnesses, including osteoporosis. Neural networks examine protein variants for protein engineering and predict their structure and function impacts. Proteins with improved function and stability have been engineered using LLMs and CNNs. Sequence-based models, especially protein LLMs, predict variation effects, fitness, post-translational modifications, biophysical properties, and protein structure. CNNs trained on structural data also improve enzyme function. It is unknown if these models differ or forecast similarly. This study seeks Pre-trained language models to predict Wnt-Sclerostin Protein sequences in alveolar bone formation. Methods: Using UniProt ID, sclerostin and related proteins (Q9BQB4, Q9BQB4-1, Q9BQB4-2, Q6X4U4, O75197) were identified and quality-checked. Deepbio analyzed FASTA sequences. Deep Bio is a one-stop web service allowing academics to build any biological deep-learning architecture. DeepBIO used deep learning to improve and visualize biological sequencing data. LLM BASED Reformer, AAPNP, TEXTRGNN, VDCNN, and split sequence-based datasets into test and training. We randomly partitioned each dataset into 1000 training and 200 testing sets to change hyperparameters and measure performance. Results: Reformer, AAPNP, TEXTRGNN, VDCNN, RNN CNN exhibit 93, 64, 51, 91, and 64 percent accuracy. Conclusion: Protein sequence-based massive language models are growing, and R\&D is solving complicated challenges.

Style APA, Harvard, Vancouver, ISO itp.

25

Wang, Yan, Huiting Sun, Nan Sheng, Kai He, Wenjv Hou, Ziqi Zhao, Qixing Yang i Lan Huang. "ESMSec: Prediction of Secreted Proteins in Human Body Fluids Using Protein Language Models and Attention". International Journal of Molecular Sciences 25, nr 12 (9.06.2024): 6371. http://dx.doi.org/10.3390/ijms25126371.

Pełny tekst źródła

Streszczenie:

The secreted proteins of human body fluid have the potential to be used as biomarkers for diseases. These biomarkers can be used for early diagnosis and risk prediction of diseases, so the study of secreted proteins of human body fluid has great application value. In recent years, the deep-learning-based transformer language model has transferred from the field of natural language processing (NLP) to the field of proteomics, leading to the development of protein language models (PLMs) for protein sequence representation. Here, we propose a deep learning framework called ESM Predict Secreted Proteins (ESMSec) to predict three types of proteins secreted in human body fluid. The ESMSec is based on the ESM2 model and attention architecture. Specifically, the protein sequence data are firstly put into the ESM2 model to extract the feature information from the last hidden layer, and all the input proteins are encoded into a fixed 1000 × 480 matrix. Secondly, multi-head attention with a fully connected neural network is employed as the classifier to perform binary classification according to whether they are secreted into each body fluid. Our experiment utilized three human body fluids that are important and ubiquitous markers. Experimental results show that ESMSec achieved average accuracy of 0.8486, 0.8358, and 0.8325 on the testing datasets for plasma, cerebrospinal fluid (CSF), and seminal fluid, which on average outperform the state-of-the-art (SOTA) methods. The outstanding performance results of ESMSec demonstrate that the ESM can improve the prediction performance of the model and has great potential to screen the secretion information of human body fluid proteins.

Style APA, Harvard, Vancouver, ISO itp.

26

Zhu, Yi-Heng, Chengxin Zhang, Dong-Jun Yu i Yang Zhang. "Integrating unsupervised language model with triplet neural networks for protein gene ontology prediction". PLOS Computational Biology 18, nr 12 (22.12.2022): e1010793. http://dx.doi.org/10.1371/journal.pcbi.1010793.

Pełny tekst źródła

Streszczenie:

Accurate identification of protein function is critical to elucidate life mechanisms and design new drugs. We proposed a novel deep-learning method, ATGO, to predict Gene Ontology (GO) attributes of proteins through a triplet neural-network architecture embedded with pre-trained language models from protein sequences. The method was systematically tested on 1068 non-redundant benchmarking proteins and 3328 targets from the third Critical Assessment of Protein Function Annotation (CAFA) challenge. Experimental results showed that ATGO achieved a significant increase of the GO prediction accuracy compared to the state-of-the-art approaches in all aspects of molecular function, biological process, and cellular component. Detailed data analyses showed that the major advantage of ATGO lies in the utilization of pre-trained transformer language models which can extract discriminative functional pattern from the feature embeddings. Meanwhile, the proposed triplet network helps enhance the association of functional similarity with feature similarity in the sequence embedding space. In addition, it was found that the combination of the network scores with the complementary homology-based inferences could further improve the accuracy of the predicted models. These results demonstrated a new avenue for high-accuracy deep-learning function prediction that is applicable to large-scale protein function annotations from sequence alone.

Style APA, Harvard, Vancouver, ISO itp.

27

Lin, Zeming, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin i in. "Evolutionary-scale prediction of atomic-level protein structure with a language model". Science 379, nr 6637 (17.03.2023): 1123–30. http://dx.doi.org/10.1126/science.ade2574.

Pełny tekst źródła

Streszczenie:

Recent advances in machine learning have leveraged evolutionary information in multiple sequence alignments to predict protein structure. We demonstrate direct inference of full atomic-level protein structure from primary sequence using a large language model. As language models of protein sequences are scaled up to 15 billion parameters, an atomic-resolution picture of protein structure emerges in the learned representations. This results in an order-of-magnitude acceleration of high-resolution structure prediction, which enables large-scale structural characterization of metagenomic proteins. We apply this capability to construct the ESM Metagenomic Atlas by predicting structures for >617 million metagenomic protein sequences, including >225 million that are predicted with high confidence, which gives a view into the vast breadth and diversity of natural proteins.

Style APA, Harvard, Vancouver, ISO itp.

28

Valentini, Giorgio, Dario Malchiodi, Jessica Gliozzo, Marco Mesiti, Mauricio Soto-Gomez, Alberto Cabri, Justin Reese, Elena Casiraghi i Peter N. Robinson. "The promises of large language models for protein design and modeling". Frontiers in Bioinformatics 3 (23.11.2023). http://dx.doi.org/10.3389/fbinf.2023.1304099.

Pełny tekst źródła

Streszczenie:

The recent breakthroughs of Large Language Models (LLMs) in the context of natural language processing have opened the way to significant advances in protein research. Indeed, the relationships between human natural language and the “language of proteins” invite the application and adaptation of LLMs to protein modelling and design. Considering the impressive results of GPT-4 and other recently developed LLMs in processing, generating and translating human languages, we anticipate analogous results with the language of proteins. Indeed, protein language models have been already trained to accurately predict protein properties, generate novel functionally characterized proteins, achieving state-of-the-art results. In this paper we discuss the promises and the open challenges raised by this novel and exciting research area, and we propose our perspective on how LLMs will affect protein modeling and design.

Style APA, Harvard, Vancouver, ISO itp.

29

Mall, Raghvendra, Rahul Kaushik, Zachary A. Martinez, Matt W. Thomson i Filippo Castiglione. "Benchmarking protein language models for protein crystallization". Scientific Reports 15, nr 1 (18.01.2025). https://doi.org/10.1038/s41598-025-86519-5.

Pełny tekst źródła

Streszczenie:

Abstract The problem of protein structure determination is usually solved by X-ray crystallography. Several in silico deep learning methods have been developed to overcome the high attrition rate, cost of experiments and extensive trial-and-error settings, for predicting the crystallization propensities of proteins based on their sequences. In this work, we benchmark the power of open protein language models (PLMs) through the TRILL platform, a be-spoke framework democratizing the usage of PLMs for the task of predicting crystallization propensities of proteins. By comparing LightGBM / XGBoost classifiers built on the average embedding representations of proteins learned by different PLMs, such as ESM2, Ankh, ProtT5-XL, ProstT5, xTrimoPGLM, SaProt with the performance of state-of-the-art sequence-based methods like DeepCrystal, ATTCrys and CLPred, we identify the most effective methods for predicting crystallization outcomes. The LightGBM classifiers utilizing embeddings from ESM2 model with 30 and 36 transformer layers and 150 and 3000 million parameters respectively have performance gains by 3- $$5\%$$ than all compared models for various evaluation metrics, including AUPR (Area Under Precision-Recall Curve), AUC (Area Under the Receiver Operating Characteristic Curve), and F1 on independent test sets. Furthermore, we fine-tune the ProtGPT2 model available via TRILL to generate crystallizable proteins. Starting with 3000 generated proteins and through a step of filtration processes including consensus of all open PLM-based classifiers, sequence identity through CD-HIT, secondary structure compatibility, aggregation screening, homology search and foldability evaluation, we identified a set of 5 novel proteins as potentially crystallizable.

Style APA, Harvard, Vancouver, ISO itp.

30

Avraham, Orly, Tomer Tsaban, Ziv Ben-Aharon, Linoy Tsaban i Ora Schueler-Furman. "Protein language models can capture protein quaternary state". BMC Bioinformatics 24, nr 1 (14.11.2023). http://dx.doi.org/10.1186/s12859-023-05549-w.

Pełny tekst źródła

Streszczenie:

Abstract Background Determining a protein’s quaternary state, i.e. the number of monomers in a functional unit, is a critical step in protein characterization. Many proteins form multimers for their activity, and over 50% are estimated to naturally form homomultimers. Experimental quaternary state determination can be challenging and require extensive work. To complement these efforts, a number of computational tools have been developed for quaternary state prediction, often utilizing experimentally validated structural information. Recently, dramatic advances have been made in the field of deep learning for predicting protein structure and other characteristics. Protein language models, such as ESM-2, that apply computational natural-language models to proteins successfully capture secondary structure, protein cell localization and other characteristics, from a single sequence. Here we hypothesize that information about the protein quaternary state may be contained within protein sequences as well, allowing us to benefit from these novel approaches in the context of quaternary state prediction. Results We generated ESM-2 embeddings for a large dataset of proteins with quaternary state labels from the curated QSbio dataset. We trained a model for quaternary state classification and assessed it on a non-overlapping set of distinct folds (ECOD family level). Our model, named QUEEN (QUaternary state prediction using dEEp learNing), performs worse than approaches that include information from solved crystal structures. However, it successfully learned to distinguish multimers from monomers, and predicts the specific quaternary state with moderate success, better than simple sequence similarity-based annotation transfer. Our results demonstrate that complex, quaternary state related information is included in such embeddings. Conclusions QUEEN is the first to investigate the power of embeddings for the prediction of the quaternary state of proteins. As such, it lays out strengths as well as limitations of a sequence-based protein language model approach, compared to structure-based approaches. Since it does not require any structural information and is fast, we anticipate that it will be of wide use both for in-depth investigation of specific systems, as well as for studies of large sets of protein sequences. A simple colab implementation is available at: https://colab.research.google.com/github/Furman-Lab/QUEEN/blob/main/QUEEN_prediction_notebook.ipynb.

Style APA, Harvard, Vancouver, ISO itp.

31

Boshar, Sam, Evan Trop, Bernardo P. de Almeida, Liviu Copoiu i Thomas Pierrot. "Are Genomic Language Models All You Need? Exploring Genomic Language Models on Protein Downstream Tasks". Bioinformatics, 30.08.2024. http://dx.doi.org/10.1093/bioinformatics/btae529.

Pełny tekst źródła

Streszczenie:

Abstract Motivation Large language models, trained on enormous corpora of biological sequences, are state-of-the-art for downstream genomic and proteomic tasks. Since the genome contains the information to encode all proteins, genomic language models (gLMs) hold the potential to make downstream predictions not only about DNA sequences, but also about proteins. However, the performance of gLMs on protein tasks remains unknown, due to few tasks pairing proteins with the coding DNA sequences (CDS) that can be processed by gLMs. Results In this work, we curated five such datasets and used them to evaluate the performance of gLMs and proteomic language models (pLMs). We show that gLMs are competitive and even outperform their pLMs counterparts on some tasks. The best performance was achieved using the retrieved CDS compared to sampling strategies. We found that training a joint genomic-proteomic model outperforms each individual approach, showing that they capture different but complementary sequence representations, as we demonstrate through model interpretation of their embeddings. Lastly, we explored different genomic tokenization schemes to improve downstream protein performance. We trained a new Nucleotide Transformer (50M) foundation model with 3mer tokenization that outperforms its 6mer counterpart on protein tasks while maintaining performance on genomics tasks. The application of gLMs to proteomics offers the potential to leverage rich CDS data, and in the spirit of the central dogma, the possibility of a unified and synergistic approach to genomics and proteomics. Availability and Implementation We make our inference code, 3mer pre-trained model weights and datasets available. Supplementary information Supplementary data are available at Journal Name online.

Style APA, Harvard, Vancouver, ISO itp.

32

An, Jingmin, i Xiaogang Weng. "Collectively encoding protein properties enriches protein language models". BMC Bioinformatics 23, nr 1 (8.11.2022). http://dx.doi.org/10.1186/s12859-022-05031-z.

Pełny tekst źródła

Streszczenie:

AbstractPre-trained natural language processing models on a large natural language corpus can naturally transfer learned knowledge to protein domains by fine-tuning specific in-domain tasks. However, few studies focused on enriching such protein language models by jointly learning protein properties from strongly-correlated protein tasks. Here we elaborately designed a multi-task learning (MTL) architecture, aiming to decipher implicit structural and evolutionary information from three sequence-level classification tasks for protein family, superfamily and fold. Considering the co-existing contextual relevance between human words and protein language, we employed BERT, pre-trained on a large natural language corpus, as our backbone to handle protein sequences. More importantly, the encoded knowledge obtained in the MTL stage can be well transferred to more fine-grained downstream tasks of TAPE. Experiments on structure- or evolution-related applications demonstrate that our approach outperforms many state-of-the-art Transformer-based protein models, especially in remote homology detection.

Style APA, Harvard, Vancouver, ISO itp.

33

Narang, Kush, Abhigyan Nath, William Hemstrom i Simon K. S. Chu. "HaloClass: Salt-Tolerant Protein Classification with Protein Language Models". Protein Journal, 21.10.2024. http://dx.doi.org/10.1007/s10930-024-10236-7.

Pełny tekst źródła

Streszczenie:

AbstractSalt-tolerant proteins, also known as halophilic proteins, have unique adaptations to function in high-salinity environments. These proteins have naturally evolved in extremophilic organisms, and more recently, are being increasingly applied as enzymes in industrial processes. Due to an abundance of salt-tolerant sequences and a simultaneous lack of experimental structures, most computational methods to predict stability are sequence-based only. These approaches, however, are hindered by a lack of structural understanding of these proteins. Here, we present HaloClass, an SVM classifier that leverages ESM-2 protein language model embeddings to accurately identify salt-tolerant proteins. On a newer and larger test dataset, HaloClass outperforms existing approaches when predicting the stability of never-before-seen proteins that are distal to its training set. Finally, on a mutation study that evaluated changes in salt tolerance based on single- and multiple-point mutants, HaloClass outperforms existing approaches, suggesting applications in the guided design of salt-tolerant enzymes.

Style APA, Harvard, Vancouver, ISO itp.

34

McWhite, Claire Darnell, Isabel Armour-Garb i Mona Singh. "Leveraging protein language models for accurate multiple sequence alignments". Genome Research, 6.07.2023, gr.277675.123. http://dx.doi.org/10.1101/gr.277675.123.

Pełny tekst źródła

Streszczenie:

Multiple sequence alignment is a critical step in the study of protein sequence and function. Typically, multiple sequence alignment algorithms progressively align pairs of sequences and combine these alignments with the aid of a guide tree. These alignment algorithms use scoring systems based on substitution matrices to measure amino acid similarities. While successful, standard methods struggle on sets of proteins with low sequence identity - the so-called twilight zone of protein alignment. For these difficult cases, another source of information is needed. Protein language models are a powerful new approach that leverage massive sequence datasets to produce high-dimensional contextual embeddings for each amino acid in a sequence. These embeddings have been shown to reflect physicochemical and higher-order structural and functional attributes of amino acids within proteins. Here, we present a novel approach to multiple sequence alignment, based on clustering and ordering amino acid contextual embeddings. Our method for aligning semantically consistent groups of proteins circumvents the need for many standard components of multiple sequence alignment algorithms, avoiding initial guide tree construction, intermediate pairwise alignments, gap penalties, and substitution matrices. The added information from contextual embeddings leads to higher accuracy alignments for structurally similar proteins with low amino acid similarity. We anticipate that protein language models will become a fundamental component of the next generation of algorithms for generating MSAs.

Style APA, Harvard, Vancouver, ISO itp.

35

Vitale, Rosario, Leandro A. Bugnon, Emilio Luis Fenoy, Diego H. Milone i Georgina Stegmayer. "Evaluating large language models for annotating proteins". Briefings in Bioinformatics 25, nr 3 (27.03.2024). http://dx.doi.org/10.1093/bib/bbae177.

Pełny tekst źródła

Streszczenie:

Abstract In UniProtKB, up to date, there are more than 251 million proteins deposited. However, only 0.25% have been annotated with one of the more than 15000 possible Pfam family domains. The current annotation protocol integrates knowledge from manually curated family domains, obtained using sequence alignments and hidden Markov models. This approach has been successful for automatically growing the Pfam annotations, however at a low rate in comparison to protein discovery. Just a few years ago, deep learning models were proposed for automatic Pfam annotation. However, these models demand a considerable amount of training data, which can be a challenge with poorly populated families. To address this issue, we propose and evaluate here a novel protocol based on transfer learningṪhis requires the use of protein large language models (LLMs), trained with self-supervision on big unnanotated datasets in order to obtain sequence embeddings. Then, the embeddings can be used with supervised learning on a small and annotated dataset for a specialized task. In this protocol we have evaluated several cutting-edge protein LLMs together with machine learning architectures to improve the actual prediction of protein domain annotations. Results are significatively better than state-of-the-art for protein families classification, reducing the prediction error by an impressive 60% compared to standard methods. We explain how LLMs embeddings can be used for protein annotation in a concrete and easy way, and provide the pipeline in a github repo. Full source code and data are available at https://github.com/sinc-lab/llm4pfam

Style APA, Harvard, Vancouver, ISO itp.

36

Jing, Xiaoyang, Fandi Wu, Xiao Luo i Jinbo Xu. "Single-sequence protein structure prediction by integrating protein language models". Proceedings of the National Academy of Sciences 121, nr 13 (20.03.2024). http://dx.doi.org/10.1073/pnas.2308788121.

Pełny tekst źródła

Streszczenie:

Protein structure prediction has been greatly improved by deep learning in the past few years. However, the most successful methods rely on multiple sequence alignment (MSA) of the sequence homologs of the protein under prediction. In nature, a protein folds in the absence of its sequence homologs and thus, a MSA-free structure prediction method is desired. Here, we develop a single-sequence-based protein structure prediction method RaptorX-Single by integrating several protein language models and a structure generation module and then study its advantage over MSA-based methods. Our experimental results indicate that in addition to running much faster than MSA-based methods such as AlphaFold2, RaptorX-Single outperforms AlphaFold2 and other MSA-free methods in predicting the structure of antibodies (after fine-tuning on antibody data), proteins of very few sequence homologs, and single mutation effects. By comparing different protein language models, our results show that not only the scale but also the training data of protein language models will impact the performance. RaptorX-Single also compares favorably to MSA-based AlphaFold2 when the protein under prediction has a large number of sequence homologs.

Style APA, Harvard, Vancouver, ISO itp.

37

Haselbeck, Florian, Maura John, Yuqi Zhang, Jonathan Pirnay, Juan Pablo Fuenzalida-Werner, Rubén D. Costa i Dominik G. Grimm. "Superior protein thermophilicity prediction with protein language model embeddings". NAR Genomics and Bioinformatics 5, nr 4 (11.10.2023). http://dx.doi.org/10.1093/nargab/lqad087.

Pełny tekst źródła

Streszczenie:

Abstract Protein thermostability is important in many areas of biotechnology, including enzyme engineering and protein-hybrid optoelectronics. Ever-growing protein databases and information on stability at different temperatures allow the training of machine learning models to predict whether proteins are thermophilic. In silico predictions could reduce costs and accelerate the development process by guiding researchers to more promising candidates. Existing models for predicting protein thermophilicity rely mainly on features derived from physicochemical properties. Recently, modern protein language models that directly use sequence information have demonstrated superior performance in several tasks. In this study, we evaluate the usefulness of protein language model embeddings for thermophilicity prediction with ProLaTherm, a Protein Language model-based Thermophilicity predictor. ProLaTherm significantly outperforms all feature-, sequence- and literature-based comparison partners on multiple evaluation metrics. In terms of the Matthew’s correlation coefficient, ProLaTherm outperforms the second-best competitor by 18.1% in a nested cross-validation setup. Using proteins from species not overlapping with species from the training data, ProLaTherm outperforms all competitors by at least 9.7%. On these data, it misclassified only one nonthermophilic protein as thermophilic. Furthermore, it correctly identified 97.4% of all thermophilic proteins in our test set with an optimal growth temperature above 70°C.

Style APA, Harvard, Vancouver, ISO itp.

38

Lin, Peicong, Huanyu Tao, Hao Li i Sheng-You Huang. "Protein–protein contact prediction by geometric triangle-aware protein language models". Nature Machine Intelligence, 19.10.2023. http://dx.doi.org/10.1038/s42256-023-00741-2.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

39

Ieremie, Ioan, Rob M. Ewing i Mahesan Niranjan. "Protein language models meet reduced amino acid alphabets". Bioinformatics, 3.02.2024. http://dx.doi.org/10.1093/bioinformatics/btae061.

Pełny tekst źródła

Streszczenie:

Abstract Motivation Protein Language Models (PLMs), which borrowed ideas for modelling and inference from Natural Language Processing, have demonstrated the ability to extract meaningful representations in an unsupervised way. This led to significant performance improvement in several downstream tasks. Clustering amino acids based on their physical-chemical properties to achieve reduced alphabets has been of interest in past research, but their application to PLMs or folding models is unexplored. Results Here, we investigate the efficacy of PLMs trained on reduced amino acid alphabets in capturing evolutionary information, and we explore how the loss of protein sequence information impacts learned representations and downstream task performance. Our empirical work shows that PLMs trained on the full alphabet and a large number of sequences capture fine details that are lost in alphabet reduction methods. We further show the ability of a structure prediction model(ESMFold) to fold CASP14 protein sequences translated using a reduced alphabet. For 10 proteins out of the 50 targets, reduced alphabets improve structural predictions with LDDT-Cα differences of up to 19%. Availability Trained models and code are available at github.com/Ieremie/reduced-alph-PLM Supplementary information Supplementary data are available at Bioinformatics online.

Style APA, Harvard, Vancouver, ISO itp.

40

Pudžiuvelytė, Ieva, Kliment Olechnovič, Egle Godliauskaite, Kristupas Sermokas, Tomas Urbaitis, Giedrius Gasiunas i Darius Kazlauskas. "TemStaPro: protein thermostability prediction using sequence representations from protein language models". Bioinformatics, 20.03.2024. http://dx.doi.org/10.1093/bioinformatics/btae157.

Pełny tekst źródła

Streszczenie:

Abstract Motivation Reliable prediction of protein thermostability from its sequence is valuable for both academic and industrial research. This prediction problem can be tackled using machine learning and by taking advantage of the recent blossoming of deep learning methods for sequence analysis. These methods can facilitate training on more data and, possibly, enable development of more versatile thermostability predictors for multiple ranges of temperatures. Results We applied the principle of transfer learning to predict protein thermostability using embeddings generated by protein language models (pLMs) from an input protein sequence. We used large pLMs that were pre-trained on hundreds of millions of known sequences. The embeddings from such models allowed us to efficiently train and validate a high-performing prediction method using over one million sequences that we collected from organisms with annotated growth temperatures. Our method, TemStaPro (Temperatures of Stability for Proteins), was used to predict thermostability of CRISPR-Cas Class II effector proteins (C2EPs). Predictions indicated sharp differences among groups of C2EPs in terms of thermostability and were largely in tune with previously published and our newly obtained experimental data. Availability and Implementation TemStaPro software and the related data are freely available from https://github.com/ievapudz/TemStaPro and https://doi.org/10.5281/zenodo.7743637.

Style APA, Harvard, Vancouver, ISO itp.

41

Kabir, Anowarul, Asher Moldwin, Yana Bromberg i Amarda Shehu. "In the Twilight Zone of Protein Sequence Homology: Do Protein Language Models Learn Protein Structure?" Bioinformatics Advances, 17.08.2024. http://dx.doi.org/10.1093/bioadv/vbae119.

Pełny tekst źródła

Streszczenie:

Abstract Protein language models based on the transformer architecture are increasingly improving performance on protein prediction tasks, including secondary structure, subcellular localization, and more. Despite being trained only on protein sequences, protein language models appear to implicitly learn protein structure. This paper investigates whether sequence representations learned by protein language models encode structural information and to what extent. We address this by evaluating protein language models on remote homology prediction, where identifying remote homologs from sequence information alone requires structural knowledge, especially in the “twilight zone” of very low sequence identity. Through rigorous testing at progressively lower sequence identities, we profile the performance of protein language models ranging from millions to billions of parameters in a zero-shot setting. Our findings indicate that while transformer-based protein language models outperform traditional sequence alignment methods, they still struggle in the twilight zone. This suggests that current protein language models have not sufficiently learned protein structure to address remote homology prediction when sequence signals are weak. We believe this opens the way for further research both on remote homology prediction and on the broader goal of learning sequence- and structure-rich representations of protein molecules. All code, data, and models are made publicly available.

Style APA, Harvard, Vancouver, ISO itp.

42

Chen, Bo, Ziwei Xie, Jiezhong Qiu, Zhaofeng Ye, Jinbo Xu i Jie Tang. "Improved the heterodimer protein complex prediction with protein language models". Briefings in Bioinformatics, 16.06.2023. http://dx.doi.org/10.1093/bib/bbad221.

Pełny tekst źródła

Streszczenie:

Abstract AlphaFold-Multimer has greatly improved the protein complex structure prediction, but its accuracy also depends on the quality of the multiple sequence alignment (MSA) formed by the interacting homologs (i.e. interologs) of the complex under prediction. Here we propose a novel method, ESMPair, that can identify interologs of a complex using protein language models. We show that ESMPair can generate better interologs than the default MSA generation method in AlphaFold-Multimer. Our method results in better complex structure prediction than AlphaFold-Multimer by a large margin (+10.7% in terms of the Top-5 best DockQ), especially when the predicted complex structures have low confidence. We further show that by combining several MSA generation methods, we may yield even better complex structure prediction accuracy than Alphafold-Multimer (+22% in terms of the Top-5 best DockQ). By systematically analyzing the impact factors of our algorithm we find that the diversity of MSA of interologs significantly affects the prediction accuracy. Moreover, we show that ESMPair performs particularly well on complexes in eucaryotes.

Style APA, Harvard, Vancouver, ISO itp.

43

Xiang, Wenkai, Zhaoping Xiong, Huan Chen, Jiacheng Xiong, Wei Zhang, Zunyun Fu, Mingyue Zheng, Bing Liu i Qian Shi. "FAPM: Functional Annotation of Proteins using Multi-Modal Models Beyond Structural Modeling". Bioinformatics, 14.11.2024. http://dx.doi.org/10.1093/bioinformatics/btae680.

Pełny tekst źródła

Streszczenie:

Abstract Motivation Assigning accurate property labels to proteins, like functional terms and catalytic activity, is challenging, especially for proteins without homologs and “tail labels” with few known examples. Previous methods mainly focused on protein sequence features, overlooking the semantic meaning of protein labels. Results We introduce FAPM, a contrastive multi-modal model that links natural language with protein sequence language. This model combines a pretrained protein sequence model with a pretrained large language model to generate labels, such as Gene Ontology (GO) functional terms and catalytic activity predictions, in natural language. Our results show that FAPM excels in understanding protein properties, outperforming models based solely on protein sequences or structures. It achieves state-of-the-art performance on public benchmarks and in-house experimentally annotated phage proteins, which often have few known homologs. Additionally, FAPM's flexibility allows it to incorporate extra text prompts, like taxonomy information, enhancing both its predictive performance and explainability. This novel approach offers a promising alternative to current methods that rely on multiple sequence alignment for protein annotation. The online demo is at: https://huggingface.co/spaces/wenkai/FAPM_demo. Supplementary information Supplementary data are available at Bioinformatics online.

Style APA, Harvard, Vancouver, ISO itp.

44

Tang Tian-Yi, Xiong Yi-Ming, Zhang Rui-Ge, Zhang Jian, Li Wen-Fei, Wang Jun i Wang Wei. "Progress in Protein Pre-training Models Integrated with Structural Knowledge". Acta Physica Sinica, 2024, 0. http://dx.doi.org/10.7498/aps.73.20240811.

Pełny tekst źródła

Streszczenie:

The AI revolution sparked by natural language and image processing has brought new ideas and research paradigms to the field of protein computing. One significant advancement is the development of pre-trained protein language models through self-supervised learning from massive protein sequences. These pre-trained models encode various information about protein sequences, evolution, structures, and even functions, which can be easily transferred to various downstream tasks and demonstrate robust generalization capabilities. Recently, researchers are further developing multimodal pre-trained models that integrate more diverse types of data. This review summarizes the recent studies in this direction from the following aspects. Firstly, it reviews protein pre-trained models that integrate protein structures into language models; this is of particular importance since protein structure is the primary determinant of its function. Secondly, the pre-trained models that integrate protein dynamic information are introduced. These models may benefit downstream tasks such as protein-protein interactions, soft docking of ligands, and interactions involving allosteric proteins and intrinsic disordered proteins. Thirdly, the pre-trained models that integrate knowledge such as gene ontology are described. Fourthly, we briefly introduce pretrained models in RNA fields. Lastly, we introduce the most recent developments in protein designs and discuss the relations of these models with respect to the aforementioned pre-trained models that integrate protein structure information.

Style APA, Harvard, Vancouver, ISO itp.

45

Marquet, Céline, Michael Heinzinger, Tobias Olenyi, Christian Dallago, Kyra Erckert, Michael Bernhofer, Dmitrii Nechaev i Burkhard Rost. "Embeddings from protein language models predict conservation and variant effects". Human Genetics, 30.12.2021. http://dx.doi.org/10.1007/s00439-021-02411-y.

Pełny tekst źródła

Streszczenie:

AbstractThe emergence of SARS-CoV-2 variants stressed the demand for tools allowing to interpret the effect of single amino acid variants (SAVs) on protein function. While Deep Mutational Scanning (DMS) sets continue to expand our understanding of the mutational landscape of single proteins, the results continue to challenge analyses. Protein Language Models (pLMs) use the latest deep learning (DL) algorithms to leverage growing databases of protein sequences. These methods learn to predict missing or masked amino acids from the context of entire sequence regions. Here, we used pLM representations (embeddings) to predict sequence conservation and SAV effects without multiple sequence alignments (MSAs). Embeddings alone predicted residue conservation almost as accurately from single sequences as ConSeq using MSAs (two-state Matthews Correlation Coefficient—MCC—for ProtT5 embeddings of 0.596 ± 0.006 vs. 0.608 ± 0.006 for ConSeq). Inputting the conservation prediction along with BLOSUM62 substitution scores and pLM mask reconstruction probabilities into a simplistic logistic regression (LR) ensemble for Variant Effect Score Prediction without Alignments (VESPA) predicted SAV effect magnitude without any optimization on DMS data. Comparing predictions for a standard set of 39 DMS experiments to other methods (incl. ESM-1v, DeepSequence, and GEMME) revealed our approach as competitive with the state-of-the-art (SOTA) methods using MSA input. No method outperformed all others, neither consistently nor statistically significantly, independently of the performance measure applied (Spearman and Pearson correlation). Finally, we investigated binary effect predictions on DMS experiments for four human proteins. Overall, embedding-based methods have become competitive with methods relying on MSAs for SAV effect prediction at a fraction of the costs in computing/energy. Our method predicted SAV effects for the entire human proteome (~ 20 k proteins) within 40 min on one Nvidia Quadro RTX 8000. All methods and data sets are freely available for local and online execution through bioembeddings.com, https://github.com/Rostlab/VESPA, and PredictProtein.

Style APA, Harvard, Vancouver, ISO itp.

46

Livesey, Benjamin J., i Joseph A. Marsh. "Advancing variant effect prediction using protein language models". Nature Genetics, 10.08.2023. http://dx.doi.org/10.1038/s41588-023-01470-3.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

47

Nijkamp, Erik, Jeffrey A. Ruffolo, Eli N. Weinstein, Nikhil Naik i Ali Madani. "ProGen2: Exploring the boundaries of protein language models". Cell Systems, październik 2023. http://dx.doi.org/10.1016/j.cels.2023.10.002.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

48

Hu, Yue, Bin Huang, Chun Zi Zang i Jia Jie Xu. "Detection of circular permutations by Protein Language Models". Computational and Structural Biotechnology Journal, grudzień 2024. https://doi.org/10.1016/j.csbj.2024.12.029.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

49

Zhang, Qiang, Wanyi Chen, Ming Qin, Yuhao Wang, Zhongji Pu, Keyan Ding, Yuyue Liu i in. "Integrating protein language models and automatic biofoundry for enhanced protein evolution". Nature Communications 16, nr 1 (11.02.2025). https://doi.org/10.1038/s41467-025-56751-8.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

50

Si, Yunda, i Chengfei Yan. "Improved inter-protein contact prediction using dimensional hybrid residual networks and protein language models". Briefings in Bioinformatics, 9.02.2023. http://dx.doi.org/10.1093/bib/bbad039.

Pełny tekst źródła

Streszczenie:

Abstract The knowledge of contacting residue pairs between interacting proteins is very useful for the structural characterization of protein–protein interactions (PPIs). However, accurately identifying the tens of contacting ones from hundreds of thousands of inter-protein residue pairs is extremely challenging, and performances of the state-of-the-art inter-protein contact prediction methods are still quite limited. In this study, we developed a deep learning method for inter-protein contact prediction, which is referred to as DRN-1D2D_Inter. Specifically, we employed pretrained protein language models to generate structural information-enriched input features to residual networks formed by dimensional hybrid residual blocks to perform inter-protein contact prediction. Extensively bechmarking DRN-1D2D_Inter on multiple datasets, including both heteromeric PPIs and homomeric PPIs, we show DRN-1D2D_Inter consistently and significantly outperformed two state-of-the-art inter-protein contact prediction methods, including GLINTER and DeepHomo, although both the latter two methods leveraged the native structures of interacting proteins in the prediction, and DRN-1D2D_Inter made the prediction purely from sequences. We further show that applying the predicted contacts as constraints for protein–protein docking can significantly improve its performance for protein complex structure prediction.

Style APA, Harvard, Vancouver, ISO itp.

Artykuły w czasopismach na temat „Protein language models”

Utwórz poprawne odniesienie w stylach APA, MLA, Chicago, Harvard i wielu innych