Department of Biological Sciences
Permanent URI for this collectionhttp://localhost:4000/handle/123456789/1922
Browse
10 results
Search Results
Item flDPnn2: accurate and fast predictor of intrinsic disorder in proteins(Elsevier, 2024-09) Basu, SushmitaPrediction of the intrinsic disorder in protein sequences is an active research area, with well over 100 predictors that were released to date. These efforts are motivated by the functional importance and high levels of abundance of intrinsic disorder, combined with relatively low amounts of experimental annotations. The disorder predictors are periodically evaluated by independent assessors in the Critical Assessment of protein Intrinsic Disorder prediction (CAID) experiments. The recently completed CAID2 experiment assessed close to 40 state-of-the-art methods demonstrating that some of them produce accurate results. In particular, flDPnn2 method, which is the successor of flDPnn that performed well in the CAID1 experiment, secured the overall most accurate results on the Disorder-NOX dataset in CAID2. flDPnn2 implements a number of improvements when compared to its predecessor including changes to the inputs, increased size of the deep network model that we retrained on a larger training set, and addition of an alignment module. Using results from CAID2, we show that flDPnn2 produces accurate predictions very quickly, modestly improving over the accuracy of flDPnn and reducing the runtime by half, to about 27 s per proteinItem DescribePROT database of residue-level protein structure and function annotations(Springer, 2024-11) Basu, SushmitaDescribePROT is a freely available online database of structural and functional descriptors of proteins at the amino acid level. It provides access to 13 diverse descriptors that include sequence conservation, putative secondary structure, solvent accessibility, intrinsic disorder, and signal peptides, and putative annotations of residues that interact with proteins, peptides and nucleic acids. These data can be used to elucidate protein functions, to support efforts to develop therapeutics, and to develop and evaluate future predictors of protein structure and function. DescribePROT includes 7.8 billion predictions for 1.4 million proteins from 83 complete proteomes of popular model organisms. This information can be downloaded at multiple levels of scope (entire database, specific organisms, and individual proteins) and can be interacted with using a graphical interface that simultaneously displays data on multiple descriptors. We describe the contents of this resource, provide directions on how to use its interface, and offer instructions on how to obtain and interact with the underlying data.Item Taxonomy-specific assessment of intrinsic disorder predictions at residue and region levels in higher eukaryotes, protists, archaea, bacteria and viruses(Elsevier, 2024-12) Basu, SushmitaIntrinsic disorder predictors were evaluated in several studies including the two large CAID experiments. However, these studies are biased towards eukaryotic proteins and focus primarily on the residue-level predictions. We provide first-of-its-kind assessment that comprehensively covers the taxonomy and evaluates predictions at the residue and disordered region levels. We curate a benchmark dataset that uniformly covers eukaryotic, archaeal, bacterial, and viral proteins. We find that predictive performance differs substantially across taxonomy, where viruses are predicted most accurately, followed by protists and higher eukaryotes, while bacterial and archaeal proteins suffer lower levels of accuracy. These trends are consistent across predictors. We also find that current tools, except for flDPnn, struggle with reproducing native distributions of the numbers and sizes of the disordered regions. Moreover, analysis of two variants of disorder predictions derived from the AlphaFold2 predicted structures reveals that they produce accurate residue-level propensities for archaea, bacteria and protists. However, they underperform for higher eukaryotes and generally struggle to accurately identify disordered regions. Our results motivate development of new predictors that target bacteria and archaea and which produce accurate results at both residue and region levels. We also stress the need to include the region-level assessments in future assessments.Item Twenty years of advances in prediction of nucleic acid-binding residues in protein sequences(OUP, 2025-01) Basu, SushmitaComputational prediction of nucleic acid-binding residues in protein sequences is an active field of research, with over 80 methods that were released in the past 2 decades. We identify and discuss 87 sequence-based predictors that include dozens of recently published methods that are surveyed for the first time. We overview historical progress and examine multiple practical issues that include availability and impact of predictors, key features of their predictive models, and important aspects related to their training and assessment. We observe that the past decade has brought increased use of deep neural networks and protein language models, which contributed to substantial gains in the predictive performance. We also highlight advancements in vital and challenging issues that include cross-predictions between deoxyribonucleic acid (DNA)-binding and ribonucleic acid (RNA)-binding residues and targeting the two distinct sources of binding annotations, structure-based versus intrinsic disorder-based. The methods trained on the structure-annotated interactions tend to perform poorly on the disorder-annotated binding and vice versa, with only a few methods that target and perform well across both annotation types. The cross-predictions are a significant problem, with some predictors of DNA-binding or RNA-binding residues indiscriminately predicting interactions with both nucleic acid types. Moreover, we show that methods with web servers are cited substantially more than tools without implementation or with no longer working implementations, motivating the development and long-term maintenance of the web servers. We close by discussing future research directions that aim to drive further progress in this area.Item pLMMoRF: A web server that accurately predicts membrane-interacting molecular recognition features by employing a protein language model(Elsevier, 2025-09) Basu, SushmitaInteractions between proteins and lipids are crucial for numerous cellular processes. Some of the lipid interacting segments in protein sequences are intrinsically disordered regions (IDRs), which may gain secondary structures upon binding. We collected experimentally annotated lipid-interacting IDRs, named membrane molecular recognition features (MemMoRFs). We used this dataset to develop and test an accurate and relatively fast sequence-based MemMoRF predictor, pLMMoRF, thereby supporting tedious and costly experimental identification of MemMoRFs. Our predictor utilizes a protein language model (pLM) which we processed to generate inputs to a deep convolutional neural network. We considered various pLMs (ESM-2, ProstT5, ProtT5 and Ankh) and applied feature selection to reduce their outputs, creating a more compact neural network model. pLMMoRF leverages the Ankh-based model, selected for its higher accuracy compared to our other models. Tests on low similarity test datasets demonstrate that pLMMoRF is more accurate than the sole current predictor of MemMoRFs, CoMemMoRFPred. Moreover, pLMMoRF has a relatively small computational footprint because of the compact network size and use of dedicated GPU nodes. This allowed us to make MemMoRF predictions for the human proteome. We analyzed these predictions and made them publicly available, facilitating an improved understanding of functions of membrane-coupled proteins. Our work underscores the importance of selecting key embedding features to enhance predictive performance and reduce computational footprint of sequence-based predictors of protein functions. The web server for the pLMMoRF predictor and the predictions for human proteinsItem Prediction of intrinsic disorder functions with DEPICTER2(Springer Nature, 2025-07) Basu, SushmitaDEPICTER2 is a modern web server that provides convenient access to a broad selection of sequence-based predictions of intrinsic disorder and disorder functions. It incorporates six state-of-the-art methods that include ANCHOR2, DFLpred, DisoLipPred, DisoRDPbind, flDPnn, and MoRFCHiBi_Light, which predict disordered linkers and disordered regions that bind proteins, peptides, DNA, RNA, and lipids. DEPICTER2 facilitates selection of any combination of the six methods and batch predictions for multiple protein sequences. The prediction process is fully automated, performed on the server side, and does not require installation of any software. We describe and motivate selection of the six predictors, detail the prediction process, and explain how to interact with this web resource. We focus on the aspects related to the prediction of intrinsic disorder functions and provide a case study that illustrates how to interpret results produced by DEPICTER2.Item MERIT: accurate prediction of multi ligand-binding residues with hybrid deep transformer network, evolutionary couplings and transfer learning(Elsevier, 2025-08) Basu, SushmitaMulti-ligand binding residues (MLBRs) are amino acids in protein sequences that interact with multiple different ligands that include proteins, peptides, nucleic acids, and a variety of small molecules. MLBRs are implicated in a number of cellular functions and targeted in a context of multiple human diseases. There are many sequence-based predictors of residues that interact with specific ligand types and they can be collectively used to identify MLBRs. However, there are no methods that directly predict MLBRs. To this end, we conceptualize, design, evaluate and release MERIT (Multi-binding rEsidues pRedIcTor). This tool relies on a custom-crafted deep neural network that implements a number of innovative features, such as a multi-layered/step architecture with transformer modules that we train using a custom-designed loss function, computation of evolutionary couplings, and application of transfer learning. These innovations boost predictive performance, which we demonstrate using an ablation analysis. In particular, they reduce the number of cross-predictions, defined as residues that interact with a single ligand type that are incorrectly predicted as MLBRs. We compare MERIT against a representative selection of current and popular ligand-specific predictors, meta-predictors that combine their results to identify MLBRs, and a baseline regression-based predictor. These tests reveal that MERIT provides accurate predictions and statistically outperforms these alternatives. Moreover, using two test datasets, one with MLBRs and another with only the single ligand binding residues, we show that MERIT consistently produces relatively low false positive rates, including low rates of cross-predictions. The web server and datasets from this study are freely available at http://biomine.cs.vcu.edu/servers/MERIT/.Item Prediction of nucleic acid binding residues in protein sequences: recent advances and future prospects(Elsevier, 2025-10) Basu, SushmitaComputational prediction of DNA-binding residues (DBRs) and the RNA-binding residues (RBRs) in protein sequences is an active area of research, with about 90 predictors and 20 that were published over the last two years. The new predictors rely on sophisticated deep neural networks and protein language models, produce accurate predictions, and are conveniently available as code and/or web servers. However, we identified shortage of tools that predict these interactions in intrinsically disordered regions and tools capable of predicting residues that interact with specific RNA and DNA types. Moreover, cross-predictions between RBRs and DBRs should be quantified and minimized to ensure that future tools accurately differentiate between these two distinct types of nucleic acids.Item Comparative assessment of binding residue predictions in intrinsically disordered regions(Wiley, 2025-09) Basu, SushmitaDozens of impactful methods that predict intrinsically disordered regions (IDRs) in protein sequences that interact with proteins and/or nucleic acids were developed. Their training and assessment rely on the IDR-level binding annotations, while the equivalent structure-trained methods predict more granular annotations of binding amino acids (AA). We compiled a new benchmark dataset that annotates binding AA in IDRs and applied it to complete a first-of-its-kind assessment of predictions of the disordered binding residues. We evaluated a representative collection of 14 methods, used several hundred low-similarity test proteins, and focused on the challenging task of differentiating these binding residues from other disordered AA and considering ligand type-specific predictions (protein–protein vs. protein–nucleic acid interactions). We found that current methods struggle to accurately predict binding IDRs among disordered residues; however, better-than-random tools predict disordered binding residues significantly better than binding IDRs. We identified at least one relatively accurate tool for predicting disordered protein-binding and disordered nucleic acid-binding AA. Analysis of cross-predictions between interactions with protein and nucleic acids revealed that most methods are ligand-type-agnostic. Only two predictors of the nucleic acid-binding IDRs and two predictors of the protein-binding IDRs can be considered as ligand-type-specific. We also discussed several potential future directions that would move this field forward by producing more accurate methods that target the prediction of binding residues, reduce cross-predictions, and cover a broader range of ligand types.Item flDPnn3: Fast and accurate prediction of intrinsic disorder in protein sequences(Elsevier, 2026-01) Basu, SushmitaflDPnn3 provides fast and highly accurate predictions of intrinsic disorder. Compared to its earlier versions, it uses a more sophisticated sequence-derived profile as input, covering a modern protein language model and additional predicted disorder functions, while maintaining a similarly small computational footprint. flDPnn3 and over 70 other disorder predictors were independently evaluated on the Disorder-NOX dataset by assessors in CAID3 (3rd Critical Assessment of protein Intrinsic Disorder prediction). A side-by-side comparison in CAID3, including low-sequence-similarity subsets of the CAID3 test data, reveals that our method matches the predictive quality of the best disorder predictors. The runtime analysis shows that flDPnn3 produces results between 3 and 8 times faster than similarly accurate disorder predictors and can be used to produce predictions at the whole-proteome scale. Additionally, flDPnn3 achieves 100% coverage by predicting all proteins, while some other accurate tools fail to predict some proteins. The CAID3 results also demonstrate that flDPnn3 is significantly more accurate than its previous versions, flDPnn and flDPnn2, which were among the top-ranked methods in CAID1 and CAID2, respectively. The flDPnn3’s web server supports batch predictions, provides interactive visualization of results, offers a tutorial page,