Department of Biological Sciences

Permanent URI for this collectionhttp://localhost:4000/handle/123456789/1922

Browse

Now showing 1 - 20 of 24

CoMemMoRFPred: sequence-based prediction of MemMoRFs by combining predictors of intrinsic disorder, MoRFs and disordered lipid-binding regions
(Elsevier, 2023-11) Basu, Sushmita
Molecular recognition features (MoRFs) are a commonly occurring type of intrinsically disordered regions (IDRs) that undergo disorder-to-order transition upon binding to partner molecules. We focus on recently characterized and functionally important membrane-binding MoRFs (MemMoRFs). Motivated by the lack of computational tools that predict MemMoRFs, we use a dataset of experimentally annotated MemMoRFs to conceptualize, design, evaluate and release an accurate sequence-based predictor. We rely on state-of-the-art tools that predict residues that possess key characteristics of MemMoRFs, such as intrinsic disorder, disorder-to-order transition and lipid-binding. We identify and combine results from three tools that include flDPnn for the disorder prediction, DisoLipPred for the prediction of disordered lipid-binding regions, and MoRFCHiBiLight for the prediction of disorder-to-order transitioning protein binding regions. Our empirical analysis demonstrates that combining results produced by these three methods generates accurate predictions of MemMoRFs. We also show that use of a smoothing operator produces predictions that closely mimic the number and sizes of the native MemMoRF regions.
Comparative assessment of binding residue predictions in intrinsically disordered regions
(Wiley, 2025-09) Basu, Sushmita
Dozens of impactful methods that predict intrinsically disordered regions (IDRs) in protein sequences that interact with proteins and/or nucleic acids were developed. Their training and assessment rely on the IDR-level binding annotations, while the equivalent structure-trained methods predict more granular annotations of binding amino acids (AA). We compiled a new benchmark dataset that annotates binding AA in IDRs and applied it to complete a first-of-its-kind assessment of predictions of the disordered binding residues. We evaluated a representative collection of 14 methods, used several hundred low-similarity test proteins, and focused on the challenging task of differentiating these binding residues from other disordered AA and considering ligand type-specific predictions (protein–protein vs. protein–nucleic acid interactions). We found that current methods struggle to accurately predict binding IDRs among disordered residues; however, better-than-random tools predict disordered binding residues significantly better than binding IDRs. We identified at least one relatively accurate tool for predicting disordered protein-binding and disordered nucleic acid-binding AA. Analysis of cross-predictions between interactions with protein and nucleic acids revealed that most methods are ligand-type-agnostic. Only two predictors of the nucleic acid-binding IDRs and two predictors of the protein-binding IDRs can be considered as ligand-type-specific. We also discussed several potential future directions that would move this field forward by producing more accurate methods that target the prediction of binding residues, reduce cross-predictions, and cover a broader range of ligand types.
Computational prediction of disordered binding regions
(Elsevier, 2023) Basu, Sushmita
One of the key features of intrinsically disordered regions (IDRs) is their ability to interact with a broad range of partner molecules. Multiple types of interacting IDRs were identified including molecular recognition fragments (MoRFs), short linear sequence motifs (SLiMs), and protein-, nucleic acids- and lipid-binding regions. Prediction of binding IDRs in protein sequences is gaining momentum in recent years. We survey 38 predictors of binding IDRs that target interactions with a diverse set of partners, such as peptides, proteins, RNA, DNA and lipids. We offer a historical perspective and highlight key events that fueled efforts to develop these methods. These tools rely on a diverse range of predictive architectures that include scoring functions, regular expressions, traditional and deep machine learning and meta-models. Recent efforts focus on the development of deep neural network-based architectures and extending coverage to RNA, DNA and lipid-binding IDRs. We analyze availability of these methods and show that providing implementations and webservers results in much higher rates of citations/use. We also make several recommendations to take advantage of modern deep network architectures, develop tools that bundle predictions of multiple and different types of binding IDRs, and work on algorithms that model structures of the resulting complexes.
Conservation and coevolution determine evolvability of different classes of disordered residues in human intrinsically disordered proteins
(Wiley, 2021-10) Basu, Sushmita
Structure, function, and evolution are interdependent properties of proteins. Diversity of protein functions arising from structural variations is a potential driving force behind protein evolvability. Intrinsically disordered proteins or regions (IDPs or IDRs) lack well-defined structure under normal physiological conditions, yet, they are highly functional. Increased occurrence of IDPs in eukaryotes compared to prokaryotes indicates strong correlation of protein evolution and disorderedness. IDPs generally have higher evolution rate compared to globular proteins. Structural pliability allows IDPs to accommodate multiple mutations without affecting their functional potential. Nevertheless, how evolutionary signals vary between different classes of disordered residues (DRs) in IDPs is poorly understood. This study addresses variation of evolutionary behavior in terms of residue conservation and intra-protein coevolution among structural and functional classes of DRs in IDPs. Analyses are performed on 579 human IDPs, which are classified based on length of IDRs, interacting partners and functional classes. We find short IDRs are less conserved than long IDRs or full IDPs. Functional classes which require flexibility and specificity to perform their activity comparatively evolve slower than others. Disorder promoting amino acids evolve faster than order promoting amino acids. Pro, Gly, Ile, and Phe have unique coevolving nature which further emphasizes on their roles in IDPs. This study sheds light on evolutionary footprints in different classes of DRs from human IDPs and enhances our understanding of the structural and functional potential of IDPs.
DEPICTER2: a comprehensive webserver for intrinsic disorder and disorder function prediction
(OUP, 2023-05) Basu, Sushmita
Intrinsic disorder in proteins is relatively abundant in nature and essential for a broad spectrum of cellular functions. While disorder can be accurately predicted from protein sequences, as it was empirically demonstrated in recent community-organized assessments, it is rather challenging to collect and compile a comprehensive prediction that covers multiple disorder functions. To this end, we introduce the DEPICTER2 (DisorderEd PredictIon CenTER) webserver that offers convenient access to a curated collection of fast and accurate disorder and disorder function predictors. This server includes a state-of-the-art disorder predictor, flDPnn, and five modern methods that cover all currently predictable disorder functions: disordered linkers and protein, peptide, DNA, RNA and lipid binding. DEPICTER2 allows selection of any combination of the six methods, batch predictions of up to 25 proteins per request and provides interactive visualization of the resulting predictions
DescribePROT database of residue-level protein structure and function annotations
(Springer, 2024-11) Basu, Sushmita
DescribePROT is a freely available online database of structural and functional descriptors of proteins at the amino acid level. It provides access to 13 diverse descriptors that include sequence conservation, putative secondary structure, solvent accessibility, intrinsic disorder, and signal peptides, and putative annotations of residues that interact with proteins, peptides and nucleic acids. These data can be used to elucidate protein functions, to support efforts to develop therapeutics, and to develop and evaluate future predictors of protein structure and function. DescribePROT includes 7.8 billion predictions for 1.4 million proteins from 83 complete proteomes of popular model organisms. This information can be downloaded at multiple levels of scope (entire database, specific organisms, and individual proteins) and can be interacted with using a graphical interface that simultaneously displays data on multiple descriptors. We describe the contents of this resource, provide directions on how to use its interface, and offer instructions on how to obtain and interact with the underlying data.
DescribePROT in 2023: more, higher-quality and experimental annotations and improved data download options
(OUP, 2023-11) Basu, Sushmita
The DescribePROT database of amino acid-level descriptors of protein structures and functions was substantially expanded since its release in 2020. This expansion includes substantial increase in the size, scope, and quality of the underlying data, the addition of experimental structural information, the inclusion of new data download options, and an upgraded graphical interface. DescribePROT currently covers 19 structural and functional descriptors for proteins in 273 reference proteomes generated by 11 accurate and complementary predictive tools. Users can search our resource in multiple ways, interact with the data using the graphical interface, and download data at various scales including individual proteins, entire proteomes, and whole database. The annotations in DescribePROT are useful for a broad spectrum of studies that include investigations of protein structure and function, development and validation of predictive tools, and to support efforts in understanding molecular underpinnings of diseases and development of therapeutics
Do sequence neighbours of intrinsically disordered regions promote structural flexibility in intrinsically disordered proteins?
(Elsevier, 2020-02) Basu, Sushmita
Intrinsically disordered proteins (IDPs) are crucial players in various cellular activities. Several experimental and computational analyses have been conducted to study structural pliability and functional potential of IDPs. In spite of active research in past few decades, what induces structural disorder in IDPs and how is still elusive. Many studies testify that sequential and spatial neighbours often play important roles in determining structural and functional behaviour of proteins. Considering this fact, we assessed sequence neighbours of intrinsically disordered regions (IDRs) to understand if they have any role to play in inducing structural flexibility in IDPs. Our analysis includes 97% eukaryotic IDPs and 3% from bacteria and viruses. Physicochemical and structural parameters including amino acid propensity, hydrophobicity, secondary structure propensity, relative solvent accessibility, B-factor and atomic packing density are used to characterise the neighbouring residues of IDRs (NRIs). We show that NRIs exhibit a unique nature, which makes them stand out from both ordered and disordered residues. They show correlative occurrences of residue pairs like Ser-Thr and Gln-Asn, indicating their tendency to avoid strong biases of order or disorder promoting amino acids. We also find differential preferences of amino acids between N- and C-terminal neighbours, which might indicate a plausible directional effect on the dynamics of adjacent IDRs. We designed an efficient prediction tool using Random Forest to distinguish the NRIs from the ordered residues. Our findings will contribute to understand the behaviour of IDPs, and may provide potential lead in deciphering the role of IDRs in protein folding and assembly
Effect of neighbouring residues in conformational plasticity of intrinsically disordered proteins
(Elsevier, 2018-02) Basu, Sushmita
Effect of neighbouring residues in conformational plasticity of intrinsically disordered regions. The concept of unstructured proteins has opened new avenues in the field of structural biology. Intrinsically disordered proteins (IDPs) are the new class of proteins which have been found to be a major player in many significant cellular functions. IDPs have been characterised by its physicochemical properties as well as its molecular interaction behaviour. Detailed study of IDPs can lead to a better understanding of protein folding and its functioning. To understand the source of disorderedness in the disordered regions (IDRs) in IDPs, we studied how the sequence environment of a disordered region correlates to its randomness. Here, we analysed the physicochemical and structural features like amino acid propensities, net charge, hydropathy index, secondary structure propensity, relative surface accessibility, interaction density and H-bonds to characterise the neighbours of the IDRs. Five residues, each towards N and C terminal of the disordered region are considered as the neighbours of IDRs. These neighbouring residues are found to be enriched in disorder promoting amino acids and have higher propensity to form loops than other secondary structures. Solvent accessibility of neighbouring residues also showed increasing trend as we move towards the IDRs. The variation of other parameters along with the above observation indicates that the neighbouring residues of IDRs induce a degree of flexibility to the adjoining IDRs. Based on our findings, we are designing an algorithm using random forest, which shall predict the disordered region based on its neighbouring sequences. The information on IDRs and its neighbours can be useful for proteins to be expressed or characterised for the first time. It can also provide a lead in understanding the molecular mechanism behind the polymorphic interactions that are involved with IDPs.
flDPnn2: accurate and fast predictor of intrinsic disorder in proteins
(Elsevier, 2024-09) Basu, Sushmita
Prediction of the intrinsic disorder in protein sequences is an active research area, with well over 100 predictors that were released to date. These efforts are motivated by the functional importance and high levels of abundance of intrinsic disorder, combined with relatively low amounts of experimental annotations. The disorder predictors are periodically evaluated by independent assessors in the Critical Assessment of protein Intrinsic Disorder prediction (CAID) experiments. The recently completed CAID2 experiment assessed close to 40 state-of-the-art methods demonstrating that some of them produce accurate results. In particular, flDPnn2 method, which is the successor of flDPnn that performed well in the CAID1 experiment, secured the overall most accurate results on the Disorder-NOX dataset in CAID2. flDPnn2 implements a number of improvements when compared to its predecessor including changes to the inputs, increased size of the deep network model that we retrained on a larger training set, and addition of an alignment module. Using results from CAID2, we show that flDPnn2 produces accurate predictions very quickly, modestly improving over the accuracy of flDPnn and reducing the runtime by half, to about 27 s per protein
flDPnn3: Fast and accurate prediction of intrinsic disorder in protein sequences
(Elsevier, 2026-01) Basu, Sushmita
flDPnn3 provides fast and highly accurate predictions of intrinsic disorder. Compared to its earlier versions, it uses a more sophisticated sequence-derived profile as input, covering a modern protein language model and additional predicted disorder functions, while maintaining a similarly small computational footprint. flDPnn3 and over 70 other disorder predictors were independently evaluated on the Disorder-NOX dataset by assessors in CAID3 (3rd Critical Assessment of protein Intrinsic Disorder prediction). A side-by-side comparison in CAID3, including low-sequence-similarity subsets of the CAID3 test data, reveals that our method matches the predictive quality of the best disorder predictors. The runtime analysis shows that flDPnn3 produces results between 3 and 8 times faster than similarly accurate disorder predictors and can be used to produce predictions at the whole-proteome scale. Additionally, flDPnn3 achieves 100% coverage by predicting all proteins, while some other accurate tools fail to predict some proteins. The CAID3 results also demonstrate that flDPnn3 is significantly more accurate than its previous versions, flDPnn and flDPnn2, which were among the top-ranked methods in CAID1 and CAID2, respectively. The flDPnn3’s web server supports batch predictions, provides interactive visualization of results, offers a tutorial page,
HybridDBRpred: improved sequence-based prediction of DNA-binding amino acids using annotations from structured complexes and disordered proteins
(OUP, 2023-12) Basu, Sushmita
Current predictors of DNA-binding residues (DBRs) from protein sequences belong to two distinct groups, those trained on binding annotations extracted from structured protein-DNA complexes (structure-trained) vs. intrinsically disordered proteins (disorder-trained). We complete the first empirical analysis of predictive performance across the structure- and disorder-annotated proteins for a representative collection of ten predictors. Majority of the structure-trained tools perform well on the structure-annotated proteins while doing relatively poorly on the disorder-annotated proteins, and vice versa. Several methods make accurate predictions for the structure-annotated proteins or the disorder-annotated proteins, but none performs highly accurately for both annotation types. Moreover, most predictors make excessive cross-predictions for the disorder-annotated proteins, where residues that interact with non-DNA ligand types are predicted as DBRs. Motivated by these results, we design, validate and deploy an innovative meta-model, hybridDBRpred, that uses deep transformer network to combine predictions generated by three best current predictors. HybridDBRpred provides accurate predictions and low levels of cross-predictions across the two annotation types, and is statistically more accurate than each of the ten tools and baseline meta-predictors that rely on averaging and logistic regression.
Impaired nuclear transport induced by juvenile ALS causing P525L mutation in NLS domain of FUS: A molecular mechanistic study
(Elsevier, 2022-04) Basu, Sushmita
Amyotrophic lateral sclerosis (ALS) and fronto-temporal lobar degeneration (FTLD) are progressive neurological disorders affecting motor neurons. Cellular aggregates of fused in sarcoma (FUS) protein are found in cytoplasm of ALS and FTLD patients. Nuclear localisation signal (NLS) domain of FUS binds to Karyopherin β2 (Kapβ2), which drives nuclear transport of FUS from cytoplasm. Several pathogenic mutations are reported in FUS NLS, which are associated with its impaired nuclear transport and cytoplasmic mis-localisation. P525L mutation in NLS is most commonly found in cases of juvenile ALS (jALS), which affects individuals below 25 years of age. jALS progresses aggressively causing death within a year of its onset. This study elucidates the molecular mechanism behind jALS-causing P525L mutation hindering nuclear transport of FUS. We perform multiple molecular dynamics simulations in aqueous and hydrophobic solvent to understand the effect of the mutation at molecular level. Dynamics of Kapβ2-FUS complex is better captured in hydrophobic solvent compared to aqueous solvent. P525 and Y526 (PY-motif) of NLS exhibit fine-tuned stereochemical arrangement, which is essential for optimum Kapβ2 binding. P525L causes loss of several native contacts at interface leading to weaker binding, which promotes self-aggregation of FUS in cytoplasm. Native complex samples closed conformation, while mutant complex exhibits open conformation exposing hydrophilic residues of Kapβ2 to hydrophobic solvent. Mutant complex also fails to exhibit spring-like motion essential for its transport through nuclear pore complex. This study provides a mechanistic insight of binding affinity between NLS and Kapβ2 that inhibits self-aggregation of FUS preventing the disease condition.
MERIT: accurate prediction of multi ligand-binding residues with hybrid deep transformer network, evolutionary couplings and transfer learning
(Elsevier, 2025-08) Basu, Sushmita
Multi-ligand binding residues (MLBRs) are amino acids in protein sequences that interact with multiple different ligands that include proteins, peptides, nucleic acids, and a variety of small molecules. MLBRs are implicated in a number of cellular functions and targeted in a context of multiple human diseases. There are many sequence-based predictors of residues that interact with specific ligand types and they can be collectively used to identify MLBRs. However, there are no methods that directly predict MLBRs. To this end, we conceptualize, design, evaluate and release MERIT (Multi-binding rEsidues pRedIcTor). This tool relies on a custom-crafted deep neural network that implements a number of innovative features, such as a multi-layered/step architecture with transformer modules that we train using a custom-designed loss function, computation of evolutionary couplings, and application of transfer learning. These innovations boost predictive performance, which we demonstrate using an ablation analysis. In particular, they reduce the number of cross-predictions, defined as residues that interact with a single ligand type that are incorrectly predicted as MLBRs. We compare MERIT against a representative selection of current and popular ligand-specific predictors, meta-predictors that combine their results to identify MLBRs, and a baseline regression-based predictor. These tests reveal that MERIT provides accurate predictions and statistically outperforms these alternatives. Moreover, using two test datasets, one with MLBRs and another with only the single ligand binding residues, we show that MERIT consistently produces relatively low false positive rates, including low rates of cross-predictions. The web server and datasets from this study are freely available at http://biomine.cs.vcu.edu/servers/MERIT/.
Molecular mechanism of the enhanced viral fitness contributed by secondary mutations in the hemagglutinin protein of oseltamivir resistant H1N1 influenza viruses: Modeling studies of antibody and receptor binding
(Elsevier, 2015-02) Basu, Sushmita
The envelope protein hemagglutinin (HA) of influenza viruses is primarily associated with host antibody and receptor interactions. The HA protein is known to maintain a functional balance with neuraminidase (NA), the other major envelope protein. Prior to 2007–2008, human seasonal H1N1 viruses possessing the NA H274Y mutation, which confers oseltamivir resistance, generally had low growth capability. Subsequently, secondary mutations that compensate for the deleterious effect of the NA H274Y mutation have been identified. The molecular mechanism of how the defect could be counteracted by these secondary mutations is not fully understood. We studied here the effect of three such mutations (T86K, K144E and R192K) in the HA protein, which are located at either the HA receptor binding site or in the H1N1 antigenic sites. Molecular docking and dynamics studies showed that, of the three mutations, the R192K mutation could have mediated neutralizing antibody escape and decreased receptor binding affinity, either or both of which may have contributed to increased viral fitness. The study suggests the molecular basis of enhanced viral fitness induced by secondary mutations in the evolution of oseltamivir-resistant influenza strains.
pLMMoRF: A web server that accurately predicts membrane-interacting molecular recognition features by employing a protein language model
(Elsevier, 2025-09) Basu, Sushmita
Interactions between proteins and lipids are crucial for numerous cellular processes. Some of the lipid interacting segments in protein sequences are intrinsically disordered regions (IDRs), which may gain secondary structures upon binding. We collected experimentally annotated lipid-interacting IDRs, named membrane molecular recognition features (MemMoRFs). We used this dataset to develop and test an accurate and relatively fast sequence-based MemMoRF predictor, pLMMoRF, thereby supporting tedious and costly experimental identification of MemMoRFs. Our predictor utilizes a protein language model (pLM) which we processed to generate inputs to a deep convolutional neural network. We considered various pLMs (ESM-2, ProstT5, ProtT5 and Ankh) and applied feature selection to reduce their outputs, creating a more compact neural network model. pLMMoRF leverages the Ankh-based model, selected for its higher accuracy compared to our other models. Tests on low similarity test datasets demonstrate that pLMMoRF is more accurate than the sole current predictor of MemMoRFs, CoMemMoRFPred. Moreover, pLMMoRF has a relatively small computational footprint because of the compact network size and use of dedicated GPU nodes. This allowed us to make MemMoRF predictions for the human proteome. We analyzed these predictions and made them publicly available, facilitating an improved understanding of functions of membrane-coupled proteins. Our work underscores the importance of selecting key embedding features to enhance predictive performance and reduce computational footprint of sequence-based predictors of protein functions. The web server for the pLMMoRF predictor and the predictions for human proteins
Prediction of intrinsic disorder functions with DEPICTER2
(Springer Nature, 2025-07) Basu, Sushmita
DEPICTER2 is a modern web server that provides convenient access to a broad selection of sequence-based predictions of intrinsic disorder and disorder functions. It incorporates six state-of-the-art methods that include ANCHOR2, DFLpred, DisoLipPred, DisoRDPbind, flDPnn, and MoRFCHiBi_Light, which predict disordered linkers and disordered regions that bind proteins, peptides, DNA, RNA, and lipids. DEPICTER2 facilitates selection of any combination of the six methods and batch predictions for multiple protein sequences. The prediction process is fully automated, performed on the server side, and does not require installation of any software. We describe and motivate selection of the six predictors, detail the prediction process, and explain how to interact with this web resource. We focus on the aspects related to the prediction of intrinsic disorder functions and provide a case study that illustrates how to interpret results produced by DEPICTER2.
Prediction of nucleic acid binding residues in protein sequences: recent advances and future prospects
(Elsevier, 2025-10) Basu, Sushmita
Computational prediction of DNA-binding residues (DBRs) and the RNA-binding residues (RBRs) in protein sequences is an active area of research, with about 90 predictors and 20 that were published over the last two years. The new predictors rely on sophisticated deep neural networks and protein language models, produce accurate predictions, and are conveniently available as code and/or web servers. However, we identified shortage of tools that predict these interactions in intrinsically disordered regions and tools capable of predicting residues that interact with specific RNA and DNA types. Moreover, cross-predictions between RBRs and DBRs should be quantified and minimized to ensure that future tools accurately differentiate between these two distinct types of nucleic acids.
qNABpredict: Quick, accurate, and taxonomy-aware sequence-based prediction of content of nucleic acid binding amino acids
(Wiley, 2022-12) Basu, Sushmita
Protein sequence-based predictors of nucleic acid (NA)-binding include methods that predict NA-binding proteins and NA-binding residues. The residue-level tools produce more details but suffer high computational cost since they must predict every amino acid in the input sequence and rely on multiple sequence alignments. We propose an alternative approach that predicts content (fraction) of the NA-binding residues, offering more information than the protein-level prediction and much shorter runtime than the residue-level tools. Our first-of-its-kind content predictor, qNABpredict, relies on a small, rationally designed and fast-to-compute feature set that represents relevant characteristics extracted from the input sequence and a well-parametrized support vector regression model. We provide two versions of qNABpredict, a taxonomy-agnostic model that can be used for proteins of unknown taxonomic origin and more accurate taxonomy-aware models that are tailored to specific taxonomic kingdoms: archaea, bacteria, eukaryota, and viruses. Empirical tests on a low-similarity test dataset show that qNABpredict is 100 times faster and generates statistically more accurate content predictions when compared to the content extracted from results produced by the residue-level predictors. We also show that qNABpredict's content predictions can be used to improve results generated by the residue-level predictors. We release qNABpredict as a convenient webserver and source code at http://biomine.cs.vcu.edu/servers/qNABpredict/. This new tool should be particularly useful to predict details of protein–NA interactions for large protein families and proteomes.
Real-time observation of macroscopic helical morphologies under optical microscope: a curious case of π–π stacking driven molecular self-assembly of an organic gelator devoid of hydrogen bonding
(Wiley, 2022-12) Basu, Sushmita
Supramolecular assemblies such as tubules/helix/double helix/helical tape etc. are usually submicron objects preventing direct observation under optical microscope. Chiral-pure form of these assemblies is important for potential applications. Herein, we report a rare phenomenon wherein a DMSO gel of a simple terpyridine derivative [(4-CNPhe)4PyTerp] produced macroscopic helical morphologies (μm length scale) which could be observed under optical microscope, formation of which could be monitored by optical videography, stable enough to withstand acidic vapour, robust enough to display reversible gel↔sol in response to acidic and ammonia vapour and sturdy enough to be maneuvered with a needle. These properties appeared to be unique to the title compound as the other related derivatives failed to display such assembly structures. SXRD and MD simulation studies suggested that weak interactions (π-π stacking) played a crucial role in the self-assembly process.