BITS Faculty Publications

Permanent URI for this communityhttp://localhost:4000/handle/123456789/1867

Browse

Search Results

Now showing 1 - 10 of 29

SAC: a framework for measuring and inducing personality traits in llms with dynamic intensity control
(2025-06) Challa, Jagat Sesh; Kumar, Dhruv
Large language models (LLMs) have gained significant traction across a wide range of fields in recent years. There is also a growing expectation for them to display human-like personalities during interactions. To meet this expectation, numerous studies have proposed methods for modelling LLM personalities through psychometric evaluations. However, most existing models face two major limitations: they rely on the Big Five (OCEAN) framework, which only provides coarse personality dimensions, and they lack mechanisms for controlling trait intensity. In this paper, we address this gap by extending the Machine Personality Inventory (MPI), which originally used the Big Five model, to incorporate the 16 Personality Factor (16PF) model, allowing expressive control over sixteen distinct traits. We also developed a structured framework known as Specific Attribute Control (SAC) for evaluating and dynamically inducing trait intensity in LLMs. Our method introduces adjective-based semantic anchoring to guide trait intensity expression and leverages behavioural questions across five intensity factors: \textit{Frequency}, \textit{Depth}, \textit{Threshold}, \textit{Effort}, and \textit{Willingness}. Through experimentation, we find that modelling intensity as a continuous spectrum yields substantially more consistent and controllable personality expression compared to binary trait toggling. Moreover, we observe that changes in target trait intensity systematically influence closely related traits in psychologically coherent directions, suggesting that LLMs internalize multi-dimensional personality structures rather than treating traits in isolation. Our work opens new pathways for controlled and nuanced human-machine interactions in domains such as healthcare, education, and interviewing processes, bringing us one step closer to truly human-like social machines.
The Impact of Large Language Models on K-12 Education in Rural India: A Thematic Analysis of Student Volunteer's Perspectives
(2025-05) Kumar, Dhruv; Challa, Jagat Sesh; Ramachandran, Veena
AI-driven education, particularly Large Language Models (LLMs), has the potential to address learning disparities in rural K-12 schools. However, research on AI adoption in rural India remains limited, with existing studies focusing primarily on urban settings. This study examines the perceptions of volunteer teachers on AI integration in rural education, identifying key challenges and opportunities. Through semi-structured interviews with 23 volunteer educators in Rajasthan and Delhi, we conducted a thematic analysis to explore infrastructure constraints, teacher preparedness, and digital literacy gaps. Findings indicate that while LLMs could enhance personalized learning and reduce teacher workload, barriers such as poor connectivity, lack of AI training, and parental skepticism hinder adoption. Despite concerns over over-reliance and ethical risks, volunteers emphasize that AI should be seen as a complementary tool rather than a replacement for traditional teaching. Given the potential benefits, LLM-based tutors merit further exploration in rural classrooms, with structured implementation and localized adaptations to ensure accessibility and equity.
Transformers for vision: a survey on innovative methods for computer vision
(IEEE, 2025-05) Kumar, Dhruv; Chalapathi, G.S.S.
Transformers have emerged as a groundbreaking architecture in the field of computer vision, offering a compelling alternative to traditional convolutional neural networks (CNNs) by enabling the modeling of long-range dependencies and global context through self-attention mechanisms. Originally developed for natural language processing, transformers have now been successfully adapted for a wide range of vision tasks, leading to significant improvements in performance and generalization. This survey provides a comprehensive overview of the fundamental principles of transformer architectures, highlighting the core mechanisms such as self-attention, multi-head attention, and positional encoding that distinguish them from CNNs. We delve into the theoretical adaptations required to apply transformers to visual data, including image tokenization and the integration of positional embeddings. A detailed analysis of key transformer-based vision architectures such as ViT, DeiT, Swin Transformer, PVT, Twins, and CrossViT are presented, alongside their practical applications in image classification, object detection, video understanding, medical imaging, and cross-modal tasks. The paper further compares the performance of vision transformers with CNNs, examining their respective strengths, limitations, and the emergence of hybrid models. Finally, current challenges in deploying ViTs, such as computational cost, data efficiency, and interpretability, and explore recent advancements and future research directions including efficient architectures, self-supervised learning, and multimodal integration are discussed.
Investigating pedagogical teacher and student LLM agents: genetic adaptation meets retrieval augmented generation across learning style
(2025-05) Kumar, Dhruv
Effective teaching requires adapting instructional strategies to accommodate the diverse cognitive and behavioral profiles of students, a persistent challenge in education and teacher training. While Large Language Models (LLMs) offer promise as tools to simulate such complex pedagogical environments, current simulation frameworks are limited in two key respects: (1) they often reduce students to static knowledge profiles, and (2) they lack adaptive mechanisms for modeling teachers who evolve their strategies in response to student feedback. To address these gaps, \textbf{we introduce a novel simulation framework that integrates LLM-based heterogeneous student agents with a self-optimizing teacher agent}. The teacher agent's pedagogical policy is dynamically evolved using a genetic algorithm, allowing it to discover and refine effective teaching strategies based on the aggregate performance of diverse learners. In addition, \textbf{we propose Persona-RAG}, a Retrieval Augmented Generation module that enables student agents to retrieve knowledge tailored to their individual learning styles. Persona-RAG preserves the retrieval accuracy of standard RAG baselines while enhancing personalization, an essential factor in modeling realistic educational scenarios. Through extensive experiments, we demonstrate how our framework supports the emergence of distinct and interpretable teaching patterns when interacting with varied student populations. Our results highlight the potential of LLM-driven simulations to inform adaptive teaching practices and provide a testbed for training human educators in controlled, data-driven environments.
Automated type annotation in Python using large language models
(2025-08) Kumar, Dhruv
Type annotations in Python enhance maintainability and error detection. However, generating these annotations manually is error prone and requires extra effort. Traditional automation approaches like static analysis, machine learning, and deep learning struggle with limited type vocabularies, behavioral over approximation, and reliance on large labeled datasets. In this work, we explore the use of LLMs for generating type annotations in Python. We develop a generate check repair pipeline: the LLM proposes annotations guided by a Concrete Syntax Tree representation, a static type checker (Mypy) verifies them, and any errors are fed back for iterative refinement. We evaluate four LLM variants: GPT 4oMini, GPT 4.1mini (general-purpose), and O3Mini, O4Mini (reasoning optimized), on 6000 code snippets from the ManyTypes4Py benchmark. We first measure the proportion of code snippets annotated by LLMs for which MyPy reported no errors (i.e., consistent results): GPT 4oMini achieved consistency on 65.9% of cases (34.1% inconsistent), while GPT 4.1mini, O3Mini, and O4Mini each reached approximately 88.6% consistency (around 11.4% failures). To measure annotation quality, we then compute exact-match and base-type match accuracies over all 6000 snippets: GPT 4.1mini and O3Mini perform the best, achieving up to 70.5% exact match and 79.1% base type accuracy, requiring under one repair iteration on average. Our results demonstrate that general-purpose and reasoning optimized LLMs, without any task specific fine tuning or additional training can be effective in generating consistent type this http URL perform competitively with traditional deep learning techniques which require large labeled dataset for training. While our work focuses on Python, the pipeline can be extended to other optionally typed imperative languages like Ruby
Loop invariant generation: a hybrid framework of reasoning optimised LLMs and SMT solvers
(2025-08) Kumar, Dhruv
Loop invariants are essential for proving the correctness of programs with loops. Developing loop invariants is challenging, and fully automatic synthesis cannot be guaranteed for arbitrary programs. Some approaches have been proposed to synthesize loop invariants using symbolic techniques and more recently using neural approaches. These approaches are able to correctly synthesize loop invariants only for subsets of standard benchmarks. In this work, we investigate whether modern, reasoning-optimized large language models can do better. We integrate OpenAI's O1, O1-mini, and O3-mini into a tightly coupled generate-and-check pipeline with the Z3 SMT solver, using solver counterexamples to iteratively guide invariant refinement. We use Code2Inv benchmark, which provides C programs along with their formal preconditions and postconditions. On this benchmark of 133 tasks, our framework achieves 100% coverage (133 out of 133), outperforming the previous best of 107 out of 133, while requiring only 1-2 model proposals per instance and 14-55 seconds of wall-clock time. These results demonstrate that LLMs possess latent logical reasoning capabilities which can help automate loop invariant synthesis. While our experiments target C-specific programs, this approach should be generalizable to other imperative languages.
The role of generative AI tools in shaping mechanical engineering education from an undergraduate perspective
(Springer Nature, 2025-03) Challa, Jagat Sesh; Kumar, Dhruv
This study evaluates the effectiveness of three leading generative AI tools-ChatGPT, Gemini, and Copilot-in undergraduate mechanical engineering education using a mixed-methods approach. The performance of these tools was assessed on 800 questions spanning seven core subjects, covering multiple-choice, numerical, and theory-based formats. While all three AI tools demonstrated strong performance in theory-based questions, they struggled with numerical problem-solving, particularly in areas requiring deep conceptual understanding and complex calculations. Among them, Copilot achieved the highest accuracy (60.38%), followed by Gemini (57.13%) and ChatGPT (46.63%). To complement these findings, a survey of 172 students and interviews with 20 participants provided insights into user experiences, challenges, and perceptions of AI in academic settings. Thematic analysis revealed concerns regarding AI’s reliability in numerical tasks and its potential impact on students’ problem-solving abilities. Based on these results, this study offers strategic recommendations for integrating AI into mechanical engineering curricula, ensuring its responsible use to enhance learning without fostering dependency. Additionally, we propose instructional strategies to help educators adapt assessment methods in the era of AI-assisted learning. These findings contribute to the broader discussion on AI’s role in engineering education and its implications for future learning methodologies.
Rubric is all you need: enhancing llm-based code evaluation with question-specific rubrics
(2025-03) Challa, Jagat Sesh; Kumar, Dhruv
Since the disruption in LLM technology brought about by the release of GPT-3 and ChatGPT, LLMs have shown remarkable promise in programming-related tasks. While code generation remains a popular field of research, code evaluation using LLMs remains a problem with no conclusive solution. In this paper, we focus on LLM-based code evaluation and attempt to fill in the existing gaps. We propose multi-agentic novel approaches using question-specific rubrics tailored to the problem statement, arguing that these perform better for logical assessment than the existing approaches that use question-agnostic rubrics. To address the lack of suitable evaluation datasets, we introduce two datasets: a Data Structures and Algorithms dataset containing 150 student submissions from a popular Data Structures and Algorithms practice website, and an Object Oriented Programming dataset comprising 80 student submissions from undergraduate computer science courses. In addition to using standard metrics (Spearman Correlation, Cohen's Kappa), we additionally propose a new metric called as Leniency, which quantifies evaluation strictness relative to expert assessment. Our comprehensive analysis demonstrates that question-specific rubrics significantly enhance logical assessment of code in educational settings, providing better feedback aligned with instructional goals beyond mere syntactic correctness
System test case design from requirements specifications: insights and challenges of using chatgpt
(2024) Kumar, Dhruv
System testing is essential in any software development project to ensure that the final products meet the requirements. Creating comprehensive test cases for system testing from requirements is often challenging and time-consuming. This paper explores the effectiveness of using Large Language Models (LLMs) to generate test case designs from Software Requirements Specification (SRS) documents. In this study, we collected the SRS documents of five software engineering projects containing functional and non-functional requirements, which were implemented, tested, and delivered by respective developer teams. For generating test case designs, we used ChatGPT-4o Turbo model. We employed prompt-chaining, starting with an initial context-setting prompt, followed by prompts to generate test cases for each use case. We assessed the quality of the generated test case designs through feedback from the same developer teams as mentioned above. Our experiments show that about 87 percent of the generated test cases were valid, with the remaining 13 percent either not applicable or redundant. Notably, 15 percent of the valid test cases were previously not considered by developers in their testing. We also tasked ChatGPT with identifying redundant test cases, which were subsequently validated by the respective developers to identify false positives and to uncover any redundant test cases that may have been missed by the developers themselves. This study highlights the potential of leveraging LLMs for test generation from the Requirements Specification document and also for assisting developers in quickly identifying and addressing redundancies, ultimately improving test suite quality and efficiency of the testing procedure.
Debatebench: a challenging long context reasoning benchmark for large language models
(2025-02) Kumar, Dhruv
We introduce DebateBench, a novel dataset consisting of an extensive collection of transcripts and metadata from some of the world's most prestigious competitive debates. The dataset consists of British Parliamentary debates from prestigious debating tournaments on diverse topics, annotated with detailed speech-level scores and house rankings sourced from official adjudication data. We curate 256 speeches across 32 debates with each debate being over 1 hour long with each input being an average of 32,000 tokens. Designed to capture long-context, large-scale reasoning tasks, DebateBench provides a benchmark for evaluating modern large language models (LLMs) on their ability to engage in argumentation, deliberation, and alignment with human experts. To do well on DebateBench, the LLMs must perform in-context learning to understand the rules and evaluation criteria of the debates, then analyze 8 seven minute long speeches and reason about the arguments presented by all speakers to give the final results. Our preliminary evaluation using GPT o1, GPT-4o, and Claude Haiku, shows that LLMs struggle to perform well on DebateBench, highlighting the need to develop more sophisticated techniques for improving their performance.

BITS Faculty Publications

Browse

Filters

Settings

Sort By

Results per page

Search Results