Browsing by Author "Sharma, Yashvardhan"

Now showing 1 - 20 of 101

AENeT: an attention-enabled neural architecture for fake news detection using contextual features
(Springer, 2021) Narang, Pratik; Sharma, Yashvardhan
In the current era of social media, the popularity of smartphones and social media platforms has increased exponentially. Through these electronic media, fake news has been rising rapidly with the advent of new sources of information, which are highly unreliable. Checking off a particular news article is genuine or fake is not easy for any end user. Search engines like Google are also not capable of telling about the fakeness of any news article due to its restriction with limited query keywords. In this paper, our end goal is to design an efficient deep learning model to detect the degree of fakeness in a news statement. We propose a simple network architecture that combines the use of contextual embedding as word embedding and uses attention mechanisms with relevant metadata available. The efficacy and efficiency of our models are demonstrated on several real-world datasets. Our model achieved 46.36% accuracy on the LIAR dataset, which outperforms the current state of the art by 1.49%.
AENeT: an attention-enabled neural architecture for fake news detection using contextual features
(Springer, 2021-08) Sharma, Yashvardhan
In the current era of social media, the popularity of smartphones and social media platforms has increased exponentially. Through these electronic media, fake news has been rising rapidly with the advent of new sources of information, which are highly unreliable. Checking off a particular news article is genuine or fake is not easy for any end user. Search engines like Google are also not capable of telling about the fakeness of any news article due to its restriction with limited query keywords. In this paper, our end goal is to design an efficient deep learning model to detect the degree of fakeness in a news statement. We propose a simple network architecture that combines the use of contextual embedding as word embedding and uses attention mechanisms with relevant metadata available. The efficacy and efficiency of our models are demonstrated on several real-world datasets. Our model achieved 46.36% accuracy on the LIAR dataset, which outperforms the current state of the art by 1.49%.
Anaphora Resolution from Social Media Text
(CEUR-WS, 2022) Sharma, Yashvardhan
Anaphora resolution for social media texts is essential yet difficult task for text understanding. An important characteristic of anaphora is that it creates a connection between the antecedent and the anaphor buried in the anaphoric sentence. This paper outlines the methods used to locate anaphora and their antecedents in a particular text. The text is a social media tweet for the SocAnaRes-IL 2022 challenge that was part of FIRE 2022. The proposed model uses a Neural Co-reference Network for the anaphora resolution
Application of Java Relationship Graphs (JRG) to plagiarism detection in Java Projects: A Neo4j Graph Database Approach
(ACM Digital Library, 2021) Sharma, Yashvardhan; Arora, Ritu
A significant role is played by visualization of complex projects as graphs with nodes and edges in the area of software engineering. A graphical visualization of the whole project is the best way to understand it as a whole and effectively comprehend the dependencies between the participating entities. Graph databases are always easy to understand and work with, when it comes to complex projects. Leveraging the concepts of graph database for software engineering education, plagiarism detection, component evaluation etc. can be accomplished. This paper shows how to make use of a graph database obtained from a Java project for plagiarism detection. Graph databases along with graph algorithms have great applications in the field of software testing, plagiarism detection, partial evaluation and many more.
Application of Java Relationship Graphs to Academics for Detection of Plagiarism in Java Projects
(Springer, 2022-01) Arora, Ritu; Sharma, Yashvardhan
In today’s online learning environment, plagiarism detection tools are increasingly used by teachers and instructors to restrain students from plagiarism. Moreover, need for plagiarism detection tools to detect plagiarism in programming assignments has also increased manifolds. In this paper, we present the application of Neo4j Graph Databases to detect similarity between Java program submissions by students, in academics. This is done by converting a Java program into a specialized dependency graph and then implementing various comparison techniques on this graph. The two graph comparison techniques proposed and implemented in this paper are based on structural comparison of graphs by node-type count comparison and elemental comparison of method nodes in graphs by body-element-count comparison. The results of these two techniques are combined with the call graph-based technique, proposed in an earlier work, to calculate overall similarity index between program codes. This study captures a large category of changes that may be introduced to the code for plagiarism.
Applying TF-IDF and BERT-based Variants under Multilabel Classification for Emotion Detection in Urdu Language
(CEUR-WS, 2022) Sharma, Yashvardhan
Nowadays, the use of emojis is very common to show our emotions with just a single image instead of long sentences describing our emotions. Each emoji describes a particular emotion, such as anger, disgust, fear, sadness, surprise, and happiness. Now if we are given a task to identify emotions in a text, that means we have to tag a text with multiple emojis, each pointing to a different emotion. This paper aims to check for multiple emotions in an Urdu text, which comes under the category of multi-label classification. We have used pre-trained BERT models to add basic knowledge about a language (Urdu in our case). Over the pre-trained model, we added the classification layer using PyTorch. The output layer has seven nodes, six of which are for six emotions, and the seventh is for neutral. FIRE 2022 provided the Urdu tweet dataset used here as part of the subtask ”Multi-label emotion classification in Urdu” of the main task ”Emothreat: Emotion and Threat detection in Urdu.”
Applying Transfer Learning using BERT-Based Models for Hate Speech Detection
(CEUR-WS, 2021) Sharma, Yashvardhan; Chauhan, Gajendra Singh
Hateful and Offensive speech is rising along with social media. This issue has motivated researchers to devise novel approaches which perform better than the traditional algorithms. This paper presents the methods adopted by the BITS Pilani team for Subtask 1A of the Hate Speech and Offensive Content Identification in English and Indo-Aryan Language task proposed by the Forum of Information Retrieval Evaluation in 2021. We have used data augmentation to make the models generalize better. We have experimented with different feature extraction techniques along with machine learning algorithms. But, fine-tuning the pre-trained BERT-based models using transfer learning gave us the best results for all the given languages on the test set. We got the highest Macro-F1 of 0.7993 for the English Language, 0.7612 for the Hindi Language, and 0.8306 for the Marathi Language using the pre-trained BERT-based models.
ArabiziVec: A Set of ArabiziWord Embeddings for Informal Arabic Sentiment Analysis
(Sentic, 2023) Sharma, Yashvardhan
The current circumstances of the Arab world have provided bloggers and commenters with various subjects to discuss. Therefore, Arabic-generated content in social media is ramping up continuously. An informal written form of spoken Arabic called Arabizi has recently emerged as a commonly used language in the Arabic space, attracting great interest for sentiment analysis tasks. However, only a few sentiment resources exist, and state-of-the-art language models such as BERT and FastText do not consider Arabizi yet. This paper presents the first version of ArabiziVec, a set of pre-trained distributed word representations. ArabiziVec provides six different word embedding models to deal with Arabizi sentiment analysis challenges. The presented work surpasses all of the baseline sets for each experiment, regardless of whether the test set is from a previously published dataset or an extracted one. To the best of our knowledge, this is one of the first few resources that deals with Arabizi content and semantics in the context of sentiment analysis
ATSSI: Abstractive Text Summarization Using Sentiment Infusion
(Elsevier, 2016) Sharma, Yashvardhan
Text Summarization is condensing of text such that, redundant data are removed and important information is extracted and represented in the shortest way possible. With the explosion of the abundant data present on social media, it has become important to analyze this text for seeking information and use it for the advantage of various applications and people. From past few years, this task of automatic summarization has stirred the interest among communities of Natural Language Processing and Text Mining, especially when it comes to opinion summarization. Opinions play a pivotal role in decision making in the society. Other's opinions and suggestions are the base for an individual or a company while making decisions. In this paper, we propose a graph based technique that generates summaries of redundant opinions and uses sentiment analysis to combine the statements. The summaries thus generated are abstraction based summaries and are well formed to convey the gist of the text.
Automatic Subjective Answer Evaluation
(ICPRAM, 2023) Sharma, Yashvardhan
The evaluation of answer scripts is vital for assessing a student’s performance. The manual evaluation of the answers can sometimes be biased. The assessment depends on various factors, including the evaluator’s mental state, their relationship with the student, and their level of expertise in the subject matter. These factors make evaluating descriptive answers a very tedious and time-consuming task. Automatic scoring approaches can be utilized to simplify the evaluation process. This paper presents an automated answer script evaluation model that intends to reduce the need for human intervention, minimize bias brought on by evaluator psychological changes, save time, maintain track of evaluations, and simplify extraction. The proposedmethod can automatically weigh the assessing element and produce results nearly identical to an instructor’s. We compared the model’s grades to the grades of the teacher, as well as the results of several keyword matching and similarity check techniques, in order to evaluate the developed model
BITS Pilani at HinglishEval: Quality Evaluation for Code-Mixed Hinglish Text Using Transformers
(Association for Computational Linguistics, 2022) Sharma, Yashvardhan
Code-Mixed text data consists of sentences having words or phrases from more than one language. Most multi-lingual communities worldwide communicate using multiple languages, with English usually one of them. Hinglish is a Code-Mixed text composed of Hindi and English but written in Roman script. This paper aims to determine the factors influencing the quality of Code-Mixed text data generated by the system. For the HinglishEval task, the proposed model uses multilingual BERT to find the similarity between synthetically generated and human-generated sentences to predict the quality of synthetically generated Hinglish sentences.
BITS Pilani at SemEval-2024 Task 10: Fine-tuning BERT and Llama 2 for Emotion Recognition in Conversation
(Association for Computational Linguistics, 2024) Sharma, Yashvardhan
Emotion Recognition in Conversation (ERC)aims to assign an emotion to a dialogue in aconversation between people. The first subtaskof EDiReF shared task aims to assign an emo-tions to a Hindi-English code mixed conversa-tion. For this, our team proposes a system toidentify the emotion based on fine-tuning largelanguage models on the MaSaC dataset. Forour study we have fine tuned 2 LLMs BERTand Llama 2 to perform sequence classification to identify the emotion of the text.
BITS Pilani at SemEval-2024 Task 9: Prompt Engineering with GPT-4 for Solving Brainteasers
(Association for Computational Linguistics, 2024) Sharma, Yashvardhan
Solving brainteasers is a task that requires complex reasoning prowess. The increase of research in natural language processing has leadto the development of massive large languagemodels with billions (or trillions) of parameters that are able to solve difficult questionsdue to their advanced reasoning capabilities.The SemEval BRAINTEASER shared tasks consists of sentence and word puzzles along withoptions containing the answer for the puzzle.Our team uses OpenAI’s GPT-4 model alongwith prompt engineering to solve these brainteasers.
BITS-P at WAT 2023: Improving Indic Language Multimodal Translation by Image Augmentation using Diffusion Models
(Association for Computational Linguistics, 2023) Sharma, Yashvardhan
This paper describes the proposed system for mutlimodal machine translation. We have participated in multimodal translation tasks for English into three Indic languages: Hindi, Bengali, and Malayalam. We leverage the inherent richness of multimodal data to bridge the gap of ambiguity in translation. We fine-tuned the ‘No Language Left Behind’ (NLLB) machine translation model for multimodal translation, further enhancing the model accuracy by image data augmentation using latent diffusion. Our submission achieves the best BLEU score for English-Hindi, English-Bengali, and English-Malayalam language pairs for both Evaluation and Challenge test sets.
Bits2020@ Dravidian-CodeMix-FIRE2020: Sub-Word Level Sentiment Analysis of Dravidian Code Mixed Data
(CEUR-WS, 2020) Sharma, Yashvardhan
This paper presents the methodologies implemented while classifying Dravidian code-mixed comments according to their polarity in the evaluation of the track ‘Sentiment Analysis for Davidian Languages in Code-Mixed Text’ proposed by Forum of Information Retrieval Evaluation in 2020. The implemented method used a sub-word level representation to capture the sentiment of the text. Using a Long Short Term Memory (LSTM) network along with language-specific preprocessing, the model classified the text according to its polarity. With F1-scores of 0.61 and 0.60, the model achieved an overall rank of 5 and 12 in the Tamil and Malayalam tasks respectively.
BITS_PILANI@DPIL-FIRE2016:Paraphrase Detection in Hindi Language using Syntactic Features of Phrase
(CFIR, 2016-12) Sharma, Yashvardhan
Paraphrasing means expressing or conveying the same mean- ing or essence of a sentence or text using diﬀerent words or rearrangement of words. Paraphrase detection is a chal- lenge, especially in Indian languages like Hindi, because it is very essential to understand the semantics of the language. Detecting paraphrases is very relevant in real life because it has a lot of importance in applications like Information Retrieval, Extraction and Text Summarization. This paper focuses on using Machine Learning classiﬁcation techniques for detecting paraphrases in Hindi language for the DPIL Task in Fire 2016. A feature vector based approach has been used for detecting paraphrases. The task involves checking whether a given pair of sentences conveys the same informa- tion and meaning even if they are written in diﬀerent forms. Given a pair of sentences in Hindi, the proposed technique labels whether the pair of sentences are Paraphrases (P), Semi-Paraphrases (SP) or Not Paraphrases (NP)
BITS_PILANI@IMRiDis-FIRE 2017: Information Retrieval from Microblog during Disasters
(CEUR, 2017-12) Sharma, Yashvardhan
Microblogging sites like Twitter are increasingly being used for aiding relief operations during disaster events. In such situations, identifying actionable information like needs and availabilities of various types of resources is critical for effective coordination of post disaster relief operations. However, such critical information is usually submerged within a lot of conversational content, such as sympathy for the victims of the disaster. Hence, automated IR techniques are needed to find and process such information. In this paper, we utilize word vector embeddings along with fastText sentence classification algorithm to perform the task of classification of tweets posted during natural disasters.
Bits_Pilani@INLI-FIRE-2017:Indian Native Language Identification using Deep Learning
(CEUR, 2017) Sharma, Yashvardhan
The task of Native Language Identification involves identifying the prior or first learnt language of a user based on his writing technique and/or analysis of speech and phonetics in second language. There is a surplus of such data present on social media sites and organised dataset from bodies like Educational Testing Service(ETS), which can be exploited to develop language learning systems and forensic linguistics. In this paper we propose a deep neural network for this task using hierarchical paragraph encoder with attention mechanism to identify relevant features over tendencies and errors a user makes with second language for the INLI task in FIRE 2017. The task involves six Indian languages as prior/native set and english as the second language which has been collected from user's social media account.
Building a data warehousing infrastructure based on service oriented architecture
(IEEE, 2012) Sharma, Yashvardhan
This paper analyses the possibility and advantages of providing on-demand data warehousing as an attempt to reach a new level of business intelligence. Providing data warehousing as a service is primarily aimed at medium and small business organizations, which have large volume of data but are incapable of having a suitable data warehousing infrastructure. A suitable architecture based on service orientation is proposed to provide different components of data warehousing as a service and describes how this architecture provide support for integration and discovery of services. Features of web services and data warehousing are combined to implement the proposed architecture. Data warehousing can be supported by service orientation which has the ability to join various services from different areas of the data warehouse to create composite applications. These composite applications can take the form of common services. Services are composed in such way that it would be generic to all data warehouses.
Catchphrase Extraction from Legal Documents Using LSTM Networks
(CEUR, 2017-12) Sharma, Yashvardhan
Legal texts usually have a complex structure and reading through them is a time-consuming and strenuous task. Hence it is essential to provide the legal practitioners a concise representation of the text. Catchphrases are those phrases which state the important issues present in the text, thus effectively characterizing it. This paper proposes an approach for the subtask 1 of the task IRLed (Information Retrieval from Legal Documents), FIRE 2017. The proposed algorithm uses a three step approach for extracting catchphrases from legal documents.