Department of Computer Science and Information Systems

Permanent URI for this collectionhttp://localhost:4000/handle/123456789/1928

Browse

Search Results

Now showing 1 - 10 of 12
  • Item
    On the Universality of Deep Contextual Language Models
    (2021-12) Goyal, Poonam
    Deep Contextual Language Models (LMs) like ELMO, BERT, and their successors dominate the landscape of Natural Language Processing due to their ability to scale across multiple tasks rapidly by pre-training a single model, followed by task-specific fine-tuning. Furthermore, multilingual versions of such models like XLM-R and mBERT have given promising results in zero-shot cross-lingual transfer, potentially enabling NLP applications in many under-served and under-resourced languages. Due to this initial success, pre-trained models are being used as `Universal Language Models' as the starting point across diverse tasks, domains, and languages. This work explores the notion of `Universality' by identifying seven dimensions across which a universal model should be able to scale, that is, perform equally well or reasonably well, to be useful across diverse settings. We outline the current theoretical and empirical results that support model performance across these dimensions, along with extensions that may help address some of their current limitations. Through this survey, we lay the foundation for understanding the capabilities and limitations of massive contextual language models and help discern research gaps and directions for future work to make these LMs inclusive and fair to diverse applications, users, and linguistic phenomena.
  • Item
    FAID: Feature Aftermath for Irony Discernment
    (IEEE, 2019) Sharma, Yashvardhan
    This paper deals with the impediment of identifying sarcasm in social media text which can be used to improve sentiment analysis technique. After thorough analysis, some features were identified which could help in recognition of sarcasm. In state of art, features have been extracted from the data set which embraced standalone sentences. Proposed algorithm analyzes the impact of these features and a combination of them on the review data set in which reviews had three or more sentences, so that context of sentence is also taken into consideration by the machine before classifying a review.
  • Item
    Encoder-Decoder Architectures for Generating Questions
    (Elsevier, 2018) Sharma, Yashvardhan
    With exploding textual data on the internet with e-books, legal documents and products information, it is an opportunity to harness it for applications which can aid human tasks. Developing systems for question generation can be used for making frequently-asked-questions, creating school quiz-es and serve for the purpose of unified AI. Here in this study various encoder decoder architectures for generating questions from text inputs have been explored using Stanford’s SQuAD dataset as for training development and test sets and evaluation metrics such as BLEU, ROUGUE and training time were used to compare the effectiveness of the models. The article develops upon the work of current end-to-end system by using gated recurrent unit in place of long short term memory which give similar accuracy but with lesser training time, further it also show the successfully use of a convolution based encoder for this task which gives results comparable to current state of the art system with much lesser training time.
  • Item
    Bits_Pilani@INLI-FIRE-2017:Indian Native Language Identification using Deep Learning
    (CEUR, 2017) Sharma, Yashvardhan
    The task of Native Language Identification involves identifying the prior or first learnt language of a user based on his writing technique and/or analysis of speech and phonetics in second language. There is a surplus of such data present on social media sites and organised dataset from bodies like Educational Testing Service(ETS), which can be exploited to develop language learning systems and forensic linguistics. In this paper we propose a deep neural network for this task using hierarchical paragraph encoder with attention mechanism to identify relevant features over tendencies and errors a user makes with second language for the INLI task in FIRE 2017. The task involves six Indian languages as prior/native set and english as the second language which has been collected from user's social media account.
  • Item
    Catchphrase Extraction from Legal Documents Using LSTM Networks
    (CEUR, 2017-12) Sharma, Yashvardhan
    Legal texts usually have a complex structure and reading through them is a time-consuming and strenuous task. Hence it is essential to provide the legal practitioners a concise representation of the text. Catchphrases are those phrases which state the important issues present in the text, thus effectively characterizing it. This paper proposes an approach for the subtask 1 of the task IRLed (Information Retrieval from Legal Documents), FIRE 2017. The proposed algorithm uses a three step approach for extracting catchphrases from legal documents.
  • Item
    Named Entity Recognition for Code Mixing in Indian Languages using Hybrid Approach
    (CEUR, 2016-12) Sharma, Yashvardhan
    Automating the process of Named Entity Recognition has received a lot of attention over past few years in Social Media Text. Named Entities are real world objects such as Person, Organization, Product, Location. Identifying these entities in social media text is an important challenging task due the informal nature of text present on social media. One such challenge that is faced in recognizing named entities in Indian Social Media Text is Code Mixing. Code Mixing is usage of more than one language in a sentence. Being a multilingual country, people of India tend to know more than one language, which in turn results in the code mixing of text while expressing their opinions. This paper describes the proposed approach for shared task CMEE-IL (Code Mix Entity Extraction in Indian Language), FIRE 2016. Pro- posed algorithm uses a hybrid approach of a dictionary cum supervised classi cation approach for identifying entities in Code Mix Text of Indian Languages such as Hindi- English and Tamil-English.
  • Item
    Sentiment analysis for mixed script Indic sentences
    (IEEE, 2016) Sharma, Yashvardhan
    India is a multi-lingual and multi-script country. Developing natural language processing techniques for Indic languages is an active area of research. With the advent of social media, there has been an increasing trend of mixing different languages to convey thoughts in social media text. Users are more comfortable in their regionalistic language and tend to express their thoughts by mixing words from multiple languages. In this paper, we have attempted to develop a system for mining sentiments from code mixed sentences for English with combination of four other Indian languages (Tamil, Telugu, Hindi and Bengali). Due to the complex nature of the problem the technique used is divided into two stages, viz Language Identification and Sentiment Mining Approach. Evaluated results are compared to baseline obtained from machine translated sentences in English, and found to be around 8% better in terms of precision. The proposed approach is flexible and robust enough to handle additional languages for identification as well as anomalous foreign or extraneous words.
  • Item
    Query Labelling for Indic Languages using a hybrid approach
    (CEUR, 2015) Sharma, Yashvardhan
    With a boom in the internet, social media text has been increasing day by day. Much of the user generated content on internet is written in a very informal way. Usually people tend to write text on social media using indigenous script. To understand a script different from ours is a difficult task. Moreover, nowadays queries received by the search engines are large number of transliterated text. Hence providing a common platform to deal with the problem of transliterated text becomes really important. This paper presents our approach to handle labeling of queries as part of the FIRE2015 shared task on Mixed-Script Information Retrieval. Tokens in the query are labeled on basis of a hybrid approach which involves rule based and machine learning techniques. Each annotation has been dealt separately but sequentially.
  • Item
    TwiBiNG: A Bipartite News Generator Using Twitter
    (CEUR, 2014) Sharma, Yashvardhan
    Online Journalism is being seen as future of Journalism. News Professionals are vying to capture newsworthy stories that emerge from crowd. Live Social Media especially Twitter is generating enormous volumes of data every minute. It becomes difficult to select credible and relevant tweets that may form quality news among others. The problem intensifies due to the freedom of Twitter being an informal language. Generating headlines by solving this problem may still not be relevant and may face the question of authenticity. Given a set of keywords and a time period this problem becomes manageable and can be solved efficiently. We propose a bipartite algorithm that clusters authentic tweets based on key phrases and ranks the clusters based on trends in each timeslot.
  • Item
    An Interactive System leveraging Automatic Speech Recognition and Machine Translation for learning Hindi as a Second Language
    (IEEE, 2022) Rohil, Mukesh Kumar
    When English speakers are in the early stages of learning Hindi, formulating sentences in Hindi is often attempted by a verbatim translation of English words to corresponding Hindi words. Due to this reason, they are unable to learn Hindi sentences correctly. We have tried to overcome this problem by use of technology for second language learners. The use of Automatic Speech Recognition, and Machine Translation for second language learning, here learning Hindi by English speaker, has been illustrated by taking English speech as input and translating the given English sentences and words into Hindi and then displaying its equivalent construct in Devanagari script. The interactive system under study displays and speaks the same. It has been observed that a second language can be learnt faster by frequently listening to the vocabulary and sentences of the language. Thus the system furnishes the functionality of speaking the sentence in Hindi once it is represented in Devanagari script. The English sentences and words from the grammar tool books are given as input to the system for experimentation. We have observed that the critical problem encountered while doing so is the translation of English to Hindi. Another problem encountered at times is insertion error for letters (only surfaced). The system cannot translate sentences represented using continuous tense and perfect continuous tense correctly. The overall accuracy of the system, otherwise, is approximately 67% which can help the second language learners in the beginning.