Department of Computer Science and Information Systems

Permanent URI for this collectionhttp://localhost:4000/handle/123456789/1928

Browse

Search Results

Now showing 1 - 10 of 12
  • Item
    Incremental MapReduce for K-Medoids Clustering of Big Time-Series Data
    (IEEE, 2018) Jangiti, Saikishor
    There is a high necessity to refresh the data mining results, as the former results become stale and obsolete over time due to dynamic and evolving data. Clustering is one of the important data mining techniques that help to group data points with similarity together. To mine the data generated exponentially in these days, MapReduce, a parallel programming framework can be combined MapReduce with the k-medoids clustering algorithm to arrive at the optimum results quickly. Due to the parallel processing architecture of Hadoop, the proposed iterative algorithm for processing incremental data using an intermediate key file exhibited better performance over conventional k-medoids.
  • Item
    Catur Approach to Assess the Quality of Big Data Using Decision Tree and Multidimensional Model
    (AENSI Publisher, 2015) K., Pradheep Kumar
    This paper is intended to design and develop multidimensional and decision tree based frameworks, for assessing the quality of a big data. Since the datasets represented in a big data environment is both complex and multidimensional, the quality of big data can be better viewed through multiple dimensions. Most enterprises face number of challenges in managing the quality of the big data during their initial setup or migration from traditional database or after building the big data. This paper uses multidimensional model proposed for Knowledge Management System for designing critical quality dimensions for big data. Based on the extensive literature review, this work proposes a classification of big data quality into many quality factors such as accessibility, consistency, integrity, usability, relevance, completeness, compatibility, conformity and accuracy. Since there are very few appropriate data stewards or frameworks available for confirmation of quality dimensions, this paper aims to develop some hybrid approaches using multi-dimensional model and decision tree based methods for automatic quality checks. Using decision tree, multiple if-then rules can be formed to decide on the quality of data based on the specific constraints developed for big data. The paper also aims to provide the quality framework and measures which can serve as a data quality firewall just like an internet firewall to proactively find the quality issues and apply the rules based on the decision tree algorithms to prevent bad or inconsistent or invalid data or access entering in to the big data environment.
  • Item
    Fuzzy-Based Querying Approach for Multidimensional Big Data Quality Assessment
    (2017) K., Pradheep Kumar
    This paper is intended to design a fuzzy based approach to assess standards and quality of big data. It also serves as a platform to organizations that intend to migrate their existing database environment to big data environment. Data is assessed using a multidimensional approach based on quality factors like accuracy, completeness, reliability, usability, etc. These factors are analysed by constructing decision trees to identify the quality aspects which need to be improved. In this work fuzzy queries have been designed. The queries are grouped as sets namely Excellent, Optimal, Fair and Hybrid. Based on the fuzzy data sets formed and the query compatibility index, a query set is chosen. A data set that has a very high degree of membership is assigned a fair query set. A data set with a medium degree of membership is assigned a optimal query set. A data set that has a lesser degree of membership is assigned a Excellent query set. A data set which needs a combination of queries of all the above is assigned a hybrid query set. The fuzzy query based approach reduces the query compatibility index by 36%, compared to a normal query set approach.
  • Item
    Unpredictable Password Generation using Graphical Authentication and Decentralized Encryption
    (IJSEAS, 2016) Ganesan, Akshaya
    Abstract—Privacy is which data can be safely disclosed without leaking sensitive information. The objective of a knowledge based secure system is to select stronger passwords for the users and to provide them secret keys. In this paper, a multi-authority decentralized encryption scheme is proposed which provides secret keys without knowing the global identifier of the user. This scheme issues secret keys without any cooperation from the different authorities. Any authority is free to join or leave the system. Users can select passwords of higher strength using click-points. Persuasive technology is used for generating graphical passwords. Keywords-Graphical passwords,privacy,decentralized encryption,secure system
  • Item
    CTI-Twitter: Gathering Cyber Threat Intelligence from Twitter using Integrated Supervised and Unsupervised Learning
    (IEEE, 2020) Agarwal, Vinti
    Cyber threat intelligence (CTI) can be gathered from multiple sources, and Twitter is one such open source platform where a large volume and variety of threat data is shared every day. The automated and timely mining of relevant threat knowledge from this data can be crucial for enrichment of existing threat intelligence platforms to proactively defend against cyber attacks. We propose CTI-Twitter: a novel frame-work combining supervised and unsupervised learning models to collect, process, analyze and generate threat specific knowledge from tweets coming from multiple users. CTI-Twitter has multi-fold contributions: i) first collecting tweets through Twitter API, ii) extracting relevant threat tweets from irrelevant ones, and classifying relevant ones into multiple classes of threats iii) then grouping tweets belonging to each class using topic modeling iv) finally performing data enrichment and verification process. We evaluate our proposed model on real-time tweets collected for about four months (in year 2020) using Twitter API. The encouraging results obtained indicate the effectiveness of CTI-Twitter in terms of timeliness and discovery of trending attacks patterns, and vulnerabilities.
  • Item
    Unwanted Traffic Identification in Large-Scale University Networks: A Case Study
    (Springer, 2016) Narang, Pratik
    To mitigate the malicious impact of P2P traffic on University networks, in this article the authors have proposed the design of payload-oblivious privacy-preserving P2P traffic detectors. The proposed detectors do not rely on payload signatures, and hence, are resilient to P2P client and protocol changes—a phenomenon which is now becoming increasingly frequent with newer, more popular P2P clients/protocols. The article also discusses newer designs to accurately distinguish P2P botnets from benign P2P applications. The datasets gathered from the testbed and other sources range from Gigabytes to Terabytes containing both unstructured and structured data assimilated through running of various applications within the University network. The approaches proposed in this article describe novel ways to handle large amounts of data that is collected at unprecedented scale in authors’ University network.
  • Item
    Big Data Security Challenges and Preventive Solutions
    (Springer, 2019-10) Rohil, Mukesh Kumar
    Big data has opened the possibility of making great advancements in many scientific disciplines and has become a very interesting topic in academic world and in industry. It has also given contributions to innovation, improvements in productivity and competitiveness. However, at present, there are various security risks involved in the process of collection, storage and use. The leakage of privacy caused by big data poses serious problems for the users; also the incorrect or false big data may lead to wrong or invalid analysis of results. The presented work analyzes the technical challenges of implementing big data security and privacy protection, and describes some key solutions to address the issues related with big data security and privacy.
  • Item
    Rapid Prototyping of Hierarchical Agglomerative Clustering Algorithms for Distributed Systems
    (IEEE, 2019) Goyal, Poonam; Goyal, Navneet
    Hierarchical Agglomerative Clustering (HAC) algorithms are used in many applications where clusters have a hierarchical relationship between them. Their parallelization is challenging due to the dependence of every agglomeration step on all previous agglomerations. Although a few parallel algorithms have been proposed for SLINK HAC algorithm, only limited work has been done to parallelize other HAC algorithms. In this paper, we present a high-level abstraction, which provides a uniform way to specify any HAC algorithm, and a framework for automatic parallelization of the same for distributed memory systems. The abstraction is supported by constructs in a high level, domain specific language, and a compiler translates algorithms expressed in this language to efficient parallel code targeting distributed systems. Our experiments on multiple HAC algorithms proves that the runtime performance achieved is comparable with state-of-the-art manual parallel implementations on Spark and MPI while requiring only a fraction of the programming effort. At runtime, master-slave execution is used, and load is balanced among the slaves in an algorithm-agnostic way, which is a significant contrast to custom load-balancing techniques seen in the literature on parallel HAC algorithms.
  • Item
    Parallel SLINK for big data
    (Springer, 2019-06) Goyal, Navneet; Goyal, Poonam
    The major strength of hierarchical clustering algorithms is that it allows visual interpretations of clusters through dendrograms. Users can cut the dendrogram at different levels to get desired number of clusters. A major problem with hierarchical algorithms is their quadratic runtime complexity, which limits the amount of data that can be clustered in reasonable amount of time. Also, due to its agglomerative merging process, each iteration depends on the data of all previous iterations, making it difficult to parallelize. Thus, there is a need for an efficient parallel implementation of SLINK algorithm which can scale to big data. We present a parallel SLINK algorithm, sGridSLINK, for shared memory architectures. sGridSLINK produces exactly the same dendrogram as the classical SLINK algorithm. We also present, hGridSLINK, a parallel algorithm which fully exploits a multi-core cluster system. To the best of our knowledge, there is no hybrid parallel algorithm for SLINK available in the literature. The proposed algorithms exploit spatial locality of data to reduce the number of distance calculations. Adaptive gridding is used to counter skewness in data and to ensure load balancing. Extensive experiments are carried out to establish the efficiency and scalability of proposed parallel algorithms. sGridSLINK is approximately 840 times faster than the state-of-the-art algorithm using 55 threads on a 48-core machine on a real dataset having 6 million data points. It also achieves a speedup of 47.93 over the best known sequential SLINK, GridSLINK, on a real dataset using 48 threads on a 48-core machine. hGridSLINK achieves a maximum speedup of 68.26 on a 32-node cluster (32×4 processing elements) with respect to GridSLINK. The hGridSLINK algorithm is able to cluster 200 million data points in only 1317 s (less than 22 min). No existing parallel SLINK algorithm is capable of such efficient clustering of Big Data.
  • Item
    A Domain Specific Language for Clustering
    (Springer, 2016-11) Goyal, Navneet; Goyal, Poonam
    Clustering of large volumes of data is a complex problem which requires use of sophisticated algorithms as well as High Performance Computing hardware like a cluster of computers. It is highly desirable that data mining experts have a solution which on one hand provides a simple interface for ex-pressing their algorithms in terms of domain specific idioms and on the other hand automatically generates parallel code that can run on a cluster of multicore nodes. The proposed Domain Specific Language (DSL) along with its parallelizing compiler attempts to provide a solution. In this paper, we give the design of the DSL, called DWARF. Various language constructs have been described along with the rationale behind their inclusion in the language. A qualitative comparison of abstraction provided by DWARF is compared with MapReduce, Spark, and other MPI-based implementations to establish the usefulness of the proposed clustering DSL.