Department of Computer Science and Information Systems
Permanent URI for this collectionhttp://localhost:4000/handle/123456789/1928
Browse
32 results
Search Results
Item An Efficient Density Based Incremental Clustering Algorithm in Data Warehousing Environment(IPCSIT, 2009) Goyal, Navneet; Goyal, PoonamData Warehouses are a good source of data for downstream data mining applications. New data arrives in data warehouses during the periodic refresh cycles. Appending of data on existing data requires that all patterns discovered earlier using various data mining algorithms are updated with each refresh. In this paper, we present an incremental density based clustering algorithm. Incremental DBSCAN is an existing incremental algorithm in which data can be added/deleted to/from existing clusters, one point at a time. Our algorithm is capable of adding points in bulk to existing set of clusters. In this new algorithm, the data points to be added are first clustered using the DBSCAN algorithm and then these new clusters are merged with existing clusters, to come up with the modified set of clusters. That is, we add the clusters incrementally rather than adding points incrementally. It is found that the proposed incremental clustering algorithm produces the same clusters as obtained by Incremental DBSCAN. We have used R*-trees as the data structure to hold the multidimensional data that we need to cluster. One of the major advantages of the proposed approach is that it allows us to see the clustering patterns of the new data along with the existing clustering patterns. Moreover, we can see the merged clusters as well. The proposed algorithm is capable of considerable savings, in terms of region queries performed, as compared to incremental DBSCAN. Results are presented to support the claimItem Designing self-adaptive websites using online hotlink assignment algorithm(ACM Digital Library, 2009-12) Goyal, Navneet; Goyal, PoonamAn online hotlink assignment algorithm is proposed for designing adaptive websites. The objective is to reach desired pages on a website in minimum number of clicks, thereby reducing the load on the web server. As a consequence, the traffic on the internet is also reduced. The hotlinks are assigned based on the frequency of access of pages. We model a website as a single source directed graph. Optimal hotlink assignment problem is NP-hard for general graphs. The website graph is reduced to a Breadth First Search (BFS) tree which maintains the semantic relationships between web pages. The proposed online algorithm can place at most k hotlinks per page with a maximum of l hotlinks on the entire website, where k«l. The input stream is simulated using the Zipf distribution. The results presented in the paper compare the performance of the online algorithm with the optimal offline algorithm.Item Concept based query recommendation(ACM Digital Library, 2011) Goyal, PoonamFor a search engine, the challenge of finding relevant information from the web is becoming more and more difficult with rapid increase/change in content of the web. This difficulty further increases as queries submitted by users are general, imprecise, short and ambiguous. Relevance between user's information need and documents returned by search engine is largely dependent on the query given by them. In this paper, we have proposed a method to facilitate users with query recommendations which are the concepts related to their information needs. In this work, we have extracted concepts from the web snippets and we have proposed two weight functions to measure the relevance between query and concepts. Related concepts with different meaning are selected and recommended as query suggestions. To evaluate our method, we have used a Google middleware for the extraction of concepts. We have estimated the relevance between the query and concepts using the proposed weight functions and compared with the support of the concepts as well as with the TFIDF approach using the standard information-retrieval metrics of precision and Mean Average Precision(MAP). We show that our approach leads to gains in average precision than the other existing approach for different type of queries.Item A robust approach for finding conceptually related queries using feature selection and tripartite graph structure(Sage, 2013-03) Goyal, PoonamThe information explosion on the Internet has placed high demands on search engines. Despite the improvements in search engine technology, the precision of current search engines is still unsatisfactory. Moreover, the queries submitted by users are short, ambiguous and imprecise. This leads to a number of problems in dealing with similar queries. The problems include lack of common keywords, selection of different documents by the search engine and lack of common clicks etc. These problems render the traditional query clustering methods unsuitable for query recommendations. In this paper, we propose a new query recommendation system. For this, we have identified conceptually related queries by capturing users’ preferences using click-through graphs of web search logs and by extracting the best features, relevant to the queries, from the snippets. The proposed system has an online feature extraction phase and an offline phase in which feature filtering and query clustering are performed. Query clustering is carried out by a new tripartite agglomerative clustering algorithm, Query-Document-Concept Clustering, in which the documents are used innovatively to decouple queries and features/concepts in a tripartite graph structure. This results in clusters of similar queries, associated clusters of documents and clusters of features. We model the query recommendation problem in four different ways. Two models are non-personalized and personalized content-ignorant models. Other two are non-personalized and personalized content-aware models. Three similarity measures are introduced to estimate different kinds of similarities. Experimental results show that the proposed approach has better precision, recall and F-measure than the existing approaches.Item An approach for search result topic identification and labeling(ACM Digital Library, 2015-03) Goyal, PoonamOrganizing search results is one of the challenging task of the search engines due to various and dynamic intentions of the queries. As a consequence search engines are not able to understand the exact user context, and thus retrieve large volumes of results, most of which are irrelevant to the user. Search Result Clustering (SRC) is a technique which groups the search results and presents users the various intentions of the query. In this work, we have proposed an approach that first identifies the associated topics and represents them in the form of concepts and then forms groups of documents by assigning each document to the appropriate topic and in the end it provides suitable labels to these topics. Experimental results show that the proposed method is able to produce encouraging results as compared to the most popular non-commercial methods Lingo and STC on standard datasets such as ODP and Ambient datasets.Item A concurrent k-NN search algorithm for R-tree(ACM Digital Library, 2015-10) Goyal, Navneet; Goyal, Poonam; Challa, Jagat Seshk-nearest neighbor (k-NN) search is one of the commonly used query in database systems. It has its application in various domains like data mining, decision support systems, information retrieval, multimedia and spatial databases, etc. When k-NN search is performed over large data sets, spatial data indexing structures such as R-trees are commonly used to improve query efficiency. The best-first k-NN (BF-kNN) algorithm is the fastest known k-NN over R-trees. We present CBF-kNN, a concurrent BF-kNN for R-trees, which is the first concurrent version of k-NN we know of for R-trees. CBF-kNN uses one of the most efficient concurrent priority queues known as mound. CBF-kNN overcomes the concurrency limitations of priority queues by using a tree-parallel mode of execution. CBF-kNN has an estimated speedup of O(p/k) for p threads. Experimental results on various real datasets show that the speedup in practice is close to this estimate.Item Parallel Framework for Efficient k-means Clustering(ACM Digital Library, 2015-10) Goyal, Navneet; Goyal, PoonamHandling and processing of larger volume of data requires efficient data mining algorithms. k-means is a very popular clustering algorithm for data mining, but its performance suffers because of initial seeding problem. The computation time of k-means algorithm is directly proportional to the number of data-points, number of dimensions, and number of iterations, therefore, it is very expensive to process large data-points sequentially. We proposed an efficient parallel framework which includes dimensionality-reduction as well as data-size reduction techniques to improve k-means processing time and initial seeding problem. Our proposed parallel framework leverages the multi-node and multi-core architectures of a typical commodity cluster. We have validated our proposed approaches with real and synthetic datasets in parallel environment setup. The experimental results clearly show the significant improvements in k-means performance.Item Exploiting Visual and Textual Neighborhood Information to Improve Image-Tag Relevance(IEEE, 2017) Goyal, PoonamMany applications, such as image searching, image indexing, and image label recommendations, have started using tagged images to benefit from user input. However, tags tend to be imprecise, incomplete, and ambiguous. Moreover, tags are also biased towards the user's perspective which degrades the performance of tag-based systems. Most of the existing methods use visual neighborhoods and/or tags to estimate image-tag relevance. We improve image-tag relevance by combining visual neighborhood of images and textual neighborhood of tags. By doing this, we boost the ranking of informative tags of an image. Most of the image-tag relevance measures work well when large supporting data is available, which is typically not sufficient in real datasets. This problem of Void of Information (VoI) is addressed by exploiting tags of visual neighbors of the images. We also exploit external resources like Wikipedia and WordNet to strengthen the tags. The proposed approach, TVNTag (Textual Visual Neighborhood based Tag) exhibits up to 46.1% relative improvement in tag ranking and 79.5% in image ranking, with respect to the current state-of-the-art methods. The experiments are conducted for different tasks and evaluation scenarios on benchmarked social data, such as MIRFlickr, NUS-WIDE, and train10k.Item Rapid Prototyping of Hierarchical Agglomerative Clustering Algorithms for Distributed Systems(IEEE, 2019) Goyal, Poonam; Goyal, NavneetHierarchical Agglomerative Clustering (HAC) algorithms are used in many applications where clusters have a hierarchical relationship between them. Their parallelization is challenging due to the dependence of every agglomeration step on all previous agglomerations. Although a few parallel algorithms have been proposed for SLINK HAC algorithm, only limited work has been done to parallelize other HAC algorithms. In this paper, we present a high-level abstraction, which provides a uniform way to specify any HAC algorithm, and a framework for automatic parallelization of the same for distributed memory systems. The abstraction is supported by constructs in a high level, domain specific language, and a compiler translates algorithms expressed in this language to efficient parallel code targeting distributed systems. Our experiments on multiple HAC algorithms proves that the runtime performance achieved is comparable with state-of-the-art manual parallel implementations on Spark and MPI while requiring only a fraction of the programming effort. At runtime, master-slave execution is used, and load is balanced among the slaves in an algorithm-agnostic way, which is a significant contrast to custom load-balancing techniques seen in the literature on parallel HAC algorithms.Item Incremental models for query clustering and query-context aware document clustering(Inder Science, 2015) Goyal, Navneet; Goyal, PoonamThe traditional query clustering algorithms are designed to work on previously collected data from query stream. These algorithms become less and less effective with time because users' interests, query meaning and popularity of topics change over time. So, there is a need for incremental algorithms which can accommodate the concept drift that surface with new data being added to the collection without performing a complete re-clustering. We have proposed an incremental model for query and query-context aware document clustering. The model periodically updates new information efficiently and can be applied in a distributed environment. The proposed incremental model retains the quality of both query and document clusters. The proposed model can be applied to the results of hierarchical query clustering algorithms that produce query and document clusters. The model is tested on three hierarchical clustering algorithms on different datasets including TREC session track 2011 dataset. We have also experimented with the variant of the proposed incremental model for comparing the performance. The proposed model and its variant not only achieve accuracy very close to that of static models in all the experiments, but also offer a significant speedup.