Abstract:
Set-wise Clustering is a clustering technique for data streams that groups sets of objects based on distribution patterns, applicable in contexts like retail chain clustering, text-based community clustering, restaurant categorization, etc. The existing set-wise clustering method cannot handle variable and high-speed streams with reasonable accuracy. This paper presents an Anytime Set-wise Clustering method for data streams known as ANYSETCLUS. The method handles the variable inter-arrival rates of stream objects using a proposed indexing structure called AnySetClusTree, which stores a hierarchy of micro-clusters of multi-set entities at varying granularity. ANYSETCLUS is highly adaptive as it supports incremental model updates, segregates outliers, enables outlier-to-concept transition, and captures concept drift. The method also enables anytime offline clustering wherein it can generate multiple clusterings of varying granularity and purity depending upon the available time allowance for final clustering. The experimental results affirm the superior efficacy of the proposed method in handling variable and high-speed streams compared to the state-of-the-art method. The experimental results also showcase its effectiveness in achieving significantly higher micro-cluster purity for low and high-speed streams. This contrasts with the state-of-the-art method, which is unable to generate valid clustering results for high-speed streams. The experiments further validate the proposed method’s capability for anytime offline clustering.