Sparse computation for large-scale data mining

Hochbaum, Dorit S.; Baumann, Philipp (2016). Sparse computation for large-scale data mining. IEEE Transactions on Big Data, 2(2), pp. 151-174. Institute of Electrical and Electronics Engineers 10.1109/TBDATA.2016.2576470

[img] Text
HocBau14 (2).pdf - Published Version
Restricted to registered users only
Available under License Publisher holds Copyright.

Download (185kB) | Request a copy

Leading machine learning techniques rely on inputs in the form of pairwise similarities between objects in the data set. The number of pairwise similarities grows quadratically in the size of the data set which poses a challenge in terms of scalability. One way to achieve practical efficiency for similarity-based techniques is to sparsify the similarity matrix. However, existing sparsification approaches consider the complete similarity matrix and remove some of the non-zero entries. This requires quadratic time and storage and is thus intractable for large-scale data sets. We introduce here a method called sparse computation that generates a sparse similarity matrix which contains only relevant similarities without computing first all pairwise similarities. The relevant similarities are identified by projecting the data onto a low-dimensional space in which groups of objects that share the same grid neighborhood are deemed of potential high similarity whereas pairs of objects that do not share a neighborhood are considered to be dissimilar and thus their similarities are not computed. The projection is performed efficiently even for massively large data sets. We apply sparse computation for the $K$ -nearest neighbors algorithm (KNN), for graph-based machine learning techniques of supervised normalized cut and K-supervised normalized cut (SNC and KSNC) and for support vector machines with radial basis function kernels (SVM), on real-world classification problems. Our empirical results show that the approach achieves a significant reduction in the density of the similarity matrix, resulting in a substantial reduction in tuning and testing times, while having a minimal effect (and often none) on accuracy. The low-dimensional projection is of further use in massively large data sets where the grid structure allows to easily identify groups of “almost identical” objects. Such groups of objects are then replaced by representatives, thus reducing the size of the matrix. This approach is effective, as illustrated here for data sets comprising up to 8.5 million objects.

Item Type:

Journal Article (Original Article)

Division/Institute:

03 Faculty of Business, Economics and Social Sciences > Department of Business Management > Institute of Financial Management > Professorship for Quantitative Methods in Business Administration

UniBE Contributor:

Baumann, Philipp

Subjects:

600 Technology > 650 Management & public relations

ISSN:

2332-7790

Publisher:

Institute of Electrical and Electronics Engineers

Language:

English

Submitter:

Juliana Kathrin Moser-Zurbrügg

Date Deposited:

26 Sep 2016 14:49

Last Modified:

05 Dec 2022 14:58

Publisher DOI:

10.1109/TBDATA.2016.2576470

BORIS DOI:

10.7892/boris.88124

URI:

https://boris.unibe.ch/id/eprint/88124

Actions (login required)

Edit item Edit item
Provide Feedback