Skip to content

Commit 0553aa4

Browse files
authored
Merge pull request #28 from PyDataBlog/experimental
Merging un-factored stable implementation.
2 parents 35a29c2 + 64e20f9 commit 0553aa4

File tree

13 files changed

+2086
-139
lines changed

13 files changed

+2086
-139
lines changed

.gitignore

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,4 +9,6 @@
99
/benchmark/tune.json
1010
.benchmarkci/
1111
.idea/*
12-
.vscode/*
12+
.vscode/*
13+
test/experiments.jl
14+
/extras/.ipynb_checkpoints/*

README.md

Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,3 +4,99 @@
44
[![Dev](https://img.shields.io/badge/docs-dev-blue.svg)](https://PyDataBlog.github.io/ParallelKMeans.jl/dev)
55
[![Build Status](https://www.travis-ci.org/PyDataBlog/ParallelKMeans.jl.svg?branch=master)](https://www.travis-ci.org/PyDataBlog/ParallelKMeans.jl)
66
[![Coverage Status](https://coveralls.io/repos/github/PyDataBlog/ParallelKMeans.jl/badge.svg?branch=master)](https://coveralls.io/github/PyDataBlog/ParallelKMeans.jl?branch=master)
7+
[![FOSSA Status](https://app.fossa.com/api/projects/git%2Bgithub.com%2FPyDataBlog%2FParallelKMeans.jl.svg?type=shield)](https://app.fossa.com/projects/git%2Bgithub.com%2FPyDataBlog%2FParallelKMeans.jl?ref=badge_shield)
8+
_________________________________________________________________________________________________________
9+
**Authors:** [Bernard Brenyah](https://www.linkedin.com/in/bbrenyah/) & [Andrey Oskin](https://www.linkedin.com/in/andrej-oskin-b2b03959/)
10+
_________________________________________________________________________________________________________
11+
12+
## Table Of Content
13+
14+
1. [Motivation](#Motivatiion)
15+
2. [Installation](#Installation)
16+
3. [Features](#Features)
17+
4. [Benchmarks](#Benchmarks)
18+
5. [Pending Features](#Pending-Features)
19+
6. [How To Use](#How-To-Use)
20+
7. [Release History](#Release-History)
21+
8. [How To Contribute](#How-To-Contribute)
22+
9. [Credits](#Credits)
23+
10. [License](#License)
24+
25+
_________________________________________________________________________________________________________
26+
27+
### Motivation
28+
It's a funny story actually led to the development of this package.
29+
What started off as a personal toy project trying to re-construct the K-Means algorithm in native Julia blew up after into a heated discussion on the Julia Discourse forums after I asked for Julia optimizaition tips. Long story short, Julia community is an amazing one! Andrey Oskin offered his help and together, we decided to push the speed limits of Julia with a parallel implementation of the most famous clustering algorithm. The initial results were mind blowing so we have decided to tidy up the implementation and share with the world.
30+
31+
Say hello to our baby, `ParallelKMeans`!
32+
_________________________________________________________________________________________________________
33+
34+
### Installation
35+
You can grab the latest stable version of this package by simply running in Julia.
36+
Don't forget to Julia's package manager with `]`
37+
38+
```julia
39+
pkg> add TextAnalysis
40+
```
41+
42+
For the few (and selected) brave ones, one can simply grab the current experimental features by simply adding the experimental branch to your development environment after invoking the package manager with `]`:
43+
44+
```julia
45+
dev git@github.com:PyDataBlog/ParallelKMeans.jl.git
46+
```
47+
48+
Don't forget to checkout the experimental branch and you are good to go with bleeding edge features and breaks!
49+
```bash
50+
git checkout experimental
51+
```
52+
_________________________________________________________________________________________________________
53+
54+
### Features
55+
56+
- Lightening fast implementation of Kmeans clustering algorithm even on a single thread in native Julia.
57+
- Support for multi-theading implementation of Kmeans clustering algorithm.
58+
- Kmeans++ initialization for faster and better convergence.
59+
- Modified version of Elkan's Triangle inequality to speed up K-Means algorithm.
60+
61+
_________________________________________________________________________________________________________
62+
63+
### Benchmarks
64+
65+
_________________________________________________________________________________________________________
66+
67+
### Pending Features
68+
- [X] Implementation of Triangle inequality based on [Elkan C. (2003) "Using the Triangle Inequality to Accelerate
69+
K-Means"](https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf)
70+
- [ ] Support for DataFrame inputs.
71+
- [ ] Refactoring and finalizaiton of API desgin.
72+
- [ ] GPU support.
73+
- [ ] Even faster Kmeans implementation based on current literature.
74+
- [ ] Optimization of code base.
75+
76+
_________________________________________________________________________________________________________
77+
78+
### How To Use
79+
80+
```Julia
81+
82+
```
83+
84+
_________________________________________________________________________________________________________
85+
86+
### Release History
87+
88+
- 0.1.0 Initial release
89+
90+
_________________________________________________________________________________________________________
91+
92+
### How To Contribue
93+
94+
_________________________________________________________________________________________________________
95+
96+
### Credits
97+
98+
_________________________________________________________________________________________________________
99+
100+
### License
101+
102+
[![FOSSA Status](https://app.fossa.com/api/projects/git%2Bgithub.com%2FPyDataBlog%2FParallelKMeans.jl.svg?type=large)](https://app.fossa.com/projects/git%2Bgithub.com%2FPyDataBlog%2FParallelKMeans.jl?ref=badge_large)

benchmark/bench01_distance.jl

Lines changed: 12 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -7,20 +7,22 @@ using Random
77
suite = BenchmarkGroup()
88

99
Random.seed!(2020)
10-
X = rand(100_000, 3)
11-
centroids = rand(2, 3)
12-
d = rand(100_000, 2)
13-
suite["100kx3"] = @benchmarkable ParallelKMeans.pairwise!($d, $X, $centroids)
10+
X = rand(3, 100_000)
11+
centroids = rand(3, 2)
12+
d = Vector{Float64}(undef, 100_000)
13+
suite["100kx3"] = @benchmarkable ParallelKMeans.colwise!($d, $X, $centroids)
1414

15-
X = rand(100_000, 10)
16-
centroids = rand(2, 10)
17-
d = rand(100_000, 2)
18-
suite["100kx10"] = @benchmarkable ParallelKMeans.pairwise!($d, $X, $centroids)
15+
X = rand(10, 100_000)
16+
centroids = rand(10, 2)
17+
d = Vector{Float64}(undef, 100_000)
18+
suite["100kx10"] = @benchmarkable ParallelKMeans.colwise!($d, $X, $centroids)
1919

2020
# for reference
2121
metric = SqEuclidean()
22-
suite["100kx10_distances"] = @benchmarkable Distances.pairwise!($d, $metric, $X, $centroids, dims = 1)
23-
22+
#suite["100kx10_distances"] = @benchmarkable Distances.colwise!($d, $metric, $X, $centroids)
23+
dist = Distances.pairwise(metric, X, centroids, dims = 2)
24+
min = minimum(dist, dims=2)
25+
suite["100kx10_distances"] = @benchmarkable $d = min
2426
end # module
2527

2628
BenchDistance.suite

benchmark/extras/README.md

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
# Skoffer comparison between Clustering, SingleThread mode of PKMeans and MultiThreadPKMeans
2+
3+
```julia
4+
versioninfo()
5+
6+
Julia Version 1.3.1
7+
Commit 2d5741174c (2019-12-30 21:36 UTC)
8+
Platform Info:
9+
OS: Linux (x86_64-pc-linux-gnu)
10+
CPU: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
11+
WORD_SIZE: 64
12+
LIBM: libopenlibm
13+
LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
14+
Environment:
15+
JULIA_EDITOR = atom -a
16+
JULIA_NUM_THREADS = 4
17+
```
18+
19+
For `X = rand(60, 1_000_000); tol = 1e-6` output of `TimerOutputs`
20+
21+
```
22+
Time Allocations
23+
────────────────────── ───────────────────────
24+
Tot / % measured: 1541s / 85.5% 19.5GiB / 99.4%
25+
26+
Section ncalls time %tot avg alloc %tot avg
27+
───────────────────────────────────────────────────────────────────────────────
28+
Clustering 1 662s 50.2% 662s 18.6GiB 96.1% 18.6GiB
29+
10 clusters 1 92.6s 7.03% 92.6s 2.35GiB 12.1% 2.35GiB
30+
9 clusters 1 89.7s 6.81% 89.7s 2.34GiB 12.1% 2.34GiB
31+
8 clusters 1 87.1s 6.62% 87.1s 2.33GiB 12.0% 2.33GiB
32+
7 clusters 1 85.3s 6.48% 85.3s 2.32GiB 12.0% 2.32GiB
33+
6 clusters 1 80.6s 6.12% 80.6s 2.32GiB 12.0% 2.32GiB
34+
5 clusters 1 78.3s 5.95% 78.3s 2.31GiB 11.9% 2.31GiB
35+
4 clusters 1 76.6s 5.82% 76.6s 2.30GiB 11.9% 2.30GiB
36+
3 clusters 1 50.3s 3.82% 50.3s 1.58GiB 8.16% 1.58GiB
37+
2 clusters 1 20.9s 1.59% 20.9s 732MiB 3.69% 732MiB
38+
PKMeans Singlethread 2 491s 37.3% 245s 208MiB 1.05% 104MiB
39+
9 clusters 1 131s 10.0% 131s 22.9MiB 0.12% 22.9MiB
40+
10 clusters 1 89.5s 6.80% 89.5s 22.9MiB 0.12% 22.9MiB
41+
7 clusters 1 77.3s 5.87% 77.3s 22.9MiB 0.12% 22.9MiB
42+
8 clusters 1 59.4s 4.51% 59.4s 22.9MiB 0.12% 22.9MiB
43+
6 clusters 1 44.1s 3.35% 44.1s 22.9MiB 0.12% 22.9MiB
44+
5 clusters 1 35.1s 2.67% 35.1s 22.9MiB 0.12% 22.9MiB
45+
4 clusters 1 32.9s 2.50% 32.9s 22.9MiB 0.12% 22.9MiB
46+
3 clusters 1 14.6s 1.11% 14.6s 22.9MiB 0.12% 22.9MiB
47+
2 clusters 2 6.52s 0.50% 3.26s 23.3MiB 0.12% 11.7MiB
48+
PKMeans Multithread 1 165s 12.5% 165s 575MiB 2.90% 575MiB
49+
9 clusters 1 37.2s 2.82% 37.2s 40.1MiB 0.20% 40.1MiB
50+
8 clusters 1 33.1s 2.51% 33.1s 23.9MiB 0.12% 23.9MiB
51+
10 clusters 1 25.8s 1.96% 25.8s 24.0MiB 0.12% 24.0MiB
52+
6 clusters 1 20.9s 1.59% 20.9s 23.6MiB 0.12% 23.6MiB
53+
7 clusters 1 16.4s 1.25% 16.4s 23.4MiB 0.12% 23.4MiB
54+
5 clusters 1 13.1s 1.00% 13.1s 23.4MiB 0.12% 23.4MiB
55+
4 clusters 1 9.90s 0.75% 9.90s 23.4MiB 0.12% 23.4MiB
56+
3 clusters 1 4.97s 0.38% 4.97s 370MiB 1.87% 370MiB
57+
2 clusters 1 3.26s 0.25% 3.26s 23.2MiB 0.12% 23.2MiB
58+
───────────────────────────────────────────────────────────────────────────────
59+
```

benchmark/extras/comparisons.jl

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
using Clustering
2+
using ParallelKMeans
3+
using Plots
4+
using BenchmarkTools
5+
using TimerOutputs
6+
using Random
7+
using ProgressMeter
8+
9+
# Create a TimerOutput, this is the main type that keeps track of everything.
10+
const to = TimerOutput()
11+
12+
Random.seed!(2020)
13+
X = rand(60, 1_000_000);
14+
# Timed assingments
15+
global a = Float64[]
16+
global b = Float64[]
17+
global c = Float64[]
18+
19+
p = Progress(9, 10, "Computing clustering...")
20+
@timeit to "Clustering" begin
21+
for i in 2:10
22+
@timeit to "$i clusters" push!(a, Clustering.kmeans(X, i, tol=1e-6, maxiter=300).totalcost)
23+
next!(p)
24+
end
25+
end
26+
27+
p = Progress(9, 10, "Computing singlethreaded ParallelKMeans...")
28+
@timeit to "PKMeans Singlethread" begin
29+
for i in 2:10
30+
@timeit to "$i clusters" push!(b, ParallelKMeans.kmeans(X, i, tol=1e-6, max_iters=300, verbose=false).totalcost)
31+
next!(p)
32+
end
33+
end
34+
35+
p = Progress(9, 10, "Computing multithreaded ParallelKMeans...")
36+
@timeit to "PKMeans Multithread" begin
37+
for i in 2:10
38+
@timeit to "$i clusters" push!(c, ParallelKMeans.kmeans(X, i, ParallelKMeans.MultiThread(), tol=1e-6, max_iters=300, verbose=false).totalcost)
39+
next!(p)
40+
end
41+
end
42+
43+
plot(a, label="Clustering.jl")
44+
plot!(b, label="Single-Thread ParallelKmeans")
45+
plot!(c, label="Multi-Thread ParallelKmeans")
46+
47+
print(to)

docs/src/index.md

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,17 @@
1-
# ParallelKMeans.jl
1+
# ParallelKMeans.jl Documentation
2+
3+
```@contents
4+
```
5+
6+
## Installation
7+
8+
9+
## Features
10+
11+
12+
## How To Use
13+
14+
215

316
```@index
417
```

0 commit comments

Comments
 (0)