Skip to content

Commit 64e20f9

Browse files
committed
finalised TODO requests
1 parent 5587d6b commit 64e20f9

File tree

3 files changed

+40
-9
lines changed

3 files changed

+40
-9
lines changed

README.md

Lines changed: 18 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -25,23 +25,38 @@ ________________________________________________________________________________
2525
_________________________________________________________________________________________________________
2626

2727
### Motivation
28+
It's a funny story actually led to the development of this package.
29+
What started off as a personal toy project trying to re-construct the K-Means algorithm in native Julia blew up after into a heated discussion on the Julia Discourse forums after I asked for Julia optimizaition tips. Long story short, Julia community is an amazing one! Andrey Oskin offered his help and together, we decided to push the speed limits of Julia with a parallel implementation of the most famous clustering algorithm. The initial results were mind blowing so we have decided to tidy up the implementation and share with the world.
2830

31+
Say hello to our baby, `ParallelKMeans`!
2932
_________________________________________________________________________________________________________
3033

3134
### Installation
35+
You can grab the latest stable version of this package by simply running in Julia.
36+
Don't forget to Julia's package manager with `]`
3237

33-
```bash
38+
```julia
39+
pkg> add TextAnalysis
40+
```
41+
42+
For the few (and selected) brave ones, one can simply grab the current experimental features by simply adding the experimental branch to your development environment after invoking the package manager with `]`:
3443

44+
```julia
45+
dev git@github.com:PyDataBlog/ParallelKMeans.jl.git
3546
```
3647

48+
Don't forget to checkout the experimental branch and you are good to go with bleeding edge features and breaks!
49+
```bash
50+
git checkout experimental
51+
```
3752
_________________________________________________________________________________________________________
3853

3954
### Features
4055

4156
- Lightening fast implementation of Kmeans clustering algorithm even on a single thread in native Julia.
4257
- Support for multi-theading implementation of Kmeans clustering algorithm.
4358
- Kmeans++ initialization for faster and better convergence.
44-
- Feature 4
59+
- Modified version of Elkan's Triangle inequality to speed up K-Means algorithm.
4560

4661
_________________________________________________________________________________________________________
4762

@@ -51,7 +66,7 @@ ________________________________________________________________________________
5166

5267
### Pending Features
5368
- [X] Implementation of Triangle inequality based on [Elkan C. (2003) "Using the Triangle Inequality to Accelerate
54-
-Mean"](https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf)
69+
K-Means"](https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf)
5570
- [ ] Support for DataFrame inputs.
5671
- [ ] Refactoring and finalizaiton of API desgin.
5772
- [ ] GPU support.

docs/src/index.md

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,17 @@
1-
# ParallelKMeans.jl
1+
# ParallelKMeans.jl Documentation
2+
3+
```@contents
4+
```
5+
6+
## Installation
7+
8+
9+
## Features
10+
11+
12+
## How To Use
13+
14+
215

316
```@index
417
```

src/ParallelKMeans.jl

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -84,7 +84,6 @@ end
8484
MultiThread() = MultiThread(Threads.nthreads()) # Uses all avaialble cores by default
8585

8686

87-
8887
"""
8988
colwise!(target, x, y, mode)
9089
@@ -98,8 +97,6 @@ following modes supported:
9897
9998
This dispatch handles the colwise calculation for single threads.
10099
"""
101-
colwise!(target, x, y) = colwise!(target, x, y, SingleThread())
102-
103100
function colwise!(target, x, y, mode::SingleThread)
104101
@inbounds for j in axes(x, 2)
105102
res = 0.0
@@ -111,6 +108,10 @@ function colwise!(target, x, y, mode::SingleThread)
111108
end
112109

113110

111+
# TODO: Why is this being dispatched here and not in a function?
112+
colwise!(target, x, y) = colwise!(target, x, y, SingleThread())
113+
114+
114115
"""
115116
spliiter(n, k)
116117
@@ -182,8 +183,7 @@ design matrix (X) and desired groups (k) that a user supplies.
182183
`k-means++` algorithm is used by default with the normal random selection
183184
of centroids from X used if any other string is attempted.
184185
185-
A tuple representing the centroids, number of rows, & columns respecitively
186-
is returned.
186+
A named tuple representing centroids and indices respecitively is returned.
187187
"""
188188
function smart_init(X::Array{Float64, 2}, k::Int, mode::T = SingleThread();
189189
init::String="k-means++") where {T <: CalculationMode}
@@ -366,7 +366,9 @@ kmeans(alg::Lloyd, design_matrix::Array{Float64, 2}, k::Int, mode::T = SingleThr
366366
"""
367367
function kmeans(alg::LightElkan, design_matrix::Array{Float64, 2}, k::Int, mode::T = SingleThread();
368368
k_init::String = "k-means++", max_iters::Int = 300, tol = 1e-6, verbose::Bool = true, init = nothing) where {T <: CalculationMode}
369+
# Get the dimensions of the design_matrix
369370
nrow, ncol = size(design_matrix)
371+
370372
centroids = init == nothing ? smart_init(design_matrix, k, mode, init=k_init).centroids : deepcopy(init)
371373
new_centroids, centroids_cnt = create_containers(k, nrow, mode)
372374
# new_centroids = similar(centroids)
@@ -437,6 +439,7 @@ end
437439
"""
438440
function update_centroids!(centroids, new_centroids, centroids_cnt, labels,
439441
design_matrix, mode::MultiThread)
442+
440443
mode.n == 1 && return update_centroids!(centroids, new_centroids[1], centroids_cnt[1], labels,
441444
design_matrix, SingleThread())
442445

0 commit comments

Comments
 (0)