|
1 | 1 | # MLDatasets.jl |
2 | 2 |
|
3 | | -_This package represents a community effort to provide a common |
4 | | -interface for accessing common Machine Learning (ML) datasets. In |
5 | | -contrast to other data-related Julia packages, the focus of |
6 | | -`MLDatasets.jl` is specifically on downloading, unpacking, and |
7 | | -accessing benchmark dataset. Functionality for the purpose of |
8 | | -data processing or visualization is only provided to a degree |
9 | | -that is special to some dataset._ |
10 | | - |
11 | | -| **Package Status** | **Build Status** | |
| 3 | +| **Documentation** | **Build Status** | |
12 | 4 | |:------------------:|:-----------------:| |
13 | | -| [](LICENSE.md) [](https://JuliaML.github.io/MLDatasets.jl/stable) | [](https://github.com/JuliaML/MLDatasets.jl/actions)| |
| 5 | +| ![Docs][docs-stable-img](docs-stable-url) [![Docs][docs-latest-img](docs-latest-url) | [](https://github.com/JuliaML/MLDatasets.jl/actions)| |
14 | 6 |
|
15 | | -This package is a part of the |
16 | | -[`JuliaML`](https://github.com/JuliaML) ecosystem. Its |
17 | | -functionality is build on top of the package |
18 | | -[`DataDeps.jl`](https://github.com/oxinabox/DataDeps.jl). |
19 | | - |
20 | | -## Introduction |
| 7 | +[docs-stable-img]: https://img.shields.io/badge/docs-stable-blue.svg |
| 8 | +[docs-latest-img]: https://img.shields.io/badge/docs-latest-blue.svg |
| 9 | +[docs-stable-url]: https://JuliaML.github.io/MLDatasets.jl/stable |
| 10 | +[docs-latest-url]: https://JuliaML.github.io/MLDatasets.jl/latest |
21 | 11 |
|
22 | | -The way `MLDatasets.jl` is organized is that each dataset has its |
23 | | -own dedicated sub-module. Where possible, those sub-module share |
24 | | -a common interface for interacting with the datasets. For example |
25 | | -you can load the training set and the test set of the MNIST |
26 | | -database of handwritten digits using the following commands: |
| 12 | +This package represents a community effort to provide a common interface for accessing common Machine Learning (ML) datasets. |
| 13 | +In contrast to other data-related Julia packages, the focus of `MLDatasets.jl` is specifically on downloading, unpacking, and accessing benchmark datasets. |
| 14 | +Functionality for the purpose of data processing or visualization is only provided to a degree that is special to some dataset. |
27 | 15 |
|
28 | | -```julia |
29 | | -using MLDatasets |
30 | | - |
31 | | -train_x, train_y = MNIST.traindata() |
32 | | -test_x, test_y = MNIST.testdata() |
33 | | -``` |
| 16 | +This package is a part of the |
| 17 | +[`JuliaML`](https://github.com/JuliaML) ecosystem. |
| 18 | +Its functionality is built on top of the package |
| 19 | +[`DataDeps.jl`](https://github.com/oxinabox/DataDeps.jl). |
34 | 20 |
|
35 | | -To load the data the package looks for the necessary files in |
36 | | -various locations (see |
37 | | -[`DataDeps.jl`](https://github.com/oxinabox/DataDeps.jl#configuration) |
38 | | -for more information on how to configure such defaults). If the |
39 | | -data can't be found in any of those locations, then the package |
40 | | -will trigger a download dialog to `~/.julia/datadeps/MNIST`. To |
41 | | -overwrite this on a case by case basis, it is possible to specify |
42 | | -a data directory directly in `traindata(dir = <directory>)` and |
43 | | -`testdata(dir = <directory>)`. |
44 | 21 |
|
45 | 22 | ## Available Datasets |
46 | 23 |
|
47 | | -Check out the **[latest |
48 | | -documentation](https://juliaml.github.io/MLDatasets.jl/latest)** |
49 | | - |
50 | | -Additionally, you can make use of Julia's native docsystem. |
51 | | -The following example shows how to get additional information |
52 | | -on `MNIST.traintensor` within Julia's REPL: |
53 | | - |
54 | | -```julia |
55 | | -?MNIST.traintensor |
56 | | -``` |
57 | | - |
58 | | -Each dataset has its own dedicated sub-module. As such, it makes |
59 | | -sense to document their functionality similarly distributed. Find |
60 | | -below a list of available datasets and links to their their |
61 | | -documentation. |
62 | | - |
63 | | -### Image Classification |
64 | | - |
65 | | -This package provides a variety of common benchmark datasets for |
66 | | -the purpose of image classification. |
67 | | - |
68 | | -Dataset | Classes | `traintensor` | `trainlabels` | `testtensor` | `testlabels` |
69 | | -:------:|:-------:|:-------------:|:-------------:|:------------:|:------------: |
70 | | -[**MNIST**](https://juliaml.github.io/MLDatasets.jl/latest/datasets/MNIST/) | 10 | 28x28x60000 | 60000 | 28x28x10000 | 10000 |
71 | | -[**FashionMNIST**](https://juliaml.github.io/MLDatasets.jl/latest/datasets/FashionMNIST/) | 10 | 28x28x60000 | 60000 | 28x28x10000 | 10000 |
72 | | -[**CIFAR-10**](https://juliaml.github.io/MLDatasets.jl/latest/datasets/CIFAR10/) | 10 | 32x32x3x50000 | 50000 | 32x32x3x10000 | 10000 |
73 | | -[**CIFAR-100**](https://juliaml.github.io/MLDatasets.jl/latest/datasets/CIFAR100/) | 100 (20) | 32x32x3x50000 | 50000 (x2) | 32x32x3x10000 | 10000 (x2) |
74 | | -[**SVHN-2**](https://juliaml.github.io/MLDatasets.jl/latest/datasets/SVHN2/) (*) | 10 | 32x32x3x73257 | 73257 | 32x32x3x26032 | 26032 |
75 | | - |
76 | | -(*) Note that the SVHN-2 dataset provides an additional 531131 observations aside from the training- and testset |
77 | | - |
78 | | -[**EMNIST**](https://www.nist.gov/itl/products-and-services/emnist-dataset) packages 6 different extensions of the MNIST dataset involving letters and digits and variety of test train split options. Each extension has the standard test/train data/labels nested under it as shown below. |
79 | | - |
80 | | -```julia |
81 | | -traindata = EMNIST.Balanced.traindata() |
82 | | -testdata = EMNIST.Balanced.testdata() |
83 | | -trainlabels = EMNIST.Balanced.trainlabels() |
84 | | -testlabels = EMNIST.Balanced.testlabels() |
85 | | -``` |
86 | | - |
87 | | -Dataset | Classes | `traindata` | `trainlabels` | `testdata` | `testlabels` | `balanced classes` |
88 | | -:------:|:-------:|:-------------:|:-------------:|:------------:|:------------:|:------------: |
89 | | -**ByClass** | 62 | 697932x28x28 | 697932x1 | 116323x28x28 | 116323x1 | no |
90 | | -**ByMerge** | 47 | 697932x28x28 | 697932x1 | 116323x28x28 | 116323x1 | no |
91 | | -**Balanced** | 47 | 112800x28x28 | 112800x1 | 18800x28x28 | 18800x1 | yes |
92 | | -**Letters** | 26 | 124800x28x28 | 124800x1 | 20800x28x28 | 208000x1 | yes |
93 | | -**Digits** | 10 | 240000x28x28 | 240000x1 | 40000x28x28 | 40000x1 | yes |
94 | | -**MNIST** | 10 | 60000x28x28 | 60000x1 | 10000x28x28 | 10000x1 | yes |
| 24 | +Each dataset has its own dedicated sub-module. |
| 25 | +Find below a list of available datasets and links to their documentation. |
95 | 26 |
|
96 | | -### Misc. Datasets |
| 27 | +#### Vision |
| 28 | + - [CIFAR10](https://juliaml.github.io/MLDatasets.jl/latest/datasets/CIFAR100/) |
| 29 | + - [CIFAR100](https://juliaml.github.io/MLDatasets.jl/latest/datasets/CIFAR100/) |
| 30 | + - [EMNIST](https://juliaml.github.io/MLDatasets.jl/latest/datasets/EMNIST/) |
| 31 | + - [FashionMNIST](https://juliaml.github.io/MLDatasets.jl/latest/datasets/FashionMNIST/) |
| 32 | + - [MNIST](https://juliaml.github.io/MLDatasets.jl/latest/datasets/MNIST/) |
| 33 | + - [SVHN2](https://juliaml.github.io/MLDatasets.jl/latest/datasets/SVHN2/) |
97 | 34 |
|
98 | | -Dataset | Classes | `traintensor` | `trainlabels` | `testtensor` | `testlabels` |
99 | | -:------:|:-------:|:-------------:|:-------------:|:------------:|:------------: |
100 | | -**Iris** | 3 | 4x150 | 150 | - | - |
101 | | -**BostongHousing** | - | 13x506 | 1x506 | - | - |
102 | 35 |
|
103 | | -### Language Modeling |
| 36 | +#### Miscellaneous |
| 37 | + - [BostonHousing](https://juliaml.github.io/MLDatasets.jl/latest/datasets/BostonHousing/) |
| 38 | + - [Iris](https://juliaml.github.io/MLDatasets.jl/latest/datasets/Iris/) |
104 | 39 |
|
105 | | -| | Train x | Train y | Test x | Test y | |
106 | | -|:--:|:-------:|:-------:|:------:|:------:| |
107 | | -| **PTBLM** | 42068 | 42068 | 3761 | 3761 | |
108 | | -| **UD_English** | 12543 | - | 2077 | - | |
109 | 40 |
|
110 | | -#### PTBLM |
| 41 | +#### Text |
| 42 | + - [PTBLM](https://juliaml.github.io/MLDatasets.jl/latest/datasets/PTBLM/) |
| 43 | + - [UD_English](https://juliaml.github.io/MLDatasets.jl/latest/datasets/UD_English/) |
111 | 44 |
|
112 | | -The `PTBLM` dataset consists of Penn Treebank sentences for |
113 | | -language modeling, available from |
114 | | -[tomsercu/lstm](https://github.com/tomsercu/lstm). The unknown |
115 | | -words are replaced with `<unk>` so that the total vocabulary size |
116 | | -becomes 10000. |
| 45 | +#### Graphs |
| 46 | + - To be added. |
117 | 47 |
|
118 | | -This is the first sentence of the PTBLM dataset. |
| 48 | +#### Audio |
| 49 | + - To be added. |
119 | 50 |
|
120 | | -```julia |
121 | | -x, y = PTBLM.traindata() |
122 | | - |
123 | | -x[1] |
124 | | -> ["no", "it", "was", "n't", "black", "monday"] |
125 | | -y[1] |
126 | | -> ["it", "was", "n't", "black", "monday", "<eos>"] |
127 | | -``` |
128 | | - |
129 | | -where `MLDataset` adds the special word: `<eos>` to the end of `y`. |
130 | | - |
131 | | -### Text Analysis (POS-Tagging, Parsing) |
132 | | - |
133 | | -#### UD English |
134 | | - |
135 | | -The [UD_English](https://github.com/UniversalDependencies/UD_English-EWT) |
136 | | -Universal Dependencies English Web Treebank dataset is an annotated corpus of morphological features, |
137 | | -POS-tags and syntactic trees. The dataset follows CoNLL-style |
138 | | -format. |
139 | | - |
140 | | -```julia |
141 | | -traindata = UD_English.traindata() |
142 | | -devdata = UD_English.devdata() |
143 | | -testdata = UD_English.devdata() |
144 | | -``` |
145 | | - |
146 | | -## Documentation |
147 | | - |
148 | | -Check out the **[latest |
149 | | -documentation](https://JuliaML.github.io/MLDatasets.jl/stable)** |
150 | | - |
151 | | -Additionally, you can make use of Julia's native docsystem. |
152 | | -The following example shows how to get additional information |
153 | | -on `MNIST.convert2image` within Julia's REPL: |
154 | | - |
155 | | -```julia |
156 | | -?MNIST.convert2image |
157 | | -``` |
158 | | -``` |
159 | | - convert2image(array) -> Array{Gray} |
160 | | -
|
161 | | - Convert the given MNIST horizontal-major tensor (or feature matrix) to a vertical-major Colorant array. The values are also color corrected according to |
162 | | - the website's description, which means that the digits are black on a white background. |
163 | | -
|
164 | | - julia> MNIST.convert2image(MNIST.traintensor()) # full training dataset |
165 | | - 28×28×60000 Array{Gray{N0f8},3}: |
166 | | - [...] |
167 | | -
|
168 | | - julia> MNIST.convert2image(MNIST.traintensor(1)) # first training image |
169 | | - 28×28 Array{Gray{N0f8},2}: |
170 | | - [...] |
171 | | -``` |
172 | 51 |
|
173 | 52 | ## Installation |
174 | 53 |
|
175 | | -To install `MLDatasets.jl`, start up Julia and type the following |
176 | | -code snippet into the REPL. It makes use of the native Julia |
| 54 | +To install `MLDatasets.jl`, start up Julia and type the following code snippet into the REPL. |
| 55 | +It makes use of the native Julia |
177 | 56 | package manger. |
178 | 57 |
|
179 | 58 | ```julia |
|
0 commit comments