JuliaML
diff --git a/‎.github/workflows/Documenter.yml‎
Lines changed: 2 additions & 3 deletions b/‎.github/workflows/Documenter.yml‎
Lines changed: 2 additions & 3 deletions
diff --git a/‎README.md‎
Lines changed: 34 additions & 155 deletions b/‎README.md‎
Lines changed: 34 additions & 155 deletions
diff --git a/‎docs/Project.toml‎
Lines changed: 1 addition & 1 deletion b/‎docs/Project.toml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/make.jl‎
Lines changed: 27 additions & 9 deletions b/‎docs/make.jl‎
Lines changed: 27 additions & 9 deletions
diff --git a/‎docs/src/datasets/BostonHousing.md‎
Lines changed: 13 additions & 0 deletions b/‎docs/src/datasets/BostonHousing.md‎
Lines changed: 13 additions & 0 deletions
diff --git a/‎docs/src/datasets/EMNIST.md‎
Lines changed: 21 additions & 0 deletions b/‎docs/src/datasets/EMNIST.md‎
Lines changed: 21 additions & 0 deletions
diff --git a/‎docs/src/datasets/Iris.md‎
Lines changed: 13 additions & 0 deletions b/‎docs/src/datasets/Iris.md‎
Lines changed: 13 additions & 0 deletions
diff --git a/‎docs/src/datasets/PTBLM.md‎
Lines changed: 20 additions & 0 deletions b/‎docs/src/datasets/PTBLM.md‎
Lines changed: 20 additions & 0 deletions
diff --git a/‎docs/src/datasets/UD_English.md‎
Lines changed: 18 additions & 0 deletions b/‎docs/src/datasets/UD_English.md‎
Lines changed: 18 additions & 0 deletions
@@ -14,9 +14,8 @@ jobs:
   build:
     runs-on: ${{ matrix.os }}
     strategy:
-      matrix:
-        julia-version: [1]
-        os: [ubuntu-latest]
+      julia-version: 1.6
+      os: ubuntu-latest
     steps:
       - uses: actions/checkout@v2
       - uses: julia-actions/setup-julia@latest
 
@@ -1,179 +1,58 @@
 # MLDatasets.jl
 
-_This package represents a community effort to provide a common
-interface for accessing common Machine Learning (ML) datasets. In
-contrast to other data-related Julia packages, the focus of
-`MLDatasets.jl` is specifically on downloading, unpacking, and
-accessing benchmark dataset. Functionality for the purpose of
-data processing or visualization is only provided to a degree
-that is special to some dataset._
-
-| **Package Status** | **Build Status**  |
+| **Documentation** | **Build Status**  |
 |:------------------:|:-----------------:|
-| [![License](http://img.shields.io/badge/license-MIT-brightgreen.svg?style=flat)](LICENSE.md) [![Docs](https://img.shields.io/badge/docs-stable-blue.svg)](https://JuliaML.github.io/MLDatasets.jl/stable) |  [![Build Status](https://github.com/JuliaML/MLDatasets.jl/workflows/Unit%20test/badge.svg)](https://github.com/JuliaML/MLDatasets.jl/actions)|
+| ![Docs][docs-stable-img](docs-stable-url) [![Docs][docs-latest-img](docs-latest-url) |  [![Build Status](https://github.com/JuliaML/MLDatasets.jl/workflows/Unit%20test/badge.svg)](https://github.com/JuliaML/MLDatasets.jl/actions)|
 
-This package is a part of the
-[`JuliaML`](https://github.com/JuliaML) ecosystem. Its
-functionality is build on top of the package
-[`DataDeps.jl`](https://github.com/oxinabox/DataDeps.jl).
-
-## Introduction
+[docs-stable-img]: https://img.shields.io/badge/docs-stable-blue.svg
+[docs-latest-img]: https://img.shields.io/badge/docs-latest-blue.svg
+[docs-stable-url]: https://JuliaML.github.io/MLDatasets.jl/stable
+[docs-latest-url]: https://JuliaML.github.io/MLDatasets.jl/latest
 
-The way `MLDatasets.jl` is organized is that each dataset has its
-own dedicated sub-module. Where possible, those sub-module share
-a common interface for interacting with the datasets. For example
-you can load the training set and the test set of the MNIST
-database of handwritten digits using the following commands:
+This package represents a community effort to provide a common interface for accessing common Machine Learning (ML) datasets. 
+In contrast to other data-related Julia packages, the focus of `MLDatasets.jl` is specifically on downloading, unpacking, and accessing benchmark datasets. 
+Functionality for the purpose of data processing or visualization is only provided to a degree that is special to some dataset.
 
-```julia
-using MLDatasets
-
-train_x, train_y = MNIST.traindata()
-test_x,  test_y  = MNIST.testdata()
-```
+This package is a part of the
+[`JuliaML`](https://github.com/JuliaML) ecosystem. 
+Its functionality is built on top of the package
+[`DataDeps.jl`](https://github.com/oxinabox/DataDeps.jl).
 
-To load the data the package looks for the necessary files in
-various locations (see
-[`DataDeps.jl`](https://github.com/oxinabox/DataDeps.jl#configuration)
-for more information on how to configure such defaults). If the
-data can't be found in any of those locations, then the package
-will trigger a download dialog to `~/.julia/datadeps/MNIST`. To
-overwrite this on a case by case basis, it is possible to specify
-a data directory directly in `traindata(dir = <directory>)` and
-`testdata(dir = <directory>)`.
 
 ## Available Datasets
 
-Check out the **[latest
-documentation](https://juliaml.github.io/MLDatasets.jl/latest)**
-
-Additionally, you can make use of Julia's native docsystem.
-The following example shows how to get additional information
-on `MNIST.traintensor` within Julia's REPL:
-
-```julia
-?MNIST.traintensor
-```
-
-Each dataset has its own dedicated sub-module. As such, it makes
-sense to document their functionality similarly distributed. Find
-below a list of available datasets and links to their their
-documentation.
-
-### Image Classification
-
-This package provides a variety of common benchmark datasets for
-the purpose of image classification.
-
-Dataset | Classes | `traintensor` | `trainlabels` | `testtensor` | `testlabels`
-:------:|:-------:|:-------------:|:-------------:|:------------:|:------------:
-[**MNIST**](https://juliaml.github.io/MLDatasets.jl/latest/datasets/MNIST/) | 10 | 28x28x60000 | 60000 | 28x28x10000 | 10000
-[**FashionMNIST**](https://juliaml.github.io/MLDatasets.jl/latest/datasets/FashionMNIST/) | 10 | 28x28x60000 | 60000 | 28x28x10000 | 10000
-[**CIFAR-10**](https://juliaml.github.io/MLDatasets.jl/latest/datasets/CIFAR10/) | 10 | 32x32x3x50000 | 50000 | 32x32x3x10000 | 10000
-[**CIFAR-100**](https://juliaml.github.io/MLDatasets.jl/latest/datasets/CIFAR100/) | 100 (20) | 32x32x3x50000 | 50000 (x2) | 32x32x3x10000 | 10000 (x2)
-[**SVHN-2**](https://juliaml.github.io/MLDatasets.jl/latest/datasets/SVHN2/) (*) | 10 | 32x32x3x73257 | 73257 | 32x32x3x26032 | 26032
-
-(*) Note that the SVHN-2 dataset provides an additional 531131 observations aside from the training- and testset
-
-[**EMNIST**](https://www.nist.gov/itl/products-and-services/emnist-dataset) packages 6 different extensions of the MNIST dataset involving letters and digits and variety of test train split options. Each extension has the standard test/train data/labels nested under it as shown below.
-
-```julia
-traindata = EMNIST.Balanced.traindata()
-testdata = EMNIST.Balanced.testdata()
-trainlabels = EMNIST.Balanced.trainlabels()
-testlabels = EMNIST.Balanced.testlabels()
-```
-
-Dataset | Classes | `traindata` | `trainlabels` | `testdata` | `testlabels` | `balanced classes`
-:------:|:-------:|:-------------:|:-------------:|:------------:|:------------:|:------------:
-**ByClass** | 62 | 697932x28x28 | 697932x1 | 116323x28x28 | 116323x1 | no
-**ByMerge** | 47 | 697932x28x28 | 697932x1 | 116323x28x28 | 116323x1 | no
-**Balanced** | 47 | 112800x28x28 | 112800x1 | 18800x28x28 | 18800x1 | yes
-**Letters** | 26 | 124800x28x28 | 124800x1 | 20800x28x28 | 208000x1 | yes
-**Digits** | 10 | 240000x28x28 | 240000x1 | 40000x28x28 | 40000x1 | yes
-**MNIST** | 10 | 60000x28x28 | 60000x1 | 10000x28x28 | 10000x1 | yes
+Each dataset has its own dedicated sub-module. 
+Find below a list of available datasets and links to their documentation.
 
-### Misc. Datasets
+#### Vision
+  - [CIFAR10](https://juliaml.github.io/MLDatasets.jl/latest/datasets/CIFAR100/)
+  - [CIFAR100](https://juliaml.github.io/MLDatasets.jl/latest/datasets/CIFAR100/)
+  - [EMNIST](https://juliaml.github.io/MLDatasets.jl/latest/datasets/EMNIST/)
+  - [FashionMNIST](https://juliaml.github.io/MLDatasets.jl/latest/datasets/FashionMNIST/)
+  - [MNIST](https://juliaml.github.io/MLDatasets.jl/latest/datasets/MNIST/)
+  - [SVHN2](https://juliaml.github.io/MLDatasets.jl/latest/datasets/SVHN2/)
 
-Dataset | Classes | `traintensor` | `trainlabels` | `testtensor` | `testlabels`
-:------:|:-------:|:-------------:|:-------------:|:------------:|:------------:
-**Iris** | 3 | 4x150 | 150 | - | -
-**BostongHousing** | - | 13x506 | 1x506 | - | -
 
-### Language Modeling
+#### Miscellaneous
+  - [BostonHousing](https://juliaml.github.io/MLDatasets.jl/latest/datasets/BostonHousing/)
+  - [Iris](https://juliaml.github.io/MLDatasets.jl/latest/datasets/Iris/)
 
-|    | Train x | Train y | Test x | Test y |
-|:--:|:-------:|:-------:|:------:|:------:|
-| **PTBLM** | 42068 | 42068 | 3761 | 3761 |
-| **UD_English** | 12543 | - | 2077 | - |
 
-#### PTBLM
+#### Text
+  - [PTBLM](https://juliaml.github.io/MLDatasets.jl/latest/datasets/PTBLM/)
+  - [UD_English](https://juliaml.github.io/MLDatasets.jl/latest/datasets/UD_English/)
 
-The `PTBLM` dataset consists of Penn Treebank sentences for
-language modeling, available from
-[tomsercu/lstm](https://github.com/tomsercu/lstm). The unknown
-words are replaced with `<unk>` so that the total vocabulary size
-becomes 10000.
+#### Graphs
+  - To be added.
 
-This is the first sentence of the PTBLM dataset.
+#### Audio
+  - To be added.
 
-```julia
-x, y = PTBLM.traindata()
-
-x[1]
-> ["no", "it", "was", "n't", "black", "monday"]
-y[1]
-> ["it", "was", "n't", "black", "monday", "<eos>"]
-```
-
-where `MLDataset` adds the special word: `<eos>` to the end of `y`.
-
-### Text Analysis (POS-Tagging, Parsing)
-
-#### UD English
-
-The [UD_English](https://github.com/UniversalDependencies/UD_English-EWT)
-Universal Dependencies English Web Treebank dataset is an annotated corpus of morphological features,
-POS-tags and syntactic trees. The dataset follows CoNLL-style
-format.
-
-```julia
-traindata = UD_English.traindata()
-devdata = UD_English.devdata()
-testdata = UD_English.devdata()
-```
-
-## Documentation
-
-Check out the **[latest
-documentation](https://JuliaML.github.io/MLDatasets.jl/stable)**
-
-Additionally, you can make use of Julia's native docsystem.
-The following example shows how to get additional information
-on `MNIST.convert2image` within Julia's REPL:
-
-```julia
-?MNIST.convert2image
-```
-```
-  convert2image(array) -> Array{Gray}
-
-  Convert the given MNIST horizontal-major tensor (or feature matrix) to a vertical-major Colorant array. The values are also color corrected according to
-  the website's description, which means that the digits are black on a white background.
-
-  julia> MNIST.convert2image(MNIST.traintensor()) # full training dataset
-  28×28×60000 Array{Gray{N0f8},3}:
-  [...]
-
-  julia> MNIST.convert2image(MNIST.traintensor(1)) # first training image
-  28×28 Array{Gray{N0f8},2}:
-  [...]
-```
 
 ## Installation
 
-To install `MLDatasets.jl`, start up Julia and type the following
-code snippet into the REPL. It makes use of the native Julia
+To install `MLDatasets.jl`, start up Julia and type the following code snippet into the REPL. 
+It makes use of the native Julia
 package manger.
 
 ```julia
 
@@ -1,2 +1,2 @@
 [deps]
-Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
+Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
@@ -1,29 +1,47 @@
 using Documenter, MLDatasets
 
+DocMeta.setdocmeta!(MLDatasets, :DocTestSetup, :(using MLDatasets); recursive=true)
+
+# Build documentation.
+# ====================
+
 makedocs(
     modules = [MLDatasets],
+    doctest = true,
     clean = false,
-    format = Documenter.HTML(
-                prettyurls = haskey(ENV, "CI"), 
-                assets = [joinpath("assets", "favicon.ico")]
-            ),
     sitename = "MLDatasets.jl",
+    format = Documenter.HTML(
+        canonical = "https://juliadata.github.io/MLDatasets.jl/stable/",
+        assets = ["assets/favicon.ico"],
+        prettyurls = get(ENV, "CI", nothing) == "true",
+        collapselevel=3,
+    ),
+
     authors = "Hiroyuki Shindo, Christof Stocker",
-    linkcheck = !("skiplinks" in ARGS),
     pages = Any[
         "Home" => "index.md",
         "Available Datasets" => Any[
-            "Image Classification" => Any[
-                "MNIST handwritten digits" => "datasets/MNIST.md",
-                "Fashion MNIST" => "datasets/FashionMNIST.md",
+            "Vision" => Any[
+                "MNIST" => "datasets/MNIST.md",
+                "FashionMNIST" => "datasets/FashionMNIST.md",
                 "CIFAR-10" => "datasets/CIFAR10.md",
                 "CIFAR-100" => "datasets/CIFAR100.md",
                 "SVHN format 2" => "datasets/SVHN2.md",
             ],
+            "Miscellaneous" => Any[
+                "Iris" => "datasets/Iris.md",
+                "Boston Housing" => "datasets/BostonHousing.md",
+            ],
+            
+            "Text" => Any[
+                "PTBLM" => "datasets/PTBLM.md",
+                "UD_English" => "datasets/UD_English.md",
+            ],
+
         ],
-        hide("Indices" => "indices.md"),
         "LICENSE.md",
     ]
 )
 
+
 deploydocs(repo = "github.com/JuliaML/MLDatasets.jl.git")
@@ -0,0 +1,13 @@
+# Boston Housing
+
+```@docs
+BostonHousing
+```
+
+## API reference
+
+```@docs
+BostonHousing.feature_names
+BostonHousing.features
+BostonHousing.targets
+```
@@ -0,0 +1,21 @@
+# EMNIST
+
+[**EMNIST**](https://www.nist.gov/itl/products-and-services/emnist-dataset) packages 6 different extensions of the MNIST dataset involving letters and digits and variety of test train split options. Each extension has the standard test/train data/labels nested under it as shown below.
+
+```julia
+using MLDatasets: EMNIST
+
+traindata = EMNIST.Balanced.traindata()
+testdata = EMNIST.Balanced.testdata()
+trainlabels = EMNIST.Balanced.trainlabels()
+testlabels = EMNIST.Balanced.testlabels()
+```
+
+Dataset | Classes | `traindata` | `trainlabels` | `testdata` | `testlabels` | `balanced classes`
+:------:|:-------:|:-------------:|:-------------:|:------------:|:------------:|:------------:
+**ByClass** | 62 | 697932x28x28 | 697932x1 | 116323x28x28 | 116323x1 | no
+**ByMerge** | 47 | 697932x28x28 | 697932x1 | 116323x28x28 | 116323x1 | no
+**Balanced** | 47 | 112800x28x28 | 112800x1 | 18800x28x28 | 18800x1 | yes
+**Letters** | 26 | 124800x28x28 | 124800x1 | 20800x28x28 | 208000x1 | yes
+**Digits** | 10 | 240000x28x28 | 240000x1 | 40000x28x28 | 40000x1 | yes
+**MNIST** | 10 | 60000x28x28 | 60000x1 | 10000x28x28 | 10000x1 | yes
@@ -0,0 +1,13 @@
+# Iris
+
+```@docs
+Iris
+```
+
+## API reference
+
+```@docs
+Iris.features
+Iris.labels
+Iris.download
+```
@@ -0,0 +1,20 @@
+# PTBLM
+
+The `PTBLM` dataset consists of Penn Treebank sentences for
+language modeling, available from
+[tomsercu/lstm](https://github.com/tomsercu/lstm). The unknown
+words are replaced with `<unk>` so that the total vocabulary size
+becomes 10000.
+
+This is the first sentence of the PTBLM dataset.
+
+```julia
+x, y = PTBLM.traindata()
+
+x[1]
+> ["no", "it", "was", "n't", "black", "monday"]
+y[1]
+> ["it", "was", "n't", "black", "monday", "<eos>"]
+```
+
+where `MLDataset` adds the special word: `<eos>` to the end of `y`.
@@ -0,0 +1,18 @@
+
+# UD English
+
+The [UD_English](https://github.com/UniversalDependencies/UD_English-EWT)
+Universal Dependencies English Web Treebank dataset is an annotated corpus of morphological features,
+POS-tags and syntactic trees. The dataset follows CoNLL-style format.
+
+```julia
+traindata = UD_English.traindata()
+devdata = UD_English.devdata()
+testdata = UD_English.devdata()
+```
+
+## Data Size
+
+|    | Train x | Train y | Test x | Test y |
+|:--:|:-------:|:-------:|:------:|:------:|
+| **UD_English** | 12543 | - | 2077 | - |
Original file line number	Diff line number	Diff line change
`@@ -1,2 +1,2 @@`
`1`	`1`	`[deps]`
`2`		`-Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"`
	`2`	`+Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"`