Add solutions for provenance

Tania Allard · Tania Allard · commit ede711311e59 · 2018-05-04T11:38:12.000+01:00
diff --git a/03_ProcessData.ipynb b/03_ProcessData.ipynb
@@ -43,23 +43,6 @@
     "```"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "subslide"
-    }
-   },
-   "source": [
-    "# Documentation\n",
-    "\n",
-    "Documentation is an important part of a reproducible workflow.\n",
-    "\n",
-    "Take 5 minutes and identify which scripts/notebook have the best documentation. Why makes it a good documentation?\n",
-    "\n",
-    "A good point to start is checking the [Google Python style guidelines](https://google.github.io/styleguide/pyguide.html#Comments)"
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {
@@ -133,6 +116,20 @@
     "</table>\n"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Running Jupyter lab\n",
+    "\n",
+    "We will be using [Jupyter lab](https://github.com/jupyterlab/jupyterlab) to write, execute, and modify our scripts and notebooks. \n",
+    "\n",
+    "You should have this installed already. We are going to start an instance by typing on the shell:\n",
+    "```\n",
+    "$ jupyter lab\n",
+    "```"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {
@@ -154,7 +151,26 @@
     "$ python src/data/01_subset-data-GBP.py data/raw/winemag-data-130k-v2.csv \n",
     "$ python src/visualization/02_visualize-wines.py data/interim/2018-05-09-winemag_priceGBP.csv \n",
     "$ python src/data/03_country-subset.py data/interim/2018-05-09-winemag_priceGBP.csv Chile\n",
-    "```\n"
+    "```\n",
+    "\n",
+    "😕 What problems did you encounter? \n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "slideshow": {
+     "slide_type": "subslide"
+    }
+   },
+   "source": [
+    "# Documentation\n",
+    "\n",
+    "Documentation is an important part of a reproducible workflow.\n",
+    "\n",
+    "Take 5 minutes and identify which scripts/notebook have the best documentation. What makes it a good documentation?\n",
+    "\n",
+    "If you want to know more about documentation styles and Python style visit: [Google Python style guidelines](https://google.github.io/styleguide/pyguide.html#Comments)"
    ]
   },
   {
@@ -201,9 +217,9 @@
     "```\n",
     "Instead we need to do it like so:\n",
     "```python\n",
-    "subset = importlib.import_module('.data.01_subset-data-GBP', 'scripts')\n",
-    "plotwines = importlib.import_module('.visualization.02_visualize-wines', 'scripts')\n",
-    "country_sub = importlib.import_module('.data.03_country-subset', 'scripts')\n",
+    "subset = importlib.import_module('.data.01_subset-data-GBP', 'src')\n",
+    "plotwines = importlib.import_module('.visualization.02_visualize-wines', 'src')\n",
+    "country_sub = importlib.import_module('.data.03_country-subset', 'src')\n",
     "```\n",
     "\n",
     "<div class='info'> Note that we need to make sure that the other subpackages are imported into the main package </div>\n",
diff --git a/04_Testing.ipynb b/04_Testing.ipynb
@@ -35,7 +35,26 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We will start by testing some of our functions:\n",
+    "## Exceptions \n",
+    "Remember when you tried to run `02_visualize-wines.py`? It woud not work unless you had created a figures directory beforehand.\n",
+    "\n",
+    "We can catch this kinds of errors by adding this piece of code:\n",
+    "```python\n",
+    "try:\n",
+    "        fig.savefig(fname, bbox_inches = 'tight')\n",
+    "    except OSError as e:\n",
+    "        os.makedirs('figures')\n",
+    "        print('Creating figures directory')\n",
+    "        fig.savefig(fname, bbox_inches='tight')\n",
+    "```\n",
+    "Now our `runall` should work!!! 🎉🎉"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Unit testing\n",
     "Open `03_country-subset.py` and add the following function:\n",
     "    \n",
     "```python \n",
@@ -80,8 +99,9 @@
     "\n",
     "Now we can create our tests:\n",
     "```\n",
-    "$ touch tests/__init__.py\n",
-    "$ touch test_03_country_subset.py\n",
+    "$ mkdir tests                     # Create tests directory\n",
+    "$ touch tests/__init__.py         # Help find the test\n",
+    "$ touch test_03_country_subset.py # Create our first test\n",
     "```\n",
     "⭐ Your test scripts name must start with: `test`"
    ]
@@ -96,8 +116,8 @@
     "\n",
     "country = importlib.import_module('.data.03_country-subset', 'src')\n",
     "\n",
-    "interim_data = \"data/interim/2018-04-30-winemag_priceGBP.csv\"\n",
-    "processed_data = \"data/processed/2018-04-30-winemag_Chile.csv\"\n",
+    "interim_data = \"data/interim/2018-05-09-winemag_priceGBP.csv\"\n",
+    "processed_data = \"data/processed/2018-05-03-winemag_Chile.csv\"\n",
     "\n",
     "def test_get_mean_price():\n",
     "    mean_price = country.get_mean_price(processed_data)\n",
@@ -106,15 +126,15 @@
     "\n",
     "And you can run it from the shell using:\n",
     "```\n",
-    "$ python -m pytest tests/test_03_country-subset.py\n",
+    "$ pytest\n",
     "```"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## What if you want all the decimal numbers?\n",
+    "### What if you want all the decimal numbers?\n",
     "\n",
     "``` python\n",
     "import importlib\n",
@@ -166,6 +186,57 @@
     "```    "
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Pytest tells us which tests passed and which did not:\n",
+    "\n",
+    "```python\n",
+    " {message}\n",
+    "    [left]:  {left}\n",
+    "    [right]: {right}\"\"\".format(obj=obj, message=message, left=left, right=right)\n",
+    "\n",
+    "        if diff is not None:\n",
+    "            msg += \"\\n[diff]: {diff}\".format(diff=diff)\n",
+    "\n",
+    ">       raise AssertionError(msg)\n",
+    "E       AssertionError: DataFrame are different\n",
+    "E\n",
+    "E       DataFrame shape mismatch\n",
+    "E       [left]:  (4472, 6)\n",
+    "E       [right]: (4472, 7)\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We now know what kind of bugs we can encounter.\n",
+    "Let's fix this, open `03_subset-country.py` and add the following lines\n",
+    "\n",
+    "```python\n",
+    "def get_country(filename, country):\n",
+    "    # Load table\n",
+    "    wine = pd.read_csv(filename)\n",
+    "\n",
+    "    # Use the country name to subset data\n",
+    "    subset_country = wine[wine['country'] == country ].copy()\n",
+    "    subset_country.reset_index(drop=True, inplace=True) \n",
+    "\n",
+    "    # Constructing the fname\n",
+    "    today = datetime.datetime.today().strftime('%Y-%m-%d')\n",
+    "    fname = f'data/processed/{today}-winemag_{country}.csv'\n",
+    "\n",
+    "    # Saving the csv\n",
+    "    subset_country.to_csv(fname, index =False)\n",
+    "    print(fname)  # print the fname from here\n",
+    "\n",
+    "    return(subset_country)  #returns the data frame\n",
+    "```"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -189,13 +260,13 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Past as Truth\n",
+    "## Past as Truth (regression tests)\n",
     "\n",
     "Regression tests assume that the past is “correct.” They are great for letting developers know when and how a code base has changed. They are not great for letting anyone know why the change occurred. The change between what a code produces now and what it computed before is called a regression.\n",
     "\n",
     "** How many times have you tried to run a script or a notebook you found online just to realize it is broken?**\n",
     "\n",
-    "Let's do some regression testing on the Jupyter notebook using *nbval*"
+    "Let's do some regression testing on the Jupyter notebook using [nbval](https://github.com/computationalmodelling/nbval)"
    ]
   },
   {
@@ -244,7 +315,23 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Make sure everything is commited to git before carrying on.\n"
+    "<div class='warn'>Make sure everything is commited to git before carrying on.</div>\n",
+    "<br>\n",
+    "Add the following line to your `runall-wine-analysis` script\n",
+    "\n",
+    "```python\n",
+    "import recipy\n",
+    "```\n",
+    "Run the script again `python -m src.runall-wine-analysis`"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You can now track the provenance of your project. \n",
+    "\n",
+    "Try using `recipy latest` and `recipy gui`"
    ]
   },
   {
diff --git a/solutions/02_visualize-wines-forTest.py b/solutions/02_visualize-wines-forTest.py
@@ -0,0 +1,74 @@
+#!/usr/bin/env python
+"""
+Module contaning the functions to visualize the 
+wines distribution using a subset data
+"""
+
+import pandas as pd
+import numpy as np
+import matplotlib.pyplot as plt
+
+def create_plots(filename):
+    """
+    Create plots for the analysis
+    Args:
+    -----
+    filename: str
+        Path to the filename containing the wine data
+   
+    """
+    wine = pd.read_csv(filename)
+
+    # Calls the function that plots the distribution
+    print(plot_distribution(wine))
+
+    # Calls the function that plots the scatter plot
+    print(plot_scatter(wine))
+
+
+def plot_distribution(wine):
+    num_bins = 20
+
+    mu = 88 # mean of distribution
+    sigma = 3 # standard deviation of distribution
+
+    # Histogram of the data
+    fig, ax =  plt.subplots(figsize = (10,8))
+    n, bins, patches = plt.hist(wine['points'], num_bins, density=1, facecolor='blue', alpha=0.5)
+
+
+    # Add a 'best fit' line
+    y = ((1 / (np.sqrt(2 * np.pi) * sigma)) *
+         np.exp(-0.5 * (1 / sigma * (bins - mu))**2))
+
+    ax.plot(bins, y, '--')
+
+    ax.set_title('Distribution of Wine scores:  $\mu=88$, $\sigma=3$')
+    ax.set_ylabel('Probability density')
+    ax.set_xlabel('Points');
+
+    fname = f'figures/fig01_distribution-wine-scores.png'
+
+    fig.savefig(fname, bbox_inches = 'tight')
+    return (fname)
+
+
+def plot_scatter(wine):
+
+    fig, ax =  plt.subplots(figsize = (10,8))
+
+    plt.scatter(wine['points'], wine['price'])
+    ax.set_title('Scatter wine points vs price')
+    ax.set_ylabel('Price USD')
+    ax.set_xlabel('Points')
+
+    fname = f'figures/fig02_scatter-points-vs-price.png'
+    fig.savefig(fname, bbox_inches = 'tight')
+    return (fname)
+
+
+if __name__ == '__main__':
+    # Filename is passed by the user 
+    filename = sys.argv[1]
+    
+    create_plots(filename)
diff --git a/solutions/02_visualize-wines.py b/solutions/02_visualize-wines.py
@@ -51,8 +51,14 @@ def plot_distribution(wine):
     ax.set_xlabel('Points');
 
     fname = f'figures/fig01_distribution-wine-scores.png'
+    
+    try:
+        fig.savefig(fname, bbox_inches = 'tight')
+    except OSError as e:
+        os.makedirs('figures')
+        print('Creating figures directory')
+        fig.savefig(fname, bbox_inches='tight')
 
-    fig.savefig(fname, bbox_inches = 'tight')
     return (fname)
 
 
@@ -66,7 +72,14 @@ def plot_scatter(wine):
     ax.set_xlabel('Points')
 
     fname = f'figures/fig02_scatter-points-vs-price.png'
-    fig.savefig(fname, bbox_inches = 'tight')
+    
+    try:
+        fig.savefig(fname, bbox_inches = 'tight')
+    except OSError as e:
+        os.makedirs('figures')
+        print('Creating figures directory')
+        fig.savefig(fname, bbox_inches='tight')
+        
     return (fname)
 
 
diff --git a/solutions/03_country-subset.py b/solutions/03_country-subset.py
@@ -28,7 +28,6 @@ def get_country(filename, country):
 
     # Use the country name to subset data
     subset_country = wine[wine['country'] == country ].copy()
-    subset_country.reset_index(drop=True, inplace=True)
 
     # Subset the
 
diff --git a/solutions/scripts/data/01_subset-data-GBP.py b/solutions/scripts/data/01_subset-data-GBP.py
diff --git a/solutions/scripts/runall-wine-analysis.py b/solutions/scripts/runall-wine-analysis.py