Skip to content

Commit ede7113

Browse files
author
Tania Allard
committed
Add solutions for provenance
1 parent c2e49d2 commit ede7113

File tree

7 files changed

+236
-39
lines changed

7 files changed

+236
-39
lines changed

03_ProcessData.ipynb

Lines changed: 37 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -43,23 +43,6 @@
4343
"```"
4444
]
4545
},
46-
{
47-
"cell_type": "markdown",
48-
"metadata": {
49-
"slideshow": {
50-
"slide_type": "subslide"
51-
}
52-
},
53-
"source": [
54-
"# Documentation\n",
55-
"\n",
56-
"Documentation is an important part of a reproducible workflow.\n",
57-
"\n",
58-
"Take 5 minutes and identify which scripts/notebook have the best documentation. Why makes it a good documentation?\n",
59-
"\n",
60-
"A good point to start is checking the [Google Python style guidelines](https://google.github.io/styleguide/pyguide.html#Comments)"
61-
]
62-
},
6346
{
6447
"cell_type": "markdown",
6548
"metadata": {
@@ -133,6 +116,20 @@
133116
"</table>\n"
134117
]
135118
},
119+
{
120+
"cell_type": "markdown",
121+
"metadata": {},
122+
"source": [
123+
"# Running Jupyter lab\n",
124+
"\n",
125+
"We will be using [Jupyter lab](https://github.com/jupyterlab/jupyterlab) to write, execute, and modify our scripts and notebooks. \n",
126+
"\n",
127+
"You should have this installed already. We are going to start an instance by typing on the shell:\n",
128+
"```\n",
129+
"$ jupyter lab\n",
130+
"```"
131+
]
132+
},
136133
{
137134
"cell_type": "markdown",
138135
"metadata": {
@@ -154,7 +151,26 @@
154151
"$ python src/data/01_subset-data-GBP.py data/raw/winemag-data-130k-v2.csv \n",
155152
"$ python src/visualization/02_visualize-wines.py data/interim/2018-05-09-winemag_priceGBP.csv \n",
156153
"$ python src/data/03_country-subset.py data/interim/2018-05-09-winemag_priceGBP.csv Chile\n",
157-
"```\n"
154+
"```\n",
155+
"\n",
156+
"😕 What problems did you encounter? \n"
157+
]
158+
},
159+
{
160+
"cell_type": "markdown",
161+
"metadata": {
162+
"slideshow": {
163+
"slide_type": "subslide"
164+
}
165+
},
166+
"source": [
167+
"# Documentation\n",
168+
"\n",
169+
"Documentation is an important part of a reproducible workflow.\n",
170+
"\n",
171+
"Take 5 minutes and identify which scripts/notebook have the best documentation. What makes it a good documentation?\n",
172+
"\n",
173+
"If you want to know more about documentation styles and Python style visit: [Google Python style guidelines](https://google.github.io/styleguide/pyguide.html#Comments)"
158174
]
159175
},
160176
{
@@ -201,9 +217,9 @@
201217
"```\n",
202218
"Instead we need to do it like so:\n",
203219
"```python\n",
204-
"subset = importlib.import_module('.data.01_subset-data-GBP', 'scripts')\n",
205-
"plotwines = importlib.import_module('.visualization.02_visualize-wines', 'scripts')\n",
206-
"country_sub = importlib.import_module('.data.03_country-subset', 'scripts')\n",
220+
"subset = importlib.import_module('.data.01_subset-data-GBP', 'src')\n",
221+
"plotwines = importlib.import_module('.visualization.02_visualize-wines', 'src')\n",
222+
"country_sub = importlib.import_module('.data.03_country-subset', 'src')\n",
207223
"```\n",
208224
"\n",
209225
"<div class='info'> Note that we need to make sure that the other subpackages are imported into the main package </div>\n",

04_Testing.ipynb

Lines changed: 97 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,26 @@
3535
"cell_type": "markdown",
3636
"metadata": {},
3737
"source": [
38-
"We will start by testing some of our functions:\n",
38+
"## Exceptions \n",
39+
"Remember when you tried to run `02_visualize-wines.py`? It woud not work unless you had created a figures directory beforehand.\n",
40+
"\n",
41+
"We can catch this kinds of errors by adding this piece of code:\n",
42+
"```python\n",
43+
"try:\n",
44+
" fig.savefig(fname, bbox_inches = 'tight')\n",
45+
" except OSError as e:\n",
46+
" os.makedirs('figures')\n",
47+
" print('Creating figures directory')\n",
48+
" fig.savefig(fname, bbox_inches='tight')\n",
49+
"```\n",
50+
"Now our `runall` should work!!! 🎉🎉"
51+
]
52+
},
53+
{
54+
"cell_type": "markdown",
55+
"metadata": {},
56+
"source": [
57+
"## Unit testing\n",
3958
"Open `03_country-subset.py` and add the following function:\n",
4059
" \n",
4160
"```python \n",
@@ -80,8 +99,9 @@
8099
"\n",
81100
"Now we can create our tests:\n",
82101
"```\n",
83-
"$ touch tests/__init__.py\n",
84-
"$ touch test_03_country_subset.py\n",
102+
"$ mkdir tests # Create tests directory\n",
103+
"$ touch tests/__init__.py # Help find the test\n",
104+
"$ touch test_03_country_subset.py # Create our first test\n",
85105
"```\n",
86106
"⭐ Your test scripts name must start with: `test`"
87107
]
@@ -96,8 +116,8 @@
96116
"\n",
97117
"country = importlib.import_module('.data.03_country-subset', 'src')\n",
98118
"\n",
99-
"interim_data = \"data/interim/2018-04-30-winemag_priceGBP.csv\"\n",
100-
"processed_data = \"data/processed/2018-04-30-winemag_Chile.csv\"\n",
119+
"interim_data = \"data/interim/2018-05-09-winemag_priceGBP.csv\"\n",
120+
"processed_data = \"data/processed/2018-05-03-winemag_Chile.csv\"\n",
101121
"\n",
102122
"def test_get_mean_price():\n",
103123
" mean_price = country.get_mean_price(processed_data)\n",
@@ -106,15 +126,15 @@
106126
"\n",
107127
"And you can run it from the shell using:\n",
108128
"```\n",
109-
"$ python -m pytest tests/test_03_country-subset.py\n",
129+
"$ pytest\n",
110130
"```"
111131
]
112132
},
113133
{
114134
"cell_type": "markdown",
115135
"metadata": {},
116136
"source": [
117-
"## What if you want all the decimal numbers?\n",
137+
"### What if you want all the decimal numbers?\n",
118138
"\n",
119139
"``` python\n",
120140
"import importlib\n",
@@ -166,6 +186,57 @@
166186
"``` "
167187
]
168188
},
189+
{
190+
"cell_type": "markdown",
191+
"metadata": {},
192+
"source": [
193+
"Pytest tells us which tests passed and which did not:\n",
194+
"\n",
195+
"```python\n",
196+
" {message}\n",
197+
" [left]: {left}\n",
198+
" [right]: {right}\"\"\".format(obj=obj, message=message, left=left, right=right)\n",
199+
"\n",
200+
" if diff is not None:\n",
201+
" msg += \"\\n[diff]: {diff}\".format(diff=diff)\n",
202+
"\n",
203+
"> raise AssertionError(msg)\n",
204+
"E AssertionError: DataFrame are different\n",
205+
"E\n",
206+
"E DataFrame shape mismatch\n",
207+
"E [left]: (4472, 6)\n",
208+
"E [right]: (4472, 7)\n",
209+
"```"
210+
]
211+
},
212+
{
213+
"cell_type": "markdown",
214+
"metadata": {},
215+
"source": [
216+
"We now know what kind of bugs we can encounter.\n",
217+
"Let's fix this, open `03_subset-country.py` and add the following lines\n",
218+
"\n",
219+
"```python\n",
220+
"def get_country(filename, country):\n",
221+
" # Load table\n",
222+
" wine = pd.read_csv(filename)\n",
223+
"\n",
224+
" # Use the country name to subset data\n",
225+
" subset_country = wine[wine['country'] == country ].copy()\n",
226+
" subset_country.reset_index(drop=True, inplace=True) \n",
227+
"\n",
228+
" # Constructing the fname\n",
229+
" today = datetime.datetime.today().strftime('%Y-%m-%d')\n",
230+
" fname = f'data/processed/{today}-winemag_{country}.csv'\n",
231+
"\n",
232+
" # Saving the csv\n",
233+
" subset_country.to_csv(fname, index =False)\n",
234+
" print(fname) # print the fname from here\n",
235+
"\n",
236+
" return(subset_country) #returns the data frame\n",
237+
"```"
238+
]
239+
},
169240
{
170241
"cell_type": "markdown",
171242
"metadata": {},
@@ -189,13 +260,13 @@
189260
"cell_type": "markdown",
190261
"metadata": {},
191262
"source": [
192-
"# Past as Truth\n",
263+
"## Past as Truth (regression tests)\n",
193264
"\n",
194265
"Regression tests assume that the past is “correct.” They are great for letting developers know when and how a code base has changed. They are not great for letting anyone know why the change occurred. The change between what a code produces now and what it computed before is called a regression.\n",
195266
"\n",
196267
"** How many times have you tried to run a script or a notebook you found online just to realize it is broken?**\n",
197268
"\n",
198-
"Let's do some regression testing on the Jupyter notebook using *nbval*"
269+
"Let's do some regression testing on the Jupyter notebook using [nbval](https://github.com/computationalmodelling/nbval)"
199270
]
200271
},
201272
{
@@ -244,7 +315,23 @@
244315
"cell_type": "markdown",
245316
"metadata": {},
246317
"source": [
247-
"Make sure everything is commited to git before carrying on.\n"
318+
"<div class='warn'>Make sure everything is commited to git before carrying on.</div>\n",
319+
"<br>\n",
320+
"Add the following line to your `runall-wine-analysis` script\n",
321+
"\n",
322+
"```python\n",
323+
"import recipy\n",
324+
"```\n",
325+
"Run the script again `python -m src.runall-wine-analysis`"
326+
]
327+
},
328+
{
329+
"cell_type": "markdown",
330+
"metadata": {},
331+
"source": [
332+
"You can now track the provenance of your project. \n",
333+
"\n",
334+
"Try using `recipy latest` and `recipy gui`"
248335
]
249336
},
250337
{
Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
#!/usr/bin/env python
2+
"""
3+
Module contaning the functions to visualize the
4+
wines distribution using a subset data
5+
"""
6+
7+
import pandas as pd
8+
import numpy as np
9+
import matplotlib.pyplot as plt
10+
11+
def create_plots(filename):
12+
"""
13+
Create plots for the analysis
14+
Args:
15+
-----
16+
filename: str
17+
Path to the filename containing the wine data
18+
19+
"""
20+
wine = pd.read_csv(filename)
21+
22+
# Calls the function that plots the distribution
23+
print(plot_distribution(wine))
24+
25+
# Calls the function that plots the scatter plot
26+
print(plot_scatter(wine))
27+
28+
29+
def plot_distribution(wine):
30+
num_bins = 20
31+
32+
mu = 88 # mean of distribution
33+
sigma = 3 # standard deviation of distribution
34+
35+
# Histogram of the data
36+
fig, ax = plt.subplots(figsize = (10,8))
37+
n, bins, patches = plt.hist(wine['points'], num_bins, density=1, facecolor='blue', alpha=0.5)
38+
39+
40+
# Add a 'best fit' line
41+
y = ((1 / (np.sqrt(2 * np.pi) * sigma)) *
42+
np.exp(-0.5 * (1 / sigma * (bins - mu))**2))
43+
44+
ax.plot(bins, y, '--')
45+
46+
ax.set_title('Distribution of Wine scores: $\mu=88$, $\sigma=3$')
47+
ax.set_ylabel('Probability density')
48+
ax.set_xlabel('Points');
49+
50+
fname = f'figures/fig01_distribution-wine-scores.png'
51+
52+
fig.savefig(fname, bbox_inches = 'tight')
53+
return (fname)
54+
55+
56+
def plot_scatter(wine):
57+
58+
fig, ax = plt.subplots(figsize = (10,8))
59+
60+
plt.scatter(wine['points'], wine['price'])
61+
ax.set_title('Scatter wine points vs price')
62+
ax.set_ylabel('Price USD')
63+
ax.set_xlabel('Points')
64+
65+
fname = f'figures/fig02_scatter-points-vs-price.png'
66+
fig.savefig(fname, bbox_inches = 'tight')
67+
return (fname)
68+
69+
70+
if __name__ == '__main__':
71+
# Filename is passed by the user
72+
filename = sys.argv[1]
73+
74+
create_plots(filename)

solutions/02_visualize-wines.py

Lines changed: 15 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -51,8 +51,14 @@ def plot_distribution(wine):
5151
ax.set_xlabel('Points');
5252

5353
fname = f'figures/fig01_distribution-wine-scores.png'
54+
55+
try:
56+
fig.savefig(fname, bbox_inches = 'tight')
57+
except OSError as e:
58+
os.makedirs('figures')
59+
print('Creating figures directory')
60+
fig.savefig(fname, bbox_inches='tight')
5461

55-
fig.savefig(fname, bbox_inches = 'tight')
5662
return (fname)
5763

5864

@@ -66,7 +72,14 @@ def plot_scatter(wine):
6672
ax.set_xlabel('Points')
6773

6874
fname = f'figures/fig02_scatter-points-vs-price.png'
69-
fig.savefig(fname, bbox_inches = 'tight')
75+
76+
try:
77+
fig.savefig(fname, bbox_inches = 'tight')
78+
except OSError as e:
79+
os.makedirs('figures')
80+
print('Creating figures directory')
81+
fig.savefig(fname, bbox_inches='tight')
82+
7083
return (fname)
7184

7285

solutions/03_country-subset.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,6 @@ def get_country(filename, country):
2828

2929
# Use the country name to subset data
3030
subset_country = wine[wine['country'] == country ].copy()
31-
subset_country.reset_index(drop=True, inplace=True)
3231

3332
# Subset the
3433

0 commit comments

Comments
 (0)