-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathgrouping_and_sorting.py
More file actions
90 lines (65 loc) · 3.08 KB
/
grouping_and_sorting.py
File metadata and controls
90 lines (65 loc) · 3.08 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
# **This notebook is an exercise in the [Pandas](https://www.kaggle.com/learn/pandas) course.
# You can reference the tutorial at [this link](https://www.kaggle.com/residentmario/grouping-and-sorting).**
#
# ---
#
# # Introduction
#
# In these exercises we'll apply groupwise analysis to our dataset.
#
# Run the code cell below to load the data before running the exercises.
import pandas as pd
reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
#pd.set_option("display.max_rows", 5)
from learntools.core import binder; binder.bind(globals())
from learntools.pandas.grouping_and_sorting import *
print("Setup complete.")
# # Exercises
# ## 1.
# Who are the most common wine reviewers in the dataset?
# Create a `Series` whose index is the `taster_twitter_handle` category from the dataset,
# and whose values count how many reviews each person wrote.
# Your code here
reviews_written = reviews.groupby('taster_twitter_handle')['taster_name'].count()
# Check your answer
q1.check()
# ## 2.
# What is the best wine I can buy for a given amount of money?
# Create a `Series` whose index is wine prices and whose values is the maximum number of points
# a wine costing that much was given in a review.
# Sort the values by price, ascending (so that `4.0` dollars is at the top and `3300.0` dollars is at the bottom).
best_rating_per_price = reviews.groupby('price').points.max()
# Check your answer
q2.check()
# ## 3.
# What are the minimum and maximum prices for each `variety` of wine?
# Create a `DataFrame` whose index is the `variety` category from the dataset
# and whose values are the `min` and `max` values thereof.
price_extremes = reviews.groupby('variety')['price'].agg(['min', 'max'])
# Check your answer
q3.check()
# ## 4.
# What are the most expensive wine varieties?
# Create a variable `sorted_varieties` containing a copy of the dataframe from the previous question
# where varieties are sorted in descending order based on minimum price, then on maximum price (to break ties).
sorted_varieties = price_extremes.sort_values(['min','max'], ascending=(False, False))
# Check your answer
q4.check()
# ## 5.
# Create a `Series` whose index is reviewers and whose values is the average review score
# given out by that reviewer. Hint: you will need the `taster_name` and `points` columns.
reviewer_mean_ratings = reviews.groupby('taster_name')['points'].mean()
# Check your answer
q5.check()
# Are there significant differences in the average scores assigned by the various reviewers?
# Run the cell below to use the `describe()` method to see a summary of the range of values.
reviewer_mean_ratings.describe()
# ## 6.
#
# What combination of countries and varieties are most common?
# Create a `Series` whose index is a `MultiIndex`of `{country, variety}` pairs.
# For example, a pinot noir produced in the US should map to `{"US", "Pinot Noir"}`.
# Sort the values in the `Series` in descending order based on wine count.
country_variety_counts = reviews.groupby(['country', 'variety'])['variety'].count().sort_values(ascending=False)
# Check your answer
q6.check()