We can calculate statistics on a column of a DataFrame with Pandas’ aggregate statistics functions.
These are some main functions
Command | Meaning |
---|---|
mean | Average of all values in column |
std | Standard deviation |
median | Median |
max | Maximum value in column |
min | Minimum value in column |
count | Number of values in column |
nunique | Number of unique values in column |
unique | List of unique values in column |
For example, we can see a list of colors in the inventory: print(inventory.color.unique())
Summary statistics summarize and tell us about dataset.
# Print the mean of age
print(df["age"].mean())
# Print the median of age
print(df["age"].median())
.agg() method is useful to compute multiple statistics on multiple variables.
Removing duplicates is an essential skill to get accurate counts, because often you don’t want to count the same thing multiple times.
We can drop duplicate with this syntax:
df.drop_duplicates(subset="column-name")
Counting is a great way to get an overview of your data.
# Count the number of stores of each type
store_counts = store_types["type"].value_counts()
print(store_counts)
# Get the proportion of stores of each type
store_props = store_types["type"].value_counts(normalize=True)
print(store_props)
# Count the number of each department number and sort
dept_counts_sorted = store_depts["department"].value_counts(sort=True)
print(dept_counts_sorted)
# Get the proportion of departments of each number and sort
dept_props_sorted = store_depts["department"].value_counts(sort=True, normalize=True)
print(dept_props_sorted)
Pivot tables are the standard way of aggregating data in spreadsheets.
.pivot_table()
method is just an alternative to .groupby()
The .pivot_table() method has several useful arguments, including fill_value and margins.
Aggregate statistic functions can be applied across multiple rows by using a groupby
function.
groupby function creates a new Series, not a DataFrame, so we need to add reset_index()
.
Generally, you’ll always see a groupby statement followed by reset_index:
df.groupby('column1').column2.function().reset_index()
For example, to check the most expensive shoes in the orders
pricey_shoes = orders.groupby('shoe_type').price.max().reset_index()
Note
We can use groupby to more than one column by adding a list of columns:
For example, we create a DataFrame with the total number of shoes of each shoe_type/shoe_color.
When we’re using count(), it doesn’t really matter which column we perform the calculation on. We can calculate based on id column.
shoe_counts = orders.groupby(['shoe_type', 'shoe_color']).id.count().reset_index()
shoe_type | shoe_color | id | |
---|---|---|---|
0 | ballet flats | black | 2 |
1 | ballet flats | brown | 11 |
2 | ballet flats | navy | 17 |
3 | ballet flats | red | 13 |
4 | ballet flats | white | 7 |
5 | sandals | black | 3 |
6 | sandals | brown | 10 |
7 | sandals | navy | 13 |
8 | sandals | red | 14 |
9 | sandals | white | 10 |
10 | stilettos | black | 8 |
11 | stilettos | brown | 14 |
12 | stilettos | navy | 7 |
13 | stilettos | red | 16 |
14 | stilettos | white | 5 |
15 | wedges | brown | 13 |
16 | wedges | navy | 16 |
17 | wedges | red | 4 |
18 | wedges | white | 17 |
A pivot table is a table of statistics that summarizes the data of a more extensive table.
This summary might include sums, averages, or other statistics, which the pivot table groups together in a meaningful way.
In Pandas, the command for pivot is:
df.pivot(columns='ColumnToPivot',
index='ColumnToBeRows',
values='ColumnToBeValues')
For example, to change the above table into pivot table, we use:
shoe_counts = orders.groupby(['shoe_type', 'shoe_color']).id.count().reset_index()
shoe_counts_pivot = shoe_counts.pivot(
columns = 'shoe_color',
index = 'shoe_type',
values = 'id').reset_index()
print(shoe_counts_pivot)
shoe_type | black | brown | navy | red | white | |
---|---|---|---|---|---|---|
0 | ballet flats | 2.0 | 11.0 | 17.0 | 13.0 | 7.0 |
1 | sandals | 3.0 | 10.0 | 13.0 | 14.0 | 10.0 |
2 | stilettos | 8.0 | 14.0 | 7.0 | 16.0 | 5.0 |
3 | wedges | nan | 13.0 | 16.0 | 4.0 | 17.0 |
If you’d like to perform more complicated than mean or, you can use the apply method and lambda functions.
The Pandas apply() function can apply a function on every value in a column or row of a DataFrame.
For example, this code overwrites the existing ‘Name’ columns by making names uppercase.
df['Name'] = df.Name.apply(upper)
# Group by type; calc total weekly sales
sales_by_type = sales.groupby("type")["weekly_sales"].sum()
# Get proportion for each type
sales_propn_by_type = sales_by_type / sum(sales_by_type)
print(sales_propn_by_type)