Apply computation within groups in a pandas DataFrame

One useful capability of pandas is to execute computations within groups of a DataFrame. This is achieved using groupby() followed by transform().

# Import libraries
import pandas as pd
import seaborn as sns

# Load sample data in a DataFrame
df = (
    sns.load_dataset('iris')
    .sample(n=12, random_state=24)
    .sort_values('species')
    .reset_index(drop=True)
    [['species', 'sepal_width']]
)
df
species sepal_width
0 setosa 3.4
1 setosa 3.7
2 setosa 3.2
3 setosa 3.1
4 setosa 3.8
5 versicolor 2.4
6 versicolor 2.7
7 versicolor 2.5
8 virginica 2.9
9 virginica 2.8
10 virginica 3.0
11 virginica 3.4

Compute mean at group-level

To get group-level statistics like mean, sum or count, use transform('function').

# Get group mean
df.assign(mean=
    df.groupby('species').transform('mean')
)
species sepal_width mean
0 setosa 3.4 3.440000
1 setosa 3.7 3.440000
2 setosa 3.2 3.440000
3 setosa 3.1 3.440000
4 setosa 3.8 3.440000
5 versicolor 2.4 2.533333
6 versicolor 2.7 2.533333
7 versicolor 2.5 2.533333
8 virginica 2.9 3.025000
9 virginica 2.8 3.025000
10 virginica 3.0 3.025000
11 virginica 3.4 3.025000

Standardize values

You can center values inside a group by substracting the group mean to each row.

# Standardize values
df.assign(standardized=
    df.groupby('species').transform(lambda x: x - x.mean())
)
species sepal_width standardized
0 setosa 3.4 -0.040000
1 setosa 3.7 0.260000
2 setosa 3.2 -0.240000
3 setosa 3.1 -0.340000
4 setosa 3.8 0.360000
5 versicolor 2.4 -0.133333
6 versicolor 2.7 0.166667
7 versicolor 2.5 -0.033333
8 virginica 2.9 -0.125000
9 virginica 2.8 -0.225000
10 virginica 3.0 -0.025000
11 virginica 3.4 0.375000

Rank values inside groups

Besides computing group-wise values, you can also rank values within each group.

# Rank values
df.assign(rank=
    df.groupby('species').transform(lambda x: x.rank())
)
species sepal_width rank
0 setosa 3.4 3.0
1 setosa 3.7 4.0
2 setosa 3.2 2.0
3 setosa 3.1 1.0
4 setosa 3.8 5.0
5 versicolor 2.4 1.0
6 versicolor 2.7 3.0
7 versicolor 2.5 2.0
8 virginica 2.9 2.0
9 virginica 2.8 1.0
10 virginica 3.0 3.0
11 virginica 3.4 4.0

Error when output has multiple columns

If the result of the transform generates more than 1 column, and you try to assign it to a column of an existing DataFrame, you will encounter an error like ValueError: Wrong number of items passed X, placement implies 1. To avoid this, pass only one column in the result:

df.assign(rank=
    df.groupby('species').transform(lambda x: x.rank()).iloc[:, 0]
)
species sepal_width rank
0 setosa 3.4 3.0
1 setosa 3.7 4.0
2 setosa 3.2 2.0
3 setosa 3.1 1.0
4 setosa 3.8 5.0
5 versicolor 2.4 1.0
6 versicolor 2.7 3.0
7 versicolor 2.5 2.0
8 virginica 2.9 2.0
9 virginica 2.8 1.0
10 virginica 3.0 3.0
11 virginica 3.4 4.0