# Introduction

When running an experiment, sometimes the **randomisation unit is different from the analysis unit.** In this case, the assumption of independence between each observation may not hold anymore.

Since the independent and identically distributed (i.i.d.) assumption is violated, it is therefore impossible to run a standard test on raw data.

Several options are possible. One is to estimate the true variance with the Delta method, explained in a previous post:

Delta Method for A/B testingAnother option is to perform bootstrapping, which we discuss in this article.

# Python implementation

**As experiment data, we have a DataFrame of users**that were randomly assigned to the Control or Target group. We recorded their sessions in the app, and the number of sessions that generated a conversion.**Define a function to calculate the difference**in conversion rates between groups.**Calculate the difference for the observed data**, not yet resampled. In this example, there is a -2.29% difference in the Target group vs Control, as seen from the summary stats in step 1.**Perform bootstrapping by randomly assigning groups****n****times.**At each iteration, the difference in conversion rates between the random groups is returned and appended to an array.**Finally, compute the p-value**as the share of bootstrap samples where the*absolute*difference (because weβre running a two-tailed test) in conversion rates is greater than the global observed difference.

The table contains **one row per **** user**, with their

`group`

, total number of `sessions`

, total number of `conversions`

, and `conversion_rate`

calculated as conversions over sessions:```
| group | user_id | sessions | conversions | conversion_rate |
|:--------|:-----------------|---------:|------------:|----------------:|
| Control | b0cc6b25669f1cfb | 150 | 62 | 0.413333 |
| Target | 1cc2f0c081cff495 | 20 | 11 | 0.550000 |
| Control | 0dfa929aa7cea87a | 31 | 6 | 0.193548 |
| Target | 0dfa929aa7cea87a | 39 | 9 | 0.230769 |
| Control | e1916d7a661d210f | 3 | 2 | 0.666667 |
```

We check summary statistics, with the conversion rate of each group:

```
# Summary stats
df_summary = (
df
.assign(conversion_rate=lambda x: (x['conversions']/x['sessions']))
.groupby(['group'])
.agg({'user_id': 'count', 'sessions': 'sum', 'conversions': 'sum'})
.assign(conversion_rate=lambda x: x['conversions']/x['sessions'])
)
df_summary
```

group | users | sessions | conversions | conversion_rate |

Control | 488 | 37689 | 7662 | 0.2032 |

Target | 493 | 45106 | 8134 | 0.1803 |

Note: another possible option would be to look at the difference in *average (unweighted) users conversion rates *between groups*. *

```
# Function to get the statistic
def calculate_difference(data, numerator, denominator, group):
conv_rates = (
data
.groupby(group)
.agg({numerator: 'sum', denominator: 'sum'})
.assign(conv_rate=lambda x: x[numerator]/x[denominator])
['conv_rate']
)
return conv_rates.iloc[1] - conv_rates.iloc[0]
```

```
# Actual observed difference
observed_difference = calculate_difference(df)
observed_difference
```

`-0.0229`

```
# Boostrap n times over all users, with replacement
n_bootstrap = 10000
bootstrap_difference = []
for i in range(n_bootstrap):
df['boot_group'] = 'A'
df.loc[df.sample(frac=0.5).index, 'boot_group'] = 'B'
bootstrap_difference.append(calculate_difference(df, 'conversions', 'sessions', 'boot_group'))
bootstrap_difference = np.array(bootstrap_difference)
```

**This is the very definition of p-value**: if we sampled the data repeatedly from two random groups, how often would we get a more extreme difference than the observed difference?

```
# Calculate p-value
p_value = (np.abs(bootstrap_difference) >= np.abs(observed_difference)).sum() / n_bootstrap
print("p-value: {:.3f}".format(p_value))
```

`p-value: 0.465`

It happens to be non significant, with a p-value much higher than 0.05.