Notes from Udacity A/B Testing course.

Lesson 4: Designing an experiment

Choose “Subject” — Unit of Diversion

We begin with two definitions:

Subject: decide how to assign events to either the control or experiment
Unit of diversion: is how we define what an individual subject is in the experiment — look for proxy for users

How does one choose the unit of diversion?

Consistency
1. User consistency:
  - If using user-id, users get a consistent experience as they change devices so long as they are signed in
  - If you test a change that crosses the sign-in and sign-out border, user-id might not work well.
    
    e.g. location of a sign-in bar, the layout of a page.
    
    In such cases, the use of cookies can ensure consistency across sign-in and sign-out, but not across devices.
2. User Visibility: For user-visible changes, we should always consider using a user-id or cookie as unit of diversification. For changes not visible to users, such as latency changes, backend infrastructure changes, and rankings changes, should consider event-based diversification.
3. what you want to measure:
  - Learning effect does a learner adapts to change? Might still want to use user-id or cookie.
  - Latency - does the user uses the site less — might still choose user id or cookie although the change is not visible. Totally depends on the measurement you are trying to get.
Ethical Considerations
- If you use user id, then it is person identifiable, and there will be security and confidentiality concerns to address, and might need to get user consent. This is less of an issue for cookie-based diversion.
Variability Considerations
- Unit of analysis (what is the denominator of your metric) needs to be consistent with unit of diversion. Otherwise, the actual variability might be a lot different than what was calculated analytically. This is because when calculating the analytical variability, you are assuming: the distribution of the underlying data, what is independent.
- If you use event-based diversion, you assume each event is independent. But if you use user id or cookie-based diversion, the independence assumption is no longer valid, as you are diverting groups of events and they are correlated.

Choose “Population” (who is eligible)

Inter-user experiments: different users on A and B sides
Intra-user experiments: expose the same user to this feature on and off over time, and analyze how users behave in different time windows.
- Notes:
  1. Need to choose a comparable time window.
  2. With a lot of features, there might be frustration or learning problems, where users learn to use the particular features in the first two weeks, and ask why when you turn it off.
Rank order list — can run interleaved experiments where you expose the same users to A and B at the same time.
Interleaved experiments: Suppose you have two ranking algorithms, X and Y. Algorithm X would show results X1, X2, … XN in that order, and algorithm Y would show Y1, Y2, … YN. An interleaved experiment would show some interleaving of those results, for example, X1, Y1, X2, Y2, … with duplicate results removed. One way to measure this would be by comparing the click-through rate or -probability of the results from the two algorithms.

Target Population

Need to decide in advance who you are targeting in your users — there are some easy divisions to consider, such as browsers, geo locations, country, language, etc.

High-profile launch — want to restrict the number of people who see it before the official launch to avoid press coverage.
If we want to run internationally, need to check if the language is correct.
Avoid overlapping between various experiments
Only run your experiment on the affected traffic. Filtering the traffic might affect the variability as well. Running the experiment with global data with unaffected population included might dilute the changes.

Cases in which don’t choose particular traffic:

Want to test the effect across the global population as not sure if you can target correctly
90% of the total traffic might be affected, and does not worth the trouble to find the specific target
Need to check with the engineering team to better understand the features. Concern for potential interactions so that we might want to run a global experiment.
Use the same filters for the target and untargeted of the experiments.
Before launching a big change, run a global experiment to make sure you do not have an unintentional effect on the traffic you were not targeting

Population and Cohort

We should now define a Cohort

Cohorts: a subset of populations for users entering the experiment at the same time

This

Define an “entering class”, and only look at users entering both sides at the same time.
Can also use other information to define cohort — e.g. users who have been using your site consistently for 2 months, users with both laptop and mobile associated with their user ID, etc.
Typically, cohorts are used when looking for user stability (e.g. measure learning effects, examine user retention, increase user activity, anything requiring users to be established), when you want to observe how your change affects users’ behaviors instead of their history.
One cannot use a cohorts that started the courses prior to the start of the experiment, as they may have completed the course or lesson already.

One cannot use a cohorts from before the experiment starts for control either as there might be some other system changes that affect the experience.

\implies One must use cohorts of users that start after the course after the experiment starts, and split them between control and experiment groups.

While the basic definition of a cohort has the advantage of removing temporal bias from within the group, we frequently prefer to study groups from different times. What can be very useful is to consider cohort enrollment triggered by a some event. I.e. enrollment would be by registration or by first purchase. We would then compare members of these group by tracking their behavior using time following the enrolment. This has the added advantage of allow us to enroll more and more people into the cohort to achieve a statistical power goals.

Size

How to reduce the size of an experiment:
1. Increase α，β or d_min
2. make the unit of diversion and unit of analysis the same;
3. shrink it to specific traffic that is only affected, exclude the traffic that will dilute the experiment effect
There might be cases in which you don’t know which fraction of the population is going to be affected by the changes of the feature — need to be conservative about the time needed for the experiment. Can run a pilot experiment, or just observe the experiment for the first couple of weeks to check which fraction is affected.

Duration and Exposure

The duration of the experiment is related to the percentage of traffic you are sending to experiment and control each day. More traffic per day, less time.

Safety consideration: e.g. new feature and not sure how users will react. So keep the site the same to most people and only expose it to a small portion of people.
Run a small percentage on every day including weekdays and weekends, instead of a single day (especially holidays) to account for other sources of variability.

Learning Effect — Users adapt to the changes.

Choose the unit of diversion correctly to capture this, such as user id and cookie.
Learning is also about how often (dosage) users see the change. Use a cohort instead of the entire population, based on how often they have been exposed to the change or how long they have seen it.
For a high-risk change, run through a small portion of users over a longer period of time.
Pre-period and post-period: A/A tests before and after the experiment. Differences observed in the post-period can attribute to the learning effect.

Reuse

CC SA BY-NC-ND

Citation

BibTeX citation:

@online{bochman2023,
  author = {Bochman, Oren},
  title = {Lesson 4 - {Designing} an Experiment},
  date = {2023-01-01},
  url = {https://orenbochman.github.io/notes/AB-testing/l4.html},
  langid = {en}
}

For attribution, please cite this work as:

Bochman, Oren. 2023. “Lesson 4 - Designing an Experiment.” January 1, 2023. https://orenbochman.github.io/notes/AB-testing/l4.html.