1. Follow The Checklist Manifesto
A great experiment seems so simple! It’s mostly common sense – we all probably learned some version of this in grade school science class. But often, we end up in situations where we don’t have the data we want, we haven’t collected enough samples for analysis, the results are vague, and we’re scrambling. That’s why it’s worth it to take the time before we start an experiment to have a plan.
The above is a checklist for experiment design. A checklist is a simple way to cover your bases and is proven to reduce mistakes – that’s the checklist manifesto. If you can answer all the questions in the checklist, you’ll be in good shape (even though things can and likely will change).
2. State the Research Question
Basically, be sure to gather data with a question in mind or an idea of what you want to accomplish. Be as specific as possible. Without that, you may find yourself with an important question to answer but without the right data or poor data quality.
Example 1: Take something that seems as simple as measuring customer retention by join date. First, what defines a join date? What if their contract changes or you upsell them? And what about different segments of the population (contract types, age/gender of the customer, etc.)? It seems straightforward, but when you don’t ask a specific question, it causes confusion and wastes time – especially when several parties are involved. You’ll end up in a rabbit hole of problem definition.
Another common issue is the analytical rabbit hole – when you’re trying to figure out a stopping point. There’s always another way to cut the data, transform it, or model it. If you have a clear idea of what you want to accomplish, this should lend itself to a timeframe and a measure of success or failure.
Stating your research question up front will also force you to analyze any relevant existing data you have – it will force you to size the problem and the prize.
3. Identify the Population
What population are we trying to describe or analyze? This could be all users of your service, or it could just be a subset.
Example 1: We may want to target weekday and weekend users differently. Other commonly used subpopulations are gender, age, market, and referrer (browser type, device, etc.).
Example 2: Another consideration is misleading data, especially when aggregating over different subpopulations. A classic example is known as Simpson’s Paradox. So, in the table below, you can see that treatment A performs better for both Men and Women. But when we aggregate over both populations, treatment B performs better. This is because there was bias in the traffic to each group – in other words, the traffic was not split evenly across groups. This is a real example looking at two treatments across large and small kidney stones. Larger kidney stones were typically given the stronger treatment A, causing the bias.
The lesson here is really to make sure that subpopulations (if they exist) are separated or are evenly distributed across treatment groups.
4. Define the study protocol
Study protocols are statistics 101, so it’s worthwhile to quickly review.
First, we have prospective vs. retrospective studies. In prospective studies, we’re collecting subjects and then looking at subsequent events. In retrospective studies, we’re studying past events.
Second, we have experimental vs. observational studies. Experimental studies are ones in which the researcher controls the assignment of subjects to treatment groups – these are typically prospective studies. When you hear the terms single blind, double blind, etc., they’re talking about randomized assignments and about experimental studies. One nice advantage of randomizing assignment is that we should be less susceptible to bias (recall Simpson’s paradox), in the sense that our subpopulations should be evenly split between treatment groups (e.g. old vs. young, men vs. women, etc.). Observational studies are ones in which (for whatever reason) assignment isn’t possible. For example, studies on race or sex (not groups you can select into) where people are prospectively monitored for disease, etc.
Lastly, we can split into longitudinal studies where data is collected over an extended time period and cross-sectional where we collect data over a fixed time point.
Given this, we can make a decision on study protocol based on money, time, etc.
Example 1: Imagine we want to determine whether customer reviews are causal to or simply correlated with sales of a particular product. We could conduct a prospective, experimental study to control review numbers and examine subsequent impact on sales. The pain points are the time needed to run the experiment, the cost of sales lost and, depending on the sample size, the number of users affected. An alternative option would be a retrospective study to examine the correlation between number of reviews and views per sale.
Example 2: A/B tests are prospective, experimental and cross-sectional. In A/B tests, we use the system infrastructure to help choose between alternate features for the website. The idea is to randomly split traffic between two different variants – the A configuration and the B configuration – and examine which does better in terms of revenue, conversion rate, or whatever else we choose. Using this method, we can optimize our site and the business.
5. Identify the Data to Collect
Next, identify the data to collect. Think about the data you want. Is it categorical, numerical, censored, ranked, etc.? This affects the downstream analysis method and probably the sample size. Understand your data sources and how they work. What are the sources? How is the data structured? How was it collected? What’s the quality of data you are measuring? In other words, know the nuances of the data.
Example 1: For data science studies, data is often collected from disparate sources, aggregated, and then analyzed. An obvious but common problem is making sure this process actually runs smoothly. Having databases that go stale or go out of sync is more common than anyone likes to admit.
A good tip is to figure out ways to profile or check your data quality. Generally, if something is off and you do a good job with this, you’ll see it.
Example 2: Know what the data means. Very typically, there are several different ways of measuring important variables; depending on the analysis, we may care more about different segments of the population.
The better you know your data, the more reliable your analysis will be. People who know the data won’t trust your analysis unless they feel confident that you know it as well.
6. Set the metric and method for measuring success
In most of our cases, this metric will be something like purchase rate or conversion rate. To whatever degree possible, make sure that all other variables in the experiment are controlled (or fixed) besides the chosen measurement metric(s); the other steps in the checklist will help to ensure this. Have an analysis method in mind to be sure you’re collecting the right data and the right amount of data (i.e. the required sample size). In order to understand sample size determination, we first need to understand how populations are compared – optimally, the answer is statistical hypothesis testing. ‘Statistical hypothesis testing’ refers to the process of using observed, collected data to choose between competing hypotheses.
Example 1: For A/B testing, we try to falsify the hypothesis that there is no difference in conversion rate between the A and B variant. For this, we can use a proportion t-test to calculate statistical significance. With this information, we can reverse calculate the sample size. You can read more about sample size calculation for A/B tests here or here.
Example 2: Often in user testing, we ask users to rate alternate designs of the site or a feature for different characteristics. The appropriate analysis method here is conjoint analysis. Conjoint analysis is a statistical technique used in market research to determine how people value different features that make up an individual product or service. The objective of conjoint analysis is to determine what combination of a limited number of attributes is most influential on respondent choice or decision making. A controlled set of potential products or services is shown to respondents and by analyzing their preferences between these products, the implicit valuation of the individual elements making up the product or service can be determined. These implicit valuations (utilities or part-worths) can be used to create market models that estimate market share, revenue, and even profitability of new designs.
As you can see, there are many different methods for analysis. The right one is highly dependent on the data you are collecting and the study design. If you are unsure of the right method, consult a statistician or data scientist.
And while sample size calculation is very important to ensure that we run a study for long enough, we can likely still make inference even if we aren’t able to hit a target. With A/B testing, we’re in a world of plenty in terms of samples. But in most medical studies, for example, this is not the case. This is because of cost and resources, and people understand that. You’ll see studies published in good journals with as few as 10-15 samples in each treatment group. In the absence of the ideal, there are rules of thumb for minimum sample sizes; we can make inference on the samples we have, and use them as a jumping off point. To summarize, we do not want to “run to significance” – what’s the point of that? And we don’t want to run forever either.
7. Review and Peer Review
After this is done, you’ll want to present your results – and a big part of presentation is making people feel comfortable that you’ve covered your bases. There’s no secret here: be your own harshest critic, clarify your assumptions, and check your work. Things will change as you go along, but the more you have planned upfront, the better off you will be. And don’t forget to have your study peer reviewed. Others may point out things you haven’t thought of, or data you didn’t know was available.
As you execute the study, documentation and publication are critical. Likely, you will have to make some assumptions that you didn’t think of. Keep a log for tracking these! You and others might incorrectly interpret and use results if you haven’t documented your assumptions. Also, keep track of unexpected findings. These may be seem like a problem, but are important as a source of future questions to answer. Also, unexpected findings may be relevant to other parties.
8. Avoid The Pitfalls of Experiment Design
Lastly, for any study, you’ll get questions about reliability. These are the pitfalls of experiment design.
- Confounders: These are cases where you haven’t accounted for a correlating factor that is the true relation. For example, if we were to study people with lung cancer, we may see a correlation with drinking because many smokers drink.
- Correlation vs. causation: Causation can be hard to show with retrospective studies, as in the case of the relationship between consumer reviews and sales previously discussed.
- Dependence between samples: If you assume independence and it isn’t true, this affects the power of your study. An example here is treating repeated measurements over time as independent instead of related.
- Selection bias: Individuals or groups selected as a sample may not be representative of the population studied.
- Detection bias: This occurs when a phenomenon is more likely to be observed for a particular set of study subjects. For instance, doctors may be more likely to look for diabetes in obese patients than in thinner patients, leading to an inflation in diabetes among obese patients because of skewed detection efforts.
- Reporting bias: This involves a skew in the availability of data, such that observations of a certain kind are more likely to be reported. Exclusion bias: This arises due to the systematic exclusion of certain individuals from the study.
- And many more!
9. In Action: The Checklist for an A/B Test
Pop Quiz! Below is the checklist in action for an A/B test. Did you guess the right answers?