If you work in growth, you will probably be involved in A/B or multivariate testing. I’ve run my fair share and have learned some lessons on what to do and what not to do. In this post, I will share a step-by-step process of doing a multivariate test using the bandit R package and a Bayesian approach.
This post is based on a collection of ideas and methods from other blogs, and therefore, is not entirely original. Nevertheless, my goal is to present an easy-to-follow guide for A/B/C/D testing with tools freely available to anyone.
It’s only fair to begin the post by acknowledging the blogs and sources that have been used for this post.
Bayes Theorem Explained with Legos
This is an introduction to the Bayesian theory explained with Legos. It is intuitive and explains the Bayes Theorem well. The Bayesian method is used to evaluate the results of the test for multiple reasons:
- In some cases, it saves you time or money as you can get results faster
- You can actually tell what you stand to lose by picking one variation over another
If you are using a frequentist method when doing A/B testing, you have to read Evan Miller’s blog. He has very useful ideas regarding A/B testing and Bayesian A/B testing. He also has a code for declaring the winner in a Bayesian A/B test, but we will be using a different code.
Both the blog, the presentation, and the paper he has written are great. They explain their methodology and address some issues with Bayesian testing. Very insightful.
It’s not really their blog but their great product that gave me the idea of using the bandit algorithm for our testing. They are a great tool for A/B testing app stores on mobile and I recommend everyone to give them a shot. They are a great group of people to work with too (I’m a client).
This is the package we will be using for our Bayesian calculations. You can get into fancy models with MCMC and JAGS on R to calculate the results, but the point of this post is to enable anyone who does not know anything about Bayesian A/B testing to give it a shot. To preempt some comments: Using this tool is not panacea, and in some cases, not 100% correct. However, it’s better than nothing, and from my experience, it’s better and cheaper than using a frequentist method.
Requirements & Assumptions
Okay, now that I have given proper acknowledgement, let’s move on to setting up a system that does A/B/C/D testing for us. You will need:
- 2 cups of testing data
- 1 tablespoon of R code
- 1 bottle of beer
- a Tableau instance (good to have)
Here are some of the assumptions I’ll be making for this test:
- We are running an A/B/C/D test on a mobile User Acquisition creative.
- The data is magically being tracked by our BI system and finds its way through to our database. I won’t discuss how this is happening, but all results are available to us.
- There is no significant time-based influence on our data (no seasonality of any kind).
- The data is obviously made up.
- We don’t need to worry about traffic, bidding type, bidding algorithm, or other things that could be affecting our campaigns.
- The data updates every 24 hours and looks like this:
The most common criticism against Bayesian methods is that the choice of prior adds a subjective bias on the analysis.
We are testing creatives that receive thousands of impressions or clicks. Therefore, there is no real difference on which prior we use because we can reasonably expect the evidence to overwhelm it. Please keep in mind that if your sample size is small, it’s important to put some thought on how you choose the prior. In our case, we will simply use a common beta distribution (a=b=1). This is called a flat prior and is used when someone has little knowledge about the data and wants the prior to have the least effect on the outcome of the analysis.
Step 1 – Load R
Launch an R instance.
For our purposes, you will need to install and run the following packages:
Step 2 – Load Your Data
After we have installed and launched the above libraries, it’s time to load our data. I’m not sure what sort of database you have, but if you are using redshift, you can use an R package called RJDBC to create a connection to your database and pull your data via a query.
We are evaluating conversion results, so there are two types of data groups: the population exposed to each variation (Group N) and the users that have converted (Group X). The results need to be manually imported to R:
- X: The group of users that clicked/ converted /installed during the test.
- N: The total number of users that were exposed to the test per variation
Step 3 – Getting results
Now, let’s use the bandit package to calculate the winners:
The table above gives us the Bayesian probability that each of the variations might be a winner. So for variation 1, the probability of being a winner is 35.04%. For variation 2, the probability is 31.23%, etc.
At this point, we don’t have enough information to make a call. We have to wait a bit longer and get more data.
Step 4 – Repeat Step 2 & 3
After a day, we will have the following data:
Let’s rerun the same code and see the results:
You’ll see here that Variation 2 has almost 0 probability of being a winner. Let’s look at the data in a little bit more detail:
In this step, we calculate the Bayesian probability that each variation is a winner given the posterior results.
- The probability for each alternative to outperform the next lower alternative (p_best)
- The confidence interval on the estimated amount. This variation outperforms the next alternative (lower & upper)
A couple of good things are happening here with Variation 2:
- p-best is almost negligible
- Moreover, the lower and upper bounds for the variation ranked 3rd (Variation 4 in our example) are positive and the significance is at 1%.
This is pretty important as our decision to pause a variation depends on:
- Value Remaining is almost 0. The “value remaining” in an experiment is the amount of increased conversion rate you could get by switching away from the winning variation.
So, we can go ahead and stop Variation 2.
Let’s run the experiment one more day:
So, Variation 2 had no chance of improving since we paused it. Variation 4 and 1 are looking bad too. Let’s see if we can make a call.
Step 5 – Visualization
It would be more helpful if we could graph the different variations so that we can see the overlap. You can do this with the following code:
dimnames(pb2)<-list(c(‘Var 1′,’Var 2′,’Var 3’, ‘Var 4’))
You need to name each column according to the variant. This step is important as the rest of the visualization won’t work without named columns of the simulated posterior result.
ggplot(melt_pb2,aes(x=value, fill=Var2)) + geom_density( colour=’black’,size=1,alpha=0.40) + scale_x_continuous(‘Conversion %’,labels = percent) + scale_y_continuous(‘Density’) + scale_fill_discrete(‘Variations’) +geom_hline(yintercept=0, size=1, color=”black”)
Things to Note.
- The more impressions, the more positive the kurtosis of the distribution, and essentially, the more accurate our estimate will be.
- Var 2 (the one we paused) vs. Var 3 (winning one) have essentially no overlap.
- Var 3 overlaps with Var 1 and Var 4. This means that although its winning in that aspect is a possibility, if we pick Var 3, the other 2 variations could do better.
Step 6 – Calculate The Remaining Value
At this point, it would be a good idea to calculate the remaining value.
The remaining value is the distribution of the improvement amounts that another arm might have over the current best arm. Bandit compares the best arm against the second best. Essentially, it tells us the amount of improvement we might forego by selecting the winning variation. This is a very important element in the Bayesian methodology. With this method, we know how much we stand to lose if we select an alternative.
value_rem=value_remaining(x2, n2, alpha = 1, beta = 1, ndraws = 100000)
Let’s create a graph for the remaining value as well:
ggplot(value_rem,aes(x=value, fill=value)) + geom_density(colour=’black’,fill=”red”,size=1,alpha=0.4)+scale_x_continuous(‘Conversion %’,labels =percent, limits = c(0, 0.03) )
Although not ideal, we can see that there is a little bit of remaining value left. This means that the worst case scenario if we picked the wrong variation (Var 3) as the winner would be foregoing a gain of 0.01%. This is a pretty negligible gain and at this point, we can either call Var 4 the winner or wait one more day.
Ideally, you want to call a winner when the remaining value is 0. If you are running out of money, time or both, then you have to evaluate the remaining value. If it’s small enough, then you can pick the winner and move on.
In our case, the difference is negligible and I will pick it as the winner.
Great things about this approach:
- You know how much you stand to lose
- Peaking is less of a concern (although not eliminated)
- You could get results faster
- You can answer the “which one is better” question in a more understandable way.