When and when not to A/B test

Split test vs. multi-armed bandit: simulation, source code and ready-to-use app

Photo by Daniil Vnoutchkov on Unsplash

This article discusses the differences of A/B or Split tests and multi-armed bandits for variant testing. It can be used to e.g. choose the best performing set of ads for a digital marketing campaign. Simulating and measuring the performance of both methods in various experiments shows that in most scenarios the Bandit dominates the Split and should be preferred. In addition to the simulation results, all of the required source code is provided plus a simple web app for variant optimization using the Bandit method.

App: optimizer.stagelink.com
Code: github.com/kinosal/optimizer

Story Outline

  1. What is an A/B test and what’s the problem?
  2. The alternative: Multi-armed bandits
  3. Experiment: Split vs. Bandit
  4. Parameters + Conclusion

So let’s get going:

What is an A/B test?

Photo by Robert Anasch on Unsplash

A/B, A/B/n or Split Tests – which is how we will call them – are used to determine which of several alternative scenarios or variants is more likely to generate a desired – successful – outcome or response by testing, measuring and comparing the behavior, i.e. the success rates, in each scenario. There are various use cases for such tests. This article will focus on maximising conversion rates, e.g. from users seeing an online advertisement to clicking on such ad or users completing a purchase after visiting a website.

The outcome or result of a Split Test can be described with a NxM matrix where N rows contain the number of successes and failures = trials – successes for all tested scenarios. In an example where Ad (scenario or variant) 1 has been shown to 1,000 people (trials), 10 of which clicked on the ad (successes), and Ad 2 has been shown to different 1,000 people of which 20 clicked on the ad, this matrix, which is also called a contingency table, looks like this:

Now the objective is to determine whether one of the two ads’ (scenarios’) success rates (likelihood to click) is “significantly” higher than of the other ad (scenario). This would mean there was a high likelihood that the samples drawn in all scenarios do not originate in the same population (or a population with the same – in our case binomial – distribution), which is our null-hypothesis H0 of the success rate being independent from the observed scenarios that we are seeking to invalidate or disprove. This is important so we can eventually conclude that the differences in the scenarios are in fact responsible for the different success rates, and not only the randomness within these scenarios. Statistical significance or confidence level can be explained as the probability of independence = the probability of wrongly rejecting H0 (a type I error) being lower than a predefined p-value. There are several methods to test for significance, we will be focussing on the (Pearson’s) chi-squared test without (Yates’) correction. The p-value can be derived from the observed data displayed in a contingency table via the chi-squared statistic which is defined as

where Oi,j are the observed successes and failures while Ei,j are their respective expected values if the underlying distribution was the same, i.e. 15 successes and 985 failures in the above example. The statistic thus indicates the dissimilarity of the two scenarios, if there is no difference in the observed scenarios chi-square equals zero. The p-value can now be calculated as the right-tailed area under the curve (the integral) of the chi-squared distribution with the proper degrees of freedom (the number of scenarios minus one in our case of two possible outcomes) from the value of the chi-square statistic (more about this later when conducting the experiment). If the p-value is lower than previously defined (as in the desired confidence or significance level) the null-hypothesis can be rejected and the outcomes deemed dependent on the observed scenarios. In the above example chi-squared equals 3.3841 and p is 0.0658 (which might not be low enough to disprove the notion of independence).

So what’s the problem?

As we have seen we need to achieve some level of confidence in our desired outcome (successes or clicks in the above example) to depend on a provided scenario. Only then we can “safely” disregard the inferior scenario or variant – e.g. ad – and thus maximize positive results.

The problem here is one of exploration and exploitation.

In order to exploit the situation by moving completely to the best scenario (e.g. show only that ad), we need to invest in exploring the situation until we are confident enough. This can be expensive since we might be paying to deliver an ad impression and/or because of the opportunity cost induced by not maximizing our return due to a “bad” choice of scenario(s). So, until we reach the desired confidence in our decision to move entirely to the designated superior scenario, we are purely paying to explore; after we have reached this point, we can fully exploit that situation (given the scenarios do not change over time). This doesn’t sound like an optimal solution, does it?

What if we could more gradually and earlier move towards the scenarios we believe might turn out to be superior and thus reduce exploration costs and increase exploitation returns, especially in the short run?

I’m glad you asked.

The alternative: Multi-armed bandits

Before going into detail, let’s quickly talk about the battle between frequentist and bayesian approaches to statistics:

Source: https://www.xkcd.com/1132, also compare the heated debate about this comic at https://stats.stackexchange.com/questions/43339/whats-wrong-with-xkcds-frequentists-vs-bayesians-comic

Until now we were living a frequentist life. However, since we don’t want to be supernova-ed, this is going to change. Bayesians update their existing beliefs about (the probability of events in) the (stochastic) world (from prior to posterior) by obtaining additional information conditionally related to those beliefs:

where A and B are events, P(A) and P(B) are the (marginal) probabilities of observing those events (where P(B) ≠ 0), P(A|B) and P(B|A) are conditional probabilities (e.g. the likelihood of A given that B is true).

There are many alternatives to “classical” frequentist A/B tests, this article will highlight a specific Bayesian approach that can be standardized and automated similar to a split test: the multi-armed bandit,

“a smarter version of an A/B test”

according to Kai Rikhye from InVision.

A brief introduction to multi-armed bandits

Photo by Steve Sawusch on Unsplash

Remember we want to more gradually and earlier move towards the scenarios we believe might turn out to be superior and thus reduce exploration costs and increase exploitation returns – a classic tradeoff. It is like us wondering which slot machine (one-armed bandit) to play in a casino; if we knew the success rate of each machine, we would only ever invest money into the machine with the highest winning probability and completely disregard all other options. However, if we do not know this, we need to explore the options in order to estimate all machines’ success rates while maximizing the overall (cumulative) return we can expect from playing. This optimization problem can be solved by testing different options and reinforcing them when a reward could be observed (hence a problem of Reinforcement Learning). Yet, finding the exact solution is hard. Luckily there are approximations sufficient for our use cases. In this article we will focus on a method called “Thompson sampling”, introduced by William R. Thompson back in 1933 and based on randomly drawing samples from each option’s beta distribution parameterized by its past successes α and failures β with the density function

where θ equals the success rate of the respective option and Γ denotes the gamma function. Important properties of the beta distribution are that the mean equals α / (α + β) (success rate = successes / trials) and that the distribution becomes more concentrated as α + β (the number of trials) grows. Thus the option with the highest drawn sample in a given drawing round which will be chosen for the next trial likely has a high success rate with high confidence and can be exploited or it has any success rate with low confidence and still needs to be explored.

Probability Density Functions, Source: https://en.wikipedia.org/wiki/Beta_distribution

Having established some theory, we can now move on to practically simulating the behavior of Split tests and multi-armed bandits when testing the performance of different variants.

Experiment: Split vs. Bandit


We are going to test two different methods (the chi-squared split test – “Split” hereafter – and the Thompson beta bandit – “Bandit”) with the objective to maximize the cumulated successes (e.g. ad clicks) by sequentially (over several periods, e.g. days) choosing trials (e.g. presenting ads to users) from multiple options (e.g. different ad creatives). Based on the above described different nature of these algorithms we expect different results depending on different environments controlled by the following variables: number of options, true success rates for these options, trials per period, confidence level (maximum p-value) and success rate instability (standard deviation). We are going to start with a simple environment with 2 options and fixed success rates (no deviation) before exploring alternative scenarios.

Basic Simulation

The basic experiment can even be simulated in a spreadsheet:

In this sheet the environment parameters are set in cells A1:I3. Based on these, the decisions and outcomes for the Split and the first 7 periods are calculated in cells A5:P13. As discussed before all trials per period will be distributed evenly over the options until a winner can be declared with statistical significance. For a scenario with two options, true success rates of 1% and 1.5% with no deviation, 2,000 trials per period and a maximum p-value of 0.1 (90% confidence) this method detects the higher success rate for option 2 after period 6 and only trials this option from then on. Columns L:P show the total, cumulated successes as well as the absolute and relative difference (the loss) from the optimal situation where always the best option has been chosen.

Cells A15:P23 analogously calculate these values for the Bandit. Instead of drawing one sample and choosing one option at a time, for comparability and practicability this method is also implemented with periodic decisions where the percentage of an option shown in period t equals the number of times it has been chosen by 100 random samples of the beta distribution constructed based on the data from periods 1 until t-1 (every individual sample and choice for all periods can be seen in cells A25:R126). With the sample environment variables described above we should find more cumulative successes and thus a smaller loss compared to the optimal choices for the Bandit then the Split. This can also be visualized by plotting the absolute and relative successes respectively over all periods:

Here it becomes obvious that – at least in this basic simulation – the Bandit dominates the Split over all periods. So let us examine the elasticity of this outcome by experimenting with alterations of the different parameters. You can do this sequentially in the provided spreadsheet, but since this and the future mutations we want to apply are much easier to simulate coded in Python, we will turn to that now.

The Split class is initiated with the number of options available, trials, successes and failures vectors as Numpy arrays as well as the (current) p-value:

class Split():
  def __init__(self, num_options):
    self.num_options = num_options
    self.trials = np.zeros(shape=(self.num_options,), dtype=int)
    self.successes = np.zeros(shape=(self.num_options,), dtype=int)
    self.failures = np.zeros(shape=(self.num_options,), dtype=int)
    self.p_value = 1.0

Then we need a method to add results (trials, successes and failures) for one period to a Split instance:

def add_results(self, option_id, trials, successes):
  self.trials[option_id] = self.trials[option_id] + trials
  self.successes[option_id] = self.successes[option_id] + successes
  self.failures[option_id] = \
    self.failures[option_id] + trials - successes

And lastly, we will recalculate the p-value for the current state:

def calculate_p_value(self):
  observations = []
  for i in range(self.num_options):
    observations.append([self.successes[i], self.failures[i]])
  self.p_value = \
    chi2_contingency(observed=observations, correction=False)[1]

The chi2_contingency function can be found in the stats module of the SciPy library and calculates the p-value from a given contingency table using the same process as described above for the spreadsheet.

The Bandit is initialized similarly, only that we do not require a p-value or a failures vector, but includes a prior tuple which could be used to create a bandit with existing prior information to construct the Beta distribution described above:

class BetaBandit():
  def __init__(self, num_options):
    self.num_options = num_options
    self.prior = (1.0, 1.0)
    self.trials = np.zeros(shape=(self.num_options,), dtype=float)
    self.successes = \
      np.zeros(shape=(self.num_options,), dtype=float)

Results can be added in the same way as for the Split (only without the failures).

Now we can choose one option as the highest sampled value from the underlying beta distributions of all options:

def choose_option(self):
  sampled_theta = []
  for i in range(self.num_options):
    dist = beta(self.prior[0] + self.successes[i],
                self.prior[1] + self.trials[i] - self.successes[i])
    sampled_theta += [dist.rvs()]
  return sampled_theta.index(max(sampled_theta))

The beta function is also part of the stats module of the SciPy library, rvs() returns a random variable based on the underlying distribution.

For batch simulation we need to repeat this choice of options:

def repeat_choice(self, repetitions):
  option_counts = np.zeros(shape=(self.num_options,), dtype=int)
  for _ in range(repetitions):
    option = self.choose_option()
    option_counts[option] += 1
  return option_counts

The basic simulation then runs adequately to the above spreadsheet example:

  1. Initialize Split and Bandit
  2. Add results for first period with equal distribution of trials over all options
  3. Recalculate p-value for Split
  4. If p-value < max p-value: Add results to Split with equal distribution of trials over all options, else: Add results to Split with all trials for best option (with most successes)
  5. Choose batch options for Bandit and add results with distribution of trials as in choices
  6. Start again from 3.

The results are (almost, accounting for the randomness of the beta sampling) the same as already seen above:

Before we continue with the elasticity experiments, let us examine 28 instead of 7 periods:

The long term differences between Split and Bandit are smaller since the Split finds the optimal option after the 6th period and after that allocates all trials to that option where the Bandit still occasionally tries the weaker option (although very sparsely). Still, also after 28 periods the Bandit has outperformed the Split.

Okay, you want to know if that changes after 100 periods? It does not:

After this basic experiment, we want to take a final closer look at how the various simulation parameters affect the results before coming to a conclusion about when and when not to A/B test.

Parameter Elasticity


We want to analyze the effect of some of the environment parameters on the experiment results, starting with different p-values for the Split (going back to a default observation length of 28 periods):

We can see that the results get better (closer to the optimum) with higher p-values. We have to be careful though to jump to a conclusion using higher p-values for our models since this will lead to more false positives (Type I errors) and thus increase the risk of choosing a sub-optimal option as the “winner”. For the remaining experiments we will set the p-value to 0.1 since we are okay with wrongly rejecting the null-hypothesis of no significant difference between the variants 10% of all times.


A look at different trial numbers for the Split (left) and the Bandit (right):

As we could expect, more trials lead to higher confidence in the Split test results and thus to finding the optimal option sooner. For both methods we can observe a seemingly asymptotic behavior (the marginal improvement by more trials decreases).

Success Rates

While we are able to influence the p-value and potentially also the number of trials when trying to find the best options to achieve our objectives in the real world, the true success rates are usually independent from our actions (not getting into Schroedinger here…). Thus the difference in the two methods will be even more important here (again Split on the left, bandit on the right):

Again, the Split is being dominated by the Bandit and even more so if the differences in true success rates of the available options is small. In the [0.01, 0.012] case, the Split does not even find the optimal option after 27 periods.


When the true success rates are not completely invariant (which they most likely aren’t in a “real world” environment), but normally distributed with a standard deviation of 50% of their mean, the results change quite drastically:

These results obviously depend on the induced random state, with a different state they might look very different:

Generally the dominance of the Bandit is less apparent in a more probabilistic environment.

Let us now compare the results for both methods under different uncertainty levels (success rate standard deviations):

More Options

Last but not least, we want to take a look at scenarios with more than two competing options. With 11 different options with equally distributed true success rates between 1% and 2% Split and Bandit results relative to optimal success look as follows (no deviation in success rates on the left, 25% standard deviation from mean on the right):

Both methods do worse when there are more options. Still, until the Split has found the optimal option, the Bandit outperforms it by a substantial margin. However, when the experiment runs long enough, the end result looks quite different:

While the Split beats the Bandit in a deterministic environment, the Bandit seems to suffer less from the introduced uncertainty even in the long run.

A real world example

The scenario here is a digital advertising campaign with 50 different ad combinations (options) with randomly (uniform) distributed click rates (true success rates) between 1% and 4%. The click rates’ standard deviation is 50% of their respective mean. There will be 5,000 trials over 28 periods each. We keep the p-value for the Split at 0.1. Here are the results:

From 8,904 possible successes the Split yields 4,346, the Bandit 3,968. The bad news is that both methods leave more than 50% of the potential on the table compared to a scenario where the true success rates were known for every period. The good news is that both methods lift the base rate (3,137 successes) by at least 26%.


To answer the initial question “when and when not to use A/B test”: especially when exploration is costly and short-term exploitation important and/or when dealing with an uncertain, i.e. probabilistic, environment, the Bandit provides a very good, mostly superior alternative in finding and using the best options and thus maximising efficiency. However, when the Split (the A/B test) is confident in having identified the best option(s) it might yield better results.

An obvious conclusion and start for further research/testing could therefore be to combine both methods in a hybrid model, using the Bandit until the Split confidence is large enough (the p-value is smaller than the defined maximum), then leveraging the Split to reduce the option set to fewer “winners” with the highest success rates and after that using the Bandit again to choose the option(s) for the next periods until the Split’s p-value is again small enough and so on…

There is one more caveat we have not yet covered though: the options’ success rates and their distributions respectively might change over time. If this is the case, we might not want to reduce the possible option set at all, but flexibly adjust our decision process with any new observed results. Since the foundation of the Split is based on limiting options, i.e. identifying winners, this will dramatically affect the goodness of fit of this method in such an environment. The Bandit, however, updating its belief about the world after every new observation (set), can be constructed to deal with such behaviors, e.g. by introducing a discount function that reduces the impact of older results when choosing the preferred options for the next period. To briefly highlight the Split’s defect with respect to changing success rates, have a look at a scenario where the rates’ means increase or decrease up to 5% per period (practically you can e.g. imagine the click rates of ads shown to one particular audience increase over time due to audience development e.g. by retargeting while the click rates of ads shown to another audience decline due to saturation):

As you can see, the Split now cannot even outperform the random base allocation anymore. The Bandit seems almost unaffected, although it could probably be improved further with mentioned discount for older results.

So all-in-all,

Bandits provide your go-to methodology when testing the performance of different variants or scenarios

and if you are still using A/B tests, you should at least consider a switch.

Before I leave you and look forward to receiving your feedback, comments and ideas I would like to thank you for staying with me for such a long and intense article. As a little goodie I have built an ad optimization tool for any of you who would like to use the Bandit for their digital marketing campaigns: optimizer.stagelink.com

In addition to that you can find all the code used for the discussed simulation on my Github: github.com/kinosal/optimizer

Thanks again and talk to you soon!

Source: https://towardsdatascience.com/when-and-when-not-to-a-b-test-c901f3ad96d9