True or False: A/B testing
You are new in the field or you are an expert, you are going to launch your first test or you carry out A/B tests every day; or you are still looking to discover new things around this notion…
The DataMa team has gathered its A/B test experts for you to answer all your most specific questions. Whether it’s about the number of tests, the duration, the method, the division of traffic or the significance of these tests, all the answers you’re looking for about A/B tests are here! Guillaume, decision scientist for more than 10 years and Alex, CRO expert, share their expertise with you.
1. It is possible to run several tests at the same time on the same page: TRUE
This question often comes up and many people say that it is not possible. But mathematically, of course it is possible to run several tests at the same time on the same page. As long as the allocation of visitors is completely random (i.e. if it is not the even-numbered visitors who go to pages A and A’ and the odd-numbered visitors who go to pages B and B’), that the two tests are independent and that the volume is sufficient, then you can carry out all the tests you want.
As an example, websites like Homeaway or Booking perform more than 200 tests/month on their website. There are not 200 different types of pages on these sites.
2. It is only possible to run 50/50 tests: FALSE
Indeed, you can choose your own distribution!
However, in 90/10, the duration of the test will be constrained by the minimum traffic (here 10). If you are not confident about a test or if you want to test slowly at the beginning, you can start with 90/10.
Of course, you can also do parallel A/B/C/D testing with several variables at the same time (e.g. 25/25/25/25).
3. The traffic split should be stable during the test: TRUE (or almost always)
Yes, ideally, the traffic split should remain stable in a classic A/B test. Now there are some exceptions:
- Often in the first 2-3 days of the test we run a ramp-up and start at 1/99 to check that the data rises cleanly in production before moving to 50/50. This ramp-up period cannot be used to read the results.
2. Even if it is more complicated statistically, there are multi-armed-bandit approaches (different from multi-variant testing): companies that are very advanced in A/B testing use this approach to vary their split traffic over time. It’s about starting at 50/50 and then as one variant is better, I’ll unbalance the traffic towards it so as not to lose traffic. Be careful, this is really harder to manage and to read statistically.
4. Signifiance for a metric cannot be above 1 : FALSE
Often the question is the following: how significant is the variation in my average basket?
If I have increased my conversion by 2%, my A/B testing tool can tell if it is statistically significant.
However, if my average shopping basket increases from €320 to €325, is it significant? Not always the answer from A/B testing tools because they follow approaches that often make the hypothesis that the variable followed is a Bernouilly variable (the conversion is either 1 or 0). So to obtain significance on the average basket or, for example, on the revenue/user, you simply have to go back to the basic distribution of this metric.
- This is what we do with our DataMa Impact tool. We can therefore read the significance on metrics that are not classic
5. The Bayesian method is better than the frequentist method : IT DEPENDS
The market is currently more inclined to offer Bayesian tests, because they are easier to interpret. To present the results to your boss, you can simply say: “this is the percentage chance that A is better than B”. Frequentist tests are a bit more difficult to interpret, because you are testing the probability that what you observe is abnormal. It is therefore much more difficult to explain to a decision-maker, which is why A/B testing tools are not as effective as this approach.
- But from a statistical point of view, at DataMa we are more comfortable with the frequentist approach to reading the significance results, so we do both! And besides, there are other tests…
- This is what we do with our DataMa Impact tool. We can therefore read the significance on metrics that are not classic.
6. All sites should run A/B tests: FALSE
In reality, not always! It doesn’t really make sense for very small sites (<10,000 visitors/month) because to have statistical significance, you need a large volume of traffic and data! It is better to first do the exercise on the sites of sizing, test duration, evaluation of the significance…
In such a case, for small sites, it is better to look at what others are doing or to do user-testing rather than to put in an A/B test tool! Do market research, have a culture of experimentation, collect data on your assets to see what works or not…
- On the other hand, for larger sites, this is essential.
7. Signifiance must be monitored even if it is obtained earlier than expected
This is a difficult question: at the very least, you can’t run an A/B test for less than a week (especially because of weekly seasonality). Moreover, even if the statistics say that the results are significant, you should not make a decision with less than 100-200 observations of success (purchase / conversion) in A and B. The basic rule is therefore to stick to the sizing done before the test.
On the other hand, it is possible to set some business rules before the test. For example, if the evolution is drastic and very rapid (e.g. 20% drop on the first day), or prolonged (significant loss of conversion for 5 days in a row), action must be taken to curb this excessive loss of business! Under these conditions, the test can be stopped.
8. A/A tests are useless: FALSE
In reality, this is one of the first steps to be taken when running A/B tests. It is better to have an A/A test permanently, if only to check the coherence of the figures and results and that the A/A flat is observed permanently.
- Moreover, over a period of time that is too short, we will regularly observe a negative or positive A/A (!), which clearly shows the irrelevance of data that is too small and/or over a period of time that is too short (for all the statistical sceptics…).
If we observe results over a sufficiently long period that are significantly up or down, this raises the question of the implementation of tools, tracking cookies, web analytics issues… all fundamental questions on which many players have many challenges. But in these cases, the first reflex should be to ask yourself: did you run an A/A test to check that your web analytics were well aligned?
Running an A/A test from a technological point of view allows you to be sure that the technical tools are well aligned, to see the difference between the connection given by their AB test tools and their analytics tools… Running an A/A test should be compulsory when you integrate a new tool or to carry out a “health check-up”, every 6 months to make sure that the tools are operational and aligned