Tools for analytics and optimization of the sales effectiveness of websites are becoming more and more popular. However, with their dissemination there should be education of marketers in the field of research and statistical inference. Unfortunately, it is very poor. That is why we decided to draw attention to the key problems regarding A / B tests and highlight key issues that should be absolutely remembered when deciding on this form of research.
Statistical inference is an extremely useful field of knowledge and every marketer should undergo a basic course in this field. Statistical inference is a key element of decision-making processes in business and data-driven marketing. If we want to act rationally and draw conclusions from available information. This is important for both quantitative and qualitative data, which can also be subjected to a certain quantification (although here too, various traps are waiting for us). Interpreting the correlation between phenomena and assessing the likelihood that our conclusions are right (and not the result of chance) are key competences, without which working with data can do us more harm than good.
It is important to know that these percentages indicate a form of correlation and do not imply a causal relationship. If a B variant performs significantly better, this means that there is a high correlation between the adjustment in our B variant and the increased conversion numbers.
Significance is a term from statistics. If two phenomena are compared with each other, the so-called ‘null hypothesis’ is assumed as standard: the assumption that there is no correlation between these two phenomena. Quite logical, we have to start with a starting point and by default we assume that there is no relationship between two phenomena. The null hypothesis remains until the opposite is proven: namely that there is a connection between two phenomena.
If an A / B test is positively significant, it means that it seems unlikely that the relationship is due to coincidence. There are various ‘confidence levels’ for this. For A / B testing this often amounts to 95%. In the medical world one wants absolute certainty that a result is not coincidental and only confidence levels are accepted from 99.9%.
A 95% significance means that you are 95% sure that your result is not coincidental. There is therefore a 5% chance that your results will be based on coincidence or variance. Significance is related to another statistical term: power.
Power indicates the chance that a null hypothesis will be accepted (that there is therefore no correlation between variables, the starting point) while in reality a coherence exists. A power factor of 80% means that the A / B tester has an 80% chance of detecting a true coherence in the sample population. A high power is therefore good. We use a minimum of 80% power with our test results.
Roughly there are two ways to assess a test result. You can do this via the Frequentist method or via the Bayesian method.
If you assess your results using the Frequentist method, you test whether an event occurs or not. It calculates the chance of repetition of that event in the long term. In the case of an A / B test, your hypothesis is the event you are testing. If your hypothesis does not occur (ie does not lead to improvement) then you get a ‘no’ as a result. If it does occur, the result is a ‘yes’. So only two outcomes are possible.
The Bayesian method obviously works differently. There is no ‘yes’ or ‘no’ outcome. Bayesian works with probabilities instead. It calculates the probability that your hypothesis will be successful on a scale of 1% to 100%. The greater the chance, the more certain you can be of your case, but you decide from when you consider an assumption acceptable. You also take other factors into account, such as costs and benefits.
We believe that the Bayesian method is a more correct way to estimate a result on its value. There is a better expectation management. You are not only concerned with the average percentages, but you also look at what happens when these averages turn out to deviate.
The final result is a risk analysis in which we look at all possible effects that the implementation of a result can have. What happens in the best case? But also, what is the effect if we end up short of it?
Enough about the underlying method. Back to that what matters, namely interpreting your results. Before we go through a number of reports with you, we would like to go through the statistics with you that will come back in such a report.
These are the data that you can retrieve directly from your test. They consist of:
The number of conversions divided by the number of visitors shows the average conversion rate per variant.
This is the difference between the average conversion percentages. It can indicate an improvement, but in some cases also a deterioration. In that case, a negative percentage is displayed.
Each average percentage is provided with a confidence interval. These confidence intervals indicate the lower and the upper limit of the average percentage. This is also called the margin of error. It is a measurement of the deviation that the average might have in practice.
The reliability indicates the chance that the original will be defeated by the variant. The higher this percentage, the smaller the risk of a result being based on coincidences. Reliability is therefore displayed as in the probability of 1% to 100%.
Having a grip on the data and the statistics is not the same as judging a result. Because what are the above figures trying to tell you? Important to remember while interpreting the data: