By Elisabeth Reitmayr

This is the third article in a three-part series that aims to add clarity and transparency around the way we work at AG真人旗舰厅. The first article 涵盖在开始实验之前所需要的工作,而 second article discusses the setup of an experiment.

At AG真人旗舰厅在美国,AG真人国际厅做了很多实验,为用户改进产品. Our experiment design guidelines for product analysts establish the process for setting up those experiments from the analytical and statistics perspective to ensure we can evaluate the experiment as intended. These guidelines give some hints, but do not fully cover the product management, user research and design perspective, i.e. what to experiment on. 在本系列的第三部分中,AG真人国际厅将重点讨论抽样.

AG真人国际厅感兴趣的是你对这些指导方针的看法. Please send any feedback to elisabeth.reitmayr@hansek.net.


抽样是实验设计的一个重要环节. 当AG真人国际厅使用样本来推断AG真人国际厅感兴趣的总体时, it is important to choose the right target group using an appropriate sampling mechanism to avoid bias. When we have bias in an experiment, this means that we do not adequately represent the population we are studying (read more about statistical bias here). To draw a valid conclusion from the experiment, 同样重要的是样本要足够大,以达到AG真人国际厅想要检测的效果(see Part 1).

Target group and sampling mechanism

目标群体应该能够代表AG真人国际厅想要进行推断的人群. This means that if we run a test on a feature that can only be used by users who fulfil certain conditions (e.g. have a publication, are new to AG真人旗舰厅), both the experimental and the control group should only consist of users who fulfil this condition. Otherwise, we will introduce a selection bias (e.g. because users who have publications tend to be more active than users who do not have publications).

有时,AG真人国际厅希望只将实验暴露给特定的用户群体. Let’s say we want to make it easier for Principal Investigators (leaders of a scientific lab) to add their lab to AG真人旗舰厅. In this case, 所有研究之门的主要研究人员都代表AG真人国际厅的人口, AG真人国际厅应该从首席调查员的人群中随机抽取这两个样本.

In rare cases, a stratified sample might help if you have a very small population or in case your sample was already drawn in a biased way. For example, 如果您想将一个新特性只暴露给一小群beta测试者, you should be aware that this will not be representative of the population as more engaged users tend to be overrepresented in beta testing groups. (他们更有可能自愿加入beta组.) Therefore, you can draw a stratified sample from the beta group to make sure the distribution of the engagement levels in your sample mirrors the distribution of engagement levels in your population. Read more here on 如何使用贝叶斯方法从选择偏差中恢复).


  • 您的实验是否完全暴露于您想要解决的用户(i.e. 代表你感兴趣的人群)?

  • 在选择受众的方式上是否存在潜在的偏见? (e.g., only new users/engaged users, etc.)

  • Is each user only exposed to one variant?

Sample size calculation


  • Minimum detectable effect:你期望从你所引入的改变中看到的最小效果. 这对应于量化的期望,如第1部分所解释的那样. AG真人国际厅想要检测的影响越小,AG真人国际厅的样本就需要越大.

  • Statistical reliability (reliability that the effect we detected is actually there) and statistical power (power to detect an effect when there is one): There is a trade-off relationship between reliability and power; generally: The higher the statistical reliability/power, the larger our sample needs to be. At AG真人旗舰厅, we set alpha (reliability) to 5% and beta (power) to 20%: these parameters are the same across all experiments for consistency.

  • Variance in the variable of interest: The higher the variance in the variable we are interested in, the larger our sample needs to be.

样本大小总是需要预先计算,i.e. before implementing the experiment. If the required sample size is too large, AG真人国际厅甚至可能不想进行实验,而是选择另一种研究方法. We use a Frequentist 评估测试的方法,并可为此目的使用第三方样本量计算器:


In case your experiment analysis requires hypothesis testing on breakdowns or multiple comparisons, 由于alpha膨胀,这将改变您的样本大小要求. (The probability of at least one false positive increases exponentially with the number of hypotheses you are testing on.) This has to be reflected in your sample size calculation (you can apply a p-value correction in your analysis - more on this in the next blog about experiment evaluation). Read more here.

Run time

General run time requirements

  • 运行时间受样本大小要求的影响. (只有当达到要求的样本量时,实验才可以停止.)

  • The run time should not be too short because otherwise the sample will be biased towards more active users. (Those tend to be overrepresented in the first days since they are more likely to login and therefore more likely to join the experiment.)

  • 考虑可能影响用户行为的季节性效应:

  • web/mobile中的时间序列数据通常是非平稳的.e. 变量的参数,如转换或保留.g. 平均值,中位数,方差在一段时间内都不是常数)

  • 可能有季节性效应,工作日效应,病毒效应,SEO

  • An absolutely non-scientific rule of thumb is that most experiments should run at least a week to account for the aforementioned potential effects.

Multi-armed Bandit (MAB)的运行时间要求

The multi-armed bandit is an algorithm that automatically chooses the variant that scored the highest according to the goal that was set. Once either a pre-defined period of time, or pre-defined sample size threshold is crossed, the MAB defaults the experiment to the more successful variant for a large proportion of our users (read more here). This helps to decrease the "cost" of the experiment as we use the better-performing variant for the larger part of our user base. To set the right threshold, AG真人国际厅需要定义最小样本容量或最小运行时间:

  • Calculate how long it takes to get the required sample size upfront and set the minimum exploration period accordingly.

  • 在这种情况下,最短的观测次数可以很快达到, 将最短探测时间设置为至少1周(见前面段落).

Edge cases: "soft" experiments

There will be edge cases where:

  • We have "medium risk" about an assumption, i.e. we ideally would want to test it experimentally; and

  • 实验并不是分析给定问题的理想方法.g. because the traffic to the feature is too small and we would have to wait for several months to reach the required sample size

Despite the low traffic, we might still want to get a quantitative understanding of the change to limit the risk of introducing the change (ensuring "we don't break things"). 在本例中,AG真人国际厅建议运行一个“软实验”:AG真人国际厅让实验为e运行.g. 2 weeks, AG真人国际厅知道AG真人国际厅无法达到要求的样本大小来对其进行频率假设检验, in order to observe how the new variant performs. We consider this a pragmatic solution to the situation we are facing as having some data on a risky problem to make a decision is favorable to not having any data at all.

Here, we are talking about the art, 而不是实验AG真人国际厅——判断是必需的. 您应该与您的团队(PM, Design, 和用户研究)来决定“软实验”是否是解决当前问题的最佳方案. You should also make sure everyone is aware of the limitations of running the experiment only "halfway".

一旦你决定进行“软实验”,就不要进行 Frequentist hypothesis 测试来评估在这种情况下的实验结果. 尽管如此,AG真人国际厅还是将其设置为A/B测试,以查看与控制变量的对比.

Based on the typology suggested in this blog post,AG真人国际厅可以在高风险和低风险之间的灰色区域添加“软”实验:

Image based on The Art of the Strategic Product Roadmap

Overlapping or concurrent experiments

Do not run more than one experiment on the same component concurrently except in the case that you have a full-factorial design (see Part 2).


Unit of observation

Another important consideration for experimental design is the definition of the unit of observation. 理论上,观测单位可以是e.g. the user/the session/the user-login day etc. For example, 如果你在比较哪种电子邮件变体更有可能让用户登录RG, the unit of observation would be the user. 在这里,重要的是要考虑独立观测的统计要求.

Independent observations

为了评价实验,观察结果是独立的是很重要的. Two observations are independent if the occurrence of one observation provides no information about the occurrence of the other observation. The statistical models we use to evaluate experiments are based on the assumptions that the observations in the sample are independent. 如果AG真人国际厅违反这个假设,AG真人国际厅从实验中得出的结论可能是有缺陷的.

这意味着在大多数情况下,实验应该基于用户,而不是基于会话(i.e. 要评估的数据集中的一行对应于一个用户,而不是一个用户会话). 如果AG真人国际厅的样本中每个用户有多个观察值, 用户的第二个观察结果不会独立于第一个观察结果.

这也意味着每个用户只能参加一次实验, otherwise we have the following problems:

  • The controls we impose on Type I and Type II error rates do not work as intended: The probability of getting a false positive is higher than the pre-defined level. Read more here and here.)

  • More active users are overrepresented because it is likely that they join the experiment multiple times. 这意味着结果偏向于更活跃的用户.