• Find a dataset to analyze: Unlike ML, there aren’t many public datasets for A/B testing. In their NeurIPS paper, Liu et al. (2021) summarized a dozen online controlled experiment datasets with different levels of granularity. Read this paper, pick a dataset, and analyze the experiment. Chances are, this process may expose gaps in your knowledge (How to load, join, and aggregate data? Which assumptions to check? What test to perform?) that merely reading a book or watching videos won’t do.

  • Implement power analysis and tttests: While reading the Kohavi book, I didn’t understand the sample size formula, 16σ2Δ2Δ216σ2, so I Googled how to derive it and implemented different variations. For tttests, I similarly implemented different variations (comparing 2 means vs. 2 proportions, pooled vs. unpooled variances) from scratch using NumPy and SciPy.

    I know myself — I’m awful at memorizing random stuff (e.g., I don’t recall my own plate number) and have to make sense of something to remember it. IMO, it doesn’t matter how you choose to learn, but you need to be honest about whether it’s working or not and do whatever works for yourself.

  • Consider complications: If you’re interviewing with companies that ask notoriously difficult A/B testing questions (e.g., Quora, Netflix, Uber, DoorDash, Google), you want to think deeper. For instance, when talking about A/B tests, we often think about independent groups, but some experiments use matched pairs (e.g., the same user saw different variants). What are the pros and cons of matched pairs vs. independent samples? And how do you analyze results differently (t−t−tests, bootstrapping…)? Another tricky problem is computing the average click-through rate (CTR) per user in each variant. Do you add clicks together and divide it by total impressions? Or do you compute each user’s CTR and average it by user? Quora DS Tom Xing called the former “ratio-of-averages” and the latter “average-of-ratios”. What are their pros and cons? Complications may also arise from the unit of randomization/analysis. Since an average Netflix user watches 0-1 movie per week, randomizing by user would mean month-long experiments. To compare algorithms A vs. B, Netflix often uses “interleaving”, showing items recommended by two algorithms in an alternating fashion (A1,B1,A2,B2…,An,BnA1,B1,A2,B2…,An,Bn) and comparing the average metric of each algorithm. When A/B testing fails, you may need other causal inference methods. To prepare, you can read these companies’ engineering blogs.