Caveats and Limitations of A/B Testing at Growth Tech Companies

For non-tech industry folks, an “A/B test” is just a randomized controlled trial where you split users or other things into treatment and control groups, and then later compare key metrics across those groups and decide which one performed better, so you can learn whether the treatment or control group is preferable. For the context of this post, I will be talking specifically about end-user testing for some sort of digital application. Also, when I talk about effect sizes, I am talking in terms of percentages, so the assumption is that effects are comparable across different sizes.

Introduction

A/B tests are the gold standard of user testing, but there are a few fundamental limitations to A/B tests:

  1. When evaluating an A/B test, your metrics must be (a) measurable, and (b) actually being measured.
  2. A/B tests apply effects that are (a) on the margin, and (b) within the testing period.

Some of these points may seem obvious on their face, but they have pretty important implications that many businesses (specifically managers, and even many folks with “data” in their job titles) fail to consider.

The “TL;DR” version

What people expect is that as an app grows, the sample sizes get larger, so this increases the statistical power of experiments. Additionally, larger denominators in KPIs imply that fixed labor costs are more productive per unit, so even if you see lower effect sizes (on a percentage basis), they become more acceptable when spread out across more users.

Everything above is true, but there are some oft neglected countervailing effects:

  1. Features “cannibalize” one another, or rather they experience diminishing marginal returns as other features get added:
  2. image
  3. As a consequence of the first point, contemporaneous effects of experimentation may squeeze rapidly toward zero, even if intertemporal effects– which are often stronger anyway– remain non-zero.

The end result is that, at a growth company, it is not unreasonable to find that diminishing effect sizes may decrease the statistical power of experiments (measuring contemporaneous or near-term effects) faster than inflating sample sizes increase statistical power. But critically, this doesn’t mean that the features being tested are “bad.”

While drafting and before publishing, I have confirmed from other data scientists at other companies that they can see a pattern of diminishing effect sizes of A/B tests over time across experiments in aggregate, and have also discussed with them some things in regards to the internal politics surrounding it. That said, everything in this post is ultimately my own opinion and analysis.

Measurable metrics are usually a proxy for immeasurable sentiment

The implications of measurability are that you can’t tell whether users actually enjoy or hate a particular feature if this doesn’t show up in metrics, e.g. they hate the feature but continue to use the app within the period of your testing.

The assumption that churn is a measure for user dissatisfaction is, ultimately, an assumption. It is more likely that contemporaneous churn signifies extreme dissatisfaction, but that minor dissatisfaction may exist without immediately obvious churn.

Some folks may mistakenly believe things like, “if we have enough users, we’ll be able to see the effects.” The mistake here is not really one of sample size, but of timing of the effects: if user dissatisfaction leads to a delayed churn, the test may conclude before the measurable quantity sees a statistically significant effect (even though the immeasurable effect, i.e. sentiment, was immediate).

The idea that users will not churn contemporaneously even if they’ll churn later sounds like a stretch– surely across 500,000 users in your experiment, you’ll see some sort of immediate effect, no?– but it’s not a stretch if you think about it from a user experience perspective. Measurable effects on churn or other KPIs may be delayed because users often do not immediately stop what they are doing just because a particular in-app flow becomes more annoying. Usually the biggest hurdle to a user is convincing them to log into the app in the first place, and a “high intent” user will not be immediately dissuaded from completing a flow even if they are dissatisfied with the smoothness of the experience in the process of doing so. Measuring the delayed effect may be especially slow if your typical user only logs in, say, once a month. (I highly recommend doing cumulative impulse response functions of key metrics partitioned by experiment group like average amount of dollar spend. You may be shocked how many effects linger past some experiment dates. You may even see sustained effects in the first derivatives of your CIRFs!) User insistence on completing subpar flows is how you get zero effects within-period but non-zero effects in future periods, even with massive sample sizes.

While we’re here, it should also be noted that “measurable” is not a synonym for “being measured.” Many metrics that sound great in theory during a quick call with management may not ever be collected in practice for a wide variety of reasons. Some of those reasons will be bad reasons and should be identified as missing metrics, then rectified. Other reasons will be good reasons, like it’d involve 30 Jira points of labor, and oh well, that’s life. So the company’s “data-driven” understanding of what is going on is confined to what the company has the internal capacity to measure. This means that many theoretically measurable metrics will be de facto immeasurable. And many of the things you really, really want to know will be unknowable via data.

And finally, a metric being measured and utilized also assumes that it is the correct metric being measured, and that there are no errors in what a particular number represents. Of course these assumptions can be violated as well.

Some metrics are adversarial; some effects are non-stationary

Let’s say you are A/B testing a price for a simple consumer product, and you decide to go with whichever one yields the most profit to the business.

This is just a simple (price - marginal cost) * quantity calculation, and it should not usually be hard to calculate which is more profitable. The problem here isn’t the measurability of profit.

Yet, there is still a hidden measurement problem, combined with an intertemporal effect problem. Specifically, the measurement problem is consumer surplus (i.e. it’s semi-measurable or deducible, but most likely not actually being measured at the majority of tech companies). The intertemporal effects are twofold: (1) users who receive low consumer surplus may be less likely to purchase again, and (2) the profit-maximizing price may change across periods due to competition and macroeconomic conditions.

Profit being an adversarial metric with the user is the root problem. Without a countervailing force being “measured” in some way (consumer surplus), you may end up pissing off users enough to not come back a 2nd time.

The other problem is competition, which may introduce non-stationarities in a broader sense. So optimal prices can change a lot over time– even if users have the memories of goldfish and there are no autoregressive (so to speak) intertemporal effects on a user-by-user basis.

I mention A/B testing of prices in particular because it is an especially dangerous thing to attempt without a deep understanding of pricing from the marketing side of things, plus a confidently held theory of how to attain desired business results with pricing. I don’t even mean it’s dangerous from an ethical sense (that’s true but outside the scope of this document); I mean it is dangerous from a strictly long-term business perspective. Even if your theory is that it’s optimal business to squeeze users for as much money as possible in a short-term sense, this should be acknowledged as a theory and it should confidently held when being executed.

Many managers do not want to confidently state that squeezing users for every penny is desirable, perhaps because it is counter to liberal sensibilities. So they may instead state that to price a particular way is “data-driven.” It’s a farce to hide behind the descriptor “data-driven” in the pricing context, as short-term profit maximization is not a data-driven result, it’s a theory for how to run a business. The price that spits out from the A/B test as maximizing short-term profit is what’s “data-driven,” not the decision to go with said pricing strategy. A general form of this idea is true in all contexts of using KPIs to make decisions (e.g. is maximizing time users spend on an app actually a good thing?), but the pricing context is where it is most obvious that describing a decision as “data-driven” is just begging the question.

A/B tests tend to yield lower effect sizes as the app grows

If your app has 5 features (ABCDE), and you are A/B testing a 6th feature (F), the A/B test is testing the differences between the combination of ABCDE and ABCDEF. In other words, you are measuring the marginal impact of F conditional on ABCDE.

It is easy to imagine a few unintended problems arising from this. Imagine for example that all these features are more or less interchangeable in terms of their effects on KPIs in isolation, but the effects of these features tend to eat into each other as the features get piled on. In this case, the order in which these features gets tested (and not the quality of the features in a vacuum) would be the primary determinant of which features evaluate well.

Note that this argument does not rely on the idea that the company tackles the most important features first, and then features additions by their very nature become more fringe and smaller over time. Although there will certainly be some of that going on, this effect comes from a saturation of features in general. Diminishing marginal returns exist even if all the features are more or less the same quality. But, that’s the key to understanding a key limitation of A/B tests: you are testing the marginal effect conditional on what was contemporaneously in the app while you conducted the test.

In the worst-case scenario, diminishing marginal returns in quantity of features become negative marginal returns, and basically nothing else can be added to the app.

Companies won’t usually tackle this problem head on (which is somewhat reasonable)

Let’s go back to our example of testing F, conditional on ABCDE. Imagine feature D was only ever tested conditional on ABC existing in the app. However, the fairest comparison of features D and F would be testing feature D conditional on ABCE, testing feature F conditional on ABCE, testing the interaction of DF conditional on ABCE, and testing neither conditional on ABCE. Given that ABCDE are already established features in the app, this would mean we need to re-test D.

To be clear, actually doing this “fair” comparison is a little unreasonable. Companies don’t tend to continually A/B test features that have proven themselves successful for a few reasons:

  • Non-zero maintenance cost of sustaining multiple experiences.
  • Difficulty in marketing a feature that not all users receive. (Imagine being advertised to about a feature that you can’t even see because you’re in the control group.)
  • Testing a worse experience comes with the opportunity cost (sometimes called “regret”) of showing some proportion of users something that doesn’t optimize a KPI.

Additionally, there are arguments in favor of preferring to retain older features at full exposure, even if they have worse marginal effects in actuality. Imagine in the hypothetical “fair” comparison that F is actually slightly better than D. We still may prefer D over F because:

  • Maintaining tried and true features tends to be easier from the engineering perspective.
  • Users come to expect the existence of a feature the longer it is around.
  • Users are “loss averse” to the removal of a feature (i.e. going from “not X” to “X” back to “not X” is potentially worse than just starting and staying on “not X”).

A final reason to not tackle this problem is because the “dumb” interpretation of A/B testing isn’t obviously bad for operational purposes. Effectively, if F has basically no effect conditional on ABCDE existing, then it’s not obvious at first glance that there are problems with taking this literally to mean there is no effect of including or excluding “F,” unless modifying ABCDE is a reasonable option to pursue.

That said, many companies would still benefit from doing the following:

  1. Most importantly, have a theory and use common sense when building the app out. (Many things don’t need to be A/B tested.)
  2. When you start to see diminished effect sizes, run longer duration A/B tests.
  3. Be careful to consider the temporal effect of when the experiment was started; experiments not run contemporaneously are often not reasonable to compare to one another.
  4. Measure IRFs and CIRFs of your KPIs with experiment exposure as the impulse, even past the experimentation period.

Many companies who experience this don’t have a strong institutional understanding of what’s going on

Product managers, data scientists, data analysts, and engineers tend to suck at thinking about their company holistically. In fairness, their job is usually to think about very small slices of the company very intensely. But spending the entire month of August optimizing some API to run faster, even if economical to do so, will belie the true determinants of the app’s performance, which are often a grab bag of both banal and crazy factors, both intrinsic and extrinsic to the company and its operations. Some tech employees lose the plot and really do need to look at an overpriced MBB consulting deck about their company, unmarred by being too deep in the weeds on one thing.

So these tech employees may not even realize that their A/B test performances are getting worse; or they may notice but not understand it is not just a coincidence, that it is a perfectly reasonably expected thing to be happening.

Another reason many tech employees may not understand what’s going on is because they’ve been told that A/B tests are good and the gold standard for testing features, without deeply understanding what they do and don’t measure. Unfortunately for managers who insist on being “data-driven,” comparing the results of A/B tests across different periods absolutely do require interpretations and subject matter expertise because inter-period A/B tests are in some ways incomparable.

These limitations, if not acknowledged, lead to toxic internal politics

A “data-driven” culture that does not acknowledge simple limitations of data tools we use is one where data and data-ish rhetoric can be abused to push for nonsensical conclusions and to cause misery.

The main frustration that comes from diminishing marginal effects of A/B testing is that managers will see those beneath them as failing in some sense– upper managers thinking their product managers are failing; product managers thinking their data scientists are failing, etc. compared to the experiments and features of yesteryear.

The truth of course is more complicated and some combination of:

  1. The intertemporal effects dominate (implying the test needs to go on for longer)
  2. The app is saturated with other cannibalizing features (implying that the work is either not productive, or that other features should be removed).
  3. The feature makes the app better (or worse?), but in some fundamentally immeasurable-in-an-A/B-test way. (Requiring theory, intuition, and common sense to dictate what should happen)

Usually people over-rely on this sort of “data-driven” rhetoric when they lack confidence in what they are doing in some regards– maybe they are completely clueless, or maybe they sense what they are doing is unethical and need some sort of external validation / therapy in the form of a dashboard.

I am a data person at the end of the day. I think the highest value data is that which tells you that a prior belief was wrong, or tells you something when you had no real prior beliefs at all. A lot of features you’ll be adding to an app don’t really need to be justified through “data-driven” means if you have a strong prior belief that it makes the app better. Maybe it doesn’t have an immediately obvious impact measurable in the treatment group, but maybe it makes users more happy over time and more likely to tell their friends to download the app (also, your app’s referral feature is probably garbage/unused and doesn’t capture 99% of downloads through word-of-mouth attributable to an A/B test).

These strong prior beliefs may come from, say, subject matter expertise. And if your company is at the point where contemporaneous and measurable effect sizes are decaying your power analyses faster than sample size increases can keep up, or you are working on a problem that is dealing with hard to measure or immeasurable effects, you’re going to need subject matter expertise to fill in that gap.

Unfortunately, this is all easier said than done. Middle managers don’t like hearing that tests need to run for longer. They don’t like hearing that their product idea “failed” an A/B test. They don’t like having to actively disavow a “data-driven” approach that upper management is pressing them to adhere strictly to (although they’ll likely be politically pressured into pretending to be “data-driven” by always reporting metrics that support the value of their own work, but that’s another story).

I’ll end on a personal anecdote of when an over-abundance of nominal commitments to “data-driven decisionmaking” led to some toxic internal politics:

I once had a “data-driven” upper manager who would say “show me the data” in conversation a lot when I said we should do something. The data this person claimed to want was never something that was actually being measured (always for good reason or reasons outside of my control). In fact, that was usually why I was not showing data– it was because it didn’t exist in some sense! But I’m not so credulous to believe that the request for data was sincere; I suspect that this manager knew this data was unattainable in some sense, and was using “show me the data” as a bludgeon to deliberately suppress these conversations and “win” arguments.

This is more of a personal post than something intended to be profound. If you are looking for a point, you will not find one here. Frankly I am not even sure who the target audience is for this (probably “data scientists who hate themselves”?).

I had been a data scientist for the past few years, but in 2022, I got a new job as a data engineer, and it’s been pretty good to me so far.

I’m still working alongside “data scientists,” and do a little bit of that myself still, but most of my “data science” work is directing and consulting on others’ work. I’ve be focusing more on implementation of data science (“MLops”) and data engineering.

The main reason I soured on data science is that the work felt like it didn’t matter, in multiple senses of the words “didn’t matter”:

  • The work is downstream of engineering, product, and office politics, meaning the work was only often as good as the weakest link in that chain.
  • Nobody knew or even cared what the difference was between good and bad data science work. Meaning you could absolutely suck at your job or be incredible at it and you’d get nearly the same regards in either case.
  • The work was often very low value-add to the business (often compensating for incompetence up the management chain).
  • When the work’s value-add exceeded the labor costs, it was often personally unfulfilling (e.g. tuning a parameter to make the business extra money).

Shitty management & insane projects

Management was by far my biggest gripe. I am completely exhausted by the absolute insanity that was the tech industry up to 2021. Companies all over were consistently pursuing things that could be reasoned about a priori as being insane ideas– ideas any decently smart person should know wouldn’t work before they’re tried. Some projects could have saved whole person-years of labor had anyone possessed a better understanding of the business, customers, broader economic / social landscape, financial accounting, and (far too underrated in tech) any relevant subject matter areas.

Those who have seen my Twitter posts know that I believe the role of the data scientist in a scenario of insane management is not to provide real, honest consultation, but to launder these insane ideas as having some sort of basis in objective reality even if they don’t. Managers will say they want to make data-driven decisions, but they really want decision-driven data. If you strayed from this role– e.g. by warning people not to pursue stupid ideas– your reward was their disdain, then they’d do it anyway, then it wouldn’t work (what a shocker). The only way to win is to become a stooge.

The reason managers pursued these insane ideas is partly because they are hired despite not having any subject matter expertise in business or the company’s operations, and partly because VC firms had the strange idea that ballooning costs well in excess of revenue was “growth” and therefore good in all cases; the business equivalent of the Flat Earth Society. It was also beneficial for one’s personal career growth to manage an insane project (résumé lines such as “managed $10 million in top-line revenue,” failing to disclose that their COGS was $30 million). Basically, there’s a decent reward for succeeding and no downside for failing, and sometimes you will even be rewarded for your failures! So why not do something insane?

Also, it seems that VC firms like companies to run the same way their portfolios do– they want companies to try 100 different things, and if only 5 out of those 100 things work, then the VCs will consider that a success. On the ground floor, this creates a lot of misery, since the median employee at the company is almost certainly working on a product that is not destined to perform well, but the shareholders are happy, which is of course all that matters.

Shitty code & shitty data science

The median data scientist is horrible at coding and engineering in general. The few who are remotely decent at coding are often not good at engineering in the sense that they tend to over-engineer solutions, have a sense of self-grandeur, and want to waste time building their own platform stuff (folks, do not do this).

This leads to two feelings on my end:

  1. It got annoying not having some amount of authority over code and infra decisions. Working with data scientists without having control over infra feels like wading through piles of immutable shit.
  2. It was obvious that there is a general industry-wide need for people who are good at both data science and coding to oversee firms’ data science practices in a technical capacity.

Poor mentorship

I don’t want to be too snooty: in a sense, it’s fine for data scientists to suck at coding! Especially if they bring other valuable skills to the table, or if they’re starting out. And in another sense, bad code getting into production is a symptom of bad team design and management, more than any individual contributors’ faults! By describing the median data scientist’s coding skills as shitty, I’m just trying to be honest, not scornful.

The problem is that the median data scientist works at a small to medium-sized company that doesn’t build their data science practices around a conceit that the data scientists’ code will suck. They’d rather let a 23 year old who knows how to pip install jupyterlab run loose and self-manage, or manage alongside other similarly situated 23 year-olds. Where is the adult in charge?

23 year-old data scientists should probably not work in start-ups, frankly; they should be working at companies that have actual capacity to on-board and delegate work to data folks fresh out of college. So many careers are being ruined before they’ve even started because data science kids went straight from undergrad to being the third data science hire at a series C company where the first two hires either provide no mentorship, or provide shitty mentorship because they too started their careers in the same way.

Poor self-directed education

On the other hand, it’s not just the companies’ and managers’ faults; individual data scientists are also to blame for being really bad at career growth. This is not contemptible for people who are just starting out their careers, but at some point folks’ résumés starts to outpace actual accumulation of skills and I cannot help but to find that a teeny bit embarrassing.

It seems like the main career growth data scientists subject themselves to is learning the API of some gradient boosting tool or consuming superficial + shallow + irrelevant knowledge. I don’t really sympathize with this learning trajectory because I’ve never felt the main bottleneck to my work was that I needed some gradient boosting tool. Rather the main bottlenecks I’ve faced were always crappy infrastructure and lacking (quality) data, so it has always felt natural to focus my efforts toward learning that stuff to unblock myself.

My knowledge gaps have also historically been less grandiose than learning how some state of the art language model works or pretending I understand some arXiv white paper ornate with LaTeX notation of advanced calculus. Personally, I’ve benefited a ton from reading the first couple chapters out of advanced textbooks (while ignoring the last 75% of the textbook), and refreshing on embarrassingly pedestrian math knowledge like “how do logarithms work.” Yeah I admit it, I’m an old man and I’ve had to refresh on my high school pre-calc. Maybe it’s because I have 30k Twitter followers, but I live in constant anxiety that someone will pop quiz me with questions like “what is the formula for an F-statistic,” and that by failing to get it right I will vanish in a puff of smoke. So my brain tells me that I must always refresh myself on the basics. I admit this is perhaps a strange way to live one’s life, but it worked for me: after having gauged my eyes out on linear regression and basic math, it’s shockingly apparent to me how much people merely pretend to understand this stuff, and how much ostensible interest in more advanced topics is pure sophistry.

For the life of me I cannot see how reading a blog post that has sentences in it such as “DALL-E is a diffusion model with billions of parameters” would ever be relevant to my work. The median guy who is into this sort of superficial content consumption hasn’t actually gone through chapters in an advanced textbook in years if ever. Don’t take them at their word that they’ve actually grinded through the math because people lie about how well-read they are all the time (and it’s easy to tell when people are lying). Like bro, you want to do stuff with “diffusion models”? You don’t even know how to add two normal distributions together! You ain’t diffusing shit!

I don’t want to blame people for spending their free-time doing things other than learning how to code or doing math exercises out of grad school textbooks. To actually become experts in multiple things is oppressively time-consuming, and leaves little time for other stuff. There’s more to life than your dang job or the subject matters that may be relevant to your career. One of the main sins of “data scientist” jobs is that it expects far too much from people.

But there’s also a part of me that’s just like, how can you not be curious? How can you write Python for 5 years of your life and never look at a bit of source code and try to understand how it works, why it was designed a certain way, and why a particular file in the repo is there? How can you fit a dozen regressions and not try to understand where those coefficients come from and the linear algebra behind it? I dunno, man.

Ultimately nobody really knows what they are doing, and that’s OK. But between companies not building around this observation, and individuals not self-directing their educations around this observation, it is just a bit maddening to feel stuck in stupid hell.

Data engineering has been relaxing

These are the things I’ve been enjoying about data engineering:

  • Sense of autonomy and control.
    • By virtue of what my job is, I have tons of control over the infrastructure.
    • Data engineering feels significantly less subject to the whims and direction of insane management.
  • Less need for sophistry.
    • My work is judged based on how good the data pipelines are, not based on how good looking my slide decks are or how many buzzwords I can use in a sentence. Not to say data engineering doesn’t have buzzwords and trends, but that’s peddled by SaaS vendors more than actual engineers.
  • More free time.
    • I dunno, it feels like becoming a data engineer cured my imposter syndrome? I feel like I have more ability to dick around in my spare time without feeling inadequate about some aspect of my job or expertise. But this is probably highly collinear with not being a lackey for product managers.
  • Obvious and immediate value that is not tied to a KPI.
    • I like being valued, what can I say.
    • Ultimately the data scientists need me more than I need them; I’m the reason their stuff is in production and runs smoothly.
    • I have a sense that, if my current place of business needed to chop employees, that it would be a dumb decision to chop me over any data scientist.
  • Frankly, I feel really good at what I do.
    • As someone who has worked a variety of downstream data-related jobs, I have both a very strong sense of what my downstream consumers want, as well as the chops to QC/QA my own work with relative ease the way a good analyst would.
    • At my last company I had a lot of “I could totally do a better job at designing this” feelings regarding our data stack, and it has immensely fed my ego to have confirmed all of these suspicions myself.
    • This role gets to leverage basically everything I’ve learned in my career so far.

By far the most important thing here is the sense of independence. At this point it feels like the main person I should be complaining about is myself. (And self-loathing is so much healthier than hating a random product manager.) As long as my company’s data scientists are dependent on me to get code into production, I have veto power over a lot of bad code. So if they are putting bad code in production, that ultimately ends up being my fault.

I think my career trajectory made sense– there was no way I was hopping straight into data engineering and doing a good job of it without having done the following first:

  • See a few data stacks and form opinions about them as a downstream consumer.
  • Get better at coding.
  • Pick up on the lingo that data engineers use to describe data (which is distinct from how social scientists, financial professionals, data scientists, etc. describe data), like “entity,” “normalization,” “slowly-changing dimension type 2,” “CAP theorem,” “upsert,” “association table,” so on and so on.

So, ultimately I have no regrets having done data science, but I am also enjoying the transition to data engineering. I continue to do data science in the sense that these roles are murkily defined (at both my “data scientist” and “data engineer” jobs I spend like 40% of the time writing downstream SQL transformations), I get to critique and provide feedback on data science work, and, hey, I actually did deploy some math heavy code recently. Hell, you could argue I’m just a data scientist who manages the data science infra at a company.

Anyway, great transition, would recommend for anyone who is good at coding and currently hates their data science job. My advice is, if you want to get into this from the data science angle, make sure you are actively blurring the lines between data science and data engineering at your current job to prepare for the transition. Contribute code to your company’s workflow management repo; put stuff into production (both live APIs and batch jobs); learn how to do CI/CD, Docker, Terraform; and form opinions about the stuff your upstream engineers are feeding you (what you do and don’t like and why). In fact it is very likely this work is higher value and more fun than tuning hyperparameters anyway, so why not start now?

Sorry, this post has no point, so it’s ending rather anticlimactically.

Related

If you Google for an explainer on the differences and use cases for the arithmetic mean vs geometric mean vs harmonic mean, I feel like everything you’ll find is pretty bad and won’t properly explain the intuition of what’s going on and why you’d ever do one or the other.

In fact, you sometimes will hear about someone choosing to use the geometric mean over the arithmetic mean because they want to put less weight on outliers– my position on this is that it should be criminal and punished with jail time and a life-time revocation of all software licenses.

Note that this blog refers to log() a lot. Every time I use log(), I am talking about the natural logarithm, never the log base 10.

TLDR: The Common Link

The arithmetic mean, geometric mean, and harmonic mean all involve the following four steps:

  1. Define some invertible function 𝑓(·)
  2. Apply this function to each number in the set: 𝑓(x)
  3. Take the arithmetic mean of the transformed series: avg(𝑓(x))
  4. Invert the transformation on the average: 𝑓(avg(𝑓(x)))
    • 1

^ Given some function avg(x) = sum(x) / count(x). Or in more math-y terms, .

The only difference between each type of mean is the function 𝑓(·). Those functions for each respective type of mean are the following:

  • Arithmetic mean: the function is the identity map: 𝑓(x) = x. The inverse of this function is trivially 𝑓1(x) = x.
  • Geometric mean: the function is the natural log: 𝑓(x) = log(x). The inverse of this function is 𝑓1(x) = ex.
  • Harmonic mean: the function is the multiplicative inverse: 𝑓(x) = 1/x. Similar to the arithmetic mean, the function is involutory and thus the inverse is itself: 𝑓1(x) = 1/x.

We can also add the root mean square into the mix, which is not one of the “Pythagorean means” and is outside this article’s scope, but can be described the same way using some 𝑓(·):

  • Root mean square: the function is: 𝑓(x) = x2. The inverse of this function is 𝑓1(x) = √x

And I leave as an exercise to the reader to define the “f(x)” of mean kinetic temperature, another mean formula that can be expressed as some 𝑓-1(avg(𝑓(x))).

This means that:

  • The geometric mean is basically exp(avg(ln(x))).
  • The harmonic mean is basically 1/avg(1/x)).

We’re playing fast and loose with the math notation when we say that, but hopefully this makes perfect sense to the Pandas, R, Excel, etc. folks out there.

Another way of thinking about what’s happening: Each type of mean is the average of 𝑓(x), converted back to the units x was originally in.

In fact, you can imagine many situations in which we don’t care about the conversion back to the original units of x, and we simply take avg(𝑓(x)) and call it a day. (Later in this article we will discuss one such example.)

Basically, instead of thinking about it as arithmetic mean vs geometric mean vs harmonic mean, I believe it is far better to think of it as deciding between taking a mean of x vs log(x) vs 1/x.

If someone tells you “I took the geometric mean of x,” you should translate it as “I took the mean of log(x), then converted it back to the units of x.”

If someone tells you “I took the harmonic mean of x,” you should translate it as “I took the mean of 1/x, then converted it back to the units of x.”

𝑓(x) for the Geometric Mean

The geometric mean is basically just exp(avg(ln(x))) for some series x. You may be thinking to yourself, “hold on just a second, that’s not the formula I learned for geometric means, isn’t the geometric mean multiplicative?” Indeed, the geometric mean is usually defined like this:

image

The multiplicative formulation of the geometric mean is a big part of my complaint on how these things are taught and is a big motivation for why I wrote this blog post. This formulation has two problems:

  1. It does not provide a good intuition for the relationship between all the types of means.
  2. It does not provide a good intuition for when this type of mean is useful.

Let’s start with a slightly easier, non-geometric example to understand our motivation for why we want to use the geometric mean in the first place. Then we will move toward understanding why we prefer the formulation that uses logarithms.

Imagine a series xi that starts at x0 = 42, and each innovation in the series is the sum of the previous value plus a random variable sampled from a normal distribution: Δx ~ N(1, 12):

The output:

123456789101112131415161718192021222324
>>> import numpy as np>>> np.random.seed(42069)>>> x = 42 + np.random.normal(1, 1, size=100).cumsum()>>> print(x)[ 42.92553097 43.49170572 44.8624312 46.51825442 48.6378453648.8951176 50.54671526 51.22867236 50.3424614 52.7573727651.69855546 52.65819763 54.06775613 53.28172338 53.6669707456.12498604 57.61778804 58.27503826 57.70256389 59.3873508560.22063766 61.90920957 63.79519213 64.69549338 64.4360245667.30519201 67.13024514 67.78104984 69.06629574 69.6041263771.16176743 71.62132034 74.39080503 77.09848795 79.465277280.39650678 80.91626346 81.52066251 82.33636486 81.9122259983.48843792 85.30007755 86.06519214 86.43031374 87.1303920490.25620318 93.03419293 94.51869434 95.27790282 95.9032177696.76434325 98.1029641 98.32628988 98.49971699 97.5964920399.94065695 103.04571803 102.67171363 103.76771164 103.59085332105.19872315 107.13508702 109.37005647 110.2144704 112.2378265113.62732814 114.72603068 115.62444397 115.05362115 115.76635685116.1258608 117.39540806 119.84769791 120.87730939 121.77746089121.28436228 123.23558472 124.28775198 123.13780385 123.70222487124.66368608 126.87138039 127.04262872 127.67170174 128.37581743131.53389542 130.40204315 132.71405898 135.27380666 136.73629043136.92096472 136.91655098 139.62076843 140.17345403 141.37458494141.97151287 142.00637033 143.79217108 143.55446468 143.82917616]

On average, this series increments by 1, or: Δx̄ = 1. The code example with N=100 is fairly close to this:

12
>>> print(np.diff(x).mean())1.019228739283586

A nice property of this series is that we can estimate the number xi at any point out by taking x0 and adding i*Δx̄:

xi ~= x0 + i * Δx̄

For example, given x0 = 42 and Δx̄ = 1, we expect the value at i = 200 to be: x200 = 42 + 1 * 200 = 242.

One more note is, you can see if we did this with Δx ~ N(0, 12), the end value would be pretty close to our starting value, since on average the change is 0 so it doesn’t tend to go anywhere:

12345678910111213141516171819
>>> y = 42 + np.random.normal(0, 1, size=100).cumsum()>>> print(y)[42.14327609 41.73220144 40.34751605 39.89921679 40.71819255 39.9866604139.99145178 38.64581255 39.41370336 38.25513447 38.30145859 39.4227464938.04648615 36.24137523 35.46375047 35.50439808 36.63500785 36.5938846737.57533195 39.08275764 41.53667158 41.56685919 41.41477371 41.257768641.07551212 42.30114276 41.64867131 42.72961065 43.64169298 42.3408846942.88702932 42.23553899 43.61890447 43.82668718 41.60184273 40.2782627141.25916824 42.18715753 41.85738733 41.14549348 41.84768201 41.6906344740.6840706 40.83319881 40.40445944 39.34640317 38.09054094 38.8765219540.3595753 40.8148765 40.83529456 41.18929838 40.08415662 39.0555729538.64264682 36.52039689 36.69964053 35.37000166 35.20584305 36.3747664536.88307023 37.31826077 38.29526266 37.74575645 36.80800914 37.545017838.65961874 38.88282908 39.928218 40.56804189 41.80931894 42.8903232442.04655778 42.18207305 44.03568325 44.30914736 43.87061869 42.6641185741.56574674 42.80970406 44.62550963 44.68844778 44.39159728 44.8392787843.25006607 42.00114961 42.34021689 41.85862954 41.12647006 42.4009155941.94186999 42.49949258 40.68446343 40.00669043 39.43730559 39.7714268639.78884967 40.82806017 40.40149929 40.08320282]

Now imagine instead that we’re working with percentages. Let’s try to create a series that acts as a percentage, but doesn’t drift anywhere i.e. its average percent change is 0. A sensible thing to do here is to set the mean of the normal distribution to 1.00 because 42 * 1 * 1 * 1 * 1 * 1 * 1 * … = 42. If we do that (and set the variance to be small), then take the cumulative product, we get this:

12345678910111213141516171819202122
>>> z = 42 * np.random.normal(1.00, 0.01, size=100).cumprod()>>> zarray([42.29485307, 42.10421377, 41.92718004, 42.15880536, 42.40603275,42.01686461, 41.69826372, 41.83495864, 41.25095708, 41.10461977,41.49515729, 41.69517539, 41.32604793, 40.91920303, 40.79499964,40.35644085, 40.20518116, 40.1654621 , 39.74730232, 39.78929327,40.12100013, 40.02502775, 39.44923442, 39.44081756, 39.4641928 ,38.80762578, 38.89268137, 39.11465487, 39.16728902, 39.76985309,40.06795117, 40.23589703, 40.10532389, 39.34583711, 39.61553391,39.9075052 , 40.10830292, 39.9106884 , 39.85976268, 39.79411555,38.99997783, 39.44369954, 40.39200402, 40.88128515, 40.44479728,40.22209045, 40.0285552 , 40.23540308, 40.42241127, 40.9433428 ,41.24621725, 40.8689337 , 41.29192764, 40.56362479, 41.17267844,40.51011262, 40.87376187, 40.39144992, 39.75819826, 40.0730663 ,40.58474266, 40.38275682, 40.37280631, 40.23939867, 40.63397416,41.09796335, 41.35882045, 41.51904402, 41.34215026, 40.6133968 ,40.66670512, 40.23359742, 40.02866379, 38.7808683 , 39.19549234,38.93472017, 38.95480154, 38.62374928, 38.76198573, 38.75981943,38.8845329 , 39.13893181, 38.41712861, 38.51901389, 38.70939761,39.17713543, 38.85041645, 38.54570443, 38.93457566, 39.1375754 ,39.2318531 , 39.14232187, 39.11531491, 38.02500266, 38.77757832,38.9453628 , 38.19591922, 37.34626113, 38.13230149, 37.96815914])

Oh dear, clearly the number is drifting downwards. Even though our normal distribution has a mean of 1.00, the percentage change is negative on average!

The reason why this happens is pretty straightforward: If you multiply a number by 101% then by 99%, you get a smaller number than you started with. If you do this again and then again a third time, the resulting number gets smaller and smaller even though (1.01 + 0.99 + 1.01 + 0.99 + 1.01 + 0.99) / 6 = 1.00.

Curious minds will be asking a few questions that are all variants of: What thing allows us to input a “0%” and allows us to have a series that actually, truly increases on average by 0% over time?

In the example of 101% and 99%, really what we needed to be doing is oscillate between 101.0101010101% and 99%, and this gives us an average percentage change over time of 0%. The formula that represents this is the geometric mean:

(0.99 * 1.01010101010101)^(1/2) = 1.00000

We can intuit here that the reason this works is because:

  • Percentages are multiplied.
  • Geometric means are like the multiplicative equivalent of the arithmetic mean.
  • Ergo, geometric means are good for percentage changes.

That’s all fine and dandy. But I would argue this approach is not the best way to think about it.

Just Use Logarithms

There is a much better way to think about everything we’ve done up to this point on geometric means, which is to use logarithms. Logarithms are, in my experience, criminally underused. Logarithms are a way to do the following:

  • turn multiplication into addition
  • turn exponentiation into multiplication

And we can see that the formula for the geometric mean consists of two steps:

  1. Multiplication of all x, x, … followed by:
  2. 0

    1

  3. Exponentiation by 1/N

If we were to instead use logarithms, it would become:

  1. Addition of all log(x), log(x), … followed by:
  2. 0

    1

  3. Multiplication by 1/N

Well golly, that looks an awful lot like taking the arithmetic mean.

We can just use the above example to confirm this works as intended:

(log(0.99) + log(1.01010101010101)) / 2 = 0

Of course, the final step to a proper geometric mean is that we need to convert back to the original domain, so let’s take the exp() of the arithmetic average of log(x), and we’re all set. (That said, we could also skip this final step, but more on that later.)

e^((log(0.99) + log(1.01010101010101)) / 2) = e^0 = 1

It would be negligent of me if I failed to point out another factoid here, which is that the example where we used a normal distribution and cumulatively multiplied it on the number 42 was pretty bad. What we should have been doing is something like this:

1234567891011121314151617181920
>>> np.random.seed(123789)>>> w = np.exp(np.log(42) + np.random.normal(0, 0.01, size=100).cumsum())>>> print(w)[42.04757853 42.03767307 41.71175204 41.35454934 41.62054759 41.607573241.54983288 41.43970185 41.91568309 42.23240569 42.72341859 42.2479118141.97051688 41.99435579 42.04788319 42.68784402 43.17163141 43.1132142243.05973883 43.48593847 43.45580668 43.67426001 44.5840072  45.6951589646.29346244 46.52853533 46.2992428  46.18620244 45.88003251 46.6213271446.60257292 46.43310272 46.55999518 46.32343379 46.20800647 45.8920774246.00097075 46.18218262 46.59929002 46.35029673 46.24973082 46.3980761846.924215   46.94728105 46.5494736  46.43645126 47.26412466 46.5432200846.33447728 46.63572133 46.6786372  46.90368576 47.21667417 47.0501593447.01905637 47.43071786 47.57165118 46.9620617  47.2683152  46.2806913446.20036235 45.76285665 46.12441204 46.58868319 46.32119959 46.3343209646.83187783 46.486633   46.84950368 47.20823273 46.80124992 47.0813014546.58325664 46.71110628 46.81578114 46.14625089 46.15792567 45.8566858945.37720581 44.64861972 44.39836865 43.93450417 45.0169601  45.5841106845.28835317 45.33707393 45.25037578 45.52754206 45.26391821 45.2838483245.06173788 45.5085144  45.63230275 44.99180853 45.43820589 46.0931929846.01279547 46.08962901 46.67943336 46.96112588]

Keeping in mind that ln(1) = 0, and again that logarithms turn multiplication into addition, so we can just use the cumulative sum this time (rather than the cumulative product).

Note that in this formulation, the series is no longer normally distributed with respect to x. Instead, it’s log-normally distributed with respect to x. Which is just a fancy way of saying that it’s normally distributed with respect to log(x). Basically, you can put the “log” either in front of the word “normally” or in front of the “x”. “Log(x) is normal” and “x is log-normal” are equivalent statements.

The code above may look like a standard normal distribution, but you know it’s log-normal because (1) 42 was wrapped in log(), and (2) I end up wrapping the entire expression in exp().

When working with percentages, we like to use logarithms and log-normal distributions for a few reasons:

  • Log-normal distributions never go below 0. Logarithms never go below zero in their domain. If we used, say, a normal distribution, there is always a chance the output number is negative, and if we mess that up things will go haywire.
  • The average of the series for log(x) equals the μ (“mu”) parameter for the log-normal distribution. Or rather, it may be more accurate to say that for any finite series that is log-normally distributed, the mean of that series is an unbiased estimator for μ.
  • Adding 2 or more normally distributed random variables together yields another normally distributed variable, but the same is not true when multiplying 2 or more normally distributed random variables. So if we use logs, we get to keep everything normal, which can be extremely convenient.
  • Adding is easier than multiplying in a lot of contexts. Some examples:
    • In SQL, the easiest way to take a “total product” in a GROUP BY statement is to do EXP(SUM(LN(x))).
    • In SQL, the easiest way to take the geometric mean in a GROUP BY statement is to do EXP(AVG(LN(x))) (but you already knew that, right?).
    • Linear regression prediction is just adding up the independent variables on the right-hand side, so estimating some log(y) using linear regression is a lot like expressing all the features as multiplicative of one another.
  • It just so happens that in a lot of contexts where percentages are an appropriate way to frame the problem, you’ll see log-normal distributions. This is especially true of real-world price data, or the changes in prices: log(pricet) – log(pricet-1). (Note if log(price) is normal, that means subtracting two log(price)’s is normal, because of the thing about adding two normal distributions.) Try a Shapiro-Wilk test on your log(x), or just make a Q-Q plot of log(x) and use your eyes.
  • log(1 + r) (with the natural log) is actually a pretty close approximation to r, especially for values close to 0.

Two more things– first, you can always us the np.random.lognormal() function then take the cumulative product to confirm your high school pre-calculus:

12345678910111213141516171819
>>> w = 42 * np.random.lognormal(0, 0.01, size=100).cumprod()>>> print(w)[42.04757853 42.03767307 41.71175204 41.35454934 41.62054759 41.607573241.54983288 41.43970185 41.91568309 42.23240569 42.72341859 42.2479118141.97051688 41.99435579 42.04788319 42.68784402 43.17163141 43.1132142243.05973883 43.48593847 43.45580668 43.67426001 44.5840072 45.6951589646.29346244 46.52853533 46.2992428 46.18620244 45.88003251 46.6213271446.60257292 46.43310272 46.55999518 46.32343379 46.20800647 45.8920774246.00097075 46.18218262 46.59929002 46.35029673 46.24973082 46.3980761846.924215 46.94728105 46.5494736 46.43645126 47.26412466 46.5432200846.33447728 46.63572133 46.6786372 46.90368576 47.21667417 47.0501593447.01905637 47.43071786 47.57165118 46.9620617 47.2683152 46.2806913446.20036235 45.76285665 46.12441204 46.58868319 46.32119959 46.3343209646.83187783 46.486633 46.84950368 47.20823273 46.80124992 47.0813014546.58325664 46.71110628 46.81578114 46.14625089 46.15792567 45.8566858945.37720581 44.64861972 44.39836865 43.93450417 45.0169601 45.5841106845.28835317 45.33707393 45.25037578 45.52754206 45.26391821 45.2838483245.06173788 45.5085144 45.63230275 44.99180853 45.43820589 46.0931929846.01279547 46.08962901 46.67943336 46.96112588]

Second, we might as well take the geometric mean of the deltas in the series, huh?

We expect that the geometric mean should be very close to 1.00, so when we subtract 1 from it we should see something very close to 0.00:

12
>>> np.exp(np.diff(np.log(w)).mean()) - 10.0011169703289617416

Additionally, we expect the geometric mean to be pretty close (but not equal) to the arithmetic mean of the percentage changes:

12
>>> (np.diff(w) / w[:-1]).mean()0.001156197504173517

Last but not least, in my opinion, if you are calculating geometric means and log-normal distributions and whatnot, it is often perfectly fine to just take the average of log(x) rather than the geometric mean of x, or the normal distribution of log(x) instead of the log-normal distribution of x, things of that nature.

In a lot of quantitative contexts where logging your variable makes sense, you will probably just stay in the log form for the entirety of your mathematical calculations. Remember that the exp() at the very end is just a conversion back from log(x) to x, and sometimes that conversion is unnecessary! Get used to talking about your data that should be logged in its logged form.

In that sense, I mostly feel like a geometric mean is just a convenience for managers. We tell our managers that the average percentage change of this or that is [geometric mean of X minus one] percent, but in our modeling we do [mean of log(X)]. Honestly, I prefer to stick with the latter.

Re-contextualizing the Canonical Example of the Harmonic Mean

The most common example provided for a harmonic mean is the average speed traveled given a fixed distance. Wikipedia provides the example of a vehicle traveling 60kph on an outbound trip, and 20kph on a return trip.

The “true” average speed that was traveled for this trip was 30kph. The 20kph needs to be “weighted” 3 times as much as the 60kph part because traveling a fixed distance at 3 times slower the speed means you’re traveling at that speed for 3 times as long because that’s how much longer it takes you to travel that distance. You could take the weighted average to get to this 30kph number, i.e. (20*3+60)/4 = 30. That works for mental math. But the more generalized way to solve this is to take reciprocals.

Given what we now know about the harmonic mean, we can see that what we are doing is:

  1. Convert the numbers to hours per kilometer: 𝑓(x) = 1/x
  2. Take the average of hours per kilometer.
  3. Convert back to kilometer per hour: 𝑓1(x) = 1/x

Our motivation for this is because we want a fixed denominator and a variable numerator so that the numbers we’re working with work sensibly when being added up (e.g. a/c + b/c = (a+b)/c). In the original formulation where the numbers are reported in kilometers per hour, the denominator is variable (hours) and the numerator is fixed (kilometers).

Again, much like the geometric mean situation, our motivation in this case is to make the problem trivially and sensibly additive by applying an invertible transformation. Note that if the duration was variable but the time was fixed, then the numbers would already be additive and we would just take the average without any fuss.

Why Use Means in the First Place?

It may sound silly to even ask, but we should not take for granted the question of why we use means and whether we should be using them.

For example, imagine a time series of annual US GDP levels from 1982 to 2021. What happens if you take the mean of this? Answer: you get $11.535 trillion. OK, so what though? This value is not going to be useful for, say, predicting US GDP in 2022. US GDP in 2022 is more likely to be closer to the 2021 number (23.315 trillion) than it is to be anywhere near the average annual GDP between 1982 and 2021. Also, this number is extremely sensitive to the time periods you pick.

Clearly, it is not necessarily the case that taking an average of any and all data is useful. Sometimes (like in the above example) it is pretty useless.

Another example where it may be somewhat useful but misleading would be for household income. Most household income statistics use median household income, not mean household income, because these numbers are heavily skewed by very rich people.

To be clear, sometimes (arguably a majority of the time) it’s fine to use the mean over a median, even if it is skewed by a heavy tail of outliers! For example, insurance policies: the median payout per fire insurance policy is a very useless number– it’s almost assuredly $0. But you can’t price with that! Your insurance company will live and die based on means, not medians!

The only reason in the household income context is because we want to know what a typical or representative household is like, and the mean doesn’t give us a great idea of that due to the skew. Similarly, in the fire insurance case, a typical household with fire insurance gets a payout of $0. But the thing is we don’t actually care about typical or representative households doing our actuarial assignment, that’s not what insurance is about.

In the case of the GDP example, the mean was useless because the domain over which we were trying to summarize the data was a meaningless and arbitrary domain, and the data generating process is non-stationary with respect to that domain. Taking the median also would have been pretty useless, too, which is to say none of these sort of summary statistics are particularly useful for this data. In the household income example, we want some sort of summary statistic, but we want every household at the top to offset every household at the bottom, hit for hit, until all that’s left is the 50th percentile.

In short:

  • For some data, neither mean or median is useful.
  • Sometimes mean is useful and median is not.
  • Sometimes median is useful and mean is not.
  • Sometimes they’re both useful.
  • No, the median is not something you use just to adjust for skew (because sometimes you don’t want to adjust for skew, even for very heavy-tailed distributions). There are no hard and fast rules you can make based on a finite distribution’s shape alone. It all depends on the context!

OK, But Really, Why Use Means? (This Time With Math)

One reason we like to use means is because of the central limit theorem. Basically, if you have a random variable sampled from some stationary distribution with finite variance, the larger your sample gets, the more the sample mean (which is itself a random variable) converges to a normal distribution with mean equal to the population mean.

The central limit theorem is often misunderstood, and it’s hard to blame people because it’s a mouthful and it’s easy to conflate many of the concepts inside that mouthful. There are two distinct mentions of random variables, distributions, and means.

If more people understood the central limit theorem you’d see fewer silly statements out in the wild like “you can’t do a t-test on data unless it’s normally distributed.” Ahem… yes you can! The key here is that distribution of the underlying data is different from the distribution of the sample mean. And it only takes a modest sample size for the distribution of a sample mean to converge to normality, even when the underlying data is non-normal. One way to think of the central limit theorem is that it is saying “more or less all arbitrary distributions of data have a sample mean that converges to a normal distribution.” It’s not saying the underlying data is normal, it’s saying the sample mean is itself a random variable that comes from its own distribution, which converges to normality.

This serves as our first motivation for why we care about means: all finite-variance stationary distributions have them, they are pretty stable as the sample size increases, and they converge to their population means.

A second motivation is because the arithmetic mean minimizes the squared errors of any given sample. And sometimes that’s pretty useful. That is to say, if you have some sequence xi of N numbers x1, x2, x3… xN, and you pick some number y, the y that minimizes will be the arithmetic mean of the sequence xi. (Also, the median minimizes the absolute error, i.e. .) Basically the mean is the argmin of the very popular, very common squared error loss function.

A third motivation is because in many contexts, the mean gives us an estimator for a parameter we care about from an underlying distribution. For example:

  • The sample mean of a normally distributed random variable x converges to the mean parameter for that distribution.
  • The sample mean of a normally distributed random variable log(x) converges to the mean parameter of a log-normal distribution of x.
  • The sample mean of the centered squares () of a random variable x converges to a biased (but correctable) estimate of the distribution’s variance.
  • The sample mean of a Bernoulli distributed random variable x converges to the probability parameter of a Bernoulli distribution.
  • Given some r, the sample mean of a negative binomial distributed variable x, when divided by r, converges to the odds of the probability parameter p.

So by taking a mean, we can estimate these and many other distributions’ parameters.

Final Notes

  • I find the statement that “arithmetic mean > geometric mean > harmonic mean” to be a cute factoid but often worse than useless to point out, which is why it’s down here. It’s way more often than not inappropriate to compare means like this. It either is or isn’t appropriate to take reciprocals before averaging, or take the log before averaging. The comparison of magnitudes across these means is completely irrelevant to that decision, and this cute factoid may mislead people into thinking that it is.
  • I think technically one reason the geometric mean is often defined using power and multiplication, rather than logs and exponents, is because you are technically allowed to use negative numbers? However, I am not aware of any actual situations where a geometric mean would be useful but we’re allowing for negative numbers.

Related