Early in my career I ran acquisition at a startup in San Francisco, responsible for a six-figure monthly budget across Google and social. It was the best possible education in testing, mostly because the feedback was immediate and the money was real. When your changes move acquisition costs this week, you learn quickly what a well-designed test looks like. One quarter, a series of structured A/B tests cut our acquisition cost by 22.8% in three months without hurting return, and I walked away believing I understood experimentation.
I understood the mechanics. The purpose took longer.
The gap became obvious years later, sitting in readout meetings on the other side of the table. A test would end. The uplift was positive, the significance cleared the bar, the slide got a green tick, everyone nodded. And then, nothing. No rollout decision, no follow-up test, no change to the roadmap. The result was significant, and nobody could say what should be different on Monday. I have now seen far more experiments die of vague success than of failure, and vague success is worse, because failure at least teaches you something specific.
The root cause is almost always upstream of the statistics. The test was pointed at a metric instead of a decision. "Does the new component increase engagement" is a question a dashboard can love and a business cannot act on. Nobody agreed in advance what number would trigger what action, so no number could.
These days I own testing end to end, from hypothesis and sample sizing through to the statistical readout, and the structure I hold every test to has three parts. The statistics come first and matter: uplift with its uncertainty attached, honestly reported even when the honest version is less exciting. Then the part averages hide: which segments actually moved. An overall uplift is usually one audience responding strongly while others shrug, and knowing which is which is frequently worth more than the headline number. And finally the sentence the whole exercise exists for: what we do next. Scale it, kill it, or run the sharper test this result just revealed.
That third part is where most readouts stop short, and it is the only part stakeholders remember a week later. So I apply the rule in reverse. Before a test runs, I write the endings in advance: if this wins, we do X, if it loses, we do Y, if it is flat, we learn Z. If we cannot fill in those sentences, the test is not ready, and no sample size calculator will rescue an experiment that was pointed at nothing.
Run fewer tests. Point them at decisions. Write the "so what" before you write the hypothesis. The statistics are how you keep an experiment honest, but the decision is why you ran it.