Selecting success metrics in web experiments

Ron Kohavi is one of the most respected names in analytics and currently leads Microsoft’s Experimentation Platform. His publications, especially those since ~2001, are required reading if you’re in the business of data. I am constantly referencing his work, such as Online Experiments: Lessons Learned, an article with Roger Longbotham published in 2007:

A common pitfall in Web experiments is the use of multiple metrics. For an organization that seeks to run many experiments in a given domain, it’s strongly desirable to select a single quantitative measure, or overall evaluation criterion (OEC), to help determine whether a particular treatment is successful or not.

Selecting singular goals turns out to be difficult—it’s much easier to come up with a set of goals than whittle it down to one.  Even after selecting a goal, it’s difficult to ignore other data as you attempt to rationalize experiment results in light of preconceptions.

In the above paper Kohavi suggests selecting an OEC which accounts for “long-term objectives such as higher revenue… In many cases, the OEC is an estimate of users’ lifetime value.” In theory this is the right approach but can be challenging in practice where most tests are designed to impact a single feature or channel. For example, how do you measure the success of a landing page A/B test comparing a red button versus a blue button? Taking Kohavi’s approach, you’d construct the following equation:

Revenue = (Visitors)(\dfrac{Visits}{Visitor})(\dfrac{Registrations}{Visit})(\dfrac{Customers}{Registration})(\dfrac{Months}{Customer})(\dfrac{Revenue}{Month})

This takes into account both the landing page performance (registrations per visit) and the composition of your conversions (revenue per registration). It’s a great OEC but it’s also a stretch for most practitioners and most experiments. Tracking these events with reliable attribution data is no small task. You also need to control for other variables as Kohavi acknowledges:

Experimenters often ignore secondary metrics that impact the user experience such as JavaScript errors, customer-service calls, and Web-page loading time. Experiments at Amazon.com showed that every 100-ms increase in the page load time decreased sales by 1 percent, while similar work at Google revealed that a 500-ms increase in the search results display time reduced revenue by 20 percent.

Easier said than done. Especially with limited resources. Which is why many experiments opt for an approach which defines success metrics locally—landing page conversion rate, in this example—instead of globally. I have yet to come across third-party tools which effectively solve this problem for non-technical audiences.