Why designers don’t like A/B testing

Awhile back I saw this tweet from a UX consultant that struck a nerve: “I don’t trust designers who don’t want their designs a/b tested. They’re not interested in knowing if they were wrong.”

I responded quickly in the short staccato bursts afforded by 140 char limits while I was (at the time) jamming on some key designs needed by the Citrix CEO for his keynote the following day. Oops! Perhaps not the best time to engage in twitter arguments ;-) For a long time now I’ve wanted to elaborate beyond twitter why (I think) many designers (certainly myself and most of my closest peers) do not love A/B testing.

And believe me, it has nothing to do with a lack of interest in being proven wrong.

As a former UI design consultant for two of the most famously data-driven internet firms, Netflix and LinkedIn, I totally understand the arguments for A/B testing and its commercial value at mercilessly incremental levels of marginal revenue value, literally nickels & dimes across millions of clicks, etc. Yep, I get all that. Massive scales, long tail value chain, so forth. Tons of money! I get it.

However, as I responded via tweets, as a designer, I must defend and assert aesthetic integrity as much as I can, keeping in mind key business metrics and technical limits. And, quite frankly, the first victims of A/B testing are beauty, elegance, charm, and grace. Instead we get a unsightly pastiche of uneven incrementalism lacking any kind of holistic cohesiveness or suggestive of a bold, vivid, nuanced vision that inspires users. A perplexing mashup of visuals, behaviors, and IA/navigation that leaves one gasping for air. It is the implicit charter of a high-quality design team (armed with user researchers and content strategists!) to propose something a user may not be able to imagine, that is significantly better, since they are so conditioned by mediocre design in the mainstream. (Look up Paul Rand’s infamous quote)

So why don’t many designers like A/B testing? I think it’s mainly the following:

* A/B testing may only be as effective as the designs being tested, which may or may not be high quality solutions. Users are not always the best judge of high quality design. That’s why you hire expert designers of seasoned skills, experience, judgment, and yes the conviction to make a call as to what’s better overall.

* As is true with any usability test, you gotta question the motives behind the participants’ answers/reactions. Instead, biz/tech folks look at A/B test results as “the truth” rather than a data point to be debated. Healthy skepticism is always warranted in any testing. Uncovering the rationale for a metric is vital.

* A/B testing is typically used for tightly focused comparisons of granular elements of an interface, resulting in poor pastiches with results drawn from different tests.

*  How do you A/B test novel interaction models, conceptual paradigms, visual styles (by the way, visuals & interactions have a two-way rapport, they inform each other, can’t separate them–see Mike Kruzeniski’s talks) which may vary wildly from before? Would you A/B test the Wii or Dyson or Prius or iPhone? Against what???

* A/B testing locks you into just two comparative options, an exclusively binary (and thus limited) way of thinking. What about C or D or Z or some other alternatives? What if there are elements of A & B that could blend together to form another option? Avenues for generative design options are shut down by looking at only A and only B.

• Finally A/B testing can undermine a strong, unified, cohesive design vision by just “picking what the user says”. A designer (and team) should have an opinion at the table and be willing to defend it, not simply cave into a simplistic math test for interfaces.

——–

Ultimately, no A/B test proves a design “wrong”. Designs can’t be “proven” wrong, only demonstrated to be in need of more effective improvement or better iteration. Therein lies the real flaw of the original statement. This assumption that designs are either “right” or “wrong” is inaccurate. Instead designs are “better” or “worse” depending on the audience & context and purpose, not to mention business strategy as well. Designers (and researchers) seasoned in the craft of software understand this deeply.

A/B test results perpetuate a falsely comforting myth that designs can be graded like a math test, in which there’s a single right answer. Certainly this myth soothes the nerves of an anxious exec about to make a multi-million dollar bet on the company’s future :-) Wanna relieve anxiety? Take prozac. Wanna achieve top quality design results, then assert confidence in a rigorous creative process as promoted  (and well articulated) by Adam Richardson, Luke Williams, Jon Kolko,  and others…as well as in your design team. Because, if you hired top quality designers & researchers with a sensble PM and skilled Engin team, you more than likely have a pretty darn good product on your hands .

At end of the day, A/B testing should NOT be used as a litmus test of a design or a designer. It’s a single data point, that’s all. It can be compelling, no doubt. Its level of impact and value varies per product/company/market, however. And just like Roe v Wade which has become an unfair litmus test for Supreme Court candidates (as part of a greater political-media circus, a whole separate issue), using A/B testing in this way only polarizes things, and makes the vetting process of a design or designer unnecessarily difficult. And you risk dissuading top quality design talent from joining the team’s cause for good, beautiful, useful designs that improve the human condition. After all, isn’t that what we are all fighting for? A/B testing is simply one tool; not something to judge the character or quality of a professional (nor her work) striving to do what’s right with integrity.

 

8 comments

  • > And, quite frankly, the first victims of A/B testing are beauty, elegance, charm, and grace.

    Right, and the first casualty of a designer’s-opinion-only approach is “beautiful” designs that users hate. (Of course, neither premise is necessarily true, this is just a baseless appeal to emotion. Who could be against “beauty, elegance, charm, and grace”?) Can you demonstrate from, say, ABtests.com or Whichtestwon.com that this is the case? Or have you simply had bad experiences at highly data-driven companies? Is Netflix, for example, guilty of sacrificing “beauty, elegance, charm, and grace”?

    > Instead we get a unsightly pastiche of uneven incrementalism lacking any kind of holistic cohesiveness or suggestive of a bold, vivid, nuanced vision that inspires users.

    Such as? Are you suggesting we go with poor-performing designs because in one or a few people’s opinion, they “inspire” users? How do you know users are inspired if their performance suffers?

    > It is the implicit charter of a high-quality design team (armed with user researchers and content strategists!) to propose something a user may not be able to imagine, that is significantly better, since they are so conditioned by mediocre design in the mainstream.

    IMO, it’s implicit that a high quality design team could come up with 2 or more such designs, or design variations, to test. Not only that, that they would come up with such design variations to test that *are* cohesive, elegant, and beautiful.

    > A/B testing may only be as effective as the designs being tested, which may or may not be high quality solutions. Users are not always the best judge of high quality design. That’s why you hire expert designers of seasoned skills, experience, judgment, and yes the conviction to make a call as to what’s better overall.

    Make a call based on what? Gut feel? The quality of designs being tested is not a problem of A/B testing — if you only want users to choose from “high quality design”, only test “high quality design”. I don’t understand the objection here. If a designer is truly an expert with seasoned skills, they should be the first to push for A/B testing of their best ideas, not retreating into vague notions of what they feel is best. Designers should own and drive the A/B testing process, not feel like they are victims of it.

    > As is true with any usability test, you gotta question the motives behind the participants’ answers/reactions. Instead, biz/tech folks look at A/B test results as “the truth” rather than a data point to be debated. Healthy skepticism is alwayswarranted in any testing. Uncovering the rationale for a metric is vital.

    People buy or they don’t. They click or they go. You can’t “question the motives” or hundreds or thousands of users, that doesn’t make sense. You can’t wish away the data.

    > A/B testing is typically used for tightly focused comparisons of granular elements of an interface, resulting in poor pastiches with results drawn from different tests.

    Poor A/B testing practice may be a problem, fine, but let’s not throw out the baby with the bathwater.

    > How do you A/B test novel interaction models, conceptual paradigms, visual styles (by the way, visuals & interactions have a two-way rapport, they inform each other, can’t separate them–see Mike Kruzeniski’s talks) which may vary wildly from before?

    Uh, you run an A/B test and measure what you’re doing.

    > Would you A/B test the Wii or Dyson or Prius or iPhone? Against what???

    Category error. We’re talking about graphic/interaction/web design. All those companies absolutely should A/B test their web sites, for example.

    > A/B testing locks you into just two comparative options, an exclusively binary (and thus limited) way of thinking. What about C or D or Z or some other alternatives? What if there are elements of A & B that could blend together to form another option? Avenues for generative design options are shut down by looking at only A and only B.

    Seriously? Even basic free tools like Google Website Optimizer let you A/B/n or multivariate test.

    > Finally A/B testing can undermine a strong, unified, cohesive design vision by just “picking what the user says”. A designer (and team) should have an opinion at the table and be willing to defend it, not simply cave into a simplistic math test for interfaces.

    Sure, and not A/B testing can undermine a business by costing them thousands of dollars by not “picking what the user says”. Pitting design teams against a “simplistic math test” is a silly way to polarize the issue. The designer or design team should be thrilled they have such a wealth of data at their disposal to come up with the best design possible. It’s not about fighting, defending, or caving, it’s about using the data to know — and not blindly guess about — what’s going on.

    > A/B test results perpetuate a falsely comforting myth that designs can be graded like a math test, in which there’s a single right answer.

    You assert this several times throughout the piece but never provide any evidence. Can you shed some light on real case studies where this happens?

    Indeed, if there is any “falsely comforting myth” doing the rounds, it’s that designers can guess and know what thousands of disparate users will prefer within a single percentage point of accuracy. If you can guess that well, I’d like you to pick my stocks.

    > And you risk dissuading top quality design talent from joining the team’s cause for good, beautiful, useful designs that improve the human condition.

    Who’s “human condition” are you improving by potentially harming client’s businesses by blindly rolling out designs which may in fact cost them thousands (or hundreds of thousands, or millions) of dollars, and jobs, and revenue, and so on…? This is another silly, polarizing argument.

    If you want “good, beautiful, useful designs”, then A/B test “good, beautiful, useful designs”. It’s not us vs them. We can have our cake and eat it too!

  • You do realize that the practice of A/B testing also allows an unlimited number of variations? a-z?
    And further more, if two ‘holistically aesthetical’ designs are a/b tested against each other for either: Bounce rate or down right conversion. Why would that attack the aesthetics of a website?

  • According to my understanding about the process of A/B testing I would like to make a few comments:

    “A/B testing may only be as effective as the designs being tested, which may or may not be high quality solutions. Users are not always the best judge of high quality design”
    Users are not asked if they like A or B better. Their behavior is measured. Users do not make a judgement during an A/B test. You measure how they behave under condition A and compare to the behavior under condition B.

    “”Instead, biz/tech folks look at A/B test results as “the truth” rather than a data point to be debated.”
    It looks pretty much like prejudice. I agree A/B testing is a data point. It doesn’t measure everything that might be important about a design.

    “How do you A/B test novel interaction models”
    You can get interesting data out of it. The importance lies in the design of questions that the A/B test should answer. Asking the wrong questions will screw up everything.

    ” A/B testing is typically used for tightly focused comparisons of granular elements of an interface, resulting in poor pastiches with results drawn from different tests.”
    I am trying to understand this conclusion but cannot get my head around it. Could you elaborate a bit more what you mean ?

    “Would you A/B test the Wii or Dyson or Prius or iPhone? Against what???”
    You can A/B test an iPhone. It really depends on the metrics however. If you would run an iPhone against an old-school smart phone (the ones before the iPhone came out). You would see that the iPhone encourages internet usage. Thus if you were Steve Jobs you knew you are on the right track.

    ” A/B testing locks you into just two comparative options, an exclusively binary (and thus limited) way of thinking. What about C or D or Z or some other “”
    Nope. You can do A/B/C/D,… testing as well. Who said you couldn’t ?

    “Finally A/B testing can undermine a strong, unified, cohesive design vision by just “picking what the user says”. ”
    Clearly, no. Let’s consider a website with an old ugly design. (ala craigslist). But the CEO wants to open it up to new customers. Thus he needs a “better” design. One that attracts new customers AND keeps his old customers happy. Thus you A/B test the design canditates with your old customers and with your new customers.
    Then the CEO can make a well informed decision what to do.
    However his options are not only to keep A or to take B. He could also say and try more designs.

    A/B testings gives valuable quantitative insights. But it doesn’t usually measure everything that is important.

    My two cents
    JE

  • I feel proper A/B testing is more about testing content, not design. Those content changes may cause the design to change, but at that point you’ve changed so many variables, you no longer have an A/B test on your hands.

    Ideally you’re testing the information you’re presenting to your users and you trust your designers and their experience to properly present those different pieces of information. When you start testing things like button colors and the size of objects, you’re testing design theory. That kind of stuff can seem unproductive to me.

  • I think you’re missing the point about A/B tests somewhat. I’m an SEO and I can get a site to rank on page 1 of Google’s results for any number of keywords. However, if those keywords are sending traffic to the site that doesn’t convert, then I haven’t done by job properly.

    The same applies to design. A website can have a lot of “beauty, elegance, charm, and grace” but if it’s not making the owner any money, the design isn’t doing its job.

    Every month I make a report for my SEO clients that shows the results of my efforts, in painstaking detail, down to every keyword and how much money they made off of it. My work is 100% accountable.

    Design is not quite so accountable. But with A/B and, esepcially, multivariate testing, it can become more accountable. Design elements can be proven to need improvement. I think that is first and foremost what designers feel uncomfortable with – their primarily emotion-driven discipline being exposed to data-driven accountability.

    I do however think there’s a limit to how far A/B and MVT testing should go. The cohesion of the design should be protected, lest the site looks messy and disjointed.

    But the client doesn’t necessarily care about that. It’s about the bottom line, after all. And think of some of the web’s biggest money making websites. Aesthetics isn’t very high on their list. (Amazon for example, not pretty but does it sell.)

  • I found this discussion fascinating, and wrote a full response here: http://beyroutey.posterous.com/design-by-data-why-ab-and-intuitive-design-bo

    Other comments have already covered that multivariate testing is possible in the A/B format. I think that misunderstanding contributes to a lot of the poor perceptions of A/B testing among designers.

    But I also think there’s something deeper here: a reluctance on the part of designers to accept that designs need to be prototyped tens of times before the best one can be found. Good design is not purely a result of divine inspiration and tacit knowledge, but also of having tried lots of possibilities.

    A/B testing helps filter out the solution-space once lots of possibilities have been generated, and to guide the decision process where many choices are arbitrary.

  • I’m thinking the real disagreement here is in parts versus wholes. A/B testing is for testing parts of a website, but aspects such as grace and beauty are products of the whole being consistent throughout. Both have their place, and the interaction between the two should be a topic of discussion as apposed to a topic of debate where one is right and one is wrong.

    I’m loving that these conversations are taking place though.

  • I don’t have time for a response as verbose as the ones above, but would like to point out just a couple of items. Any test is only as good as the person preparing it, they can be skewed and abused and bent to agendas. The key to a quality A/B test is measuring behavior and the components that are being measured should be provided by trained designers. There is no reason why the two comparison elements can’t be well designed.

Recent posts

Older Posts

Let’s go meta