Scott Huffman runs one of the least-known units at Google: the evaluation team that measures the impact of every little proposed change to the leading search engine. And with some 6,000 experiments run annually, he’s pretty busy. Not to mention, he runs mobile search at Google, too.
In a recent interview for my story on how Google’s trying to stay ahead in search, Huffman explained in detail how Google runs all those experiments—which include the use of hundreds of human evaluators in addition to Google’s massive computer infrastructure.
This is the third of a four-part series that began with search chief Udi Manber and Google Fellow Amit Singhal, who heads the core search result ranking unit. Next up on Sunday is Matt Cutts, head of the anti-Web spam unit.
Q: What does the evaluation unit do?
A: We try to measure every possible which way we can think of how good is Google, how good are our search results, how well are they serving our users. And we break that down all kinds of ways—by 100 locales [country plus language pairs], by different genres (product queries, health queries, local queries, long queries, queries that don’t happen very often, queries that are very popular) times how are we doing on those in France and Switzerland and other places.
So we look at that two ways. One is an ongoing thing, we want to know how we’re doing over time and that’s changing over time on a representative sample of the query stream that we get. Second, the group is tasked for a proposed search improvement with designing a set of experiments, a set of measurements, that allow us to say with some kind of statistical meaning is the change good? Q: Can you give me a sense for how you approach evaluation?
A: We use two main kinds of evaluation data. One kind is we have human evaluators all over the world for whom we have a workflow system. They come to it and are fed things to evaluate. A typical thing is: Here is a query, you’re speaking French in Switzerland, here’s a URL, tell us on some kind of scale or some set of flags and description how good of a URL is that for that query.
The other data source we use is live experimentation with our users. A typical example where we use that more is for user interface changes to search. It’s hard to guess what people’s reaction will be to any particular UI change.
Q: How do you decide what query and URL result to evaluate?
A: It may be that we’re measuring over time. It may be that there’s some experiment, a new filter, a way to improve the ranking of something, and it changes the result for that query, so we want to get measurement of are the new results better.
Our set of tools allows us to design any kind of task that we want. Rate a URL by how good it is for a query is one that we’re doing all the time. I have a whole team of statisticians who are both experts at experimental design and at analyzing the data that comes out. They’re called search quality analysts, but they’re really statisticians. So we’ll design in some cases very specific experiments to measure particular things.
Q: Can you give me an example of how the process works?
One of the types of experiments that we run a lot is we will take a sample of queries, run it through the baseline system and the new system and we want to look at queries where the query changes things. You might think of this almost as QA [quality assurance], but in typical QA, bugs are always basically bad. So a fix is 100% good.
In search, it doesn’t really work like that. Almost anything you do is at least in ranking, you win some and you lose some. So someone comes in and says, “I’ve got a great idea, we’ll take all the documents that start with the letter Q and move them up three spaces.” I bet you can find some queries where that will help. But of course in aggregate, that’ll be terrible.
Q: How does this work in practice?
A: Here’s one example that’s the bread and butter of what we do a lot of. We’re constantly working issues on things around stemming [what variances of the word should be part of the query] and synonymy [what synonyms should be part of the query]. So we had a project a little while back, for Chinese where the engineers were making our system for synonymy more aggressive. They were saying we should be more aggressive about putting in variants… to make a broader range of synonyms that would show [results].
We did an evaluation where we had the evaluators basically look at the old results and the new results and say what they liked about them, but they didn’t know which was old and new. It actually was highly positive. So we said this is great, that says this is better for users.
But the other thing we always do is we go in and look in more detail at what are some of the individual positive and negative things that we’re getting out of this. Are the positive things really that positive, will they really make a difference to our users? And maybe more important, for the negative things, how important are they, can we live with them?
And this time we went and looked at the negative things, and what we found was that even though in aggregate the change was pretty highly positive, we found things like, I forget specifically, it was things like we were synonymizing “small” and “big,” crazy things. Really bad, like this would be hugely embarrassing if we launch this. In that case, we went back and said, OK, you can’t launch this, but here, look at some of these examples and see if you can change something.
Q: Do these kinds of variations based on language or geographic location happen a lot? Do you have to make changes pretty specific to those factors?
A: Only occasionally do we make changes only for certain locales. Most of what we do apply at once at many locales at a time. Occasionally it works really well everywhere but Spain, [and we say], “What the hell’s going on?” Sometimes it’s a bug, sometimes it’s that there’s something different about how content is published in that country.
Q: I can imagine having to test so many variables could quickly become unscalable. How do you avoid that?
A: We try to focus our evals on a country level or a locale level where the greatest impacts are. Which are the most queries affected? In terms of queries to look at, what results to look at, we similarly pick the same way—samples that happen often enough.
This is why the set of statisticians is so important. This is one of their jobs, to try to help us design an evaluation that we can do, that will give us data that’s meaningful. Obviously Google gets zillions of queries a day. Lots of them have never happened before. Lots of them we’ll never see again. Obviously we can’t measure all the queries. A lot of search evaluation is understanding statistically whether something is a meaningful change and looking at the impact.
People have a tendency to pick what we would call really popular queries [to compare search engines]. Look, I typed “flowers” in both and this one showed me pictures and this one showed me flowers to buy, and I like pictures better. But we get a lot of queries that are a lot more rare than that, and we feel that our gap in terms of the competition really opens up there. Q: So you feel that Google shines in that medium to long tail?
A: We start to see differences even with queries that are popular, but where either there’s a lesser-known navigational result—there’s really a right answer that should be at the top—we do start to see results open up between us and our competitors where you get past the obvious queries.
Q: How are personalized search results evaluated—any differently?
A: We do pretty specific kinds of eval for personalization. We obviously can’t use human evaluators in quite the same way, because we don’t know what they like and we can’t invade their privacy. We tend to use more click-based evaluation for personalization. You can look in aggregate where for all the folks who are logged-in users whom this kind of personalization is applying to, we take some small percentage and we’re going to apply some new kind of personalization—what happens to them vs. the control group?
A lot of the things we’ve done in the last year or two in personalization come out very, very clearly in that testing—like it really, really works. That’s nice for me because we don’t have to debate it for long.
Another thing that we spend a lot of time on is at the country level. Many countries speak English, but when I type in, say “bank,” I want pretty different answers if I’m in the U.S. vs. the U.K. vs. India vs. Australia. And today Google gives you very different answers for those. It also applies inside the country—in Dallas and Atlanta, you’ll get different results for “First Baptist Church.” Those kind tend to be a little trickier for us.
Q: Who are these human raters?
A: They’re not volunteers. They’re paid, through contractors from third parties. We look for a basic level of education and communication skills, and in particular our one requirement is that they need to be able to some level in English. Other than that, what we’re really looking for is a broad cross-section of folks. Not a technology background, just like to use the Internet. We have some screening around testing of their ability to do some of the tasks we want them to do and follow the instructions. Q: Can anyone sign up to do this?
A: The [temp agencies] find them through advertising in places, like Craigslist. The main thing is who will respond to a work-at-home ad in Luxembourg, or wherever. It used to pay $15 to $17 an hour in the U.S., but it depends where you are in the country. I had a cousin who was one of the evaluators for awhile and she lived in South Dakota. She was down visiting and I said, sort of joking, “I got a side job for you if you need some extra money.” “Oh, what’s it pay?” “Oh, $16 or whatever.” She said, “$16 an hour! How do I get that job?” She was very excited.
Q: How important are the human raters vs. the more automated methods? And do you get alerts if search results or search behavior is not what you expect?
A: The human evaluators are pretty important for us today. The more automated or user behavior/click-based things really give you complementary kinds of data. Both have noise in them: Human evaluators make mistakes. Clicks are hard to interpret; people click or don’t click for all kinds of reasons.
The clicks obviously tell you what users are actually doing, and you get them at volume, at a real scale, but it’s hard to interpret. Human evaluators, there’s noise in terms of mistakes, but we can go deeper on specific examples. For this pretty rare query, we can generate examples that our ranking engineers can go look at.
Where we don’t get a correlation, that’s a big red flag. We have had cases where either the human evaluation was positive but from a click point of view it looked kinda negative, or vice versa. Then we have to look and see: Are we asking the human evaluators the wrong thing or is there something funny about how we’re measuring the clicks? So we use them to corroborate each other. Q: And the automated part, how does that work?
A: We have a pretty comprehensive system that uses both of these kinds of data to do corroboration. We have running on an ongoing basis. It’s sort of like if you’re running a data center, you have something on the machines all the time that’s checking memory usage or other performance factors.
At a [search] quality level, we have something similar. On a continuous basis in every one of our data centers, a large set of queries are being run in the background, and we’re looking at the results, looking up our evaluations of them and making sure that all of our quality metrics are within tolerance.
These are queries that we have used as ongoing tests, sort of a sample of queries that we have scored results for; our evaluators have given scores to them. So we’re constantly running these across dozens of locales. Both broad query sets and navigational query sets, like “San Francisco bike shop” to the more mundane, like: Here’s every U.S. state and they have a home page and we better get that home page in the top results, and if we don’t … then literally somebody’s pager goes off. Q: Google said on a recent analyst call that there was an acceleration of improvements. Are you doing more than you used to?
A: We’ve been launching several hundred in the past couple of years at least. Maybe now has leveled off after growing a lot for several years. We’re doing a lot more on the UI side. We’re trying to do a lot more experimentation—how can we push the envelope?
You don’t want to be just the 10 blue links. I see us definitely trying to be more aggressive with a lot of features that will start to show up on the search page. Even if you look at the search page today compared with a couple years ago, it’s actually quite a bit different.
Q: Why is it changing faster? What’s pushing Google to make more changes?
A: Google is constantly working on improving the core ranking. There I think we’re running fast and have been for awhile. On the UI side, my impression is that the bar for what people expect from search engines is higher today. That’s partly because of the features Google and others have added.
When I type “movies” into Google, I expect Google to know where I am and bring back the times of the movies that are playing, not bring me just Web links that happen to match the word movies. When I type “pizza in san francisco,” I expect Google to bring me back a map and some links to good pizza places with reviews underneath.
At the same time, you can easily screw this up and go nuts on this stuff and make the search experience distracting. People make fun of the 10 blue links, but there’s a wonderful thing about 10 blue links, which is that they’re very predictable. Your eye knows where to go. It’s optimized for scanning and easily finding what you need. I don’t think we’re ready to pitch that out the window.
Q: Will there have to be some fundamental shift in UI to deal with the new things people want from search?
A: I don’t think we’ve hit that point yet. When I feel like Universal Search [which provides links not just to pages but to videos, maps, and other material] has broken, it’s not so much that you couldn’t make a great Universal results page, it’s that somehow we’re misfiring. Something’s showing up that shouldn’t. Our way of triggering our algorithms is off somehow. But not so much the paradigm is broken.
Q: What has kept the longtime search quality people here?
A: The folks that I work with, besides being the world’s experts at what they do, they just love it. Search is such a rich problem. It’s not like building an application. Once I build it, the main problem is done.
Search isn’t like that. There’s just an endless supply of very difficult, challenging, but incredibly interesting problems to work on. We’re just nowhere near making that fountain run out.