Kaggle's William Cukierski on Data Sharing, Competitions
Kaggle’s William Cukierski joins our experts discussing the untapped potential of data analysis in medicine, education, and elsewhere, along with the pitfalls that may lie ahead.
What’s the idea behind Kaggle?
Big data was coming down the pipeline and many organizations were saying, “Oh, this data thing is going to be so big and so important. And we’re collecting all this stuff and we’re not using it.” And then at the same time you had people who are really good at working with data, but they’re all siloed away in their jobs. You know, “I’m an insurance person; I only work on insurance problems. I don’t ever touch grocery store shopping data sets.” So Kaggle recognized this and tried to matchmake, through competitions.
Which competitions have yielded the most exciting results?
One of my favorites was called the Whale Detection Challenge. So the right whale is an endangered species of whale that lives in the Atlantic. And these researchers had these buoy networks that are out there constantly recording for whale noises, and they had an algorithm, which worked OK. They said, “Let’s give it to Kaggle and see if they can do better.” And people ended up doing phenomenally on it. Now these buoys detect whale noises with like 99 percent accuracy. I think that’s really cool that someone who is just sitting at their desk, you know, sitting at a desk in New York can just take on this problem so far away and so removed from anything you’d ever do in a day-to-day job and actually help out and provide some benefit to a real-life case.
You’ve also looked at using data analysis to do some cancer research work. Does Kaggle run many contests in health-care related fields?
We haven’t gotten much traction in medical stuff at Kaggle. That’s largely because of the problems of giving out patient data. It’s very hard to get the HIPAA compliance and all the approvals.
Another problem is that the people and institutions who have this data hoard it. Pharmaceutical companies have data on pharmaceutical trials they keep in silos. There are some ham-handed efforts to share data, and places promise they’re all going to work together, but at the end of the day there is still this desire to keep things to themselves.
The privacy concerns are legitimate to a certain extent. You don’t want to give someone’s genome out and then have everyone find out it’s Sally Smith at 232 Main Street. But at the same time these concerns are extended too far. People really play a game, saying you can’t have anything even remotely useful to solve a problem unless it was specifically pulled for you and given to you. If you could get rid of that, you could make some really nice advances.
You’re running open contests where anyone can participate, but it also seems like organizations with this data might want to keep it close to the chest. Is there a tension there?
One of the biggest day-to-day challenges I face is convincing people that they can put out data and that it’s not going to threaten their organizational livelihood; that oftentimes it’s not that you have data and the data is inherently valuable, it’s entirely being able to take action on it. If we get a data set from an organization and it’s all made public, and the way that the problem was solved is public, it still doesn’t matter because nobody else has that same data and nobody else is in the position to keep getting that data and acting on it.
You have moved toward making more of the contests private, though, right?
Yeah, they’ve been our solution to that problem of what happens when an organization is just too big and has too many lawyers in-house and just says, “No. Nothing gets past the fire wall.”
Where do you think that the hoopla around big data is most out of control?
I’ll have to rephrase your question and ask, where is it not out of control? It’s really hard to talk to people and not have that conversation be the dominant thing, not have somebody’s boss step in and be like, “Well, let’s do big data.”
I think people are particularly out of control on the volume front. So they’ll say like, “Oh we have petabytes of data, we have terabytes of data.”
Most problems can be solved at much smaller scales. An example is lima beans going by on a conveyor belt. The company selling these lima beans wants to knock the spoiled ones out of the conveyor belt using a camera. You can imagine that once you’ve seen one brown lima bean, you’ve seen them all. You don’t need to have trillions of terabytes of data to solve that problem.
I’d say 95 percent of problems fit into that model. There are still 5 percent where the algorithms are very hungry, you feed in more and more data and they can make good use of it. Netflix’s movie recommendations are an example.
How do you deal with information overload in your personal life?
I’m one of the few people in the world, I think, who works on a data set in the morning to identify bird species and then works on a credit default modeling problem in the afternoon. So I am very much overloaded in the sense that every day I’m a totally new idiot in a different field. And it’s something that I’ve come to embrace.
For more conversation and video, visit: www.businessweek.com/fix-this/big-data.