When historians look back at the turmoil over prejudice and policing in the U.S. over the past few years, they’re unlikely to dwell on the case of Eric Loomis. Police in La Crosse, Wis., arrested Loomis in February 2013 for driving a car that was used in a drive-by shooting. He had been arrested a dozen times before. Loomis took a plea, and was sentenced to six years in prison plus five years of probation.
The episode was unremarkable compared with the deaths of Philando Castile and Alton Sterling at the hands of police, which were captured on camera and distributed widely online. But Loomis’s story marks an important point in a quieter debate over the role of fairness and technology in policing. Before his sentence, the judge in the case received an automatically generated risk score that determined Loomis was likely to commit violent crimes in the future.
Risk scores, generated by algorithms, are an increasingly common factor in sentencing. Computers crunch data—arrests, type of crime committed, and demographic information—and a risk rating is generated. The idea is to create a guide that’s less likely to be subject to unconscious biases, the mood of a judge, or other human shortcomings. Similar tools are used to decide which blocks police officers should patrol, where to put inmates in prison, and who to let out on parole. Supporters of these tools claim they’ll help solve historical inequities, but their critics say they have the potential to aggravate them, by hiding old prejudices under the veneer of computerized precision. Some people see them as a sterilized version of what brought protesters into the streets at Black Lives Matter rallies.
Loomis is a surprising fulcrum in this controversy: He’s a white man. But when Loomis challenged the state’s use of a risk score in his sentence, he cited many of the fundamental criticisms of the tools: that they’re too mysterious to be used in court, that they punish people for the crimes of others, and that they hold your demographics against you. Last week the Wisconsin Supreme Court ruled against Loomis, but the decision validated some of his core claims. The case, say legal experts, could serve as a jumping-off point for legal challenges questioning the constitutionality of these kinds of techniques.
To understand the algorithms being used all over the country, it’s good to talk to Richard Berk. He’s been writing them for decades (though he didn’t write the tool that created Loomis’s risk score). Berk, a professor at the University of Pennsylvania, is a shortish, bald guy, whose solid stature and I-dare-you-to-disagree-with-me demeanor might lead people to mistake him for an ex-cop. In fact, he’s a career statistician.
His tools have been used by prisons to determine which inmates to place in restrictive settings; parole departments to choose how closely to supervise people being released from prison; and police officers to predict whether people arrested for domestic violence will re-offend. He once created an algorithm that would tell the Occupational Safety and Health Administration which workplaces were likely to commit safety violations, but says the agency never used it for anything. Starting this fall, the state of Pennsylvania plans to run a pilot program using Berk’s system in sentencing decisions.
As his work has been put into use across the country, Berk’s academic pursuits have become progressively fantastical. He’s currently working on an algorithm that he says will be able to predict at the time of someone’s birth how likely she is to commit a crime by the time she turns 18. The only limit to applications like this, in Berk’s mind, is the data he can find to feed into them.
This kind of talk makes people uncomfortable, something Berk was clearly aware of on a sunny Thursday morning in May as he headed into a conference in the basement of a campus building at Penn to play the role of least popular man in the room. He was scheduled to participate in the first panel of the day, which was essentially a referendum on his work. Berk settled into his chair and prepared for a spirited debate about whether what he does all day is good for society.
The moderator, a researcher named Sandra Mayson, took the podium. “This panel is the Minority Report panel,” she said, referring to the Tom Cruise movie where the government employs a trio of psychic mutants to identify future murderers, then arrests these “pre-criminals” before their offenses occur. The comparison is so common it’s become a kind of joke. “I use it too, occasionally, because there’s no way to avoid it," Berk said later.
For the next hour, the other members of the panel took turns questioning the scientific integrity, utility, and basic fairness of predictive techniques such as Berk’s. As it went on, he began to fidget in frustration. Berk leaned all the way back in his chair and crossed his hands over his stomach. He leaned all the way forward and flexed his fingers. He scribbled a few notes. He rested his chin in one hand like a bored teenager and stared off into space.
Eventually, the debate was too much for him: “Here’s what I, maybe hyperbolically, get out of this,” Berk said. “No data are any good, the criminal justice system sucks, and all the actors in the criminal justice system are biased by race and gender. If that’s the takeaway message, we might as well all go home. There’s nothing more to do.” The room tittered with awkward laughter.
Berk’s work on crime started in the late 1960s, when he was splitting his time between grad school and a social work job in Baltimore. The city exploded in violence following the assassination of Martin Luther King Jr. Berk’s graduate school thesis examined the looting patterns during the riots. “You couldn’t really be alive and sentient at that moment in time and not be concerned about what was going on in crime and justice,” he said. “Very much like today with the Ferguson stuff.”
In the mid-1990s, Berk began focusing on machine learning, where computers look for patterns in data sets too large for humans to sift through manually. To make a model, Berk inputs tens of thousands of profiles into a computer. Each one includes the data of someone who has been arrested, including how old they were when first arrested, what neighborhood they’re from, how long they’ve spent in jail, and so on. The data also contain information about who was re-arrested. The computer finds patterns, and those serve as the basis for predictions about which arrestees will re-offend.
To Berk, a big advantage of machine learning is that it eliminates the need to understand what causes someone to be violent. “For these problems, we don’t have good theory,” he said. Feed the computer enough data and it can figure it out on its own, without deciding on a philosophy of the origins of criminal proclivity. This is a seductive idea. But it’s also one that comes under criticism each time a supposedly neutral algorithm in any field produces worryingly non-neutral results. In one widely cited study, researchers showed that Google’s automated ad-serving software was more likely to show ads for high-paying jobs to men than to women. Another found that ads for arrest records show up more often when searching the web for distinctly black names than for white ones.
Computer scientists have a maxim, “Garbage in, garbage out.” In this case, the garbage would be decades of racial and socioeconomic disparities in the criminal justice system. Predictions about future crimes based on data about historical crime statistics have the potential to equate past patterns of policing with the predisposition of people in certain groups—mostly poor and nonwhite—to commit crimes.
Berk readily acknowledges this as a concern, then quickly dismisses it. Race isn’t an input in any of his systems, and he says his own research has shown his algorithms produce similar risk scores regardless of race. He also argues that the tools he creates aren’t used for punishment—more often they’re used, he said, to reverse long-running patterns of overly harsh sentencing, by identifying people whom judges and probation officers shouldn’t worry about.
Berk began working with Philadelphia’s Adult Probation and Parole Department in 2006. At the time, the city had a big murder problem and a small budget. There were a lot of people in the city’s probation and parole programs. City Hall wanted to know which people it truly needed to watch. Berk and a small team of researchers from the University of Pennsylvania wrote a model to identify which people were most likely to commit murder or attempted murder while on probation or parole. Berk generally works for free, and was never on Philadelphia’s payroll.
A common question, of course, is how accurate risk scores are. Berk says that in his own work, between 29 percent and 38 percent of predictions about whether someone is low-risk end up being wrong. But focusing on accuracy misses the point, he says. When it comes to crime, sometimes the best answers aren’t the most statistically precise ones. Just like weathermen err on the side of predicting rain because no one wants to get caught without an umbrella, court systems want technology that intentionally overpredicts the risk that any individual is a crime risk. The same person could end up being described as either high-risk or not depending on where the government decides to set that line. “The policy position that is taken is that it’s much more dangerous to release Darth Vader than it is to incarcerate Luke Skywalker,” Berk said.
Philadelphia’s plan was to offer cognitive behavioral therapy to the highest-risk people, and offset the costs by spending less money supervising everyone else. When Berk posed the Darth Vader question, the parole department initially determined it’d be 10 times worse, according to Geoffrey Barnes, who worked on the project. Berk figured that at that threshold the algorithm would name 8,000 to 9,000 people as potential pre-murderers. Officials realized they couldn’t afford to pay for that much therapy, and asked for a model that was less harsh. Berk’s team twisted the dials accordingly. “We’re intentionally making the model less accurate, but trying to make sure it produces the right kind of error when it does,” Barnes said.
The program later expanded to group everyone into high-, medium-, and low-risk populations, and the city significantly reduced how closely it watched parolees Berk’s system identified as low-risk. In a 2010 study, Berk and city officials reported that people who were given more lenient treatment were less likely to be arrested for violent crimes than people with similar risk scores who stayed with traditional parole or probation. People classified as high-risk were almost four times more likely to be charged with violent crimes.
Since then, Berk has created similar programs in Maryland’s and Pennsylvania’s statewide parole systems. In Pennsylvania, an internal analysis showed that between 2011 and 2014 about 15 percent of people who came up for parole received different decisions because of their risk scores. Those who were released during that period were significantly less likely to be re-arrested than those who had been released in years past. The conclusion: Berk’s software was helping the state make smarter decisions.
Laura Treaster, a spokeswoman for the state’s Board of Probation and Parole, says Pennsylvania isn’t sure how its risk scores are impacted by race. “This has not been analyzed yet,” she said. “However, it needs to be noted that parole is very different than sentencing. The board is not determining guilt or innocence. We are looking at risk.”
Sentencing, though, is the next frontier for Berk’s risk scores. And using algorithms to decide how long someone goes to jail is proving more controversial than using them to decide when to let people out early.
Wisconsin courts use Compas, a popular commercial tool made by a Michigan-based company called Northpointe. By the company’s account, the people it deems high-risk are re-arrested within two years in about 70 percent of cases. Part of Loomis’s challenge was specific to Northpointe’s practice of declining to share specific information about how its tool generates scores, citing competitive reasons. Not allowing a defendant to assess the evidence against him violated due process, he argued. (Berk shares the code for his systems, and criticizes commercial products such as Northpointe’s for not doing the same.)
As the court was considering Loomis’s appeal, the journalism website ProPublica published an investigation looking at 7,000 Compas risk scores in a single county in Florida over the course of 2013 and 2014. It found that black people were almost twice as likely as white people to be labeled high-risk, then not commit a crime, while it was much more common for white people who were labeled low-risk to re-offend than black people who received a low-risk score. Northpointe challenged the findings, saying ProPublica had miscategorized many risk scores and ignored results that didn’t support its thesis. Its analysis of the same data found no racial disparities.
Even as it upheld Loomis’s sentence, the Wisconsin Supreme Court cited the research on race to raise concerns about the use of tools like Compas. Going forward, it requires risk scores to be accompanied by disclaimers about their nontransparent nature and various caveats about their conclusions. It also says they can’t be used as the determining factor in a sentencing decision. The decision was the first time that such a high court had signaled ambivalence about the use of risk scores in sentencing.
Sonja Starr, a professor at the University of Michigan’s law school and a prominent critic of risk assessment, thinks that Loomis’s case foreshadows stronger legal arguments to come. Loomis made a demographic argument, saying that Compas rated him as riskier because of his gender, reflecting the historical patterns of men being arrested at higher rates than women. But he didn’t frame it as an argument that Compas violated the Equal Protection Clause of the 14th Amendment, which allowed the court to sidestep the core issue.
Loomis also didn’t argue that the risk scores serve to discriminate against poor people. “That’s the part that seems to concern judges, that every mark of poverty serves as a risk factor,” Starr said. “We should very easily see more successful challenges in other cases.”
Officials in Pennsylvania, which has been slowly preparing to use risk assessment in sentencing for the past six years, are sensitive to these potential pitfalls. The state’s experience shows how tricky it is to create an algorithm through the public policy process. To come up with a politically palatable risk tool, Pennsylvania established a sentencing commission. It quickly rejected commercial products like Compas, saying they were too expensive and too mysterious, so the commission began creating its own system.
Race was discarded immediately as an input. But every other factor became a matter of debate. When the state initially wanted to include location, which it determined to be statistically useful in predicting who would re-offend, the Pennsylvania Association of Criminal Defense Lawyers argued that it was a proxy for race, given patterns of housing segregation. The commission eventually dropped the use of location. Also in question: the system’s use of arrests, instead of convictions, since it seems to punish people who live in communities that are policed more aggressively.
Berk argues that eliminating sensitive factors weakens the predictive power of the algorithms. “If you want me to do a totally race-neutral forecast, you’ve got to tell me what variables you’re going to allow me to use, and nobody can, because everything is confounded with race and gender,” he said.
Starr says this argument confuses the differing standards in academic research and the legal system. In social science, it can be useful to calculate the relative likelihood that members of certain groups will do certain things. But that doesn’t mean a specific person’s future should be calculated based on an analysis of populationwide crime stats, especially when the data set being used reflects decades of racial and socioeconomic disparities. It amounts to a computerized version of racial profiling, Starr argued. “If the variables aren’t appropriate, you shouldn’t be relying on them," she said.
Late this spring, Berk traveled to Norway to meet with a group of researchers from the University of Oslo. The Norwegian government gathers an immense amount of information about the country’s citizens and connects each of them to a single identification file, presenting a tantalizing set of potential inputs.
Torbjørn Skardhamar, a professor at the university, was interested in exploring how he could use machine learning to make long-term predictions. He helped set up Berk’s visit. Norway has lagged behind the U.S. in using predictive analytics in criminal justice, and the men threw around a few ideas.
Berk wants to predict at the moment of birth whether people will commit a crime by their 18th birthday, based on factors such as environment and the history of a new child’s parents. This would be almost impossible in the U.S., given that much of a person’s biographical information is spread out across many agencies and subject to many restrictions. He’s not sure if it’s possible in Norway, either, and he acknowledges he also hasn’t completely thought through how best to use such information.
Caveats aside, this has the potential to be a capstone project of Berk’s career. It also takes all of the ethical and political questions and extends them to their logical conclusion. Even in the movie Minority Report, the government peered only hours into the future—not years. Skardhamar, who is new to these techniques, said he’s not afraid of making mistakes: They’re talking about them now, he said, so they can avoid future errors. “These are tricky questions,” he said, mulling all the ways the project could go wrong. “Making them explicit—that’s a good thing.”