Justice by the Numbers: Meet the Statistician Trying to Fix Bias in Criminal Justice Algorithms

When the underlying data they rely on is incomplete—and it often is—the growing use of machine learning tools in America's criminal justice system can have devastating effects.
Author:
Publish date:
Human Rights Data Analysis Group Lead Statistician Kristian Lum speaks onstage at TechCrunch Disrupt SF 2018 at Moscone Center on September 7th, 2018, in San Francisco, California.

Human Rights Data Analysis Group Lead Statistician Kristian Lum speaks onstage at TechCrunch Disrupt SF 2018 at Moscone Center on September 7th, 2018, in San Francisco, California.

Last February, Terrence Wilkerson told a room full of technologists about his first day at Riker's Island, New York City's infamous jail. "I saw someone get a new tattoo, which was a razor from here to here"—his finger sliced from the top of his head to his chin—"and that experience alone was one of the scariest."

Wilkerson, who wore a white button-down shirt and shoulder-length braids, has twice been arrested for robberies he says he didn't commit. The first time, when he was 20 years old, he couldn't afford the $35,000 bail and spent 10 months behind bars in pre-trial detention. "I didn't know what else to do and took a plea [bargain] to something I didn't do," he told the audience. The second time, when a girlfriend scrounged up cash for his $2,500 bail, he was able to help his lawyer fight his case from the outside and eventually got acquitted. "The jury, the judges ... they're not seeing you in shackles, beat up with scars on your face from being in jail," he said. "People see how you really are."

Wilkerson was speaking at the inaugural Conference on Fairness, Accountability, and Transparency, a gathering of academics and policymakers working to make the algorithms that govern growing swaths of our lives more just. The woman who'd invited him there was Kristian Lum, the 34-year-old lead statistician at the Human Rights Data Analysis Group, a San Francisco-based non-profit that has spent more than two decades applying advanced statistical models to expose human rights violations around the world. For the past three years, Lum has deployed those methods to tackle an issue closer to home: the growing use of machine learning tools in America's criminal justice system.

In New York's courts, two algorithmic risk assessment tools now help judges decide whether or not to detain defendants like Wilkerson before trial. It's not a perfect science, Lum argues: Because software of this kind is trained with historical data, with all its gaps and inequities, it risks reproducing past injustices. And the consequences can be dire for people like Wilkerson. "It comes down to 'Bias in, bias out,'" Lum says. Her goal is to reveal those biases and, when possible, find ways to correct for them using quantitative methods.

"We need to understand fairness in the broader context in which our software will operate," Lum told the audience. "We need to worry about whether we're legitimizing unfair practices with a stance that 'science or technology says this is fair.'"

section-break

Lum, who has a dark brown bob and freckles, was working on a Ph.D. in statistics at Duke University almost a decade ago when she cold-emailed Patrick Ball, HRDAG's founder, to ask for an internship. She spent a summer helping the group figure out which statistical model could best estimate how many people had died or disappeared in Casanare, a rural region hit hard by Colombia's internal conflict, between 1998 and 2007. In 2014, she joined the team full time.

In 2015, algorithm-based tools were gaining traction among police departments and courts, and police shootings of unarmed victims were making headlines. Lum and her colleagues decided that their skill set, typically deployed in the aftermath of war crimes or genocide, could now help protect Americans' rights. The main problem with algorithms in criminal justice, Lum argues, is the same one the group has encountered while tallying human rights abuses in places like Guatemala and Syria: The underlying data is often incomplete, and those gaps matter.

Missing data was at issue in Lum's first project, which aimed to estimate the true number of homicides perpetrated by police in the United States. Media outlets like the Guardian and the Washington Post had documented killings over fixed time periods, but Lum and her colleagues felt the counts were flawed—they only included deaths that were reported, not those that no one recorded. In 2015, the Bureau of Justice Statistics took a step toward filling in the missing piece by commissioning an independent research group to determine the true number of deaths at the hands of police officers. The researchers started with two national data sets on law enforcement homicides, one from the Bureau of Justice Statistics and one from the Federal Bureau of Investigation. They then applied a statistical technique called "capture-recapture analysis," also known as "multiple systems estimation" and often used for wildlife counts, to estimate how many other deaths weren't on either list. This involved identifying records that likely overlapped in the two sources and comparing them to the number that were unique. The report concluded that, between 2003 and 2009, and in 2011, police killed a total of about 7,400 Americans, and roughly 30 percent of these deaths went unreported.

Lum and Ball felt this was an underestimate because the agency wrongly assumed the two data sets involved were completely independent. In fact, they argued, people of higher social status were more likely to appear on both lists (more than half of states relied on media reports alone to report homicides for one of the databases). Based on patterns they'd gleaned from estimating killings in other countries, including by law enforcement, Lum and Ball used statistical techniques to account for the likelihood that the two lists weren't completely isolated. They estimated that police had actually killed 10,000 Americans over that time period—roughly 1,500 a year—and half weren't recorded in the national data.

"The fundamental issue HRDAG deals with is: What's missing?" Lum says. "We're pointing out that what you are collecting may not be representative of the whole truth, and some things are systemically missing."

In 2016, Lum turned to another area of criminal justice that suffered from incomplete data: predictive policing. She reproduced the algorithm behind PredPol, a software employed by around 60 police departments in the U.S. that uses historical crime data to forecast "hotspots" for officers to target on a given day. Lum fed in open-source police data on drug crime from Oakland, California, where officials had briefly considered deploying PredPol. The result: The algorithm instructed police to almost exclusively target poor, minority neighborhoods, even though public-health data suggested drug use was spread more evenly across the city. That's because, according to Lum, those are the neighborhoods police have disproportionately targeted for drug enforcement over time. "The records used to train the data are tainted with the police's own biases, and the algorithm essentially reproduced the patterns of the past," Lum says.

section-break

Last year, Lum tackled the issue that transformed Wilkerson's life and has come under increasing scrutiny nationally: cash bail. Along with the New York Legal Aid Society and Stanford University statistician Mike Baiocchi, she studied the effect of setting bail on case outcomes for low-income defendants who passed through New York's criminal courts in 2015. Using a statistical method called "near-far matching," they showed that setting bail increased the chances someone would be found guilty, usually by taking a plea deal, by more than a third. Their findings appeared in the journal Observational Studies last October. "This was a really groundbreaking thing for us," says Joshua Norkin, a staff attorney at the New York Legal Aid Society. "We have long suspected that clients held on bail are subjected to disproportionate rates of conviction, but this became a lens to articulate how bad the problem was." The organization has used the findings to advocate for release of individual clients and is considering filing a class-action suit to challenge pre-trial detention.

Now, Lum is evaluating whether the risk assessment tools judges around the country use to make decisions about detention and bail in cases like Wilkerson's lead to unfair outcomes. For one project, she obtained data from New York City's Criminal Justice Agency about an algorithmic tool that recommends who should qualify for the city's two-year-old "supervised release" program, which lets defendants avoid jail before trial if they report regularly to a local non-profit. The software is trained to screen out people likely to be arrested again for a felony before trial and to identify those likely to go to jail because they can't afford bail. Lum is evaluating how well the tool conforms to various mathematical definitions of fairness.

Lum is also part of a research council advising New York City on how to improve a second algorithmic tool, which helps judges decide whether to release people who have been arrested when they first appear in court, based on the likelihood they won't show up in the future. The city plans to roll out the updated tool this year.

Research on algorithms in criminal justice is a growing field, but Lum's commitment to deploying her knowledge in the real world is unusual, says Solon Barocas, a Cornell University professor who cofounded the conference where Wilkerson spoke. "She's not only looked very closely at the actual tools but also worked directly with the advocates and affected populations in a way others have not," he says. "She's a person that's equal parts scholar and advocate."

Lum believes in the power of statistics to illuminate hidden abuses and make algorithmic tools more just, but unlike some quantitative buffs, she is willing to question whether technology can solve every problem. She wonders whether efforts to improve risk assessment and predictive policing software, for example, harms efforts to overhaul structural problems. "When we're designing machine learning models and claiming they're fair ... are we actually helping sweeping reforms that are vitally needed, or are we justifying putting a Band-Aid on a broken system?" she asked the audience Wilkerson addressed. Later, she told me: "Sometimes the best solution is to abandon the quantitative or technical approach."

Related