Researchers have recently discovered that anyone can trick hate speech detectors with simple changes to their language—and typos are just one way that neo-Nazis are foiling the algorithms.

Erin Schrode didn't know much about the extreme right before she ran for Congress. "I'm not going to tell you I thought anti-Semitism was dead, but I had never personally been the subject of it," she says.

That changed when The Daily Stormer, a prominent neo-Nazi website, posted an article about her 2016 campaign. The comments section filled up with derogatory statements targeting Schrode because she is Jewish. But buried between slurs and racist images, one anonymous person—under the username "Forbesmag"—posted Schrode's email address, cell phone number, and links to her social media profiles.

Those eight lines of text sparked what Schrode describes as an "onslaught" of harassment, clogging her phone with messages that paired violent anti-Semitism with gut-wrenching misogyny. Schrode was oblivious to the whole thing until she woke up the next day in California. "I couldn't believe what I saw. I'd never seen so many notifications," she says.

The messages came via Twitter, Facebook, and Instagram. There were anti-Semitic slurs, threats of gang rape, and references to the Holocaust. One user had Photoshopped Schrode's likeness into an image of a concentration camp. "I received tens of thousands of messages that first day," says Schrode, who lost that year's Democratic primary in Marin, and is now an activist and co-founder of the non-profit Turning Green, which promotes sustainable lifestyles.

For years, social media companies have struggled to contain the sort of hate speech Schrode describes. When Facebook founder Mark Zuckerberg spoke before the Senate in April of 2018, he acknowledged that human moderators were not enough to remove toxic content from Facebook; in addition, he said, they needed help from technology.

"Over time, we're going to shift increasingly to a method where more of this content is flagged up front by [artificial intelligence] tools that we develop," Zuckerberg said.

Zuckerberg estimated that A.I. could master the nuances of hate speech in five to 10 years. "But today, we're just not there," he told senators.

He's right: Researchers have recently discovered anyone can trick hate speech detectors with simple changes to their language—removing spaces in sentences, changing "S" to "$," or changing vowels to numbers.

In a 2018 paper called "Evading Hate Speech Detection," researchers from Finland's Aalto University demonstrated how easy it was to trick a range of hate-speech detection models using simple typos.

One focus of the paper was Google's Perspective API, a model designed to detect hate speech by assigning a "toxicity score" to words or sentences. For example, the phrase "I hate you" scores high, at 0.91, indicating that such a statement is "likely to be perceived as toxic."

But researchers found that the score drops dramatically if you remove the spaces between your words. "IHateYou" is "unlikely to be perceived as toxic," according to the Google tool, which scores the un-spaced version at a lowly 0.20.

"These attacks actually do work," Aalto University's Tommi Gröndhal says. "It's easy to fool these automatic detectors."

But typos are just one element of a new online language that is emerging to foil algorithms meant to protect users. Specifically, neo-Nazis are also creating codewords to disguise hate speech online.

When Joel Finkelstein, director and co-founder of the Network Contagion Research Institute, began researching anti-Semitic language on the social media sites 4chan and Gab, the results included words he'd never heard before.

He already knew about some of the most famous anti-Semitic Internet codes, such as the (((triple parentheses))) that are used to brand another user as Jewish, marking them as targets for harassment. But Finkelstein, who is also a research fellow at the Anti-Defamation League, saw that hate speech was evolving in real time, as new slurs targeting minority groups kept appearing on a regular basis. He saw the emergence of violent acronyms such as "GTKRWN" ("gas the kikes, race war now"), and hashtags like #tgsnt or "the greatest story never told" (code for "Hitler was right"). He also noted how anti-Semites were coining new codewords, such as "ZOG", "ZIO," or "turbokike" to use instead of "Jews."

Finkelstein's research follows past instances in which online trolls have used codewords to avoid detection by A.I. While the NCRI paper focused on anti-Semitism, a 2016 4chan post detailed how other minority groups can also be targets: Alongside the codeword for Jewish people ("skypes"), African Americans became "googles," Mexicans were "yahoos," Muslims were "skittles," and liberals became "car salesmen."

Even today, A.I. still struggles to referee words, such as "skypes," that have come to have two meanings.

"What A.I. doesn't pick up at this point is the context, and that's what makes language hateful," says Brittan Heller, an affiliate with the Berkman Klein Center for Internet and Society at Harvard University.

With current hate speech models so imperfect, researchers are racing to design a more capable model—and experimenting with different techniques.

Creating an A.I. model to detect hate speech on Twitter was a steep learning curve for data expert Jason Carmel.

"I really didn't know that much about hatred before I started. I had the movie version of hate in my mind—the N-word or the K-word," Carmel says. "But what we learned is that hate is way more interesting than that. These people have evolved to use a language that's both specific and sinister, in that it hides itself from plain view."

Carmel is chief data officer on the project #WeCounterHate, a program run by the advertising agency Possible. The campaign uses A.I. to detect hate speech on Twitter before "countering" it. "Countering" involves replying to toxic messages with a tweet explaining that every time the hateful post is retweeted, a donation will be made to an organization supporting diversity and equality.

While the project has had success—"countered" posts see a 64 percent reduction in retweets, according to the agency—Carmel says the project's A.I. model is not yet able to adapt to the way hate speech is quickly evolving. And #WeCounterHate still relies on human moderators to verify its results.

"Our machine has to be taught; it's not a self-learning machine. That means it will understand modest changes in language but not massive flips," Carmel says.

Another NCRI co-founder, Jeremy Blackburn, who is working on the institute's own hate speech detection system, says that "supervised" models—the type of A.I. used by Google and by #WeCounterHate—can deal well with problems because they learn what is hateful or toxic from data labeled by humans.

But humans can be prone to bias, and definitions of hate speech and toxicity can vary person-to-person.

"I've dug through some of the training data that was used for [Google's] Perspective model and have seen what I consider quite a bit of mis-labeled data," Blackburn says. "There were many cases of anti-Semitic comments that were labeled as neutral or were given what I consider to be incorrect severity scores. There were examples along the lines of 'Jews run the government and are feeding us lies' that were marked as neutral."

The University of California–Berkeley's D-Lab is trying to avoid the limitations of supervised models by working with a partly unsupervised A.I. The team is training a model that builds on top of "natural language-processing models" called ELMO and BERT, algorithms that teach a computer what words mean.

Through this approach, "[The] A.I. is able to recognize the different senses that words take on in different contexts," says D-Lab Executive Director Claudia von Vacano.

Vacano hopes this model would eventually be able to detect hate speech in disguise. Take "shrinky dinks." Until last year, Shrinky Dinks were just plastic toys that children could shrink in the microwave. Now, in various corners of the Web, "Shrinky Dinks" is being used as a substitute phrase for "Jewish people"—another example of how the most innocent-looking language can be co-opted as an anti-Semitic slur.

"Online hate speech is not a problem that we can solve without A.I.," Vacano writes in an email. "Manually reviewing large samples of user-generated text is cost-prohibitive, inaccurate, slow, and can have negative effects on human labelers, raising ethical considerations about non-A.I. approaches."

D-Lab is still training and evaluating its model. But the strategies to disguise online hate speech are constantly evolving, and it's possible that A.I. will always be playing catch-up.

Still, Vacano remains undeterred. "This work is never done," she says; "language is never static."


Pacific Standard's Ideas section is your destination for idea-driven features, voracious culture coverage, sharp opinion, and enlightening conversation. Help us shape our ongoing coverage by responding to a short reader survey.