Timothy Quinn, the co-founder of Hatebase, and his team have spent years compiling the vilest words on the Internet, and Hatebase has made understanding hate speech its primary mission.
Basically, Hatebase emerged from research at the Sentinel Project for Genocide Prevention, launched on March 25, 2013. Hatebase is a joint effort between the Sentinel Project and Mobiocracy, a software company.
The software that was developed for Hatebase was used in predicting and preventing atrocities based on analyzing the language used in a conflict-ridden region.
“What Sentinel discovered was that hate speech tends to precede escalation of these conflicts,” explained Quinn, according to Tech Crunch. “I partnered with them to build Hatebase as a pilot project — basically a lexicon of multilingual hate speech. What surprised us was that a lot of other NGOs [non-governmental organizations] started using our data for the same purpose. Then we started getting a lot of commercial entities using our data. So last year we decided to spin it out as a startup.”
A lexicon of multilingual hate speech
So, what in the world is a lexicon? A lexicon is simply the vocabulary of a person, language, or branch of knowledge. You could say it is essentially a catalog of a language’s words.
Lexicon is different than vocabulary. We all know what vocabulary means. Vocabulary merely means the list of words a person knows of a particular language. As an example, this journalist grew up using the English vocabulary. Both lexicon and vocabulary consist of words in a language. However, the lexicon encapsulates a wider knowledge of words in a language along with their proper usage.
The point is this – The lexicon of a particular country or nationality is more than just knowing a bunch of dirty words. It incorporates an entire genre of slang, particular to that language. Quinn believes the slang of a single language would fill a dictionary.
According to CBC Canada, Hatebase has gathered a growing list of over 3,600 terms considered to be hate speech, and the efforts have attracted some big-name partners around the world.
It’s not easy trying to figure out what constitutes hate speech, be it a word or a phrase, and this is where it’s great to have a platform to help find those words. “It’s a horrible job for a human being to do,” Quinn said. “You need some degree of automation to handle the worst of the worst.”
With that being said, Hatebase does not endorse censorship, and to quote from the company’s website – “We are passionate in our defense of free speech. We believe that the right to hold and express opinions, no matter how disagreeable, is one of the distinguishing characteristics of a free and open society.”
Finding early signs of violence
Looking at social media posts and messaging in particular, it would be nice to know beforehand when killing or violence against a particular group is taking place. Suspects in the Toronto van attack, the El Paso Walmart shooting and the massacre at the mosque in New Zealand, to name a few, are believed to have spread hateful content online before the attacks.
Hatebase’s automated social media monitoring platform is called Hatebrain. However, Quinn says it is not designed to single out any one user. But a rise in hate speech can sometimes be indicative of escalating or coming violence. it’s impossible to identify an individual because the places Hatebase looks for data don’t provide identity information.
“We’re not looking for the one active shooter,” Quinn said in an interview. “We’re looking for raw trends around language being used to discriminate against groups of people online.”
Hatebase has a database that includes terms used in 97 languages – detected online more than one million times – from users in 184 countries. In Canada, gay people and women represent the most-targeted groups, according to a CBC Canada reporter who saw the country-specific page held by Database.
The Canadian Civil Liberties Association (CCLA) told CBC – that while not familiar with Database’s platform – and looking at monitoring hate speech in its broadest sense – said there would be cause for concern if the data were used as the basis for excluding some points of view from online discussion. They are most concerned over the definition of hate speech being too restrictive.
Words “that most people in ordinary conversation would think is hate speech, is not hate speech under the law,” said the CCLA’s Cara Zwibel. The CCLA is concerned that “marginalized” groups or targeted populations would not only be silenced but would instead become the focus of hate speech.
Hatebase defines hate speech thusly: Hate speech is “any term which broadly categorizes a specific group of people based on malignant, qualitative and/or subjective attributes — particularly if those attributes pertain to ethnicity, nationality, religion, sexuality, disability or class.”
Hatebase’s system is not perfect. No system can claim that prize. But Quinn says, “There are very few 100 percents coming out of Hatebrain. It varies a little from the machine learning approach others use. ML is great when you have an unambiguous training set, but with human speech, and hate speech, which can be so nuanced, that’s when you get bias floating in. We just don’t have a massive corpus of hate speech, because no one can agree on what hate speech is.”