< Terug naar vorige pagina


The automated detection of racist discourse in Dutch social media

Tijdschriftbijdrage - Tijdschriftartikel

We present two experiments on the automated detection of racist discourse in Dutch social media. In both experiments, multiple classiers are trained on the same training set. This training set consists of Dutch posts retrieved from two public Belgian social media pages which are likely to attract racist reactions. The posts were labeled as racist or non-racist by multiple annotators, who reached an acceptable agreement score. The dierent classication models all use the Support Vector Machine algorithm, but use dierent (sets of) linguistic features, which can be lexical, stylistic or dictionary-based. In the rst experiment, the models are evaluated on a test set containing unseen comments retrieved from the same pages as the training set (and thus also skewed towards racism). In the second experiment, the same models from Experiment 1 are tested on an alternative test set, containing more neutral comments, retrieved from the social media page of a Belgian newspaper. In both experiments, the best performing model relies on a dictionary containing dierent word categories specically related to racist discourse. It reaches an F-score of 0.47 (exp. 1) and 0.40 (exp. 2) for the racist class and ROC Area Under Curve scores of 0.64 (exp. 1) and 0.73 (exp. 2). The dictionaries, code, and the procedure for requesting the corpus are available at: https://github.com/clips/hades.
Tijdschrift: Computational Linguistics in the Netherlands Journal
Volume: 6
Pagina's: 3 - 20
Jaar van publicatie:2016