Identifying Toxicity in the Digital Sphere: Bastián González-Bustamante presents his work at the 8th Monash-Warwick-Zurich Text-as-Data Workshop

Picture credits: Unsplash

In a world where social media is increasingly ubiquitous, the ability to identify and understand online toxicity and incivility is crucial for democracy and civil society. In this sense, the project “Large Language Models (LLMs) to Identify Toxicity in the Digital Sphere during Protest Events in Latin America” seeks to develop technologies to detect and analyse toxicity in social networks during protest events in Latin America.

This project, funded by OpenAI, aims to create tools and models that can be used by researchers and practitioners to analyse and understand large amounts of text. In this sense, our research associate Bastián González-Bustamante presented his work “Benchmarking LLMs in Political Content Text-Annotation: Proof-of-Concept with Toxicity and Incivility Data” at the 8th Monash-Warwick-Zurich Text-as-Data Workshop, held virtually from 16 to 17 September.

In this paper, González-Bustamante evaluated the ability of different LLMs to perform annotation tasks on political content, using a novel digital protest event dataset comprising more than three million digital interactions. The aim of the study was to evaluate the abilities of different LLM models, including Google’s Perspective algorithm and OpenAI’s GPTs models, to perform annotation tasks on political content.

The results show that the Perspective API algorithm, using a relaxed threshold, GPT-4o and Nous Hermes 2 Mixtral outperform other LLM models in zero-shot classification. Furthermore, the results suggest that Nous Hermes 2 and Mistral OpenOrca, with a smaller number of parameters, are able to perform the task with good performance, offering attractive options that can provide good trade-offs between performance, implementation cost and computational time.

Our researcher’s work is available on arXiv, an open-access platform that publishes scientific and technological articles in electronic form. Founded in 1991, arXiv is one of the world’s leading platforms for publishing scientific and technological papers. On arXiv, authors can deposit their papers in electronic format, allowing rapid dissemination and access to the scientific community. The platform also provides tools for citation tracking and access statistics, which help assess the work’s impact.

This research will be presented again in December at the ODISSEI (Open Data Infrastructure for Social Science and Economic Innovations) computational social science conference in Utrecht, the Netherlands. We invite you to read Bastián González-Bustamante’s paper on arXiv and explore the results and findings presented. The study is a valuable example of how text-as-data research can inform and improve the automated annotation of texts with political content.

* AI-generated text
Read more about how we generate our content

ChatBot Ollama
ChatBot Ollama
Large Language Model

ChatBot Ollama deployed locally by Training Data Lab based on different versions of LLaMA 3, LLaMA 2, Mistral.

Previous