Cyril Weerasooriya (සිරිල් වීරසූරිය)

PhD Student

Lab for Population Intelligence

Rochester Institute of Technology

Doctoral Student Association (RIT)

I’ll be at EMNLP 2023 presenting work on VOICED dataset. Recent work at ACL 2023 presenting two papers, DisCo and CrowdOpinion.

Hello there. I study how we can predict human disagreements during human annotation using machine learning. This work is helpful when we want to model human disagreements, which is conventionally considered annotation noise. Following recent breakthroughs in machine learning research has shown instances where the algorithms being biased towards specific groups. I’m PhD student at the Lab for Population Intelligence at RIT led by Professor Christopher Homan.

Currently in the job market. I’ve interned at Amazon Ads as an Applied Scientist Intern (2023), Meta (Facebook) in Summer 2022 and RPI (IBM Watson Project) in Summer 2019.

In parallel, I’m also working with University of Kelaniya in Sri Lanka to build an electronic medical record system for the entity of Sri Lanka.

My previous research also comes from sociolinguistics, studying the evolution of Sri Lankan English across multiple generations.

I enjoy DevOPS side of systems and building systems that are end to end.

When I’m not at my desk, I envy traveling.

Interests

Label Distribution Learning
Computational Linguistics & Natural Language Processing
DevOPS
Data Science
Machine Learning
Photography

Education

PhD in Computer Science, Current
Rochester Institute of Technology
BSc in Computer Science, 2017
University of Kelaniya

Featured Publications

Tharindu Cyril Weerasooriya, Sujan Dutta, Tharindu Ranasinghe, Marcos Zampieri, Christopher M. Homan, Ashiqur R. KhudaBukhsh

December 2023 EMNLP 2023

Vicarious Offense and Noise Audit of Offensive Speech Classifiers: Unifying Human and Machine Disagreement on What is Offensive

Offensive speech detection is a key component of content moderation. However, what is offensive can be highly subjective. This paper investigates how machine and human moderators disagree on what is offensive when it comes to real-world social web political discourse. We show that (1) there is extensive disagreement among the moderators (humans and machines); and (2) human and large-language-model classifiers are unable to predict how other human raters will respond, based on their political leanings. For (1), we conduct a noise audit at an unprecedented scale that combines both machine and human responses. For (2), we introduce a first-of-its-kind dataset of vicarious offense. Our noise audit reveals that moderation outcomes vary wildly across different machine moderators. Our experiments with human moderators suggest that political leanings combined with sensitive issues affect both first-person and vicarious offense.

Tharindu Cyril Weerasooriya, Alexander G. Ororbia II, Raj Bhensadadia, Ashiqur KhudaBukhsh, Christopher M. Homan

July 2023 Findings of the Association for Computational Linguistics: ACL 2023

Disagreement Matters: Preserving Label Diversity by Jointly Modeling Item and Annotator Label Distributions with DisCo

Annotator disagreement is common whenever human judgment is needed for supervised learning. It is conventional to assume that one label per item represents ground truth. However, this obscures minority opinions, if present. We regard ``ground truth″ as the distribution of all labels that a population of annotators could produce, if asked (and of which we only have a small sample). We next introduce DisCo (Distribution from Context), a simple neural model that learns to predict this distribution. The model takes annotator-item pairs, rather than items alone, as input, and performs inference by aggregating over all annotators. Despite its simplicity, our experiments show that, on six benchmark datasets, our model is competitive with, and frequently outperforms, other, more complex models that either do not model specific annotators or were not designed for label distribution learning.

Tharindu Cyril Weerasooriya, Sarah Luger, Saloni Poddar, Ashiqur KhudaBukhsh, Christopher M. Homan

July 2023 Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Subjective Crowd Disagreements for Subjective Data: Uncovering Meaningful CrowdOpinion with Population-level Learning

Human-annotated data plays a critical role in the fairness of AI systems, including those that deal with life-altering decisions or moderating human-created web/social media content. Conventionally, annotator disagreements are resolved before any learning takes place. However, researchers are increasingly identifying annotator disagreement as pervasive and meaningful. They also question the performance of a system when annotators disagree. Particularly when minority views are disregarded, especially among groups that may already be underrepresented in the annotator population. In this paper, we introduce CrowdOpinion, an unsupervised learning based approach that uses language features and label distributions to pool similar items into larger samples of label distributions. We experiment with four generative and one density-based clustering method, applied to five linear combinations of label distributions and features. We use five publicly available benchmark datasets (with varying levels of annotator disagreements) from social media (Twitter, Gab, and Reddit). We also experiment in the wild using a dataset from Facebook, where annotations come from the platform itself by users reacting to posts. We evaluate CrowdOpinion as a label distribution prediction task using KL-divergence and a single-label problem using accuracy measures.

Recent Publications

Quickly discover relevant content by filtering publications.

Vicarious Offense and Noise Audit of Offensive Speech Classifiers: Unifying Human and Machine Disagreement on What is Offensive

Offensive speech detection is a key component of content moderation. However, what is offensive can be highly subjective. This paper …

Tharindu Cyril Weerasooriya, Sujan Dutta, Tharindu Ranasinghe, Marcos Zampieri, Christopher M. Homan, Ashiqur R. KhudaBukhsh

Disagreement Matters: Preserving Label Diversity by Jointly Modeling Item and Annotator Label Distributions with DisCo

Annotator disagreement is common whenever human judgment is needed for supervised learning. It is conventional to assume that one label …

Tharindu Cyril Weerasooriya, Alexander G. Ororbia II, Raj Bhensadadia, Ashiqur KhudaBukhsh, Christopher M. Homan

Subjective Crowd Disagreements for Subjective Data: Uncovering Meaningful CrowdOpinion with Population-level Learning

Human-annotated data plays a critical role in the fairness of AI systems, including those that deal with life-altering decisions or …

Tharindu Cyril Weerasooriya, Sarah Luger, Saloni Poddar, Ashiqur KhudaBukhsh, Christopher M. Homan

See all publications

Experience

Machine Learning Research Engineer (Summer Intern)

Meta (Formerly known as Facebook)

May 2022 – Aug 2022 Menlo Park, CA

Worked at the Facebook Creators Wellbeing Team on Public Conversations. Overlooked models for improving the comments recommendation and ranking models on Facebook Pages with varying populations of followers from around the globe.

Project - Introduction of a Multi-Label Multi-Task model for assisting page administrations for comment management.

Experimented with real-time data from Facebook users for model building, millions of actions per day.
Big data pipelines with Presto (similar to SQL) for collecting and processing data for the model.
The model bypassed the existing individual action-based models used by 40% based on ROC and PR AUC scores.
Evaluated the model in production with A/B testing on 4% of overall global Facebook users.

Research Assistant

Rochester Institute of Technology

Aug 2018 – Present Rochester, NY

Responsibilities include:

Working on research for predicting human disagreements on natural language social media datasets.
Build research pipelines for deploying on Google Cloud using Python Machine Learning stack and MongoDB.
Presented work at ECAI 2020,LREC 2022.

Adjunct Lecturer

Rochester Institute of Technology

May 2020 – Aug 2020 Rochester, NY, NY

Taught CS635 - Introduction to Machine Learning virtually.

Tech Lead

LKEMR - Cloud-based Electronic Medical Record System

Mar 2018 – Present Sri Lanka

Collaborative project with the Faculty of Medicine, and Colombo North Teaching Hospital, Sri Lanka.

Web based electronic patient record management system, tailored for Sri Lanka using PHP backend, and MySQL database on AWS.
Currently been used to identify potential patients with COVID-19, as it is the only EMR widely used in Sri Lanka.
The work also contributes to open sourced EMR Project, Open-EMR.
Research work published in NITC 2019 and workshop organizing committee for WONCA 2020.

Intern Research Assistant

HEALS Project (part of IBM AI Horizon) at Rensselaer Polytechnic Institute

May 2019 – Aug 2019 Troy, NY

Research based on natural language processing and information retrieval.
Conducted research on aggregating food related data sources for food knowledge graph, which is used with IBM Watson.
The tool was able to generate the nutritional content per FDA guidelines from a crowdsourced recipe.

Cyril Weerasooriya (සිරිල් වීරසූරිය)

PhD Student

Lab for Population Intelligence

Rochester Institute of Technology

Doctoral Student Association (RIT)

Featured Publications

Recent Publications

Experience

Contact