An AI Web Service for Abusive Content Detection (for Turkish Content)
Cyberbullying is a growing concern in today’s society and machine learning is a topic in demand so we combined the two together, trained a model from tens of thousands of tweets, comments and news data and created Abusive Content Detection. And it turned out pretty well.
A-) How Did We Do It aka. Technicalities
First of all bare in mind that the objective was not just to catch swear sentences but “abusive” sentences like “Sen ne kadar karakter yoksunu birisin?”. While gathering the data we tried to always remind ourselves to this. Because it is easy to catch swear sentences using a rule based model but it is nearly impossible to catch slight nudges of offense that way. Later in the project we concluded that it was best if we used a combined classifier of both rule and machine learning based models. With that out of the way we can start.
Most of the “Machine Learning Projects” can be dumb down to 3 stages. These stages are Gathering and labeling the data, extract features from the data and training the model, make improvements on the model. And this project was no different from this.
The first and probably hardest part was to gather data. So we gathered thousands of data across the internet (Mostly on Twitter, some popular forum sites on Turkey like DonanımHaber and News Websites.) and labeled them as ABUSIVE  or NORMAL . Most probably this was the hardest part of the project so I want to thank our summer intern Enes Mesut here.
And lastly we tested out model and saw it was struggling a bit to catch some swear words so we added a dictionary based classifier on top of it and now it works pretty well if you ask me but there is always a room for improvement.
B-) Who is it for?
I can cut to the short and say it’s for everyone but to be more specific it’s pretty useful for online forum or blog owners but it’s most useful for parents who want to protect their children from the harsh climate of the Internet.
Let’s face it Internet is not a place for children. It’s filled with people around the world which have different backgrounds and mindsets. So it’s only natural when these people interact with each other bad things happen. We, -the humanity- as a whole is not in a stage of its existence where anyone and anything can be welcomed. Therefore one can argue that Internet is not for kids. But at the same time it is such a tool that it makes nearly everything in life easier and thus it is not reasonable to ban it for good. Also it is probably impossible to ban it for good. One way or another kids can get to internet and for the first time in history parent have no control over what their kids watch, read or see. We wish to give an opportunity to parents to at least control what their children read in the internet.
C-) An Ethical Concern
You probably heard this before many times. Most of the time we as the machine learning engineers have very little control over the things our model learn from the data. And sometimes it becomes very hard to control our model to not become racist, sexist or homophobic. That’s because the data we use while training the model comes from actual human beings and as we all know humans can sometimes be those things. For example given enough tainted data a model can classify a sentence which includes the words like “gay, lezbiyen, eşcinsel” as abusive while in reality it is not. Thus when gathering the data we tried to consider these kind of ethical points but as I told in the beginning of this part most of the time we have very little control.
We created an Abusive Content Classifier to prevent text based harassment on the internet and let our user enjoy the best part of the internet. It was a challenging process but in the end it was worth it.
Machine Learning, Artificial Intelligence, AI, Abusive Content Detection, Harassment, Cyberbullying, Censorship, Swear Detection, Web Service
İlke Elvan, Machine Learning Engineer, VeriUs Technology