Identifying vulgarity in Bengali social media textual content | Zendy

Salim Sazzed | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Identifying vulgarity in Bengali social media textual content

Author(s) -

Salim Sazzed

Publication year - 2021

Publication title -

peerj computer science

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.806

H-Index - 24

ISSN - 2376-5992

DOI - 10.7717/peerj-cs.665

Subject(s) - vulgarity , bengali , artificial intelligence , computer science , natural language processing , social media , lexicon , class (philosophy) , benchmark (surveying) , world wide web , literature , geography , art , geodesy

The presence of abusive and vulgar language in social media has become an issue of increasing concern in recent years. However, research pertaining to the prevalence and identification of vulgar language has remained largely unexplored in low-resource languages such as Bengali. In this paper, we provide the first comprehensive analysis on the presence of vulgarity in Bengali social media content. We develop two benchmark corpora consisting of 7,245 reviews collected from YouTube and manually annotate them into vulgar and non-vulgar categories. The manual annotation reveals the ubiquity of vulgar and swear words in Bengali social media content ( i.e ., in two corpora), ranging from 20% to 34%. To automatically identify vulgarity, we employ various approaches, such as classical machine learning (CML) classifiers, Stochastic Gradient Descent (SGD) optimizer, a deep learning (DL) based architecture, and lexicon-based methods. Although small in size, we find that the swear/vulgar lexicon is effective at identifying the vulgar language due to the high presence of some swear terms in Bengali social media. We observe that the performances of machine leanings (ML) classifiers are affected by the class distribution of the dataset. The DL-based BiLSTM (Bidirectional Long Short Term Memory) model yields the highest recall scores for identifying vulgarity in both datasets ( i.e ., in both original and class-balanced settings). Besides, the analysis reveals that vulgarity is highly correlated with negative sentiment in social media comments.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research