
Style change detection
Author(s) -
AUTHOR_ID,
Sukanya Nath
Publication year - 2022
Language(s) - English
Resource type - Dissertations/theses
DOI - 10.35662/unine-thesis-2931
Subject(s) - humanities , philosophy
Stylometry is the study of writing styles of authors aiming for authorship attribution, verification, identification and profiling among others. By analyzing the stylometric features in a given text, the characteristic writing style of an author is represented and sufficiently distinguished from another. The detection of different writing styles in the same document, suggesting multiple authorship, is called style change detection (SCD). Detecting multiple authorship is considerably challenging because the number of participating authors is not known apriori and because of the lack of additional reference corpus. The goal of this thesis is to leverage existing stylometry techniques and devise novel methods to distinguish the writing styles of authors in a multi-authored document in a simple and practical manner. We address this problem by decomposing it into three sub-problems. At first, we need to determine whether a document is written by one or more authors. A binary classification approach is taken where each document is transformed into a suitable feature matrix and fed to a variety of learning models such as Logistic Regression, support vector machines, Random Forest and neural networks etc. If a document is written by multiple authors, the second sub-problem is to determine the location of the style changes in the document, under the assumption that style changes may occur only at the end of a paragraph but not within a paragraph. Our approach is to transform the problem to an authorship verification problem, where the stylistic difference between the paragraphs compared, i.e., either the two paragraphs are stylistically different or they are not. The document is broken into paragraphs and with the help of word embeddings such as GloVe, the embedded feature vector for each paragraph is computed. Both feature vectors are then trained with a siamese neural network and their mutual distance is measured with a suitable distance measure. The third sub-problem is a form of authorship clustering and seeks to ascertain the number of distinct authors of the text. We start with the assumption that style changes may occur within a paragraph and propose two algorithms called Threshold Based Clustering (TBC) and Window Merge Clustering (WMC). The general approach is to segment the document in to chunks of texts called windows. Each window is converted into a feature vector of words or more generally of stylistic patterns. The mutual distances among the window feature vectors are measured using suitable distance measures, and a distance matrix is created for all the windows. The Threshold Based Clustering (TBC) algorithm sorts the pairs of windows in terms of their distance and puts the closest windows to the same cluster using appropriate thresholds for adding a new node and merging clusters. The number of clusters indicates the number of authors. Window Merge Clustering (WMC) algorithm starts out like the TBC and iteratively puts the closest windows in the same cluster. However, in each iteration, the windows in a cluster are merged to form a concatenated cluster in order to represent each cluster with a combined representation of all of its members together, rather than individual distances. Thus, the distance matrix is re-calculated at each iteration. A variation of the third task aims to assign texts based on style changes to their respective authors uniquely. Under the assumption that style changes may occur only at between the paragraphs but not within them, we propose to use the style change location results derived in the solution of the second problem. Thereafter, clustering approaches based on TBC and hierarchical clustering are used to determine the number of clusters or authors. We evaluate and our methods on the datasets of the PAN CLEF and show that we can achieve state-of-the-art performance.