ALGORITHMS FOR IDENTIFYING THE AUTHOR OF THE TEXT

Abstract

In various spheres of human activity, the issue of text classification and identification of the actual author of a text arises. This task has found wide application in forensics, systems for checking papers for plagiarism, analysing complaints and comments, etc. As a rule, the correspondence of the personal data submitted by the author with the text requires verification. Quite often, these data include the author's nationality, gender, and age.  The use of modern methods and algorithms for identifying the author of a text allows you to automate the process.

Modern algorithms are based on the use of neural networks based on labelled datasets. Such datasets are not always available and there is a need to create, classify and label them. Labelling of datasets requires the availability of algorithms that make it possible to identify the characteristic features of the text that are responsible for the author's data. The article proposes algorithms for finding and analysing the characteristic features of a text based on its deviation from the standard.

To determine the author's age group, a table of neologisms was created, indicating the age category of people to whom they are inherent.  The labelling of datasets by the nationality (first language) of the author was based on borrowed words from English, Spanish and French. To analyse the gender of the author of the text, the frequency of use of words of certain characteristics is calculated, and the deviation value is used as the weight of the characteristic.

With the help of the above algorithms, the datasets used to build the neural network were labelled. Based on the above algorithms, a neural network was trained using three text classification models. Each model analyses the text according to the given characteristics that correspond to the author's data. 

The developed neural network performs automatic labelling of text datasets, and also allows classifying texts by categories of the author's personal data, analyses text data and automatically labels them with determination of the probability of the text belonging to each class. 

The neural network was tested on a text dataset consisting of English texts by various authors. The number of correctly identified author's personal data, according to the developed characteristics, is 96 per cent.   

Keywords: dataset labelling, author data, anti-plagiarism algorithms, neural network.

Downloads

Download data is not yet available.
Published
2024-01-08
How to Cite
Vanin, V., Zalevska, O., Mozharovsky, V., Yablonsky, P., & Spirintsev, D. (2024). ALGORITHMS FOR IDENTIFYING THE AUTHOR OF THE TEXT. Modern Problems of Modeling, (25), 52-59. Retrieved from http://magazine.mdpu.org.ua/index.php/spm/article/view/3200