АЛГОРИТМИ ВСТАНОВЛЕННЯ ДАНИХ АВТОРА ТЕКСТУ

Volodymyr Vanin; Olga Zalevska; Valeriy Mozharovsky; Petro Yablonsky; Dmytro Spirintsev

Volodymyr Vanin National Technical University of Ukraine «Igor Sikorsky Kyiv Polytechnic Institute» https://orcid.org/0000-0001-7008-7269
Olga Zalevska National Technical University of Ukraine «Igor Sikorsky Kyiv Polytechnic Institute» https://orcid.org/0000-0002-3163-1695
Valeriy Mozharovsky National Technical University of Ukraine «Igor Sikorsky Kyiv Polytechnic Institute» https://orcid.org/0009-0002-0884-4876
Petro Yablonsky National Technical University of Ukraine «Igor Sikorsky Kyiv Polytechnic Institute» https://orcid.org/0000-0002-1971-5140
Dmytro Spirintsev Bogdan Khmelnitsky Melitopol State Pedagogical University (Ukraine) https://orcid.org/0000-0001-5728-6626

Abstract

In various spheres of human activity, the issue of text classification and identification of the actual author of a text arises. This task has found wide application in forensics, systems for checking papers for plagiarism, analysing complaints and comments, etc. As a rule, the correspondence of the personal data submitted by the author with the text requires verification. Quite often, these data include the author's nationality, gender, and age. The use of modern methods and algorithms for identifying the author of a text allows you to automate the process.

Modern algorithms are based on the use of neural networks based on labelled datasets. Such datasets are not always available and there is a need to create, classify and label them. Labelling of datasets requires the availability of algorithms that make it possible to identify the characteristic features of the text that are responsible for the author's data. The article proposes algorithms for finding and analysing the characteristic features of a text based on its deviation from the standard.

To determine the author's age group, a table of neologisms was created, indicating the age category of people to whom they are inherent. The labelling of datasets by the nationality (first language) of the author was based on borrowed words from English, Spanish and French. To analyse the gender of the author of the text, the frequency of use of words of certain characteristics is calculated, and the deviation value is used as the weight of the characteristic.

With the help of the above algorithms, the datasets used to build the neural network were labelled. Based on the above algorithms, a neural network was trained using three text classification models. Each model analyses the text according to the given characteristics that correspond to the author's data.

The developed neural network performs automatic labelling of text datasets, and also allows classifying texts by categories of the author's personal data, analyses text data and automatically labels them with determination of the probability of the text belonging to each class.

The neural network was tested on a text dataset consisting of English texts by various authors. The number of correctly identified author's personal data, according to the developed characteristics, is 96 per cent.

Keywords: dataset labelling, author data, anti-plagiarism algorithms, neural network.

Downloads

Download data is not yet available.

ALGORITHMS FOR IDENTIFYING THE AUTHOR OF THE TEXT

Abstract

Downloads

Most read articles by the same author(s)