Due to the constant influx of electronic text documents analysts of forensic linguistics, administrators of Internet forums or supervisors of social networks are increasingly facing the problem of uncertain authorship. Sometimes it is necessary to determine the exact identity of some author (e.g. confidential information about the company is disclosed at the Internet forum); sometimes is sufficient to reveal only author’s characteristics, such as age (e.g. online content is available only for the adults) or gender (e.g. 50 year-old man is trying to pretend himself as 15 year-old girl).
Research confirms that authorship can be identified after analysis of the author’s text style, but manual work requires huge human resources and is not as accurate as automatic (authorship attribution methods applied on the English texts exceed 80% of accuracy, whereas with the human efforts only ~55% of accuracy can be achieved). It is not surprising: the human cannot take into account so many different style factors at the same time. Although the concept of idiolect (an individual’s distinctive and unique use of language) for the Lithuanian language was discussed more than 40 years ago, authorship research based on the automatic methods is relatively new topic. Lithuanian language is very different from the other languages (e.g. English for which authorship problem is widely investigated) due to the relatively free word-order in a sentence; rich vocabulary (Lithuanian language has ~0.5 million headwords, while English – only ~0.3 million); rich morphology and word derivation system (inflectional endings, suffixes for diminutives/hypocoristic words); alphabet (omitted diacritics in the non-normative texts). All these differences require deeper analysis because the methods achieving such a high accuracy on the English texts are not very effective on the Lithuanian.
The aim of this project is to find automatic methods solving authorship attribution and author profiling problems (due to age, gender and political attitude characteristics) for the Lithuanian language. The research involves various functional styles and language types (from normative language to Internet comments).
The project “Automatic Authorship Attribution and Author Profiling for the Lithuanian Language” (acronym ASTRA) (No. LIT-8-69) is implemented by Vytautas Magnus University (VMU) and Kaunas University of Technology (KUT). The project is funded by the Research Council of Lithuania. Project duration: March 1st, 2014 – December 31st, 2015.