Corpora

Text corpora prepared during the project time: 

For the authorship attribution problem:

  • STENOGRAMOS_INDV contains Lithuanian parliamentary transcripts (download)
  • FORUMAS_INDV contains Internet forum texts (download)
  • GROŽINĖ_INDV contains fiction texts (download)
  • INT_KOMENTARAI_INDV contains Internet comments (download)
  • INT_KOMENTARAI_INDV2 contains Internet comments (expanded) (download)

For the author profiling problem:

  • AMŽIUS_PROF contains Lithuanian parliamentary transcripts for author profiling by age characteristic (download)
  • GROŽ_AMŽIUS_PROF contains fiction texts for author profiling by age characteristic (download)
  • LYTIS_PROF contains Lithuanian parliamentary transcripts for author profiling by gender characteristic (download)
  • GROŽ_LYTIS_PROF contains fiction texts for author profiling by gender characteristic (download)
  • POLITINĖS_PAŽIŪROS_PROF contains Lithuanian parliamentary transcripts for author profiling by political attitude characteristic (download)

Meta information about the corpora (inside the downloads) is in Lithuanian so far; therefore if you have any questions, please, do not hesitate to contact as.

The corpora can be used in your research as well!