Authorship Attribution Database

The Authorship Attribution Database (AAD) contains  short articles from 100 different authors whose texts were uniformly distributed over 10 different subjects
  • Miscellaneous,
  • Law,
  • Economics,
  • Sports,
  • Gastronomy,
  • Literature,
  • Politics,
  • Health,
  • Technology
  • Tourism.

The sources were 15 Brazilian newspapers located all over the country. We have chosen 30 short articles from each author, thus summing up 3000 pieces of documents. The articles usually deal with polemic subjects and express the authors personal opinion. In average, the articles have 600 tokens (words) and 350 Hapax (words occurring once). One aspect worth of remark is that this kind of articles can go through some revision process, which can remove some personal characteristics of the texts. Besides, authorship attribution using short articles poses an extra challenge since the number of features that can be extracted are directly related to the size of the text.

How to obtain access to the data

The AAD daatabase may be used for non-commercial research provided you acknowledge the source of the image by citing the following paper in publications about your research:

  • P.J. Varela, E. Justino, L.S. Oliveira, Selecting Syntactic Attributes For Authorship Attribution, IEEE International Joint Conference on Neural Networks, 2011, 161–172. (pdf)

Click here to download the database (6.2 MB)

Related papers

Our last results on this database can be found in this reference

  • Oliveira Jr., W., Oliveira, L. S., Justino, E., Comparing Compression Models for Authorship Attribution, Forensic Science International, 228(1-3):100-104, 2013. pdf.

This database is licensed under a Creative Commons Attribution 4.0 International License.