Rastro-DM: data mining with a trail - A methodology for documenting data mining projects and its application in the construction of a text classifier of documents associated with damages to the public treasury

  • Marcus Vinícius Borela de Castro
  • Remis Balaniuk

Resumo

This paper proposes a methodology for documenting data mining (DM) projects, Rastro-DM (Trail Data Mining), with a focus not on the model that is generated, but on the processes behind its construction, in order to leave a trail (Rastro in Portuguese) of planned actions, training completed, results obtained, and lessons learned. The proposed practices are complementary to structuring methodologies of DM, such as CRISP-DM, which establish a methodological and paradigmatic framework for the DM process. The application of best practices and their benefits is illustrated in a project called “Cladop” that was created for the classification of PDF documents associated with the investigative process of damages to the Brazilian Federal Public Treasury. Building the Rastro-DM kit in the context of a project is a small step that can lead to an institutional leap to be achieved by sharing and using the trail across the enterprise.

Biografia do Autor

Marcus Vinícius Borela de Castro
Auditor at the Court of Accounts of Brazil (TCU) since 1996. Bachelor’s degree in Informatics from theFederal University of Viçosa (1990), and specialist in IT Governance from the University of Brasilia (2012),and in Data Analysis from the Brazilian Federal Court of Account’s Serzedello Corrêa Institute (2019).
Remis Balaniuk
Auditor at the Court of Accounts of Brazil since 1989. Bachelor’s degree in Computer Science from the University of Brasilia (1986), Master’s degree in Computer Science from UFRGS (1989), doctoral degree in Informatics from the Institut National Polytechnique of Grenoble (1996), and postdoctoral research in Computer Science at Stanford University (2002) and at the Institut National pour la Recherche en Informatique et Automatique (2000). Visiting researcher at the University of Oxford (2020).
Publicado
2021-10-20
Seção
Artigos