We recently got a Demo paper accepted at EDBT 2019 describing our HOTMapper tool that enables defining and storing mappings of Open Data sources into an integrated data source in MonetDb. The tool is currently used to store and manage mappings of educational Open Data made available yearly by INEP. The resulting mapped data source can be accessed by a web portal, LDE (Laboratório de Dados Educacionais) or through a RESTful API. The overall web platform has been introduced in a previous post. Up to this date (Jan. 2019), we map more than 100 tables and about 600 million records.
The creation of the tool is motivated by the need of extracting and integrating Open Data sources from large input CSV files. The input files are often de-normalized, with about hundreds of columns; new versions are made available periodically, with new columns and data.The three main issues are:
- Integrated source creation: we wanted to have one integrated source in order to efficiently query it.
- Schema evolution: the input sources have different columns, or the existing column definition may change as well.
- Data evolution: the data definitions may change as well, having different values through the periodical releases.
While these issues could be handled by existing ETL/mapping tools, they are overly complex to the task we need, and we could not afford maintaining them through the years. For this reason, we developed the HOTMapper tool. It consists of a CLI (Command Line Interface), that takes as input the Open Data sources is CSV, the set of mappings and that produces the integrated sources in the target database. The architecture is shown below.
The key input component is the set of mappings, which are written in CSV as well. It is a simple format containing the periodicity of the input data and the mappings are written using a subset of the SQL language. We explicitly wanted to have such simple format, to easy the maintenance over the years, and this has been proven effective so far. A set of examples is available at : https://gitlab.c3sl.ufpr.br/tools/hotmapper/tree/master/mapping_protocols .
The mappings are processed by the available CLI actions, which are: create (create new tables), drop (drop an existing table), insert (insert new data or columns), remap (modifies the initial table definitions), update (updates existing columns) and generate report. The Figure below show a quick view of a set of input data and a corresponding (simple) mapping.
The tool is being currently used and improved, by our team of internships students. A lot of people contributed, some of them were tool developers, tool users and tool testers: Rudolf Eckelberg, Hamer Iboshi, Victor Picussa, Henrique Ehrenfried, Rafael Castilho, Vytor Calixto, Fernando Erd, Glenda Train, Gabriel Ruschel, Gustavo Hornig. To close the post, the screenshot below shows the execution log during an INSERT action. It is not a fancy graphical interface, but it has been very useful for us!