Developing Open Data solutions is often challenging for different reasons. The first challenge, once the application is defined, is to get the correct data source and to process it in some useful way. There are several Open Data governmental initiatives worldwide, such as the ones from Brazil, USA or UK (to name only 3). Two common ways of processing them are by downloading the complete data or by accessing through an Open API. This latter would prevent from creating any data processing infrastructure, which many developers could not (or do not want) to afford.

The following is a common assumption when developers want to create some application: give me tons of data and I will process in a useful way.  However, there is an important cost for producing this data. A usual way is to produce large CSVs on JSON files extracted from OLTP or OLAP systems. Another one is by extracting data from different sensors.

Our group (C3SL: Centro de Computação Científica para Software Livre) developed a monitoring and evaluation (M&E) system for the Ministry of Communications from Brazil where we had to deal with 1) the production and collection of data, in addition to the 2) transformation of huge data streams into useful information and 3) providing intuitive graphical analytic interfaces. The system monitors digital inclusion initiatives, which in short, are devices (computers) put available for the people in different points of the country. It receives information on availability, hardware and software inventory, and network bandwidth usage, with more than billions of records. This enables to check if the devices are being used and correctly used.

The Data Collector module is a daemon that needs to be installed in each of the computers (or PoPs, points of presence) that sends information to our Storage module. As a heterogeneous environment is monitored, it takes into account the category of the devices. It is available for major Linux distributions, though it needs to be lightweight, concerning CPU requirements and also bandwidth. Every 5 minutes it sends indicators on hardware (CPU, RAM memory, and hard disk), software (installed Operating System), and network. It also monitors routers through asynchronous SNMP requests. The data is collected and processed in our Storage Module and the information can be visualized publicly.
Additional information about the challenges we encountered is available in one paper entitled: “Transparency Meets Management: a Monitoring and Evaluating Tool for Governmental Projects”, published at AICCSA 2017.
The system is available at http://simmc.c3sl.ufpr.br (in Portuguese) and the source code at https://gitlab.c3sl.ufpr.br/minicom/simmc.