Finding information over Open Data sources and combining them is a complex task, because it is necessary to find reliable information, to sanitize the data and to integrate it, prior to executing any query for data analysis.
Once the integration is done, it is necessary to know in details how the data relates to correctly combine the available data. National governments are typical Open Data producers that make available large amounts of unrelated open data. For instance, the Brazilian government publishes many data in two web portals, such as the Brazilian Open Data, or National Institute of Educational Researches (INEP)>. In addition, the integrated data demands maintenance efforts to keep data reliable and consistent. There are solutions that enable querying directly over the original data sources, but the data cleaning and combination problem remains, thus it continues to be a difficult task for data analysts.
The Blended Integrated Open Data (BIOD) https://biod.c3sl.ufpr.br/index_en.html is an integrated repository that can be accessed through an API to execute analytical queries. It is built using the framework called BlenDb. It has been developed by the Center of Computer Science and Free Software (C3SL) located at the Federal University of Parana.
The repository objective is to provide an analytical API, in which it is not necessary to know the relations between data, only the metrics (calculations over some measures), dimensions (degree of aggregation) and filters (selection of a subset of the data). The data combinations are found automatically by the BlenDb framework, when it is possible, based on a configuration file previously defined, but that is transparent from the API user. It relies on the similarity of the attributes names. We defined a virtual schema, which is an abstraction where each available metric has one corresponding relation. The framework finds the necessary joins, and it privileges the generation of queries with the smallest join path and also the smaller number of projection attributes.
The current version of BIOD is composed by the data sources and tables described above. So far, it contains more than 2 billions of records and more than 800 attributes. The degree of normalization depends on the original data source, thus some tables have more than hundreds of attributes. The original data is in Portuguese, but we provide English translation to ease the understanding.
A query in BlenDb has the following format:
http://biod.c3sl.ufpr.br/v1/data?METRIC_1,METRIC_N &dimensions=DIMENSION_1,DIMENSION_N &filters=CLAUSE_1_FILTER_1,CLAUSE_1_FILTER_N; CLAUSE_N_FILTER_1,CLAUSE_N_FILTER_N
- Metrics: it is necessary to set the metrics, which are calculations over the available measuers.
- Dimensions: it is possible, not mandatory, to specify a number of dimensions.
- Filters: the returned data can be filtered according to specific criteria. The filters can be separated by “;”, (AND) and “,” (OR)
The framework builds the query, by finding the tables, its attributes and the relationships between them, transparently to the user. We call this query format as RFQ – relation-free query, because it is not allowed to specify relations, only metrics, dimensions and filters. As already stated, the goal is avoid explicitly knowing and specifying JOINs over the integrated data.
Conventions of the attributes names:
We have defined a set of conventions to name the parameters of the API, to ease its understanding, following the example below:
Name | Aggregation | Type of data | Description |
---|---|---|---|
met:count:cidade:id | count | integer | Number of cities |
met:avg:docente:idade | avg | float | Average Professor’s age |
Note that the name met: represents the metric, count: is the type of aggregation, followed by the kind of data and attribute name, respectively cidade:id. On the second line, it uses the avg aggregation function. BIOD uses this convention for the following functions: SUM, AVG, MAX, MIN, COUNT. When specifying the dimensions or the filters, the prefix is dim:.
- Click here to download the metrics in CSV, or to access the JSON response.
- Click here to download the dimesions in CSV, or to access them in JSON.
Calling the BIOD API
Consider it is necessary to do the following analytic query: we need to return, by Brazilian region, the number of internet acess that are accessible in a set of cities, the GDB average, the population count, the number of higher education institutions and schools, filtered by active internet points, for the years 2014 to 2017, ordered by the GDP of each region.
This question can be answered with the API call below:
The result is a JSON object, as shown in the Figure below. The result can also be obtained in a CSV file, by adding an additional parameter at the end of the URL: ‘&format=csv’
The BIOD data can be directly integrated into a Google Sheet. The integration is explained in this link.
A detailed description about the data is available in Portuguese at arXiv.
A research paper describing the approach and a use case is published at the 22nd ICEIS 2020. Download the article here.
- Lucas F. de Oliveira, Alessandro Elias, Fabiola Santore, Diego Pasqualin, Luis C. E. Bona,
Marcos Sunyé and Marcos Didonet Del Fabro. Open Data Analytic Querying using a Relation-free API. 22nd ICEIS (may/2020), Online streaming