{"id":571,"date":"2018-08-31T13:36:53","date_gmt":"2018-08-31T16:36:53","guid":{"rendered":"http:\/\/web.inf.ufpr.br\/didonet\/?p=571"},"modified":"2019-02-05T11:04:38","modified_gmt":"2019-02-05T13:04:38","slug":"large-scale-open-data-processing-using-apache-spark","status":"publish","type":"post","link":"https:\/\/web.inf.ufpr.br\/didonet\/2018\/08\/31\/large-scale-open-data-processing-using-apache-spark\/","title":{"rendered":"Large-scale open data processing using Apache Spark"},"content":{"rendered":"<p>We have been working on different initiatives to process large amounts of open data and to produce useful information, for instance, the <a href=\"http:\/\/web.inf.ufpr.br\/didonet\/2018\/05\/28\/open-educational-data-from-requirements-to-end-users\/\">Educational Data Lab <\/a> or the Web Portal for <a href=\"http:\/\/web.inf.ufpr.br\/didonet\/2018\/03\/26\/searching-and-ranking-educational-resources\/\">Educational Resources<\/a>. One recurrent difficulty is the initial data analysis and transformations, where it is necessary to understand the data before loading it into some specific storage.<\/p>\n<p>In this phase it is often necessary to &#8220;play&#8221; with the data in several ways, i.e., to apply sets of transformations, to check the results, to re-apply with changes, and so on. It is an interactive process by nature. Existing data extraction tools are not very easy to use (being optimistic),specially complete ETL tools.<\/p>\n<p>In order to have an interactive data transformation tool, our group, and most specifically Evandro Kuszera, developed Metamorfose, an interactive data tranformation tool on top of the <a href=\"http:\/\/spark.apache.org\">Apache Spark<\/a> framework. The tool has some nice features: 1) a graphical interface is simple, enabling to easily process tabular data; 2) it is possible to process processed data, i.e., to achieve a transformation workflow; 3) the mappings are written in Javascript or SQL, making it ease to start coding. A screenshot of the mappings is shown below. Writting the transformation in Javascript is specially useful for programmers, without the need to install a relational database.<\/p>\n<p><img fetchpriority=\"high\" decoding=\"async\" class=\" wp-image-578\" src=\"http:\/\/web.inf.ufpr.br\/didonet\/wp-content\/uploads\/sites\/12\/2018\/08\/screenViewDatasetSchema.png\" alt=\"\" width=\"691\" height=\"272\" srcset=\"https:\/\/web.inf.ufpr.br\/didonet\/wp-content\/uploads\/sites\/12\/2018\/08\/screenViewDatasetSchema.png 1215w, https:\/\/web.inf.ufpr.br\/didonet\/wp-content\/uploads\/sites\/12\/2018\/08\/screenViewDatasetSchema-300x118.png 300w, https:\/\/web.inf.ufpr.br\/didonet\/wp-content\/uploads\/sites\/12\/2018\/08\/screenViewDatasetSchema-768x302.png 768w, https:\/\/web.inf.ufpr.br\/didonet\/wp-content\/uploads\/sites\/12\/2018\/08\/screenViewDatasetSchema-1024x403.png 1024w\" sizes=\"(max-width: 691px) 100vw, 691px\" \/><\/p>\n<p>&nbsp;<\/p>\n<p>We have applied this tool to process open public data extracted from <a href=\"http:\/\/www.inep.gov.br\">INEP<\/a>, with several Gbs, being quite useful to this initial processing. More information about the tool can be find in this <a href=\"http:\/\/www.inf.ufpr.br\/didonet\/files\/Metamorfose-poster-SBBDv1.pdf\">nice poster<\/a> and <a href=\"http:\/\/sbbd.org.br\/2018\/wp-content\/uploads\/sites\/5\/2018\/08\/011-sbbd_2018_comp.pdf\">article<\/a> (published as a Demo paper at the Brazilian Symposion on Databases &#8211; SBBD 2018), as well as its <a href=\"https:\/\/github.com\/evandrokuszera\/metamorfose\">source code here<\/a>.<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>We have been working on different initiatives to process large amounts of open data and to produce useful information, for instance, the Educational Data Lab or the Web Portal for Educational Resources. One recurrent difficulty is the initial data analysis and transformations, where it is necessary to understand the data before loading it into some&hellip;<\/p>\n","protected":false},"author":21,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[5,18,17,22,21],"tags":[],"class_list":["post-571","post","type-post","status-publish","format-standard","hentry","category-c3sl","category-data-transformation","category-interoperability","category-javascript","category-spark"],"_links":{"self":[{"href":"https:\/\/web.inf.ufpr.br\/didonet\/wp-json\/wp\/v2\/posts\/571","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/web.inf.ufpr.br\/didonet\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/web.inf.ufpr.br\/didonet\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/web.inf.ufpr.br\/didonet\/wp-json\/wp\/v2\/users\/21"}],"replies":[{"embeddable":true,"href":"https:\/\/web.inf.ufpr.br\/didonet\/wp-json\/wp\/v2\/comments?post=571"}],"version-history":[{"count":6,"href":"https:\/\/web.inf.ufpr.br\/didonet\/wp-json\/wp\/v2\/posts\/571\/revisions"}],"predecessor-version":[{"id":580,"href":"https:\/\/web.inf.ufpr.br\/didonet\/wp-json\/wp\/v2\/posts\/571\/revisions\/580"}],"wp:attachment":[{"href":"https:\/\/web.inf.ufpr.br\/didonet\/wp-json\/wp\/v2\/media?parent=571"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/web.inf.ufpr.br\/didonet\/wp-json\/wp\/v2\/categories?post=571"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/web.inf.ufpr.br\/didonet\/wp-json\/wp\/v2\/tags?post=571"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}