Projects – DSBD – Data Science & Big Data

Generalized Artificial Neural Network

Artificial neural networks (ANN) are flexible and popular tools for predicting continuous or categorical outputs through a non-linear combination of known inputs. However, the standard specification of ANN does not introduce any source of uncertainty and consequently, there is no measure of uncertainty associated with the resulting predictions. The goal of this project is to propose a suitable stochastic specification of an ANN and thus provide suitable measures of uncertainty for the ANN’s predictions.

Computational implementation of statistical models

Statistical models are used in virtually all fields of science. The increased complexity of such models brings computational challenges for their efficient implementation. The goal of this project is to provide efficient computational implementations of contemporaneous statistical models in programming languages such as R, python, and Julia.

The range of statistical models is large including generalized linear models, generalized linear mixed models, multivariate covariance generalized linear model and other. Models for spatial data sets based on the spatial generalized linear mixed models class are of main interest. In general, the computational implementation of statistical models requires strategies for parallel computing and GPU programming.

Bayesian Networks for count data

Bayesian networks are a special class of statistical models with the goal to describe the conditional structure of dependence between a set of random variables. In general, the process of fitting a Bayesian network is done in two steps. In the first step, the goal is to learn the conditional structure, while the second step consists of to quantify the relation between the response variables. In general Bayesian network is designed for continuous or binary response variables. The goal of this project is to propose a generalization of Bayesian networks for count response variables.

Impact of data cleaning in statistical data analysis

Techniques for data cleaning have been received a lot of attention in the database literature. In general the process of data cleaning has two steps: First, we have to identify the corrupted data and second, we have to fix it in some sense. This project aims to identify and quantify the impact of data cleaning techniques in the fitting of standard statistical models such as linear and generalized linear models.

DoricStore

In the DoricStore project, our goal is to design an in-memory column-store for high-performance emerging hardware. Over the last decade Columnar database systems, or column-stores for short, take advantage of the decomposition storage model (DSM) to boost the performance of read-optimized databases. Many different systems can leverage column-stores, like Business Information Services (BIS), Customer Relationship Management (CRM) and electronic library catalog. But now many of these systems are presenting real-time analysis requirements that together with emerging new hardware offer an opportunity to rethink the design of the column-stores. In this project, we give particular attention to multi-core machines and high-performance Hybrid Memory Cubes (HMC). We believe the HMC is particularly convenient for read-optimized databases as they glue multiple logic control chips to the memory stack. Thus, we run logic query operations within these chips to avoid going to CPU as much as possible. Otherwise, we seek efficient scheduling on multi-core machines. HMC can be built over DRAM or NAND Flash, but this flexibility may present different challenges that we are working on in our research agenda. In particular, we are investigating what happens to the current state of column-stores when running atop multi-core machines and HMC to present new algorithms and data structures in topics, such as scheduling, compressing, vectorization, and late materialization

Continuous data stream preprocessing

Novel methods for continuous preprocessing are necessary to aid the use of incremental machine learning algorithms to Internet-of-Things applications as well as other online data sources. Data preprocessing is an essential part of any machine learning solution. Real-world problems require transformations to raw data before it can be used to build machine learning models. Trivial preprocessing, such as normalizing a feature, can be complicated in a streaming setting. The main reason is that statistics about the data are unknown a priori, e.g., the minimum and maximum values a given feature can exhibit. There are different approaches to scale and discretize features, still, as we move into more complex preprocessing problems (e.g., evolving data streams), it is a usually unknown territory or one which has not been explored in depth.

This project encompasses both practical and theoretical challenges for the development of techniques for continuous feature extraction and transformation from raw online data sources. These techniques include, but are not limited to:

– feature sketching
– Feature scaling and discretization
– Invalid entries handling
– Dimensionality reduction
– Feature selection

Dealing with Imbalanced Data

Machine learning is often referred to as the main actor of the fourth industrial revolution since it changes the way we live, work, or even interact with other people. Despite its relevance, ML has a lot of open problems, such as dealing with unbalanced data. When a distribution is said imbalanced, it means that among the observed instances, there is at least one small subset of them that have patterns that differ from the rest, e.g, rare diseases diagnostics, where the set of healthy patients is always bigger than the set of sick ones. As many real-world applications present imbalanced traits in their data, this topic is gaining repercussion over time. The biggest problem with imbalanced data is that ML algorithms are built in a way to minimize global error, so the patterns for the underrepresented instances are sometimes ignored or not even discovered. Along with imbalanced data, anomaly detection is also a big challenge. Despite the problem has been well studied over the last few decades, a large amount of data generated by computer networks, smartphones, wearables, and a wide range of sensors, which produce real-time data, leads us to reconsider the problem. Basically, anomalies are data points that are inconsistent with the distribution of the majority of data points. Since for most of the aforementioned scenarios the data is generated through one or more sensors, anomalous data can be occasioned by a failure in one of them, transmission errors or by odd system behaviors (OSB). As an example of OSB, one can think of a heartbeat monitor, that trigger some alert when there is a cardiac arrest, or a network intrusion monitor, that acts in real-time to prevent some kind of attack on the network. Nowadays, anomaly detection methods must be fast and be performed incrementally, in order to ensure that detection always keeps up with the current patterns of the data. My goal is to approach the anomaly detection problem in the context of evolving data streams, aiming to establish a relationship between the imbalanced data and anomalies.

High-performance computing systems and applications

High-Performance Computing is present in several projects on genetic research and machine learning, one of them being the famous Human Genome Project which deals with over 30,000 genes and 3 billion chemical base pairs. HPC is therefore vital for processing, big data applications in several fields. Meanwhile, the experts on genetics or engineering often struggle on extracting the best performance from multi-processing systems. Our goal is to detect relevant applications which could make use of HPC platforms and migrate its algorithms to use the last drop of performance from such computational systems. In this way, we are focusing on genetics together with our partners to increase the computational power for other researchers.