This tasks aims to develop an integrative data repository that consolidates available data sources in the Plataforma de Gestão Inteligente da Cidade de Lisboa (PGIL) at CML with impact on city traffic analysis. All the data available on the platform already undertook a preparation and upload process, which leverages all the challenges in this task. 
The target repository should be continuously updated in the presence of more recent data, and be designed with particular care in order to facilitate the subsequent development of data mining algorithms.
In addition, particular attention will be paid to guarantee a proper data cleaning, including the identification and possible treatment of errors, duplicates and missing values. 

To this end, six major activities will be pursued: 

A1) Data access and extraction

This activity aims to guarantee the access to data pertaining to the city traffic data, as well as other public data sources with potential relevance to understand the city traffic. 

First, A1 will focus on traffic data either maintained or accessible by CML. In this context, two major car traffic data sources will be considered: 1) sensors placed on key roads in Lisbon that measure the frequency of cars driving by per road track, and 2) traffic data collected by the use of WAZE application that measures the speed of car traffic on city roads where traffic is congested. We will also consider other modalities of city mobility, including public transportation data (CARRIS), surveillance camera data (CCTV) and public bike routing data (GIRA). 

Second, A1 will further establish efforts to guarantee the adequate access to other potentially relevant sources of data, including: 1) maps of pertinent buildings along the city (along with statistics whenever available), such as locations of schools, commercial and office areas; 2) public events of interest, such as events occurring at large public halls and football games; 3) meteorologic data collected at IST; 4) construction works and other activities with potential impact on car traffic; amongst other external sources. 

A2) Data structure design and creation

This activity will guarantee the proper mapping of raw data listed in A1 (accessed from different channels) into adequate spatiotemporal data structures.
Given the subsequent need of integrating the different data sources and advanced descriptive analytics, a multi-dimensional design will be used to model the target repository schema.

A3) Data integration 

Multi-dimensional design will be used to integrate the different data sources into the target consolidated repository. To this end, the shared dimensions among the different data sources need to be properly identified and normalized, ensuring dimensional conformance.

A4) Data cleaning 

This activity guarantees the absence of duplicates and well-defined errors in the consolidated data repository, and further provides mechanisms to detect outlier observations (possibly associated with less-trivial data errors) and categorize missing values whenever possible. 

A5) Data loading

The necessary routines for the extraction, transformation and load (ETL) of data from the original data sources into the target multidimensional database will be pursued in the context of the activity. 

A6) Data updatability 

This activity makes available an automatic mechanism for the continuous updatability of the consolidated data repository in the presence of new data from one (or more) of the multiple data sources. To this end, the ETL process will be equipped with the capability of recognizing updates in the data sources. 

The integrative data repository resulting from this six activities will provide the necessary data means to accomplish the following tasks 4-8. INESC-ID will be responsible for this task, aided by LNEC and CML. BI-1, BI-2 and BM-2 grant holders will also contribute to this task.