Lab 1: Data Profiling (cont.)

Data Distribution

After loading the data, and understanding the variables we have available to describe the records, we proceed with the distribution analysis for each variable by itself, non relating them with each other.

Depending on being numeric or symbolic, we can apply a set of different functions to explore their nature. Lets start with the numeric ones.

Numeric Variables

The simplest way to describe each numeric variable is through its five-number summary, inspecting its range through the identification of its minimum and maximum values, along with other estimators, like the mean, mode, standard deviation and other different percentiles.

In dataframes, we can access it through the describe function. It receives the include parameter with numeric as the default value, meaning that without changing it explicitly, only numeric variables are shown.

Keep a moment to analyse the count row showing the number of non-missing values for each variable.

The five-numbers summary alone, give all the information required, but it's not easy to interpret. A better way to understand the impact of such values is through the analysis of boxplots for each variable. Again the DataFrame object provides several functions to explore the data, in particular the boxplot function, which plots all numeric variables in the same chart.

Despite the ability to see the relations among the different ranges for all variables, it is difficult to analyze each one in particular, due to the different scales.

In order to address this difference, we can plot singular boxplots for each variable using boxplot methods from matplotlib. In order to show the best plots, we can use our get_variable_types function to select the variables for each type, and then explore them alone.

From the boxplots, it's clear that our variables have different ranges, scales, that there are several outliers, but we are not able to know anything else about the variables distribution. In order to see their distribution the best option is to plot the histogram for each numeric variable, through the use of hist method.

Indeed histograms give us an insight about the distribution of each variable, but recognizing the distribution that best fits the data may be hard. Seaborn provides the distplot method to display the best fit for the variable.

Despite the simplicity of this approach, we are not able to verify how much standard distributions fit to the data. In order to do that, we can try to fit different known distributions to it, using scipy.stats package functionalities to compute distributions (norm, expon, skewnorm, etc). Lets look at the histogram for the pH and Ammonium variables, and possible distributions...

Symbolic Variables

The exploration of symbolic variables is similar, but we are not able to use the functions previously used. Indeed, boxplots are not applicable to non-numeric variables.

In order to explore our 3 symbolic variables, in terms of their distribution we have to manually create histograms for them, since hist function doesn't work with non-numeric variables

Histograms for symbolic variables may also be produced, just through the use of bar charts and counting the frequency of each value for each variable.