Lab 1: Data Profiling

Data Dimensionality

Let's reload the algae data being explored. Note that the parameters parse_dates and infer_datetime_format are needed only if the data has some datetime variables.

In the presence of missing values, is important to define the symbol used to represent them in the datafile, using the na_values parameter.

The next thing we need to do is to understand the ratio between the number of records and variables.

Note that function save_fig saves the figure in a png file named records_variables in folder images, which has to exist in the working directory.

Variables Type

As you know, the different types of variables require different treatments. dtypes returns the types of all variables in the dataframe.

If we need to apply any function that only deals with symbolic variables, we need to transform object variables into category ones (the name for symbolic in pandas).

But we can go further ahead and write a function to collect the names of variables for each different type. It receives the dataframe to analyse and returns a dictionary with a list of column names for types numeric, binary, symbolic and date.

Showing and saving the chart is just a plus.

Missing values

With the previous function is then easy to see how many missing values there are for each variable.

Note that in the 5th line above we only collect the variables with missing values.