Data Profiling
Data Granularity
Despite the popularity of the use of python
and sklearn
for data science, the latter doesn't deal with non-numerical data. So, the first task in the preparation process has to be Variable Encoding.
The granularity exploration is very helpful in this task, in particular for symbolic variables which don't have any order among their values - the nominal variables.
In this case, we shall try to create a taxonomy for each variable, and analyse the frequencies for each new value in each level of granularity, for each of them.
Dates
One of the easiest types of variable to deal with, in terms of granularity are dates. Years
are composed of quarters, quarters
composed of months and months
composed of days
.
So, in the presence of a date variable, we just need to derive new variables from it, and display the histograms for each one.
The function derive_date_variables
available in dslabs_functions does it.
from pandas import DataFrame
def derive_date_variables(df: DataFrame, date_vars: list[str]) -> DataFrame:
for date in date_vars:
df[date + "_year"] = df[date].dt.year
df[date + "_quarter"] = df[date].dt.quarter
df[date + "_month"] = df[date].dt.month
df[date + "_day"] = df[date].dt.day
return df
From the new variables is then possible to analyse the frequencies for each temporal granularity for a particular date variable.
from numpy import ndarray
from pandas import Series, read_csv
from matplotlib.figure import Figure
from matplotlib.pyplot import subplots, savefig, show
from dslabs_functions import get_variable_types, plot_bar_chart, HEIGHT
def analyse_date_granularity(data: DataFrame, var: str, levels: list[str]) -> ndarray:
cols: int = len(levels)
fig: Figure
axs: ndarray
fig, axs = subplots(1, cols, figsize=(cols * HEIGHT, HEIGHT), squeeze=False)
fig.suptitle(f"Granularity study for {var}")
for i in range(cols):
counts: Series[int] = data[var + "_" + levels[i]].value_counts()
plot_bar_chart(
counts.index.to_list(),
counts.to_list(),
ax=axs[0, i],
title=levels[i],
xlabel=levels[i],
ylabel="nr records",
percentage=False,
)
return axs
file_tag = "algae"
data: DataFrame = read_csv(
"data/algae.csv", index_col=None, parse_dates=True, dayfirst=True
)
variables_types: dict[str, list] = get_variable_types(data)
data_ext: DataFrame = derive_date_variables(data, variables_types["date"])
for v_date in variables_types["date"]:
analyse_date_granularity(data, v_date, ["year", "quarter", "month", "day"])
savefig(f"images/{file_tag}_granularity_{v_date}.png")
show()
Locations
Spatial data is not so different! For administrative data, like streets, cities and countries, the solution is similar: continents
are composed of countries
, which are composed of states
, composed of counties
, composed of cities
, composed of streets
, etc...
When considering latitute and longitude, we can apply a procedura similar to the one applied to dates, just by decomposing the elements in hemisphere
, grades
, minutes
and seconds
.
However, in the first situation, most of the time we need to collect those relations from external sources, or considering different existing variables as several levels of granularity for a given property.
Consider, for example, the data in the Gender Development Index Dataset, where there are the following location variables: hemisphere
, continent
and country
. In this case, we can use the analyse_property_granularity
function to analyse the granularity for the location property.
def analyse_property_granularity(
data: DataFrame, property: str, vars: list[str]
) -> ndarray:
cols: int = len(vars)
fig: Figure
axs: ndarray
fig, axs = subplots(1, cols, figsize=(cols * HEIGHT, HEIGHT), squeeze=False)
fig.suptitle(f"Granularity study for {property}")
for i in range(cols):
counts: Series[int] = data[vars[i]].value_counts()
plot_bar_chart(
counts.index.to_list(),
counts.to_list(),
ax=axs[0, i],
title=vars[i],
xlabel=vars[i],
ylabel="nr records",
percentage=False,
)
return axs
file_tag: str = "gdindex"
data: DataFrame = read_csv("data/gdindex.csv", index_col="ISO3")
analyse_property_granularity(data, "location", ["Hemisphere", "Continent", "Country"])
savefig(f"images/{file_tag}_granularity_{property}.png")
show()
From the several histograms, we are able to understand the number of different values each variable encompasses, and their corresponding frequencies.
While the temporal variables (except year) present almost a uniform
distribution, locations variables show some differences, demonstrating that were considered much more countries in the northern hemishphere than in the southern. Additionally, it is clear that countries show the same frequency, only existing one record per each one, and the number of different countries is considerably high.