Python basics for data science

In this chapter, we cover the essentials for dealing with data.

First, we succinctly describe the main data structures to deal with data in Python and pandas, and after this we present some basic methods for plotting the most common charts to explore data. Usually, the pandas package is imported as pd.

Loading data with `pandas`

read_csv

read_csv is one the most useful methods to import data from external files, creating a Dataframe object to store the data. Beside the requirement of specifying the datafile name, the method may be parametrized to deal with different formats and contents.

Among the several optional parameters, the following ones are often required:

sep : str (default : ',')

the character used as delimiter columns in the datafile

index_col : str / int (default : None)

a column name (or its index) which corresponds to a primary key for the data

decimal : str (default : '.')

the character used as decimal point in the datafile

thousands : str (default : '.')

the character used as thousand separator in the datafile

For dealing with timestamped variables, the following ones are of particular importance:

parse_dates : bool or list of int or names or list of lists or dict (default : False)

if True, the index column will be parsed as a date, otherwise the indexes referenced will be used according to a set of specific rules

infer_datetime_format : bool (default : False)

if True and parse_dates is enabled, the engine will look for the best parser to read dates

See read_csv documentation for the complete list of parameters and details.

Below, you can see the invocation of the read_csv method to load the data in the file algae.csv (available in __).

In [1]:

import pandas as pd
from pandas.plotting import register_matplotlib_converters

register_matplotlib_converters()
data = pd.read_csv('data/algae.csv', index_col='date', sep=',', decimal='.', parse_dates=True, infer_datetime_format=True)
data

Out[1]:

	pH	Oxygen	Chloride	Nitrates	Ammonium	Orthophosphate	Phosphate	Chlorophyll	fluid_velocity	river_depth	season
date
2018-09-30	8.10	11.4	40.02	5.33	346.67	125.67	187.06	15.60	medium	low	autumn
2018-10-05	8.06	9.0	55.35	10.42	233.70	58.22	97.58	10.50	medium	low	autumn
2018-10-07	8.05	10.6	59.07	4.99	205.67	44.67	77.43	6.90	high	low	autumn
2018-10-09	7.55	11.5	4.70	1.32	14.75	4.25	98.25	1.10	high	low	autumn
2018-10-11	7.75	10.3	32.92	2.94	42.00	16.00	40.00	7.60	high	low	autumn
...	...	...	...	...	...	...	...	...	...	...	...
2019-09-09	8.30	10.0	3.86	0.87	32.00	6.00	16.00	2.86	medium	high	summer
2019-09-11	8.10	10.2	7.61	0.70	32.50	26.63	52.88	2.00	high	high	summer
2019-09-13	8.40	8.2	23.25	2.81	43.75	51.13	87.13	14.78	low	high	summer
2019-09-15	8.50	7.9	12.44	2.59	96.67	19.11	61.44	6.17	medium	high	summer
2019-09-17	8.50	6.7	82.85	2.80	27.07	64.00	140.52	18.30	medium	high	summer

200 rows × 11 columns

As you can see, the dataset has 200 records (rows), described by 11 variables (columns), and each record is indexed by its date.

Series

Before introducing dataframes, we need to understand a basic concept from pandas. Series is the pandas representation for unidimensional arrays, aiming for storing columns from a data table. Indeed, beside a Series contains an array to store the data elements, it keeps another array with labels to refer to each element - which is called the index. We can see a Series as a Python dictionary (dict) object, as illustrated in the figure below.

Suppose, we have a series read from the dataset above called col, that corresponds to the pH values recorded in the dataset. Lets look at the basic methods to deal with this structure.

As you can see, the series contains 200 elements (Length), with each one corresponding to a real number (float64). Additionally, each element is indexed by its date.

In order to get the number of elements, we can use the len method:

Like in Python, we can select a single element on an array through a numeric index as usual, but we can also use one of the values in the data index, a date in our example. Beside that we can select a subset of the values, in a similar way: (Note the difference of behavior on selecting through an interval: with python usual system the end limit is not included, but when based on the defined index it will be included).

DataFrame

The DataFrame is the main data structure to represent a multivariate dataset. It corresponds to a table, where columns represent attributes (also known as variables) and rows keep the different records (also known as instances). The major advantage of pandas dataframes is its ability to access any element on the dataset, referenced either by row or column, both through their index number and name.

Indeed, the DataFrame is even more powerful, since its indexes may be more than a simple attribute, but a set of attributes, hierarchically organized (as in hierarchies of attributes in each dimension in a Data Warehouse). In this manner, DataFrames are able to represent pivot tables.

Like in Series, DataFrames can be seen as an indexed table, but doubly indexed by two dictionaries: one indexing the records in the rows (as for Series) and another one for indexing the columns, as illustrated in the figure below.

Dataframe head and tail methods provide an insight about the data itself, and return the first and last n records in the dataset, respectively (showing 5 whenever n is not given). Jointly, they give the means to inspect if the loading process occurred without problems.

In [5]:

data.head()

Out[5]:

	pH	Oxygen	Chloride	Nitrates	Ammonium	Orthophosphate	Phosphate	Chlorophyll	fluid_velocity	river_depth	season
date
2018-09-30	8.10	11.4	40.02	5.33	346.67	125.67	187.06	15.6	medium	low	autumn
2018-10-05	8.06	9.0	55.35	10.42	233.70	58.22	97.58	10.5	medium	low	autumn
2018-10-07	8.05	10.6	59.07	4.99	205.67	44.67	77.43	6.9	high	low	autumn
2018-10-09	7.55	11.5	4.70	1.32	14.75	4.25	98.25	1.1	high	low	autumn
2018-10-11	7.75	10.3	32.92	2.94	42.00	16.00	40.00	7.6	high	low	autumn

In order to see the number of variables, you can use the property columns. Note that ndim does not correspond to the expected number, because it corresponds to the number of dimensions in a dataframe, where each column does not correspond to a different dimension.

In order to inspect the number of records (rows) in your data, you can apply the len function over the dataframe. Again, note that size doesn't return the same value.

Another way to inspect the dataset size, we can use the shape method, which returns the number of records and the number of variables, on the first and second position of the tuple, respectively.

Despite its complex implementation, dataframes allow direct access to any element in the table, through a set of predefined properties and methods. Like any other complex structure in Python, and Series in particular, it allows the access of any column or set of columns through the [ ] operator.

but we need to use the loc property for accessing rows through labels or loc to make the access through traditional numeric indexes.

The access to a single cell in the table can be made through the at property, when using labels as indexes and the iat property, when using numbers as usual to index arrays in other languages.

Another very useful selection property is values, which returns the data table as a numpy.narray (the basic structure used in scikit learn package).

Beside the properties summarized above, the DataFrame class also encompasses a large number of methods to make it easier to deal with data tables.

Among more then one hundred methods, we want to highlight some of them that can be categorized as follows:

Selection of DataFrame Methods
DataFrame Modifiers
`append`
creates a copy of the dataframe, appending rows from another one
`drop`
changes the dataframe, by removing a given column or row, or sets of them
`insert`
changes the dataframe, by inserting a given column at a specific position
`pop`
returns a column or row, and removes it from the dataframe

Operators
`corr/cov`
creates a new dataframe, whose values correspond to the correlation/covariance matrix of the original data
`mean/std/mode/min/max`
returns the mean/standard deviation/mode/minimum/maximum value for the given column or row
`add/div/mul/sub`
computes the addition/division/product/subtraction between two dataframes, creating a new one
`eq/ne/ge/gt/le/lt`
computes a new dataframe with boolean values, reflecting if the values in the original data satisfy the given condition
`nlargest/nsmallest`
returns the n rows with the largest/smallest value for the given column (similar to top/bottom in OLAP queries

Python basics for data science

Loading data with pandas

read_csv

Series

DataFrame

Loading data with `pandas`