|
In this chapter, we cover the essentials for dealing with data.
First, we succinctly describe the main data structures to deal with data in Python and pandas, and after this we present
some basic methods for plotting the most common charts to explore data. Usually, the pandas
package is
imported as pd
.
pandas
read_csv
is one the most useful methods to import data from external files, creating a Dataframe
object to store the
data.
Beside the requirement of specifying the datafile name, the method may be parametrized to deal with different
formats and contents.
Among the several optional parameters, the following ones are often required:
|
the character used as delimiter columns in the datafile |
|
a column name (or its index) which corresponds to a primary key for the data |
|
the character used as decimal point in the datafile |
|
the character used as thousand separator in the datafile |
For dealing with timestamped variables, the following ones are of particular importance:
|
if True, the index column will be parsed as a date, otherwise the indexes referenced will be used according to a set of specific rules |
|
if True and parse_dates is enabled, the engine will look for the best parser to read dates |
See read_csv documentation for the complete list of parameters and details.
Below, you can see the invocation of the read_csv method to load the data in the file algae.csv (available in __).
import pandas as pd
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
data = pd.read_csv('data/algae.csv', index_col='date', sep=',', decimal='.', parse_dates=True, infer_datetime_format=True)
data
pH | Oxygen | Chloride | Nitrates | Ammonium | Orthophosphate | Phosphate | Chlorophyll | fluid_velocity | river_depth | season | |
---|---|---|---|---|---|---|---|---|---|---|---|
date | |||||||||||
2018-09-30 | 8.10 | 11.4 | 40.02 | 5.33 | 346.67 | 125.67 | 187.06 | 15.60 | medium | low | autumn |
2018-10-05 | 8.06 | 9.0 | 55.35 | 10.42 | 233.70 | 58.22 | 97.58 | 10.50 | medium | low | autumn |
2018-10-07 | 8.05 | 10.6 | 59.07 | 4.99 | 205.67 | 44.67 | 77.43 | 6.90 | high | low | autumn |
2018-10-09 | 7.55 | 11.5 | 4.70 | 1.32 | 14.75 | 4.25 | 98.25 | 1.10 | high | low | autumn |
2018-10-11 | 7.75 | 10.3 | 32.92 | 2.94 | 42.00 | 16.00 | 40.00 | 7.60 | high | low | autumn |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2019-09-09 | 8.30 | 10.0 | 3.86 | 0.87 | 32.00 | 6.00 | 16.00 | 2.86 | medium | high | summer |
2019-09-11 | 8.10 | 10.2 | 7.61 | 0.70 | 32.50 | 26.63 | 52.88 | 2.00 | high | high | summer |
2019-09-13 | 8.40 | 8.2 | 23.25 | 2.81 | 43.75 | 51.13 | 87.13 | 14.78 | low | high | summer |
2019-09-15 | 8.50 | 7.9 | 12.44 | 2.59 | 96.67 | 19.11 | 61.44 | 6.17 | medium | high | summer |
2019-09-17 | 8.50 | 6.7 | 82.85 | 2.80 | 27.07 | 64.00 | 140.52 | 18.30 | medium | high | summer |
200 rows × 11 columns
As you can see, the dataset has 200 records (rows), described by 11 variables (columns), and each record is indexed by its date.
Before introducing dataframes, we need to understand a basic concept from pandas.
Series
is the pandas representation for unidimensional arrays, aiming for storing columns from a data table.
Indeed, beside a Series contains an array to store the data elements, it keeps another array with labels to refer
to each element - which is called the index
.
We can see a Series as a Python dictionary (dict
) object, as illustrated in the figure below.
Suppose, we have a series read from the dataset above called col, that corresponds to the pH values recorded in the dataset. Lets look at the basic methods to deal with this structure.
col = data['pH']
col
date 2018-09-30 8.10 2018-10-05 8.06 2018-10-07 8.05 2018-10-09 7.55 2018-10-11 7.75 ... 2019-09-09 8.30 2019-09-11 8.10 2019-09-13 8.40 2019-09-15 8.50 2019-09-17 8.50 Name: pH, Length: 200, dtype: float64
As you can see, the series contains 200 elements (Length), with each one corresponding to a real number (float64
).
Additionally, each element is indexed by its date.
In order to get the number of elements, we can use the len
method:
len(col)
200
Like in Python, we can select a single element on an array through a numeric index as usual, but we can also use one of the values in the data index, a date in our example. Beside that we can select a subset of the values, in a similar way: (Note the difference of behavior on selecting through an interval: with python usual system the end limit is not included, but when based on the defined index it will be included).
print(col[-1])
print(col['2019-09-17'])
print(col[:4])
print(col[:'2018-10-09'])
8.5 8.5 date 2018-09-30 8.10 2018-10-05 8.06 2018-10-07 8.05 2018-10-09 7.55 Name: pH, dtype: float64 date 2018-09-30 8.10 2018-10-05 8.06 2018-10-07 8.05 2018-10-09 7.55 Name: pH, dtype: float64
The DataFrame
is the main data structure to represent a multivariate dataset. It corresponds to a table,
where columns represent attributes (also known as variables) and rows keep the different records (also known as instances).
The major advantage of pandas dataframes is its ability to access any element on the dataset, referenced either by row or
column, both through their index number and name.
Indeed, the DataFrame is even more powerful, since its indexes may be more than a simple attribute, but a set of attributes, hierarchically organized (as in hierarchies of attributes in each dimension in a Data Warehouse). In this manner, DataFrames are able to represent pivot tables.
Like in Series, DataFrames can be seen as an indexed table, but doubly indexed by two dictionaries: one indexing the records in the rows (as for Series) and another one for indexing the columns, as illustrated in the figure below.
Dataframe head
and tail
methods provide an insight about the data itself, and return the first
and last n records in the dataset, respectively (showing 5 whenever n is not given). Jointly, they give the means to
inspect if the loading process occurred without problems.
data.head()
pH | Oxygen | Chloride | Nitrates | Ammonium | Orthophosphate | Phosphate | Chlorophyll | fluid_velocity | river_depth | season | |
---|---|---|---|---|---|---|---|---|---|---|---|
date | |||||||||||
2018-09-30 | 8.10 | 11.4 | 40.02 | 5.33 | 346.67 | 125.67 | 187.06 | 15.6 | medium | low | autumn |
2018-10-05 | 8.06 | 9.0 | 55.35 | 10.42 | 233.70 | 58.22 | 97.58 | 10.5 | medium | low | autumn |
2018-10-07 | 8.05 | 10.6 | 59.07 | 4.99 | 205.67 | 44.67 | 77.43 | 6.9 | high | low | autumn |
2018-10-09 | 7.55 | 11.5 | 4.70 | 1.32 | 14.75 | 4.25 | 98.25 | 1.1 | high | low | autumn |
2018-10-11 | 7.75 | 10.3 | 32.92 | 2.94 | 42.00 | 16.00 | 40.00 | 7.6 | high | low | autumn |
In order to see the number of variables, you can use the property columns
. Note that ndim
does
not correspond to the expected number, because it corresponds to the number of dimensions in a dataframe, where each
column does not correspond to a different dimension.
data.columns
Index(['pH', 'Oxygen', 'Chloride', 'Nitrates', 'Ammonium', 'Orthophosphate', 'Phosphate', 'Chlorophyll', 'fluid_velocity', 'river_depth', 'season'], dtype='object')
In order to inspect the number of records (rows) in your data, you can apply the len
function over the
dataframe. Again, note that size
doesn't return the same value.
len(data)
200
Another way to inspect the dataset size, we can use the shape
method, which returns the number of records
and the number of variables, on the first and second position of the tuple, respectively.
data.shape
(200, 11)
Despite its complex implementation, dataframes allow direct access to any element in the table, through a set of
predefined properties and methods.
Like any other complex structure in Python, and Series in particular, it allows the access of any column or set of
columns through the [ ]
operator.
data['pH']
date 2018-09-30 8.10 2018-10-05 8.06 2018-10-07 8.05 2018-10-09 7.55 2018-10-11 7.75 ... 2019-09-09 8.30 2019-09-11 8.10 2019-09-13 8.40 2019-09-15 8.50 2019-09-17 8.50 Name: pH, Length: 200, dtype: float64
but we need to use the loc
property for accessing rows through labels or loc
to make the
access through traditional numeric indexes.
data.loc['2018-09-30']
pH 8.1 Oxygen 11.4 Chloride 40.02 Nitrates 5.33 Ammonium 346.67 Orthophosphate 125.67 Phosphate 187.06 Chlorophyll 15.6 fluid_velocity medium river_depth low season autumn Name: 2018-09-30 00:00:00, dtype: object
The access to a single cell in the table can be made through the at
property, when using labels as indexes
and the iat
property, when using numbers as usual to index arrays in other languages.
data.at['2018-09-30','pH']
8.1
Another very useful selection property is values
, which returns the data table as a numpy.narray
(the basic structure used in scikit learn
package).
data.values
array([[8.1, 11.4, 40.02, ..., 'medium', 'low', 'autumn'], [8.06, 9.0, 55.35, ..., 'medium', 'low', 'autumn'], [8.05, 10.6, 59.07, ..., 'high', 'low', 'autumn'], ..., [8.4, 8.2, 23.25, ..., 'low', 'high', 'summer'], [8.5, 7.9, 12.44, ..., 'medium', 'high', 'summer'], [8.5, 6.7, 82.85, ..., 'medium', 'high', 'summer']], dtype=object)
Beside the properties summarized above, the DataFrame class also encompasses a large number of methods to make it easier to deal with data tables.
Among more then one hundred methods, we want to highlight some of them that can be categorized as follows:
Selection of DataFrame Methods |
---|
DataFrame Modifiers |
|
creates a copy of the dataframe, appending rows from another one |
|
changes the dataframe, by removing a given column or row, or sets of them |
|
changes the dataframe, by inserting a given column at a specific position |
|
returns a column or row, and removes it from the dataframe |
Operators |
|
creates a new dataframe, whose values correspond to the correlation/covariance matrix of the original data |
|
returns the mean/standard deviation/mode/minimum/maximum value for the given column or row |
|
computes the addition/division/product/subtraction between two dataframes, creating a new one |
|
computes a new dataframe with boolean values, reflecting if the values in the original data satisfy the given condition |
|
returns the n rows with the largest/smallest value for the given column (similar to top/bottom in OLAP queries |