Python basics for data science (cont.)

Basic charts with matplotlib.pyplot

matplotlib.pyplot is one of the best-known Python packages for plotting 2D charts. In order to plot this kind of element, the package works around the classes Figure, Subplot and Axes. Usually, it is imported as plt.

The Figure corresponds to the canvas where the elements would be plotted. It can be created through the figure method, and we can specify its number and size through the optional parameters num and fig_size.

If these parameters are not given, the default values will be assumed:

The last figure created become the active one, and any command will be applied to it, unless we call the method directly from a previous created figure.

The gcf method returns a reference to the current active figure.

Line Charts

Plotting some data in the figure, is done just through the plot method, providing the data to plot. After plotting the data, we just need to invoke the show method.

In our case, we plot the pH data recorded along time in the algae dataset. As we can see, the figure shows the pH values, between 5 and 10, recorded from 2018-09-30 to 2019-09-17. By default, the data index (date in our example) is used as labels in the abscissas axis, and the pH values in the ordinates axis.

In order to change the ordinates axis, we can change its limits in the plot, by invoking xlim and ylim methods, given them the left and right for their intervals. It is also possible to add a title to the plot and titles to the axes, as below.

Naturally, we can want to plot more than one chart in a figure, in order to do that we can split the figure with the subplots method.

This method receives the number of rows and columns to split the figure, and additional parameters to specify which subplots will share the abscissas and ordinates, sharex and sharey optional parameters, respectively.

subplots returns the slitted figure and a bi-dimensional array of Axes, one for each new part of the figure. An Axes is the class that encompasses the majority of elements in figures, such as the title, the legend, but also the usual ones in charts, like the coordinate system, its labels, units, ticks, etc.

In this manner, to be able to plot different parts in a single figure, we have to invoke the methods to change the previous methods, through the axes object, as below.

In order to make it easier to configure, lets define some auxiliary functions to do it just once.

The first one choose_grid determines the best number of columns to show a set of charts, as a function of the number of charts to show. The second, configures the axes, defining their labels and scaled. Finally the third one deals with dates. Note the use gca that returns the current axes, which is passed as a parameter to our function.

With these functions is now simple to define functions to plot the usual charts in data science Our first of these functions is one for plotting a line chart.

config file has some configuration parameters like colors.

A similar approach is used to plot several series in a single chart. Our function multiple_line_chart exemplifies it. Note that the series have to have the same index, and should have similar ranges for their values.

All the series in a dataframe satisfy the first constraint, and Phosphate and Orthophosphate satisfy the second too in our case study.

Bar Charts

Bar charts are not so different from line ones. Indeed, functions for plotting them are very similar to the previous ones.

In the next example, the function bar_chart is called to plot the frequency of each value for the 'season' variable in our dataset.

Similarly, the multiple_bar_chart plots a grouped bar chart, with each series corresponding to an entry in the yvalues dictionary.

In our example, the frequency for fluid_velocity and river_depth values are plotted, since they share the same range.