I/O Tools - Pandas

Data Selection

After creating a DataFrame object, it is now possible to access its components, like specific columns, rows, cells or even chunks of data.

In [ ]:

from pandas import read_csv, DataFrame

data: DataFrame = read_csv(
    "data/stroke.csv", index_col="id", sep=",", decimal=".", date_format="DD/MM/YYYY"
)
data

Out[ ]:

	gender	age	hypertension	heart_disease	ever_married	work_type	Residence_type	avg_glucose_level	bmi	smoking_status	stroke
id
9046	Male	67.0	No	Yes	Yes	Private	Urban	228.69	36.6	formerly smoked	Yes
51676	Female	61.0	No	No	Yes	Self-employed	Rural	202.21	NaN	never smoked	Yes
31112	Male	80.0	No	Yes	Yes	Private	Rural	105.92	32.5	never smoked	Yes
60182	Female	49.0	No	No	Yes	Private	Urban	171.23	34.4	smokes	Yes
1665	Female	79.0	Yes	No	Yes	Self-employed	Rural	174.12	24.0	never smoked	Yes
...	...	...	...	...	...	...	...	...	...	...	...
18234	Female	80.0	Yes	No	Yes	Private	Urban	83.75	NaN	never smoked	No
44873	Female	81.0	No	No	Yes	Self-employed	Urban	125.20	40.0	never smoked	No
19723	Female	35.0	No	No	Yes	Self-employed	Rural	82.99	30.6	never smoked	No
37544	Male	51.0	No	No	Yes	Private	Rural	166.29	25.6	formerly smoked	No
44679	Female	44.0	No	No	Yes	Govt_job	Urban	85.28	26.2	NaN	No

5110 rows × 11 columns

We can use several methods to get its commponents. head and tail are useful for getting an insight over the read data, presenting the first or last n rows in the data. By default n assumes 5 as its value.

In [ ]:

data.head(n=3)

Out[ ]:

	gender	age	hypertension	heart_disease	ever_married	work_type	Residence_type	avg_glucose_level	bmi	smoking_status	stroke
id
9046	Male	67.0	No	Yes	Yes	Private	Urban	228.69	36.6	formerly smoked	Yes
51676	Female	61.0	No	No	Yes	Self-employed	Rural	202.21	NaN	never smoked	Yes
31112	Male	80.0	No	Yes	Yes	Private	Rural	105.92	32.5	never smoked	Yes

Data Structure

Before seeing how to select those components, it is important to deal with the data structure, which can be accomplished through the use of some selectors such as columns to get the columns' names (note the absence of the index variable - date in this case),

In [ ]:

data.columns

Out[ ]:

Index(['gender', 'age', 'hypertension', 'heart_disease', 'ever_married',
       'work_type', 'Residence_type', 'avg_glucose_level', 'bmi',
       'smoking_status', 'stroke'],
      dtype='object')

index to get the data used as index,

In [ ]:

data.index

Out[ ]:

Index([ 9046, 51676, 31112, 60182,  1665, 56669, 53882, 10434, 27419, 60491,
       ...
       68398, 36901, 45010, 22127, 14180, 18234, 44873, 19723, 37544, 44679],
      dtype='int64', name='id', length=5110)

or values to get the data recorded as a simple two-dimensional table, a numpy.ndarray. The same can be obtained using the to_numpy method.

In [ ]:

data.values

Out[ ]:

array([['Male', 67.0, 'No', ..., 36.6, 'formerly smoked', 'Yes'],
       ['Female', 61.0, 'No', ..., nan, 'never smoked', 'Yes'],
       ['Male', 80.0, 'No', ..., 32.5, 'never smoked', 'Yes'],
       ...,
       ['Female', 35.0, 'No', ..., 30.6, 'never smoked', 'No'],
       ['Male', 51.0, 'No', ..., 25.6, 'formerly smoked', 'No'],
       ['Female', 44.0, 'No', ..., 26.2, nan, 'No']], dtype=object)

Note that each selector returns a single component that cannot be used alone to acccess the data, but may be usefull to iterate over it.

Another important selectors are len and shape, that return the number of records in the data, and a pair revealing the data dimensionality - the pair (rows, columns).

In [ ]:

print("Nr of records = ", len(data))
print("Dimensionality = ", data.shape)

Nr of records =  5110
Dimensionality =  (5110, 11)

Data Access

It is now possible to assess the different elements in the data, which can be done through several different ways. The simplest one is to use square brackets ('[' and ']').

if applyed over the name of a column, it returns the values assumed by that variable for all the records;

In [ ]:

data["age"]

Out[ ]:

id
9046     67.0
51676    61.0
31112    80.0
60182    49.0
1665     79.0
         ... 
18234    80.0
44873    81.0
19723    35.0
37544    51.0
44679    44.0
Name: age, Length: 5110, dtype: float64

As you can see, it returns 5110 values (the total number of records) - a real number (float64) indexed by its id.

However, in order to access an entire row we have to use the loc selector, using an existing id as index.

In [ ]:

data.loc[9046]

Out[ ]:

gender                          Male
age                             67.0
hypertension                      No
heart_disease                    Yes
ever_married                     Yes
work_type                    Private
Residence_type                 Urban
avg_glucose_level             228.69
bmi                             36.6
smoking_status       formerly smoked
stroke                           Yes
Name: 9046, dtype: object

In the same manner, we can access the value recorded on that date for the pH variable.

In [ ]:

data.loc[9046, "age"]

Out[ ]:

67.0

Naturally we can access each component at a time, in a decomposition way, getting the same result. See the next instruction for example.

In [ ]:

data["age"].loc[9046]

Out[ ]:

67.0

or just by

In [ ]:

data["age"][9046]

Out[ ]:

67.0

Data Aggregations

Beside the properties to access different components, the DataFrame class also encompasses a large number of methods to make it easier to deal with data tables.

Among more then one hundred methods, we highlight some of the most used ones, namely the following aggregation functions.

Aggregation Functions
`mean/std/mode/min/max`	return the mean/standard deviation/mode/minimum/maximum value for the given column or row
`nlargest/nsmallest`	return the n rows with the largest/smallest value for the given column (similar to top/bottom in OLAP queries)

Of particular relevance is the sample function, wich returns a new DataFrame resulting from randomly select some of its records.

In [ ]:

data_subset: DataFrame = data.sample(frac=0.05, replace=False)
data_subset

Out[ ]:

	gender	age	hypertension	heart_disease	ever_married	work_type	Residence_type	avg_glucose_level	bmi	smoking_status	stroke
id
42229	Female	68.0	No	No	Yes	Self-employed	Rural	93.61	24.9	never smoked	No
49130	Male	74.0	No	No	Yes	Private	Urban	98.55	25.6	NaN	Yes
1260	Male	59.0	No	No	Yes	Govt_job	Urban	101.24	26.5	never smoked	No
28559	Male	2.0	No	No	No	children	Urban	88.54	17.5	NaN	No
59915	Female	53.0	No	No	No	Private	Urban	129.43	29.6	never smoked	No
...	...	...	...	...	...	...	...	...	...	...	...
50053	Male	17.0	No	No	No	Private	Urban	62.37	41.3	never smoked	No
11544	Female	34.0	No	No	Yes	Private	Urban	71.37	32.9	never smoked	No
12376	Male	63.0	No	No	Yes	Govt_job	Urban	95.16	37.8	formerly smoked	No
16488	Female	57.0	Yes	No	Yes	Private	Urban	210.00	NaN	never smoked	No
65357	Female	5.0	No	No	No	children	Rural	84.59	17.7	NaN	No

256 rows × 11 columns

Don't forget the replace parameter in order to get a different object, without loosing the original one. Instead of using the frac parameter, which defines the percentage of rows to collect, we can use n to specify the absolute number of records.