I/O Tools - Pandas
Data Selection
After creating a DataFrame
object, it is now possible to access its components, like specific columns, rows, cells or even chunks of data.
from pandas import read_csv, DataFrame
data: DataFrame = read_csv(
"data/stroke.csv", index_col="id", sep=",", decimal=".", date_format="DD/MM/YYYY"
)
data
gender | age | hypertension | heart_disease | ever_married | work_type | Residence_type | avg_glucose_level | bmi | smoking_status | stroke | |
---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||
9046 | Male | 67.0 | No | Yes | Yes | Private | Urban | 228.69 | 36.6 | formerly smoked | Yes |
51676 | Female | 61.0 | No | No | Yes | Self-employed | Rural | 202.21 | NaN | never smoked | Yes |
31112 | Male | 80.0 | No | Yes | Yes | Private | Rural | 105.92 | 32.5 | never smoked | Yes |
60182 | Female | 49.0 | No | No | Yes | Private | Urban | 171.23 | 34.4 | smokes | Yes |
1665 | Female | 79.0 | Yes | No | Yes | Self-employed | Rural | 174.12 | 24.0 | never smoked | Yes |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
18234 | Female | 80.0 | Yes | No | Yes | Private | Urban | 83.75 | NaN | never smoked | No |
44873 | Female | 81.0 | No | No | Yes | Self-employed | Urban | 125.20 | 40.0 | never smoked | No |
19723 | Female | 35.0 | No | No | Yes | Self-employed | Rural | 82.99 | 30.6 | never smoked | No |
37544 | Male | 51.0 | No | No | Yes | Private | Rural | 166.29 | 25.6 | formerly smoked | No |
44679 | Female | 44.0 | No | No | Yes | Govt_job | Urban | 85.28 | 26.2 | NaN | No |
5110 rows × 11 columns
We can use several methods to get its commponents. head
and tail
are useful for getting an insight over the read data, presenting the first or last n rows in the data. By default n assumes 5 as its value.
data.head(n=3)
gender | age | hypertension | heart_disease | ever_married | work_type | Residence_type | avg_glucose_level | bmi | smoking_status | stroke | |
---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||
9046 | Male | 67.0 | No | Yes | Yes | Private | Urban | 228.69 | 36.6 | formerly smoked | Yes |
51676 | Female | 61.0 | No | No | Yes | Self-employed | Rural | 202.21 | NaN | never smoked | Yes |
31112 | Male | 80.0 | No | Yes | Yes | Private | Rural | 105.92 | 32.5 | never smoked | Yes |
Data Structure
Before seeing how to select those components, it is important to deal with the data structure, which can be accomplished through the use of some selectors such ascolumns
to get the columns' names (note the absence of the index variable - date in this case),
data.columns
Index(['gender', 'age', 'hypertension', 'heart_disease', 'ever_married', 'work_type', 'Residence_type', 'avg_glucose_level', 'bmi', 'smoking_status', 'stroke'], dtype='object')
index
to get the data used as index,
data.index
Index([ 9046, 51676, 31112, 60182, 1665, 56669, 53882, 10434, 27419, 60491, ... 68398, 36901, 45010, 22127, 14180, 18234, 44873, 19723, 37544, 44679], dtype='int64', name='id', length=5110)
or values
to get the data recorded as a simple two-dimensional table, a numpy.ndarray
. The same can be obtained using the to_numpy
method.
data.values
array([['Male', 67.0, 'No', ..., 36.6, 'formerly smoked', 'Yes'], ['Female', 61.0, 'No', ..., nan, 'never smoked', 'Yes'], ['Male', 80.0, 'No', ..., 32.5, 'never smoked', 'Yes'], ..., ['Female', 35.0, 'No', ..., 30.6, 'never smoked', 'No'], ['Male', 51.0, 'No', ..., 25.6, 'formerly smoked', 'No'], ['Female', 44.0, 'No', ..., 26.2, nan, 'No']], dtype=object)
Note that each selector returns a single component that cannot be used alone to acccess the data, but may be usefull to iterate over it.
Another important selectors are len
and shape
, that return the number of records in the data, and a pair revealing the data dimensionality - the pair (rows, columns).
print("Nr of records = ", len(data))
print("Dimensionality = ", data.shape)
Nr of records = 5110 Dimensionality = (5110, 11)
Data Access
It is now possible to assess the different elements in the data, which can be done through several different ways. The simplest one is to use square brackets ('[' and ']').data["age"]
id 9046 67.0 51676 61.0 31112 80.0 60182 49.0 1665 79.0 ... 18234 80.0 44873 81.0 19723 35.0 37544 51.0 44679 44.0 Name: age, Length: 5110, dtype: float64
As you can see, it returns 5110 values (the total number of records) - a real number (float64
) indexed by its id.
However, in order to access an entire row we have to use the loc
selector, using an existing id as index.
data.loc[9046]
gender Male age 67.0 hypertension No heart_disease Yes ever_married Yes work_type Private Residence_type Urban avg_glucose_level 228.69 bmi 36.6 smoking_status formerly smoked stroke Yes Name: 9046, dtype: object
In the same manner, we can access the value recorded on that date for the pH variable.
data.loc[9046, "age"]
67.0
Naturally we can access each component at a time, in a decomposition way, getting the same result. See the next instruction for example.
data["age"].loc[9046]
67.0
or just by
data["age"][9046]
67.0
Data Aggregations
Beside the properties to access different components, the DataFrame class also encompasses a large number of methods to make it easier to deal with data tables.Among more then one hundred methods, we highlight some of the most used ones, namely the following aggregation functions.
Aggregation Functions | |
---|---|
mean/std/mode/min/max |
return the mean/standard deviation/mode/minimum/maximum value for the given column or row |
nlargest/nsmallest |
return the n rows with the largest/smallest value for the given column (similar to top/bottom in OLAP queries) |
Of particular relevance is the
sample
function, wich returns a new DataFrame
resulting from randomly select some of its records.
data_subset: DataFrame = data.sample(frac=0.05, replace=False)
data_subset
gender | age | hypertension | heart_disease | ever_married | work_type | Residence_type | avg_glucose_level | bmi | smoking_status | stroke | |
---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||
42229 | Female | 68.0 | No | No | Yes | Self-employed | Rural | 93.61 | 24.9 | never smoked | No |
49130 | Male | 74.0 | No | No | Yes | Private | Urban | 98.55 | 25.6 | NaN | Yes |
1260 | Male | 59.0 | No | No | Yes | Govt_job | Urban | 101.24 | 26.5 | never smoked | No |
28559 | Male | 2.0 | No | No | No | children | Urban | 88.54 | 17.5 | NaN | No |
59915 | Female | 53.0 | No | No | No | Private | Urban | 129.43 | 29.6 | never smoked | No |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
50053 | Male | 17.0 | No | No | No | Private | Urban | 62.37 | 41.3 | never smoked | No |
11544 | Female | 34.0 | No | No | Yes | Private | Urban | 71.37 | 32.9 | never smoked | No |
12376 | Male | 63.0 | No | No | Yes | Govt_job | Urban | 95.16 | 37.8 | formerly smoked | No |
16488 | Female | 57.0 | Yes | No | Yes | Private | Urban | 210.00 | NaN | never smoked | No |
65357 | Female | 5.0 | No | No | No | children | Rural | 84.59 | 17.7 | NaN | No |
256 rows × 11 columns
Don't forget the replace
parameter in order to get a different object, without loosing the original one. Instead of using the frac
parameter, which defines the percentage of rows to collect, we can use n
to specify the absolute number of records.