I/O Tools - Pandas

Data Selection

After creating a DataFrame object, it is now possible to access its components, like specific columns, rows, cells or even chunks of data.

In [ ]:
from pandas import read_csv, DataFrame

data: DataFrame = read_csv(
    "data/stroke.csv", index_col="id", sep=",", decimal=".", date_format="DD/MM/YYYY"
)
data
Out[ ]:
gender age hypertension heart_disease ever_married work_type Residence_type avg_glucose_level bmi smoking_status stroke
id
9046 Male 67.0 No Yes Yes Private Urban 228.69 36.6 formerly smoked Yes
51676 Female 61.0 No No Yes Self-employed Rural 202.21 NaN never smoked Yes
31112 Male 80.0 No Yes Yes Private Rural 105.92 32.5 never smoked Yes
60182 Female 49.0 No No Yes Private Urban 171.23 34.4 smokes Yes
1665 Female 79.0 Yes No Yes Self-employed Rural 174.12 24.0 never smoked Yes
... ... ... ... ... ... ... ... ... ... ... ...
18234 Female 80.0 Yes No Yes Private Urban 83.75 NaN never smoked No
44873 Female 81.0 No No Yes Self-employed Urban 125.20 40.0 never smoked No
19723 Female 35.0 No No Yes Self-employed Rural 82.99 30.6 never smoked No
37544 Male 51.0 No No Yes Private Rural 166.29 25.6 formerly smoked No
44679 Female 44.0 No No Yes Govt_job Urban 85.28 26.2 NaN No

5110 rows × 11 columns

We can use several methods to get its commponents. head and tail are useful for getting an insight over the read data, presenting the first or last n rows in the data. By default n assumes 5 as its value.

In [ ]:
data.head(n=3)
Out[ ]:
gender age hypertension heart_disease ever_married work_type Residence_type avg_glucose_level bmi smoking_status stroke
id
9046 Male 67.0 No Yes Yes Private Urban 228.69 36.6 formerly smoked Yes
51676 Female 61.0 No No Yes Self-employed Rural 202.21 NaN never smoked Yes
31112 Male 80.0 No Yes Yes Private Rural 105.92 32.5 never smoked Yes

Data Structure

Before seeing how to select those components, it is important to deal with the data structure, which can be accomplished through the use of some selectors such as columns to get the columns' names (note the absence of the index variable - date in this case),
In [ ]:
data.columns
Out[ ]:
Index(['gender', 'age', 'hypertension', 'heart_disease', 'ever_married',
       'work_type', 'Residence_type', 'avg_glucose_level', 'bmi',
       'smoking_status', 'stroke'],
      dtype='object')

index to get the data used as index,

In [ ]:
data.index
Out[ ]:
Index([ 9046, 51676, 31112, 60182,  1665, 56669, 53882, 10434, 27419, 60491,
       ...
       68398, 36901, 45010, 22127, 14180, 18234, 44873, 19723, 37544, 44679],
      dtype='int64', name='id', length=5110)

or values to get the data recorded as a simple two-dimensional table, a numpy.ndarray. The same can be obtained using the to_numpy method.

In [ ]:
data.values
Out[ ]:
array([['Male', 67.0, 'No', ..., 36.6, 'formerly smoked', 'Yes'],
       ['Female', 61.0, 'No', ..., nan, 'never smoked', 'Yes'],
       ['Male', 80.0, 'No', ..., 32.5, 'never smoked', 'Yes'],
       ...,
       ['Female', 35.0, 'No', ..., 30.6, 'never smoked', 'No'],
       ['Male', 51.0, 'No', ..., 25.6, 'formerly smoked', 'No'],
       ['Female', 44.0, 'No', ..., 26.2, nan, 'No']], dtype=object)

Note that each selector returns a single component that cannot be used alone to acccess the data, but may be usefull to iterate over it.

Another important selectors are len and shape, that return the number of records in the data, and a pair revealing the data dimensionality - the pair (rows, columns).

In [ ]:
print("Nr of records = ", len(data))
print("Dimensionality = ", data.shape)
Nr of records =  5110
Dimensionality =  (5110, 11)

Data Access

It is now possible to assess the different elements in the data, which can be done through several different ways. The simplest one is to use square brackets ('[' and ']').
  • if applyed over the name of a column, it returns the values assumed by that variable for all the records;
  • In [ ]:
    data["age"]
    
    Out[ ]:
    id
    9046     67.0
    51676    61.0
    31112    80.0
    60182    49.0
    1665     79.0
             ... 
    18234    80.0
    44873    81.0
    19723    35.0
    37544    51.0
    44679    44.0
    Name: age, Length: 5110, dtype: float64

    As you can see, it returns 5110 values (the total number of records) - a real number (float64) indexed by its id.

    However, in order to access an entire row we have to use the loc selector, using an existing id as index.

    In [ ]:
    data.loc[9046]
    
    Out[ ]:
    gender                          Male
    age                             67.0
    hypertension                      No
    heart_disease                    Yes
    ever_married                     Yes
    work_type                    Private
    Residence_type                 Urban
    avg_glucose_level             228.69
    bmi                             36.6
    smoking_status       formerly smoked
    stroke                           Yes
    Name: 9046, dtype: object

    In the same manner, we can access the value recorded on that date for the pH variable.

    In [ ]:
    data.loc[9046, "age"]
    
    Out[ ]:
    67.0

    Naturally we can access each component at a time, in a decomposition way, getting the same result. See the next instruction for example.

    In [ ]:
    data["age"].loc[9046]
    
    Out[ ]:
    67.0

    or just by

    In [ ]:
    data["age"][9046]
    
    Out[ ]:
    67.0

    Data Aggregations

    Beside the properties to access different components, the DataFrame class also encompasses a large number of methods to make it easier to deal with data tables.

    Among more then one hundred methods, we highlight some of the most used ones, namely the following aggregation functions.

    Aggregation Functions
    mean/std/mode/min/max return the mean/standard deviation/mode/minimum/maximum value for the given column or row
    nlargest/nsmallest return the n rows with the largest/smallest value for the given column (similar to top/bottom in OLAP queries)

    Of particular relevance is the sample function, wich returns a new DataFrame resulting from randomly select some of its records.
    In [ ]:
    data_subset: DataFrame = data.sample(frac=0.05, replace=False)
    data_subset
    
    Out[ ]:
    gender age hypertension heart_disease ever_married work_type Residence_type avg_glucose_level bmi smoking_status stroke
    id
    42229 Female 68.0 No No Yes Self-employed Rural 93.61 24.9 never smoked No
    49130 Male 74.0 No No Yes Private Urban 98.55 25.6 NaN Yes
    1260 Male 59.0 No No Yes Govt_job Urban 101.24 26.5 never smoked No
    28559 Male 2.0 No No No children Urban 88.54 17.5 NaN No
    59915 Female 53.0 No No No Private Urban 129.43 29.6 never smoked No
    ... ... ... ... ... ... ... ... ... ... ... ...
    50053 Male 17.0 No No No Private Urban 62.37 41.3 never smoked No
    11544 Female 34.0 No No Yes Private Urban 71.37 32.9 never smoked No
    12376 Male 63.0 No No Yes Govt_job Urban 95.16 37.8 formerly smoked No
    16488 Female 57.0 Yes No Yes Private Urban 210.00 NaN never smoked No
    65357 Female 5.0 No No No children Rural 84.59 17.7 NaN No

    256 rows × 11 columns

    Don't forget the replace parameter in order to get a different object, without loosing the original one. Instead of using the frac parameter, which defines the percentage of rows to collect, we can use n to specify the absolute number of records.