Introduction to Pandas#

Webography

Code using pandas usually starts with the import statement

import numpy as np
import pandas as pd

Pandas

  • 2 data structures (Series, DataFrame) for data analysis

  • multiple methods for convenient data filtering.

  • toolkit utilities to perform input/output operations. It can read data from a variety of formats such as CSV, TSV, MS Excel, etc.

Pandas has two main data structures for data storage

  • Series

  • DataFrame

## Series structure
series1 = pd.Series([1, 2, 3, 4])
series1
0    1
1    2
2    3
3    4
dtype: int64
print(series1.sum())
print(series1.mean())
10
2.5
print(series1.to_csv())
,0
0,1
1,2
2,3
3,4
fruits = np.array(["kiwi", "orange", "mango", "apple"])
series2 = pd.Series(fruits)
series2
0      kiwi
1    orange
2     mango
3     apple
dtype: object

Dataframe#

A dictionnary of series where keys are column name

dataframe_type

How to create a data frame ?#

From scratch#

# Intialise data: dictionnary of lists.
data = {
    "Name": ["John", "Paul", "Debby", "Laura"],
    "Sex": ["Male", "Male", "Female", "Female"],
    "Age": [20, 40, 19, 30],
}

# Create DataFrame
df = pd.DataFrame(data)
df
Name Sex Age
0 John Male 20
1 Paul Male 40
2 Debby Female 19
3 Laura Female 30
type(df.Age)
pandas.core.series.Series
df.to_csv("person.txt")

From a file#

df_person = pd.read_csv("person.txt", sep=",", encoding="utf-8", header=0)
df_person
Unnamed: 0 Name Sex Age
0 0 John Male 20
1 1 Paul Male 40
2 2 Debby Female 19
3 3 Laura Female 30

By default, a new index is created

If you want use a field-based index, you have to specify it in the read_csv function:

df_person = pd.read_csv('person.txt', sep = ',', index_col=0, encoding="utf-8", header=0)
df_person
Name Sex Age
0 John Male 20
1 Paul Male 40
2 Debby Female 19
3 Laura Female 30

Basic commands#

# display simple statistics
df_person.describe()
Age
count 4.000000
mean 27.250000
std 9.844626
min 19.000000
25% 19.750000
50% 25.000000
75% 32.500000
max 40.000000
# display the 5 first rows
df_person.head()
Name Sex Age
0 John Male 20
1 Paul Male 40
2 Debby Female 19
3 Laura Female 30
# display the 5 last rows
df_person.tail()
Name Sex Age
0 John Male 20
1 Paul Male 40
2 Debby Female 19
3 Laura Female 30
# display the dataframe columns
df_person.columns
Index(['Name', 'Sex', 'Age'], dtype='object')
# query one column
df_person["Age"]
0    20
1    40
2    19
3    30
Name: Age, dtype: int64
# another method to query one column
df_person.Age
0    20
1    40
2    19
3    30
Name: Age, dtype: int64
# query multiple columns
df_person[["Name", "Age"]]
Name Age
0 John 20
1 Paul 40
2 Debby 19
3 Laura 30
# display unique value of a column
df_person.Sex.unique()
array(['Male', 'Female'], dtype=object)
# display 2 first rows
df_person[:2]
Name Sex Age
0 John Male 20
1 Paul Male 40

iloc: Purely integer-location based indexing for selection by position.#

df_person.head()
Name Sex Age
0 John Male 20
1 Paul Male 40
2 Debby Female 19
3 Laura Female 30
df_person.iloc[2]
Name     Debby
Sex     Female
Age         19
Name: 2, dtype: object
df_person.iloc[2, 2]
19

loc: access a group of rows and columns by label(s) or a boolean array.#

df_person.head()
Name Sex Age
0 John Male 20
1 Paul Male 40
2 Debby Female 19
3 Laura Female 30
# one line
df_person.loc[2]
Name     Debby
Sex     Female
Age         19
Name: 2, dtype: object
# one value
df_person.loc[2, "Name"]
'Debby'

Basic operations on columns#

df_person.Age = df_person.Age + 2
df_person.Age
0    22
1    42
2    21
3    32
Name: Age, dtype: int64

Get or set a single value (fast)#

  • df_person.at: by labels

  • df_person.iat: “integer-location based indexing”

df_person.head()
Name Sex Age
0 John Male 22
1 Paul Male 42
2 Debby Female 21
3 Laura Female 32
df_person.at[2, "Name"]
'Debby'
df_person.iat[2, 0]
'Debby'

Add a row#

new_row = {"Name": "Glenn", "Sex": "Male", "Age": 10}
df_person = pd.concat([df_person, pd.DataFrame([new_row])], ignore_index=True)
df_person
Name Sex Age
0 John Male 22
1 Paul Male 42
2 Debby Female 21
3 Laura Female 32
4 Glenn Male 10

Add some rows#

data = {
    "Name": ["Marguerite", "Annie", "Stephen", "Ava"],
    "Sex": ["Female", "Female", "Male", "Female"],
    "Age": [34, 23, 49, 22],
}
df_person = pd.concat([df_person, pd.DataFrame(data)], ignore_index=True)
df_person
Name Sex Age
0 John Male 22
1 Paul Male 42
2 Debby Female 21
3 Laura Female 32
4 Glenn Male 10
5 Marguerite Female 34
6 Annie Female 23
7 Stephen Male 49
8 Ava Female 22

Add a column#

df_person["Nationality"] = "USA"
df_person
Name Sex Age Nationality
0 John Male 22 USA
1 Paul Male 42 USA
2 Debby Female 21 USA
3 Laura Female 32 USA
4 Glenn Male 10 USA
5 Marguerite Female 34 USA
6 Annie Female 23 USA
7 Stephen Male 49 USA
8 Ava Female 22 USA

Basic statistics#

type(df_person.Age)
pandas.core.series.Series
print(df_person.Age.mean())
print(df_person.Age.min())
print(df_person.Age.max())
print(df_person.Age.count())
28.333333333333332
10
49
9

How to sort data ?#

df_person_sorted = df_person.sort_values(["Age"], ascending=True)
df_person_sorted
Name Sex Age Nationality
4 Glenn Male 10 USA
2 Debby Female 21 USA
0 John Male 22 USA
8 Ava Female 22 USA
6 Annie Female 23 USA
3 Laura Female 32 USA
5 Marguerite Female 34 USA
1 Paul Male 42 USA
7 Stephen Male 49 USA

Selection#

# selection with one criterion
df_person[df_person["Sex"] == "Female"]
Name Sex Age Nationality
2 Debby Female 21 USA
3 Laura Female 32 USA
5 Marguerite Female 34 USA
6 Annie Female 23 USA
8 Ava Female 22 USA
df_person[df_person["Age"] < 20]
Name Sex Age Nationality
4 Glenn Male 10 USA
# selection with 2 criteria
df_person[(df_person["Sex"] == "Male") & (df_person["Age"] > 30)]
Name Sex Age Nationality
1 Paul Male 42 USA
7 Stephen Male 49 USA

Update data#

# change one value by index
df_person.loc[7, "Name"] = "Stephane"
df_person
Name Sex Age Nationality
0 John Male 22 USA
1 Paul Male 42 USA
2 Debby Female 21 USA
3 Laura Female 32 USA
4 Glenn Male 10 USA
5 Marguerite Female 34 USA
6 Annie Female 23 USA
7 Stephane Male 49 USA
8 Ava Female 22 USA
# change one value after a selection
df_person.loc[df_person["Name"] == "Stephane", "Name"] = "Eric"
df_person
Name Sex Age Nationality
0 John Male 22 USA
1 Paul Male 42 USA
2 Debby Female 21 USA
3 Laura Female 32 USA
4 Glenn Male 10 USA
5 Marguerite Female 34 USA
6 Annie Female 23 USA
7 Eric Male 49 USA
8 Ava Female 22 USA
## Add a column
df_person["City"] = "City"
df_person
Name Sex Age Nationality City
0 John Male 22 USA City
1 Paul Male 42 USA City
2 Debby Female 21 USA City
3 Laura Female 32 USA City
4 Glenn Male 10 USA City
5 Marguerite Female 34 USA City
6 Annie Female 23 USA City
7 Eric Male 49 USA City
8 Ava Female 22 USA City
## Delete a column
df_person = df_person.drop("City", axis=1)
df_person
Name Sex Age Nationality
0 John Male 22 USA
1 Paul Male 42 USA
2 Debby Female 21 USA
3 Laura Female 32 USA
4 Glenn Male 10 USA
5 Marguerite Female 34 USA
6 Annie Female 23 USA
7 Eric Male 49 USA
8 Ava Female 22 USA

Concat#

concat-example
data = {
    "Name": ["Benedicte", "Bernard", "Nicolas", "Anne"],
    "Sex": ["Female", "Male", "Male", "Female"],
    "Age": [24, 34, 49, 42],
    "Nationality": ["FR", "FR", "FR", "FR"],
}
df_person_fr = pd.DataFrame(data)
list_person = [df_person, df_person_fr]
result = pd.concat(list_person)
result
Name Sex Age Nationality
0 John Male 22 USA
1 Paul Male 42 USA
2 Debby Female 21 USA
3 Laura Female 32 USA
4 Glenn Male 10 USA
5 Marguerite Female 34 USA
6 Annie Female 23 USA
7 Eric Male 49 USA
8 Ava Female 22 USA
0 Benedicte Female 24 FR
1 Bernard Male 34 FR
2 Nicolas Male 49 FR
3 Anne Female 42 FR

Join#

join-example
import random

data = {
    "id_Address": [0, 1, 2, 3],
    "Address": ["gordon street", "aqua boulevard", "st georges street", "5th street"],
    "City": ["Boston", "Chicago", "Charlotte", "San Francisco"],
}

# Create DataFrame
df_address = pd.DataFrame(data)
df_address
id_Address Address City
0 0 gordon street Boston
1 1 aqua boulevard Chicago
2 2 st georges street Charlotte
3 3 5th street San Francisco
df_person["id_Address"] = ""
nb_elements = df_person.Name.count()

cpt = 0
while cpt < nb_elements:
    df_person.loc[cpt, "id_Address"] = random.randint(0, 3)
    cpt = cpt + 1
df_person
Name Sex Age Nationality id_Address
0 John Male 22 USA 2
1 Paul Male 42 USA 1
2 Debby Female 21 USA 2
3 Laura Female 32 USA 1
4 Glenn Male 10 USA 3
5 Marguerite Female 34 USA 1
6 Annie Female 23 USA 0
7 Eric Male 49 USA 1
8 Ava Female 22 USA 0
result = pd.merge(df_person, df_address, how="left", on="id_Address")
result
Name Sex Age Nationality id_Address Address City
0 John Male 22 USA 2 st georges street Charlotte
1 Paul Male 42 USA 1 aqua boulevard Chicago
2 Debby Female 21 USA 2 st georges street Charlotte
3 Laura Female 32 USA 1 aqua boulevard Chicago
4 Glenn Male 10 USA 3 5th street San Francisco
5 Marguerite Female 34 USA 1 aqua boulevard Chicago
6 Annie Female 23 USA 0 gordon street Boston
7 Eric Male 49 USA 1 aqua boulevard Chicago
8 Ava Female 22 USA 0 gordon street Boston

Group By#

  • Splitting the data into groups based on some criteria.

  • Applying a function to each group independently.

  • Combining the results into a data structure.

groupby-example
df_person.groupby("Sex")["Sex"].count()
Sex
Female    5
Male      4
Name: Sex, dtype: int64
df_person.groupby("Sex")["Age"].mean()
Sex
Female    26.40
Male      30.75
Name: Age, dtype: float64

Export data#

export_csv = df_person.to_csv(r"./export_person.csv", index=None, header=True)

Plot data#

%matplotlib inline
df_person.groupby("Sex")["Sex"].count().plot.bar();
../_images/b643171695adad4a88db1e0fa5c572f941f5057dbee4bd8d24d1630c76a602dc.png