Lesson 6

Overview

  • Pandas
  1. Initializing a dataframe
  • From NumPy arrays
  • From a file
  1. Saving a dataframe
  • Filetype options
  1. pd.DataFrame versus pd.Series
  • Converting a Pandas series into a NumPy array
  • Accessing the list of a dataframe’s columns
  1. Series datatypes
  2. Working with strings
  3. Indexing
  • Accessing elements with .loc and .iloc
  • .reset_index
  • Multi-index
  • Using indexing to avoid merges
  1. Merging, concatenating, and appending columns
  • Merging
  • Concatenating
  • Appending columns
  1. Groupby
  • .transform
  • .apply
  1. Long and wide formats
  2. Miscellaneous functions and methods
  • .drop_duplicates
  • .sort_values
  • .unique
  • .value_counts

Pandas

Dataframes in Python

import numpy as np
import pandas as pd

1. Initializing a dataframe

1.1 From NumPy arrays

Initialize an instance of the pd.DataFrame class with a dictionary linking column names to associated NumPy arrays

rng = np.random.default_rng(1234)
i = np.arange(100)
y = rng.uniform(size=100)
df = pd.DataFrame({'i': i, 'y': y})
df
i y
0 0 0.976700
1 1 0.380196
2 2 0.923246
3 3 0.261692
4 4 0.319097
... ... ...
95 95 0.873824
96 96 0.350609
97 97 0.823521
98 98 0.947199
99 99 0.067041

100 rows × 2 columns

1.2 From a file

Use something like pd.read_csv(), pd.read_stata(), pd.read_excel(), pd.read_feather(), etc.

Source for sample data: here

df = pd.read_csv('/Users/adamoppenheimer/Library/CloudStorage/GoogleDrive-oppen040@umn.edu/My Drive/Econ Crash Course 2024/Data/twoway_sample_data.csv')
df
i j y t l k alpha psi
0 0 122 -0.588273 0 2 6 0.0 0.348756
1 0 122 -1.368881 1 2 6 0.0 0.348756
2 0 107 -1.915975 2 2 5 0.0 0.114185
3 0 107 0.587144 3 2 5 0.0 0.114185
4 0 107 -0.398403 4 2 5 0.0 0.114185
... ... ... ... ... ... ... ... ...
49995 9999 121 -0.564833 0 2 6 0.0 0.348756
49996 9999 67 -0.850231 1 2 3 0.0 -0.348756
49997 9999 96 -0.199708 2 2 4 0.0 -0.114185
49998 9999 65 -0.605951 3 2 3 0.0 -0.348756
49999 9999 65 0.365200 4 2 3 0.0 -0.348756

50000 rows × 8 columns

2. Saving a dataframe

Use df.to_csv(), df.to_stata(), df.to_excel(), df.to_feather(), etc.

2.1 Filetype options

feather is recommended - see, for example, here

3. pd.DataFrame versus pd.Series

pd.DataFrame stores a list of pd.Series, pd.Series is a single column. They each have their own class methods.

type(df)
pandas.core.frame.DataFrame
type(df.loc[:, 'i'])
pandas.core.series.Series

We can access a column but treat it as a DataFrame by wrapping the column name in brackets (['col_name'])

type(df.loc[:, ['i']])
pandas.core.frame.DataFrame

3.1 Converting a Pandas series into a NumPy array

Use .to_numpy()

df.loc[:, 'i'].to_numpy()
array([   0,    0,    0, ..., 9999, 9999, 9999])
type(df.loc[:, 'i'].to_numpy())
numpy.ndarray

3.2 Accessing the list of a dataframe’s columns

Use df.columns

df.columns
Index(['i', 'j', 'y', 't', 'l', 'k', 'alpha', 'psi'], dtype='object')
df.shape
(50000, 8)

4. Series datatypes

Same as NumPy datatypes, e.g. int, float, str, etc.

Convert a column to a new type by using col.astype(dtype)

df.loc[:, 'i'].dtype
dtype('int64')
df.loc[:, 'i'].astype(float)
0           0.0
1           0.0
2           0.0
3           0.0
4           0.0
          ...  
49995    9999.0
49996    9999.0
49997    9999.0
49998    9999.0
49999    9999.0
Name: i, Length: 50000, dtype: float64
df.loc[:, 'i'].astype(float).dtype
dtype('float64')

5. Working with strings

Use .str.function() on a Series

df.loc[:, 'i'].astype(str).str.strip('9')
0        0
1        0
2        0
3        0
4        0
        ..
49995     
49996     
49997     
49998     
49999     
Name: i, Length: 50000, dtype: object

6. Indexing

Unnecessarily complicated - each dataframe has its own index, which isn’t necessarily a consecutive series of integers, or even numeric (e.g. the index could be strings)

6.1 Accessing elements with .loc and .iloc

.loc indexes using the particular index assigned to the given dataframe

.iloc indexes by integer location

df
i j y t l k alpha psi
0 0 122 -0.588273 0 2 6 0.0 0.348756
1 0 122 -1.368881 1 2 6 0.0 0.348756
2 0 107 -1.915975 2 2 5 0.0 0.114185
3 0 107 0.587144 3 2 5 0.0 0.114185
4 0 107 -0.398403 4 2 5 0.0 0.114185
... ... ... ... ... ... ... ... ...
49995 9999 121 -0.564833 0 2 6 0.0 0.348756
49996 9999 67 -0.850231 1 2 3 0.0 -0.348756
49997 9999 96 -0.199708 2 2 4 0.0 -0.114185
49998 9999 65 -0.605951 3 2 3 0.0 -0.348756
49999 9999 65 0.365200 4 2 3 0.0 -0.348756

50000 rows × 8 columns

df['j'].to_numpy() == 70
array([False, False, False, ..., False, False, False])
df.loc[df['j'].to_numpy() == 70, :]
i j y t l k alpha psi
500 100 70 -1.681804 0 1 3 -0.430727 -0.348756
501 100 70 -1.027119 1 1 3 -0.430727 -0.348756
502 100 70 -1.427879 2 1 3 -0.430727 -0.348756
1096 219 70 0.846744 1 3 3 0.430727 -0.348756
1097 219 70 -0.232732 2 3 3 0.430727 -0.348756
... ... ... ... ... ... ... ... ...
49832 9966 70 0.259503 2 4 3 0.967422 -0.348756
49833 9966 70 -0.295527 3 4 3 0.967422 -0.348756
49834 9966 70 1.635962 4 4 3 0.967422 -0.348756
49970 9994 70 0.672032 0 1 3 -0.430727 -0.348756
49971 9994 70 0.114335 1 1 3 -0.430727 -0.348756

245 rows × 8 columns

df.loc[:10, :]
i j y t l k alpha psi
0 0 122 -0.588273 0 2 6 0.000000 0.348756
1 0 122 -1.368881 1 2 6 0.000000 0.348756
2 0 107 -1.915975 2 2 5 0.000000 0.114185
3 0 107 0.587144 3 2 5 0.000000 0.114185
4 0 107 -0.398403 4 2 5 0.000000 0.114185
5 1 132 0.164539 0 0 6 -0.967422 0.348756
6 1 132 0.072192 1 0 6 -0.967422 0.348756
7 1 132 0.589733 2 0 6 -0.967422 0.348756
8 1 28 -2.535896 3 0 1 -0.967422 -0.908458
9 1 28 -0.232353 4 0 1 -0.967422 -0.908458
10 2 174 -0.004053 0 3 8 0.430727 0.908458
df.loc[:10, :].loc[:5, :]
i j y t l k alpha psi
0 0 122 -0.588273 0 2 6 0.000000 0.348756
1 0 122 -1.368881 1 2 6 0.000000 0.348756
2 0 107 -1.915975 2 2 5 0.000000 0.114185
3 0 107 0.587144 3 2 5 0.000000 0.114185
4 0 107 -0.398403 4 2 5 0.000000 0.114185
5 1 132 0.164539 0 0 6 -0.967422 0.348756
df.loc[df['j'].to_numpy() == 70, :].loc[:5, :]
i j y t l k alpha psi
df.loc[df['j'].to_numpy() == 70, :].iloc[:5, :]
i j y t l k alpha psi
500 100 70 -1.681804 0 1 3 -0.430727 -0.348756
501 100 70 -1.027119 1 1 3 -0.430727 -0.348756
502 100 70 -1.427879 2 1 3 -0.430727 -0.348756
1096 219 70 0.846744 1 3 3 0.430727 -0.348756
1097 219 70 -0.232732 2 3 3 0.430727 -0.348756

6.2 .reset_index

This allows you to reset the index to be a consecutive series of integers

Specify the option drop=False so that the current index becomes a new column in the given dataframe

df.loc[df['j'].to_numpy() == 70, :].reset_index(drop=True)
i j y t l k alpha psi
0 100 70 -1.681804 0 1 3 -0.430727 -0.348756
1 100 70 -1.027119 1 1 3 -0.430727 -0.348756
2 100 70 -1.427879 2 1 3 -0.430727 -0.348756
3 219 70 0.846744 1 3 3 0.430727 -0.348756
4 219 70 -0.232732 2 3 3 0.430727 -0.348756
... ... ... ... ... ... ... ... ...
240 9966 70 0.259503 2 4 3 0.967422 -0.348756
241 9966 70 -0.295527 3 4 3 0.967422 -0.348756
242 9966 70 1.635962 4 4 3 0.967422 -0.348756
243 9994 70 0.672032 0 1 3 -0.430727 -0.348756
244 9994 70 0.114335 1 1 3 -0.430727 -0.348756

245 rows × 8 columns

df.loc[df['j'].to_numpy() == 70, :].reset_index(drop=False)
index i j y t l k alpha psi
0 500 100 70 -1.681804 0 1 3 -0.430727 -0.348756
1 501 100 70 -1.027119 1 1 3 -0.430727 -0.348756
2 502 100 70 -1.427879 2 1 3 -0.430727 -0.348756
3 1096 219 70 0.846744 1 3 3 0.430727 -0.348756
4 1097 219 70 -0.232732 2 3 3 0.430727 -0.348756
... ... ... ... ... ... ... ... ... ...
240 49832 9966 70 0.259503 2 4 3 0.967422 -0.348756
241 49833 9966 70 -0.295527 3 4 3 0.967422 -0.348756
242 49834 9966 70 1.635962 4 4 3 0.967422 -0.348756
243 49970 9994 70 0.672032 0 1 3 -0.430727 -0.348756
244 49971 9994 70 0.114335 1 1 3 -0.430727 -0.348756

245 rows × 9 columns

6.3 Multi-index

It allows you to specify multiple columns as the index

Horrible, please don’t use it. I don’t know how to use it and have never encountered a task in Pandas that couldn’t be accomplished without it. (Of course, many people like it, see here for some benefits.)

6.4 Using indexing to avoid merges

df['k'] = df['k'] + 10
k_to_psi = df.groupby(['k'])['psi'].first()
df['k'] = df['k'] - 10
k_to_psi
k
10   -1.335178
11   -0.908458
12   -0.604585
13   -0.348756
14   -0.114185
15    0.114185
16    0.348756
17    0.604585
18    0.908458
19    1.335178
Name: psi, dtype: float64
type(k_to_psi)
pandas.core.series.Series
k_to_psi.loc[10]
np.float64(-1.335177736118937)
k_to_psi.loc[df['k'] + 10]
k
16    0.348756
16    0.348756
15    0.114185
15    0.114185
15    0.114185
        ...   
16    0.348756
13   -0.348756
14   -0.114185
13   -0.348756
13   -0.348756
Name: psi, Length: 50000, dtype: float64

7. Merging, concatenating, and appending columns

7.1 Merging

Use pd.merge(a, b, how='inner', on='col_name') (see here for more details)

pd.merge(df.loc[:, ['i', 't', 'j']], df.loc[:, ['i', 't', 'y']], how='inner', on=['i', 't'])
i t j y
0 0 0 122 -0.588273
1 0 1 122 -1.368881
2 0 2 107 -1.915975
3 0 3 107 0.587144
4 0 4 107 -0.398403
... ... ... ... ...
49995 9999 0 121 -0.564833
49996 9999 1 67 -0.850231
49997 9999 2 96 -0.199708
49998 9999 3 65 -0.605951
49999 9999 4 65 0.365200

50000 rows × 4 columns

7.2 Concatenating

Using pd.concat([a, b, c, ...]) (see here for more details)

Specifying axis=0 concatenates rows

Specifying axis=1 concatenates columns

pd.concat([df.loc[: 99, :], df.loc[100: 999, :], df.loc[1000:, :]], axis=0)
i j y t l k alpha psi
0 0 122 -0.588273 0 2 6 0.0 0.348756
1 0 122 -1.368881 1 2 6 0.0 0.348756
2 0 107 -1.915975 2 2 5 0.0 0.114185
3 0 107 0.587144 3 2 5 0.0 0.114185
4 0 107 -0.398403 4 2 5 0.0 0.114185
... ... ... ... ... ... ... ... ...
49995 9999 121 -0.564833 0 2 6 0.0 0.348756
49996 9999 67 -0.850231 1 2 3 0.0 -0.348756
49997 9999 96 -0.199708 2 2 4 0.0 -0.114185
49998 9999 65 -0.605951 3 2 3 0.0 -0.348756
49999 9999 65 0.365200 4 2 3 0.0 -0.348756

50000 rows × 8 columns

pd.concat([df.loc[:, ['i']], df.loc[:, ['j']], df.loc[:, ['y']]], axis=1)
i j y
0 0 122 -0.588273
1 0 122 -1.368881
2 0 107 -1.915975
3 0 107 0.587144
4 0 107 -0.398403
... ... ... ...
49995 9999 121 -0.564833
49996 9999 67 -0.850231
49997 9999 96 -0.199708
49998 9999 65 -0.605951
49999 9999 65 0.365200

50000 rows × 3 columns

7.3 Appending columns

Just define a new column in the dataframe as if it is a dictionary

df.loc[:, 'n'] = rng.uniform(size=len(df))
df
i j y t l k alpha psi n
0 0 122 -0.588273 0 2 6 0.0 0.348756 0.297840
1 0 122 -1.368881 1 2 6 0.0 0.348756 0.619161
2 0 107 -1.915975 2 2 5 0.0 0.114185 0.301757
3 0 107 0.587144 3 2 5 0.0 0.114185 0.174846
4 0 107 -0.398403 4 2 5 0.0 0.114185 0.495505
... ... ... ... ... ... ... ... ... ...
49995 9999 121 -0.564833 0 2 6 0.0 0.348756 0.156580
49996 9999 67 -0.850231 1 2 3 0.0 -0.348756 0.782052
49997 9999 96 -0.199708 2 2 4 0.0 -0.114185 0.678319
49998 9999 65 -0.605951 3 2 3 0.0 -0.348756 0.004485
49999 9999 65 0.365200 4 2 3 0.0 -0.348756 0.416472

50000 rows × 9 columns

8. Groupby

First group observations into categories, then perform a particular operation within a given category

We will compute the average income per worker:

df.groupby('i')['y'].mean()
i
0      -0.736877
1      -0.388357
2       1.497247
3      -1.197781
4       1.056655
          ...   
9995    0.484574
9996   -0.821621
9997   -0.911059
9998    2.225247
9999   -0.371104
Name: y, Length: 10000, dtype: float64
df.groupby('i')['y'].var()
i
0       0.919845
1       1.527680
2       1.418687
3       2.724300
4       0.642759
          ...   
9995    1.498775
9996    0.319638
9997    0.156393
9998    1.845940
9999    0.223442
Name: y, Length: 10000, dtype: float64

If the function of interest depends only on the column you are grouping by, you can call the function directly on the groupby object

df.groupby('i').size()
i
0       5
1       5
2       5
3       5
4       5
       ..
9995    5
9996    5
9997    5
9998    5
9999    5
Length: 10000, dtype: int64

8.1 .transform

Expand the results of a groupby back to the size of the original dataframe

df.groupby('i')['y'].transform('mean')
0       -0.736877
1       -0.736877
2       -0.736877
3       -0.736877
4       -0.736877
           ...   
49995   -0.371104
49996   -0.371104
49997   -0.371104
49998   -0.371104
49999   -0.371104
Name: y, Length: 50000, dtype: float64

Now we could do something like compute the ratio of each wage to the worker’s average wage

df['y'] / df.groupby('i')['y'].transform('mean')
0        0.798332
1        1.857677
2        2.600127
3       -0.796800
4        0.540664
           ...   
49995    1.522032
49996    2.291081
49997    0.538146
49998    1.632831
49999   -0.984090
Name: y, Length: 50000, dtype: float64

8.2 .apply

Use a custom function with groupby

WARNING: .apply is notoriously slow

In this example, we will compute the second highest wage for each worker

def second_largest(col):
    return np.unique(col)[-2]

df.groupby('i')['y'].apply(second_largest)
i
0      -0.398403
1       0.164539
2       1.927676
3      -0.437750
4       1.534960
          ...   
9995    1.352489
9996   -0.569682
9997   -0.883654
9998    3.068653
9999   -0.199708
Name: y, Length: 10000, dtype: float64

9. Long and wide formats

Usually use long, it stores a single column for a variable then within an observation and over time, it adds new rows

Wide takes a single observation as a single row, and puts each year in a new column

Can convert between wide and long using pd.wide_to_long(), pd.melt(), pd.pivot(), etc.

This is a very confusing process and ultimately whenever I’ve used it I just play around until it gives the result I want

df
i j y t l k alpha psi n
0 0 122 -0.588273 0 2 6 0.0 0.348756 0.297840
1 0 122 -1.368881 1 2 6 0.0 0.348756 0.619161
2 0 107 -1.915975 2 2 5 0.0 0.114185 0.301757
3 0 107 0.587144 3 2 5 0.0 0.114185 0.174846
4 0 107 -0.398403 4 2 5 0.0 0.114185 0.495505
... ... ... ... ... ... ... ... ... ...
49995 9999 121 -0.564833 0 2 6 0.0 0.348756 0.156580
49996 9999 67 -0.850231 1 2 3 0.0 -0.348756 0.782052
49997 9999 96 -0.199708 2 2 4 0.0 -0.114185 0.678319
49998 9999 65 -0.605951 3 2 3 0.0 -0.348756 0.004485
49999 9999 65 0.365200 4 2 3 0.0 -0.348756 0.416472

50000 rows × 9 columns

10. Miscellaneous functions and methods

10.1 .drop_duplicates

Drop duplicate observations

dup_df = pd.concat([df, df.loc[: 100, :]])
dup_df.drop_duplicates()
i j y t l k alpha psi n
0 0 122 -0.588273 0 2 6 0.0 0.348756 0.297840
1 0 122 -1.368881 1 2 6 0.0 0.348756 0.619161
2 0 107 -1.915975 2 2 5 0.0 0.114185 0.301757
3 0 107 0.587144 3 2 5 0.0 0.114185 0.174846
4 0 107 -0.398403 4 2 5 0.0 0.114185 0.495505
... ... ... ... ... ... ... ... ... ...
49995 9999 121 -0.564833 0 2 6 0.0 0.348756 0.156580
49996 9999 67 -0.850231 1 2 3 0.0 -0.348756 0.782052
49997 9999 96 -0.199708 2 2 4 0.0 -0.114185 0.678319
49998 9999 65 -0.605951 3 2 3 0.0 -0.348756 0.004485
49999 9999 65 0.365200 4 2 3 0.0 -0.348756 0.416472

50000 rows × 9 columns

10.2 .sort_values

Sort the values of the dataframe based on particular columns

df
i j y t l k alpha psi n
0 0 122 -0.588273 0 2 6 0.0 0.348756 0.297840
1 0 122 -1.368881 1 2 6 0.0 0.348756 0.619161
2 0 107 -1.915975 2 2 5 0.0 0.114185 0.301757
3 0 107 0.587144 3 2 5 0.0 0.114185 0.174846
4 0 107 -0.398403 4 2 5 0.0 0.114185 0.495505
... ... ... ... ... ... ... ... ... ...
49995 9999 121 -0.564833 0 2 6 0.0 0.348756 0.156580
49996 9999 67 -0.850231 1 2 3 0.0 -0.348756 0.782052
49997 9999 96 -0.199708 2 2 4 0.0 -0.114185 0.678319
49998 9999 65 -0.605951 3 2 3 0.0 -0.348756 0.004485
49999 9999 65 0.365200 4 2 3 0.0 -0.348756 0.416472

50000 rows × 9 columns

df.sort_values(['j', 'y'])
i j y t l k alpha psi n
33263 6652 0 -5.064867 3 0 0 -0.967422 -1.335178 0.985021
9649 1929 0 -4.971061 4 0 0 -0.967422 -1.335178 0.800174
40830 8166 0 -4.758191 0 0 0 -0.967422 -1.335178 0.898404
18141 3628 0 -4.627821 1 0 0 -0.967422 -1.335178 0.311400
12029 2405 0 -4.166904 4 0 0 -0.967422 -1.335178 0.698459
... ... ... ... ... ... ... ... ... ...
33794 6758 198 4.063326 4 4 9 0.967422 1.335178 0.259600
21136 4227 198 4.117421 1 2 9 0.000000 1.335178 0.674148
48123 9624 198 4.356418 3 4 9 0.967422 1.335178 0.825097
29290 5858 198 4.447053 0 4 9 0.967422 1.335178 0.570896
2433 486 198 4.559297 3 4 9 0.967422 1.335178 0.797747

50000 rows × 9 columns

10.3 .unique

List the unique values in a particular column

WARNING: NumPy .unique returns a sorted array, whereas Pandas .unique returns an array ordered by appearance

df.loc[:, 'y'].unique()
array([-0.58827279, -1.36888053, -1.91597465, ..., -0.19970824,
       -0.60595076,  0.3652001 ])
np.unique(df.loc[:, 'y'])
array([-5.39994589, -5.39103478, -5.31532195, ...,  5.24649836,
        5.41834757,  6.03305639])

10.4 .value_counts

Count the number of occurrences of each unique value in a column

df.loc[:, 'i'].value_counts()
i
0       5
6670    5
6663    5
6664    5
6665    5
       ..
3333    5
3334    5
3335    5
3336    5
9999    5
Name: count, Length: 10000, dtype: int64