Lesson 6

Overview

Pandas

Initializing a dataframe

From NumPy arrays
From a file

Saving a dataframe

Filetype options

pd.DataFrame versus pd.Series

Converting a Pandas series into a NumPy array
Accessing the list of a dataframe’s columns

Series datatypes
Working with strings
Indexing

Accessing elements with .loc and .iloc
.reset_index
Multi-index
Using indexing to avoid merges

Merging, concatenating, and appending columns

Merging
Concatenating
Appending columns

Groupby

.transform
.apply

Long and wide formats
Miscellaneous functions and methods

.drop_duplicates
.sort_values
.unique
.value_counts

Pandas

Dataframes in Python

import numpy as np
import pandas as pd

1. Initializing a dataframe

1.1 From NumPy arrays

Initialize an instance of the pd.DataFrame class with a dictionary linking column names to associated NumPy arrays

rng = np.random.default_rng(1234)
i = np.arange(100)
y = rng.uniform(size=100)
df = pd.DataFrame({'i': i, 'y': y})

df

	i	y
0	0	0.976700
1	1	0.380196
2	2	0.923246
3	3	0.261692
4	4	0.319097
...	...	...
95	95	0.873824
96	96	0.350609
97	97	0.823521
98	98	0.947199
99	99	0.067041

100 rows × 2 columns

1.2 From a file

Use something like pd.read_csv(), pd.read_stata(), pd.read_excel(), pd.read_feather(), etc.

Source for sample data: here

df = pd.read_csv('/Users/adamoppenheimer/Library/CloudStorage/GoogleDrive-oppen040@umn.edu/My Drive/Econ Crash Course 2024/Data/twoway_sample_data.csv')

df

	i	j	y	t	l	k	alpha	psi
0	0	122	-0.588273	0	2	6	0.0	0.348756
1	0	122	-1.368881	1	2	6	0.0	0.348756
2	0	107	-1.915975	2	2	5	0.0	0.114185
3	0	107	0.587144	3	2	5	0.0	0.114185
4	0	107	-0.398403	4	2	5	0.0	0.114185
...	...	...	...	...	...	...	...	...
49995	9999	121	-0.564833	0	2	6	0.0	0.348756
49996	9999	67	-0.850231	1	2	3	0.0	-0.348756
49997	9999	96	-0.199708	2	2	4	0.0	-0.114185
49998	9999	65	-0.605951	3	2	3	0.0	-0.348756
49999	9999	65	0.365200	4	2	3	0.0	-0.348756

50000 rows × 8 columns

2. Saving a dataframe

Use df.to_csv(), df.to_stata(), df.to_excel(), df.to_feather(), etc.

2.1 Filetype options

feather is recommended - see, for example, here

3. `pd.DataFrame` versus `pd.Series`

pd.DataFrame stores a list of pd.Series, pd.Series is a single column. They each have their own class methods.

type(df)

pandas.core.frame.DataFrame

type(df.loc[:, 'i'])

pandas.core.series.Series

We can access a column but treat it as a DataFrame by wrapping the column name in brackets (['col_name'])

type(df.loc[:, ['i']])

pandas.core.frame.DataFrame

3.1 Converting a Pandas series into a NumPy array

Use .to_numpy()

df.loc[:, 'i'].to_numpy()

array([   0,    0,    0, ..., 9999, 9999, 9999])

type(df.loc[:, 'i'].to_numpy())

numpy.ndarray

3.2 Accessing the list of a dataframe’s columns

Use df.columns

df.columns

Index(['i', 'j', 'y', 't', 'l', 'k', 'alpha', 'psi'], dtype='object')

df.shape

(50000, 8)

4. Series datatypes

Same as NumPy datatypes, e.g. int, float, str, etc.

Convert a column to a new type by using col.astype(dtype)

df.loc[:, 'i'].dtype

dtype('int64')

df.loc[:, 'i'].astype(float)

0           0.0
1           0.0
2           0.0
3           0.0
4           0.0
          ...  
49995    9999.0
49996    9999.0
49997    9999.0
49998    9999.0
49999    9999.0
Name: i, Length: 50000, dtype: float64

df.loc[:, 'i'].astype(float).dtype

dtype('float64')

5. Working with strings

Use .str.function() on a Series

df.loc[:, 'i'].astype(str).str.strip('9')

0        0
1        0
2        0
3        0
4        0
        ..
49995     
49996     
49997     
49998     
49999     
Name: i, Length: 50000, dtype: object

6. Indexing

Unnecessarily complicated - each dataframe has its own index, which isn’t necessarily a consecutive series of integers, or even numeric (e.g. the index could be strings)

6.1 Accessing elements with `.loc` and `.iloc`

.loc indexes using the particular index assigned to the given dataframe

.iloc indexes by integer location

df

	i	j	y	t	l	k	alpha	psi
0	0	122	-0.588273	0	2	6	0.0	0.348756
1	0	122	-1.368881	1	2	6	0.0	0.348756
2	0	107	-1.915975	2	2	5	0.0	0.114185
3	0	107	0.587144	3	2	5	0.0	0.114185
4	0	107	-0.398403	4	2	5	0.0	0.114185
...	...	...	...	...	...	...	...	...
49995	9999	121	-0.564833	0	2	6	0.0	0.348756
49996	9999	67	-0.850231	1	2	3	0.0	-0.348756
49997	9999	96	-0.199708	2	2	4	0.0	-0.114185
49998	9999	65	-0.605951	3	2	3	0.0	-0.348756
49999	9999	65	0.365200	4	2	3	0.0	-0.348756

50000 rows × 8 columns

df['j'].to_numpy() == 70

array([False, False, False, ..., False, False, False])

df.loc[df['j'].to_numpy() == 70, :]

	i	j	y	t	l	k	alpha	psi
500	100	70	-1.681804	0	1	3	-0.430727	-0.348756
501	100	70	-1.027119	1	1	3	-0.430727	-0.348756
502	100	70	-1.427879	2	1	3	-0.430727	-0.348756
1096	219	70	0.846744	1	3	3	0.430727	-0.348756
1097	219	70	-0.232732	2	3	3	0.430727	-0.348756
...	...	...	...	...	...	...	...	...
49832	9966	70	0.259503	2	4	3	0.967422	-0.348756
49833	9966	70	-0.295527	3	4	3	0.967422	-0.348756
49834	9966	70	1.635962	4	4	3	0.967422	-0.348756
49970	9994	70	0.672032	0	1	3	-0.430727	-0.348756
49971	9994	70	0.114335	1	1	3	-0.430727	-0.348756

245 rows × 8 columns

df.loc[:10, :]

	i	j	y	t	l	k	alpha	psi
0	0	122	-0.588273	0	2	6	0.000000	0.348756
1	0	122	-1.368881	1	2	6	0.000000	0.348756
2	0	107	-1.915975	2	2	5	0.000000	0.114185
3	0	107	0.587144	3	2	5	0.000000	0.114185
4	0	107	-0.398403	4	2	5	0.000000	0.114185
5	1	132	0.164539	0	0	6	-0.967422	0.348756
6	1	132	0.072192	1	0	6	-0.967422	0.348756
7	1	132	0.589733	2	0	6	-0.967422	0.348756
8	1	28	-2.535896	3	0	1	-0.967422	-0.908458
9	1	28	-0.232353	4	0	1	-0.967422	-0.908458
10	2	174	-0.004053	0	3	8	0.430727	0.908458

df.loc[:10, :].loc[:5, :]

	i	j	y	t	l	k	alpha	psi
0	0	122	-0.588273	0	2	6	0.000000	0.348756
1	0	122	-1.368881	1	2	6	0.000000	0.348756
2	0	107	-1.915975	2	2	5	0.000000	0.114185
3	0	107	0.587144	3	2	5	0.000000	0.114185
4	0	107	-0.398403	4	2	5	0.000000	0.114185
5	1	132	0.164539	0	0	6	-0.967422	0.348756

df.loc[df['j'].to_numpy() == 70, :].loc[:5, :]

	i	j	y	t	l	k	alpha	psi

df.loc[df['j'].to_numpy() == 70, :].iloc[:5, :]

	i	j	y	t	l	k	alpha	psi
500	100	70	-1.681804	0	1	3	-0.430727	-0.348756
501	100	70	-1.027119	1	1	3	-0.430727	-0.348756
502	100	70	-1.427879	2	1	3	-0.430727	-0.348756
1096	219	70	0.846744	1	3	3	0.430727	-0.348756
1097	219	70	-0.232732	2	3	3	0.430727	-0.348756

6.2 `.reset_index`

This allows you to reset the index to be a consecutive series of integers

Specify the option drop=False so that the current index becomes a new column in the given dataframe

df.loc[df['j'].to_numpy() == 70, :].reset_index(drop=True)

	i	j	y	t	l	k	alpha	psi
0	100	70	-1.681804	0	1	3	-0.430727	-0.348756
1	100	70	-1.027119	1	1	3	-0.430727	-0.348756
2	100	70	-1.427879	2	1	3	-0.430727	-0.348756
3	219	70	0.846744	1	3	3	0.430727	-0.348756
4	219	70	-0.232732	2	3	3	0.430727	-0.348756
...	...	...	...	...	...	...	...	...
240	9966	70	0.259503	2	4	3	0.967422	-0.348756
241	9966	70	-0.295527	3	4	3	0.967422	-0.348756
242	9966	70	1.635962	4	4	3	0.967422	-0.348756
243	9994	70	0.672032	0	1	3	-0.430727	-0.348756
244	9994	70	0.114335	1	1	3	-0.430727	-0.348756

245 rows × 8 columns

df.loc[df['j'].to_numpy() == 70, :].reset_index(drop=False)

	index	i	j	y	t	l	k	alpha	psi
0	500	100	70	-1.681804	0	1	3	-0.430727	-0.348756
1	501	100	70	-1.027119	1	1	3	-0.430727	-0.348756
2	502	100	70	-1.427879	2	1	3	-0.430727	-0.348756
3	1096	219	70	0.846744	1	3	3	0.430727	-0.348756
4	1097	219	70	-0.232732	2	3	3	0.430727	-0.348756
...	...	...	...	...	...	...	...	...	...
240	49832	9966	70	0.259503	2	4	3	0.967422	-0.348756
241	49833	9966	70	-0.295527	3	4	3	0.967422	-0.348756
242	49834	9966	70	1.635962	4	4	3	0.967422	-0.348756
243	49970	9994	70	0.672032	0	1	3	-0.430727	-0.348756
244	49971	9994	70	0.114335	1	1	3	-0.430727	-0.348756

245 rows × 9 columns

6.3 Multi-index

It allows you to specify multiple columns as the index

Horrible, please don’t use it. I don’t know how to use it and have never encountered a task in Pandas that couldn’t be accomplished without it. (Of course, many people like it, see here for some benefits.)

6.4 Using indexing to avoid merges

df['k'] = df['k'] + 10
k_to_psi = df.groupby(['k'])['psi'].first()
df['k'] = df['k'] - 10

k_to_psi

k
10   -1.335178
11   -0.908458
12   -0.604585
13   -0.348756
14   -0.114185
15    0.114185
16    0.348756
17    0.604585
18    0.908458
19    1.335178
Name: psi, dtype: float64

type(k_to_psi)

pandas.core.series.Series

k_to_psi.loc[10]

np.float64(-1.335177736118937)

k_to_psi.loc[df['k'] + 10]

k
16    0.348756
16    0.348756
15    0.114185
15    0.114185
15    0.114185
        ...   
16    0.348756
13   -0.348756
14   -0.114185
13   -0.348756
13   -0.348756
Name: psi, Length: 50000, dtype: float64

7. Merging, concatenating, and appending columns

7.1 Merging

Use pd.merge(a, b, how='inner', on='col_name') (see here for more details)

pd.merge(df.loc[:, ['i', 't', 'j']], df.loc[:, ['i', 't', 'y']], how='inner', on=['i', 't'])

	i	t	j	y
0	0	0	122	-0.588273
1	0	1	122	-1.368881
2	0	2	107	-1.915975
3	0	3	107	0.587144
4	0	4	107	-0.398403
...	...	...	...	...
49995	9999	0	121	-0.564833
49996	9999	1	67	-0.850231
49997	9999	2	96	-0.199708
49998	9999	3	65	-0.605951
49999	9999	4	65	0.365200

50000 rows × 4 columns

7.2 Concatenating

Using pd.concat([a, b, c, ...]) (see here for more details)

Specifying axis=0 concatenates rows

Specifying axis=1 concatenates columns

pd.concat([df.loc[: 99, :], df.loc[100: 999, :], df.loc[1000:, :]], axis=0)

	i	j	y	t	l	k	alpha	psi
0	0	122	-0.588273	0	2	6	0.0	0.348756
1	0	122	-1.368881	1	2	6	0.0	0.348756
2	0	107	-1.915975	2	2	5	0.0	0.114185
3	0	107	0.587144	3	2	5	0.0	0.114185
4	0	107	-0.398403	4	2	5	0.0	0.114185
...	...	...	...	...	...	...	...	...
49995	9999	121	-0.564833	0	2	6	0.0	0.348756
49996	9999	67	-0.850231	1	2	3	0.0	-0.348756
49997	9999	96	-0.199708	2	2	4	0.0	-0.114185
49998	9999	65	-0.605951	3	2	3	0.0	-0.348756
49999	9999	65	0.365200	4	2	3	0.0	-0.348756

50000 rows × 8 columns

pd.concat([df.loc[:, ['i']], df.loc[:, ['j']], df.loc[:, ['y']]], axis=1)

	i	j	y
0	0	122	-0.588273
1	0	122	-1.368881
2	0	107	-1.915975
3	0	107	0.587144
4	0	107	-0.398403
...	...	...	...
49995	9999	121	-0.564833
49996	9999	67	-0.850231
49997	9999	96	-0.199708
49998	9999	65	-0.605951
49999	9999	65	0.365200

50000 rows × 3 columns

7.3 Appending columns

Just define a new column in the dataframe as if it is a dictionary

df.loc[:, 'n'] = rng.uniform(size=len(df))

df

	i	j	y	t	l	k	alpha	psi	n
0	0	122	-0.588273	0	2	6	0.0	0.348756	0.297840
1	0	122	-1.368881	1	2	6	0.0	0.348756	0.619161
2	0	107	-1.915975	2	2	5	0.0	0.114185	0.301757
3	0	107	0.587144	3	2	5	0.0	0.114185	0.174846
4	0	107	-0.398403	4	2	5	0.0	0.114185	0.495505
...	...	...	...	...	...	...	...	...	...
49995	9999	121	-0.564833	0	2	6	0.0	0.348756	0.156580
49996	9999	67	-0.850231	1	2	3	0.0	-0.348756	0.782052
49997	9999	96	-0.199708	2	2	4	0.0	-0.114185	0.678319
49998	9999	65	-0.605951	3	2	3	0.0	-0.348756	0.004485
49999	9999	65	0.365200	4	2	3	0.0	-0.348756	0.416472

50000 rows × 9 columns

8. Groupby

First group observations into categories, then perform a particular operation within a given category

We will compute the average income per worker:

df.groupby('i')['y'].mean()

i
0      -0.736877
1      -0.388357
2       1.497247
3      -1.197781
4       1.056655
          ...   
9995    0.484574
9996   -0.821621
9997   -0.911059
9998    2.225247
9999   -0.371104
Name: y, Length: 10000, dtype: float64

df.groupby('i')['y'].var()

i
0       0.919845
1       1.527680
2       1.418687
3       2.724300
4       0.642759
          ...   
9995    1.498775
9996    0.319638
9997    0.156393
9998    1.845940
9999    0.223442
Name: y, Length: 10000, dtype: float64

If the function of interest depends only on the column you are grouping by, you can call the function directly on the groupby object

df.groupby('i').size()

i
0       5
1       5
2       5
3       5
4       5
       ..
9995    5
9996    5
9997    5
9998    5
9999    5
Length: 10000, dtype: int64

8.1 `.transform`

Expand the results of a groupby back to the size of the original dataframe

df.groupby('i')['y'].transform('mean')

0       -0.736877
1       -0.736877
2       -0.736877
3       -0.736877
4       -0.736877
           ...   
49995   -0.371104
49996   -0.371104
49997   -0.371104
49998   -0.371104
49999   -0.371104
Name: y, Length: 50000, dtype: float64

Now we could do something like compute the ratio of each wage to the worker’s average wage

df['y'] / df.groupby('i')['y'].transform('mean')

0        0.798332
1        1.857677
2        2.600127
3       -0.796800
4        0.540664
           ...   
49995    1.522032
49996    2.291081
49997    0.538146
49998    1.632831
49999   -0.984090
Name: y, Length: 50000, dtype: float64

8.2 `.apply`

Use a custom function with groupby

WARNING: .apply is notoriously slow

In this example, we will compute the second highest wage for each worker

def second_largest(col):
    return np.unique(col)[-2]

df.groupby('i')['y'].apply(second_largest)

i
0      -0.398403
1       0.164539
2       1.927676
3      -0.437750
4       1.534960
          ...   
9995    1.352489
9996   -0.569682
9997   -0.883654
9998    3.068653
9999   -0.199708
Name: y, Length: 10000, dtype: float64

9. Long and wide formats

Usually use long, it stores a single column for a variable then within an observation and over time, it adds new rows

Wide takes a single observation as a single row, and puts each year in a new column

Can convert between wide and long using pd.wide_to_long(), pd.melt(), pd.pivot(), etc.

This is a very confusing process and ultimately whenever I’ve used it I just play around until it gives the result I want

df

	i	j	y	t	l	k	alpha	psi	n
0	0	122	-0.588273	0	2	6	0.0	0.348756	0.297840
1	0	122	-1.368881	1	2	6	0.0	0.348756	0.619161
2	0	107	-1.915975	2	2	5	0.0	0.114185	0.301757
3	0	107	0.587144	3	2	5	0.0	0.114185	0.174846
4	0	107	-0.398403	4	2	5	0.0	0.114185	0.495505
...	...	...	...	...	...	...	...	...	...
49995	9999	121	-0.564833	0	2	6	0.0	0.348756	0.156580
49996	9999	67	-0.850231	1	2	3	0.0	-0.348756	0.782052
49997	9999	96	-0.199708	2	2	4	0.0	-0.114185	0.678319
49998	9999	65	-0.605951	3	2	3	0.0	-0.348756	0.004485
49999	9999	65	0.365200	4	2	3	0.0	-0.348756	0.416472

50000 rows × 9 columns

10. Miscellaneous functions and methods

10.1 `.drop_duplicates`

Drop duplicate observations

dup_df = pd.concat([df, df.loc[: 100, :]])

dup_df.drop_duplicates()

	i	j	y	t	l	k	alpha	psi	n
0	0	122	-0.588273	0	2	6	0.0	0.348756	0.297840
1	0	122	-1.368881	1	2	6	0.0	0.348756	0.619161
2	0	107	-1.915975	2	2	5	0.0	0.114185	0.301757
3	0	107	0.587144	3	2	5	0.0	0.114185	0.174846
4	0	107	-0.398403	4	2	5	0.0	0.114185	0.495505
...	...	...	...	...	...	...	...	...	...
49995	9999	121	-0.564833	0	2	6	0.0	0.348756	0.156580
49996	9999	67	-0.850231	1	2	3	0.0	-0.348756	0.782052
49997	9999	96	-0.199708	2	2	4	0.0	-0.114185	0.678319
49998	9999	65	-0.605951	3	2	3	0.0	-0.348756	0.004485
49999	9999	65	0.365200	4	2	3	0.0	-0.348756	0.416472

50000 rows × 9 columns

10.2 `.sort_values`

Sort the values of the dataframe based on particular columns

df

	i	j	y	t	l	k	alpha	psi	n
0	0	122	-0.588273	0	2	6	0.0	0.348756	0.297840
1	0	122	-1.368881	1	2	6	0.0	0.348756	0.619161
2	0	107	-1.915975	2	2	5	0.0	0.114185	0.301757
3	0	107	0.587144	3	2	5	0.0	0.114185	0.174846
4	0	107	-0.398403	4	2	5	0.0	0.114185	0.495505
...	...	...	...	...	...	...	...	...	...
49995	9999	121	-0.564833	0	2	6	0.0	0.348756	0.156580
49996	9999	67	-0.850231	1	2	3	0.0	-0.348756	0.782052
49997	9999	96	-0.199708	2	2	4	0.0	-0.114185	0.678319
49998	9999	65	-0.605951	3	2	3	0.0	-0.348756	0.004485
49999	9999	65	0.365200	4	2	3	0.0	-0.348756	0.416472

50000 rows × 9 columns

df.sort_values(['j', 'y'])

	i	j	y	t	l	k	alpha	psi	n
33263	6652	0	-5.064867	3	0	0	-0.967422	-1.335178	0.985021
9649	1929	0	-4.971061	4	0	0	-0.967422	-1.335178	0.800174
40830	8166	0	-4.758191	0	0	0	-0.967422	-1.335178	0.898404
18141	3628	0	-4.627821	1	0	0	-0.967422	-1.335178	0.311400
12029	2405	0	-4.166904	4	0	0	-0.967422	-1.335178	0.698459
...	...	...	...	...	...	...	...	...	...
33794	6758	198	4.063326	4	4	9	0.967422	1.335178	0.259600
21136	4227	198	4.117421	1	2	9	0.000000	1.335178	0.674148
48123	9624	198	4.356418	3	4	9	0.967422	1.335178	0.825097
29290	5858	198	4.447053	0	4	9	0.967422	1.335178	0.570896
2433	486	198	4.559297	3	4	9	0.967422	1.335178	0.797747

50000 rows × 9 columns

10.3 `.unique`

List the unique values in a particular column

WARNING: NumPy .unique returns a sorted array, whereas Pandas .unique returns an array ordered by appearance

df.loc[:, 'y'].unique()

array([-0.58827279, -1.36888053, -1.91597465, ..., -0.19970824,
       -0.60595076,  0.3652001 ])

np.unique(df.loc[:, 'y'])

array([-5.39994589, -5.39103478, -5.31532195, ...,  5.24649836,
        5.41834757,  6.03305639])

10.4 `.value_counts`

Count the number of occurrences of each unique value in a column

df.loc[:, 'i'].value_counts()

i
0       5
6670    5
6663    5
6664    5
6665    5
       ..
3333    5
3334    5
3335    5
3336    5
9999    5
Name: count, Length: 10000, dtype: int64

Overview

Pandas

1. Initializing a dataframe

1.1 From NumPy arrays

1.2 From a file

2. Saving a dataframe

2.1 Filetype options

3. pd.DataFrame versus pd.Series

3.1 Converting a Pandas series into a NumPy array

3.2 Accessing the list of a dataframe’s columns

4. Series datatypes

5. Working with strings

6. Indexing

6.1 Accessing elements with .loc and .iloc

6.2 .reset_index

6.3 Multi-index

6.4 Using indexing to avoid merges

7. Merging, concatenating, and appending columns

7.1 Merging

7.2 Concatenating

7.3 Appending columns

8. Groupby

8.1 .transform

8.2 .apply

9. Long and wide formats

10. Miscellaneous functions and methods

10.1 .drop_duplicates

10.2 .sort_values

10.3 .unique

10.4 .value_counts

3. `pd.DataFrame` versus `pd.Series`

6.1 Accessing elements with `.loc` and `.iloc`

6.2 `.reset_index`

8.1 `.transform`

8.2 `.apply`

10.1 `.drop_duplicates`

10.2 `.sort_values`

10.3 `.unique`

10.4 `.value_counts`