import numpy as np
import pandas as pdLesson 6
Overview
- Pandas
- Initializing a dataframe
- From NumPy arrays
- From a file
- Saving a dataframe
- Filetype options
pd.DataFrameversuspd.Series
- Converting a Pandas series into a NumPy array
- Accessing the list of a dataframe’s columns
- Series datatypes
- Working with strings
- Indexing
- Accessing elements with
.locand.iloc .reset_index- Multi-index
- Using indexing to avoid merges
- Merging, concatenating, and appending columns
- Merging
- Concatenating
- Appending columns
- Groupby
.transform.apply
- Long and wide formats
- Miscellaneous functions and methods
.drop_duplicates.sort_values.unique.value_counts
Pandas
Dataframes in Python
1. Initializing a dataframe
1.1 From NumPy arrays
Initialize an instance of the pd.DataFrame class with a dictionary linking column names to associated NumPy arrays
rng = np.random.default_rng(1234)
i = np.arange(100)
y = rng.uniform(size=100)
df = pd.DataFrame({'i': i, 'y': y})df| i | y | |
|---|---|---|
| 0 | 0 | 0.976700 |
| 1 | 1 | 0.380196 |
| 2 | 2 | 0.923246 |
| 3 | 3 | 0.261692 |
| 4 | 4 | 0.319097 |
| ... | ... | ... |
| 95 | 95 | 0.873824 |
| 96 | 96 | 0.350609 |
| 97 | 97 | 0.823521 |
| 98 | 98 | 0.947199 |
| 99 | 99 | 0.067041 |
100 rows × 2 columns
1.2 From a file
Use something like pd.read_csv(), pd.read_stata(), pd.read_excel(), pd.read_feather(), etc.
Source for sample data: here
df = pd.read_csv('/Users/adamoppenheimer/Library/CloudStorage/GoogleDrive-oppen040@umn.edu/My Drive/Econ Crash Course 2024/Data/twoway_sample_data.csv')df| i | j | y | t | l | k | alpha | psi | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 122 | -0.588273 | 0 | 2 | 6 | 0.0 | 0.348756 |
| 1 | 0 | 122 | -1.368881 | 1 | 2 | 6 | 0.0 | 0.348756 |
| 2 | 0 | 107 | -1.915975 | 2 | 2 | 5 | 0.0 | 0.114185 |
| 3 | 0 | 107 | 0.587144 | 3 | 2 | 5 | 0.0 | 0.114185 |
| 4 | 0 | 107 | -0.398403 | 4 | 2 | 5 | 0.0 | 0.114185 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 49995 | 9999 | 121 | -0.564833 | 0 | 2 | 6 | 0.0 | 0.348756 |
| 49996 | 9999 | 67 | -0.850231 | 1 | 2 | 3 | 0.0 | -0.348756 |
| 49997 | 9999 | 96 | -0.199708 | 2 | 2 | 4 | 0.0 | -0.114185 |
| 49998 | 9999 | 65 | -0.605951 | 3 | 2 | 3 | 0.0 | -0.348756 |
| 49999 | 9999 | 65 | 0.365200 | 4 | 2 | 3 | 0.0 | -0.348756 |
50000 rows × 8 columns
2. Saving a dataframe
Use df.to_csv(), df.to_stata(), df.to_excel(), df.to_feather(), etc.
2.1 Filetype options
feather is recommended - see, for example, here
3. pd.DataFrame versus pd.Series
pd.DataFrame stores a list of pd.Series, pd.Series is a single column. They each have their own class methods.
type(df)pandas.core.frame.DataFrame
type(df.loc[:, 'i'])pandas.core.series.Series
We can access a column but treat it as a DataFrame by wrapping the column name in brackets (['col_name'])
type(df.loc[:, ['i']])pandas.core.frame.DataFrame
3.1 Converting a Pandas series into a NumPy array
Use .to_numpy()
df.loc[:, 'i'].to_numpy()array([ 0, 0, 0, ..., 9999, 9999, 9999])
type(df.loc[:, 'i'].to_numpy())numpy.ndarray
3.2 Accessing the list of a dataframe’s columns
Use df.columns
df.columnsIndex(['i', 'j', 'y', 't', 'l', 'k', 'alpha', 'psi'], dtype='object')
df.shape(50000, 8)
4. Series datatypes
Same as NumPy datatypes, e.g. int, float, str, etc.
Convert a column to a new type by using col.astype(dtype)
df.loc[:, 'i'].dtypedtype('int64')
df.loc[:, 'i'].astype(float)0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
...
49995 9999.0
49996 9999.0
49997 9999.0
49998 9999.0
49999 9999.0
Name: i, Length: 50000, dtype: float64
df.loc[:, 'i'].astype(float).dtypedtype('float64')
5. Working with strings
Use .str.function() on a Series
df.loc[:, 'i'].astype(str).str.strip('9')0 0
1 0
2 0
3 0
4 0
..
49995
49996
49997
49998
49999
Name: i, Length: 50000, dtype: object
6. Indexing
Unnecessarily complicated - each dataframe has its own index, which isn’t necessarily a consecutive series of integers, or even numeric (e.g. the index could be strings)
6.1 Accessing elements with .loc and .iloc
.loc indexes using the particular index assigned to the given dataframe
.iloc indexes by integer location
df| i | j | y | t | l | k | alpha | psi | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 122 | -0.588273 | 0 | 2 | 6 | 0.0 | 0.348756 |
| 1 | 0 | 122 | -1.368881 | 1 | 2 | 6 | 0.0 | 0.348756 |
| 2 | 0 | 107 | -1.915975 | 2 | 2 | 5 | 0.0 | 0.114185 |
| 3 | 0 | 107 | 0.587144 | 3 | 2 | 5 | 0.0 | 0.114185 |
| 4 | 0 | 107 | -0.398403 | 4 | 2 | 5 | 0.0 | 0.114185 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 49995 | 9999 | 121 | -0.564833 | 0 | 2 | 6 | 0.0 | 0.348756 |
| 49996 | 9999 | 67 | -0.850231 | 1 | 2 | 3 | 0.0 | -0.348756 |
| 49997 | 9999 | 96 | -0.199708 | 2 | 2 | 4 | 0.0 | -0.114185 |
| 49998 | 9999 | 65 | -0.605951 | 3 | 2 | 3 | 0.0 | -0.348756 |
| 49999 | 9999 | 65 | 0.365200 | 4 | 2 | 3 | 0.0 | -0.348756 |
50000 rows × 8 columns
df['j'].to_numpy() == 70array([False, False, False, ..., False, False, False])
df.loc[df['j'].to_numpy() == 70, :]| i | j | y | t | l | k | alpha | psi | |
|---|---|---|---|---|---|---|---|---|
| 500 | 100 | 70 | -1.681804 | 0 | 1 | 3 | -0.430727 | -0.348756 |
| 501 | 100 | 70 | -1.027119 | 1 | 1 | 3 | -0.430727 | -0.348756 |
| 502 | 100 | 70 | -1.427879 | 2 | 1 | 3 | -0.430727 | -0.348756 |
| 1096 | 219 | 70 | 0.846744 | 1 | 3 | 3 | 0.430727 | -0.348756 |
| 1097 | 219 | 70 | -0.232732 | 2 | 3 | 3 | 0.430727 | -0.348756 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 49832 | 9966 | 70 | 0.259503 | 2 | 4 | 3 | 0.967422 | -0.348756 |
| 49833 | 9966 | 70 | -0.295527 | 3 | 4 | 3 | 0.967422 | -0.348756 |
| 49834 | 9966 | 70 | 1.635962 | 4 | 4 | 3 | 0.967422 | -0.348756 |
| 49970 | 9994 | 70 | 0.672032 | 0 | 1 | 3 | -0.430727 | -0.348756 |
| 49971 | 9994 | 70 | 0.114335 | 1 | 1 | 3 | -0.430727 | -0.348756 |
245 rows × 8 columns
df.loc[:10, :]| i | j | y | t | l | k | alpha | psi | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 122 | -0.588273 | 0 | 2 | 6 | 0.000000 | 0.348756 |
| 1 | 0 | 122 | -1.368881 | 1 | 2 | 6 | 0.000000 | 0.348756 |
| 2 | 0 | 107 | -1.915975 | 2 | 2 | 5 | 0.000000 | 0.114185 |
| 3 | 0 | 107 | 0.587144 | 3 | 2 | 5 | 0.000000 | 0.114185 |
| 4 | 0 | 107 | -0.398403 | 4 | 2 | 5 | 0.000000 | 0.114185 |
| 5 | 1 | 132 | 0.164539 | 0 | 0 | 6 | -0.967422 | 0.348756 |
| 6 | 1 | 132 | 0.072192 | 1 | 0 | 6 | -0.967422 | 0.348756 |
| 7 | 1 | 132 | 0.589733 | 2 | 0 | 6 | -0.967422 | 0.348756 |
| 8 | 1 | 28 | -2.535896 | 3 | 0 | 1 | -0.967422 | -0.908458 |
| 9 | 1 | 28 | -0.232353 | 4 | 0 | 1 | -0.967422 | -0.908458 |
| 10 | 2 | 174 | -0.004053 | 0 | 3 | 8 | 0.430727 | 0.908458 |
df.loc[:10, :].loc[:5, :]| i | j | y | t | l | k | alpha | psi | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 122 | -0.588273 | 0 | 2 | 6 | 0.000000 | 0.348756 |
| 1 | 0 | 122 | -1.368881 | 1 | 2 | 6 | 0.000000 | 0.348756 |
| 2 | 0 | 107 | -1.915975 | 2 | 2 | 5 | 0.000000 | 0.114185 |
| 3 | 0 | 107 | 0.587144 | 3 | 2 | 5 | 0.000000 | 0.114185 |
| 4 | 0 | 107 | -0.398403 | 4 | 2 | 5 | 0.000000 | 0.114185 |
| 5 | 1 | 132 | 0.164539 | 0 | 0 | 6 | -0.967422 | 0.348756 |
df.loc[df['j'].to_numpy() == 70, :].loc[:5, :]| i | j | y | t | l | k | alpha | psi |
|---|
df.loc[df['j'].to_numpy() == 70, :].iloc[:5, :]| i | j | y | t | l | k | alpha | psi | |
|---|---|---|---|---|---|---|---|---|
| 500 | 100 | 70 | -1.681804 | 0 | 1 | 3 | -0.430727 | -0.348756 |
| 501 | 100 | 70 | -1.027119 | 1 | 1 | 3 | -0.430727 | -0.348756 |
| 502 | 100 | 70 | -1.427879 | 2 | 1 | 3 | -0.430727 | -0.348756 |
| 1096 | 219 | 70 | 0.846744 | 1 | 3 | 3 | 0.430727 | -0.348756 |
| 1097 | 219 | 70 | -0.232732 | 2 | 3 | 3 | 0.430727 | -0.348756 |
6.2 .reset_index
This allows you to reset the index to be a consecutive series of integers
Specify the option drop=False so that the current index becomes a new column in the given dataframe
df.loc[df['j'].to_numpy() == 70, :].reset_index(drop=True)| i | j | y | t | l | k | alpha | psi | |
|---|---|---|---|---|---|---|---|---|
| 0 | 100 | 70 | -1.681804 | 0 | 1 | 3 | -0.430727 | -0.348756 |
| 1 | 100 | 70 | -1.027119 | 1 | 1 | 3 | -0.430727 | -0.348756 |
| 2 | 100 | 70 | -1.427879 | 2 | 1 | 3 | -0.430727 | -0.348756 |
| 3 | 219 | 70 | 0.846744 | 1 | 3 | 3 | 0.430727 | -0.348756 |
| 4 | 219 | 70 | -0.232732 | 2 | 3 | 3 | 0.430727 | -0.348756 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 240 | 9966 | 70 | 0.259503 | 2 | 4 | 3 | 0.967422 | -0.348756 |
| 241 | 9966 | 70 | -0.295527 | 3 | 4 | 3 | 0.967422 | -0.348756 |
| 242 | 9966 | 70 | 1.635962 | 4 | 4 | 3 | 0.967422 | -0.348756 |
| 243 | 9994 | 70 | 0.672032 | 0 | 1 | 3 | -0.430727 | -0.348756 |
| 244 | 9994 | 70 | 0.114335 | 1 | 1 | 3 | -0.430727 | -0.348756 |
245 rows × 8 columns
df.loc[df['j'].to_numpy() == 70, :].reset_index(drop=False)| index | i | j | y | t | l | k | alpha | psi | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 500 | 100 | 70 | -1.681804 | 0 | 1 | 3 | -0.430727 | -0.348756 |
| 1 | 501 | 100 | 70 | -1.027119 | 1 | 1 | 3 | -0.430727 | -0.348756 |
| 2 | 502 | 100 | 70 | -1.427879 | 2 | 1 | 3 | -0.430727 | -0.348756 |
| 3 | 1096 | 219 | 70 | 0.846744 | 1 | 3 | 3 | 0.430727 | -0.348756 |
| 4 | 1097 | 219 | 70 | -0.232732 | 2 | 3 | 3 | 0.430727 | -0.348756 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 240 | 49832 | 9966 | 70 | 0.259503 | 2 | 4 | 3 | 0.967422 | -0.348756 |
| 241 | 49833 | 9966 | 70 | -0.295527 | 3 | 4 | 3 | 0.967422 | -0.348756 |
| 242 | 49834 | 9966 | 70 | 1.635962 | 4 | 4 | 3 | 0.967422 | -0.348756 |
| 243 | 49970 | 9994 | 70 | 0.672032 | 0 | 1 | 3 | -0.430727 | -0.348756 |
| 244 | 49971 | 9994 | 70 | 0.114335 | 1 | 1 | 3 | -0.430727 | -0.348756 |
245 rows × 9 columns
6.3 Multi-index
It allows you to specify multiple columns as the index
Horrible, please don’t use it. I don’t know how to use it and have never encountered a task in Pandas that couldn’t be accomplished without it. (Of course, many people like it, see here for some benefits.)
6.4 Using indexing to avoid merges
df['k'] = df['k'] + 10
k_to_psi = df.groupby(['k'])['psi'].first()
df['k'] = df['k'] - 10k_to_psik
10 -1.335178
11 -0.908458
12 -0.604585
13 -0.348756
14 -0.114185
15 0.114185
16 0.348756
17 0.604585
18 0.908458
19 1.335178
Name: psi, dtype: float64
type(k_to_psi)pandas.core.series.Series
k_to_psi.loc[10]np.float64(-1.335177736118937)
k_to_psi.loc[df['k'] + 10]k
16 0.348756
16 0.348756
15 0.114185
15 0.114185
15 0.114185
...
16 0.348756
13 -0.348756
14 -0.114185
13 -0.348756
13 -0.348756
Name: psi, Length: 50000, dtype: float64
7. Merging, concatenating, and appending columns
7.1 Merging
Use pd.merge(a, b, how='inner', on='col_name') (see here for more details)
pd.merge(df.loc[:, ['i', 't', 'j']], df.loc[:, ['i', 't', 'y']], how='inner', on=['i', 't'])| i | t | j | y | |
|---|---|---|---|---|
| 0 | 0 | 0 | 122 | -0.588273 |
| 1 | 0 | 1 | 122 | -1.368881 |
| 2 | 0 | 2 | 107 | -1.915975 |
| 3 | 0 | 3 | 107 | 0.587144 |
| 4 | 0 | 4 | 107 | -0.398403 |
| ... | ... | ... | ... | ... |
| 49995 | 9999 | 0 | 121 | -0.564833 |
| 49996 | 9999 | 1 | 67 | -0.850231 |
| 49997 | 9999 | 2 | 96 | -0.199708 |
| 49998 | 9999 | 3 | 65 | -0.605951 |
| 49999 | 9999 | 4 | 65 | 0.365200 |
50000 rows × 4 columns
7.2 Concatenating
Using pd.concat([a, b, c, ...]) (see here for more details)
Specifying axis=0 concatenates rows
Specifying axis=1 concatenates columns
pd.concat([df.loc[: 99, :], df.loc[100: 999, :], df.loc[1000:, :]], axis=0)| i | j | y | t | l | k | alpha | psi | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 122 | -0.588273 | 0 | 2 | 6 | 0.0 | 0.348756 |
| 1 | 0 | 122 | -1.368881 | 1 | 2 | 6 | 0.0 | 0.348756 |
| 2 | 0 | 107 | -1.915975 | 2 | 2 | 5 | 0.0 | 0.114185 |
| 3 | 0 | 107 | 0.587144 | 3 | 2 | 5 | 0.0 | 0.114185 |
| 4 | 0 | 107 | -0.398403 | 4 | 2 | 5 | 0.0 | 0.114185 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 49995 | 9999 | 121 | -0.564833 | 0 | 2 | 6 | 0.0 | 0.348756 |
| 49996 | 9999 | 67 | -0.850231 | 1 | 2 | 3 | 0.0 | -0.348756 |
| 49997 | 9999 | 96 | -0.199708 | 2 | 2 | 4 | 0.0 | -0.114185 |
| 49998 | 9999 | 65 | -0.605951 | 3 | 2 | 3 | 0.0 | -0.348756 |
| 49999 | 9999 | 65 | 0.365200 | 4 | 2 | 3 | 0.0 | -0.348756 |
50000 rows × 8 columns
pd.concat([df.loc[:, ['i']], df.loc[:, ['j']], df.loc[:, ['y']]], axis=1)| i | j | y | |
|---|---|---|---|
| 0 | 0 | 122 | -0.588273 |
| 1 | 0 | 122 | -1.368881 |
| 2 | 0 | 107 | -1.915975 |
| 3 | 0 | 107 | 0.587144 |
| 4 | 0 | 107 | -0.398403 |
| ... | ... | ... | ... |
| 49995 | 9999 | 121 | -0.564833 |
| 49996 | 9999 | 67 | -0.850231 |
| 49997 | 9999 | 96 | -0.199708 |
| 49998 | 9999 | 65 | -0.605951 |
| 49999 | 9999 | 65 | 0.365200 |
50000 rows × 3 columns
7.3 Appending columns
Just define a new column in the dataframe as if it is a dictionary
df.loc[:, 'n'] = rng.uniform(size=len(df))df| i | j | y | t | l | k | alpha | psi | n | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 122 | -0.588273 | 0 | 2 | 6 | 0.0 | 0.348756 | 0.297840 |
| 1 | 0 | 122 | -1.368881 | 1 | 2 | 6 | 0.0 | 0.348756 | 0.619161 |
| 2 | 0 | 107 | -1.915975 | 2 | 2 | 5 | 0.0 | 0.114185 | 0.301757 |
| 3 | 0 | 107 | 0.587144 | 3 | 2 | 5 | 0.0 | 0.114185 | 0.174846 |
| 4 | 0 | 107 | -0.398403 | 4 | 2 | 5 | 0.0 | 0.114185 | 0.495505 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 49995 | 9999 | 121 | -0.564833 | 0 | 2 | 6 | 0.0 | 0.348756 | 0.156580 |
| 49996 | 9999 | 67 | -0.850231 | 1 | 2 | 3 | 0.0 | -0.348756 | 0.782052 |
| 49997 | 9999 | 96 | -0.199708 | 2 | 2 | 4 | 0.0 | -0.114185 | 0.678319 |
| 49998 | 9999 | 65 | -0.605951 | 3 | 2 | 3 | 0.0 | -0.348756 | 0.004485 |
| 49999 | 9999 | 65 | 0.365200 | 4 | 2 | 3 | 0.0 | -0.348756 | 0.416472 |
50000 rows × 9 columns
8. Groupby
First group observations into categories, then perform a particular operation within a given category
We will compute the average income per worker:
df.groupby('i')['y'].mean()i
0 -0.736877
1 -0.388357
2 1.497247
3 -1.197781
4 1.056655
...
9995 0.484574
9996 -0.821621
9997 -0.911059
9998 2.225247
9999 -0.371104
Name: y, Length: 10000, dtype: float64
df.groupby('i')['y'].var()i
0 0.919845
1 1.527680
2 1.418687
3 2.724300
4 0.642759
...
9995 1.498775
9996 0.319638
9997 0.156393
9998 1.845940
9999 0.223442
Name: y, Length: 10000, dtype: float64
If the function of interest depends only on the column you are grouping by, you can call the function directly on the groupby object
df.groupby('i').size()i
0 5
1 5
2 5
3 5
4 5
..
9995 5
9996 5
9997 5
9998 5
9999 5
Length: 10000, dtype: int64
8.1 .transform
Expand the results of a groupby back to the size of the original dataframe
df.groupby('i')['y'].transform('mean')0 -0.736877
1 -0.736877
2 -0.736877
3 -0.736877
4 -0.736877
...
49995 -0.371104
49996 -0.371104
49997 -0.371104
49998 -0.371104
49999 -0.371104
Name: y, Length: 50000, dtype: float64
Now we could do something like compute the ratio of each wage to the worker’s average wage
df['y'] / df.groupby('i')['y'].transform('mean')0 0.798332
1 1.857677
2 2.600127
3 -0.796800
4 0.540664
...
49995 1.522032
49996 2.291081
49997 0.538146
49998 1.632831
49999 -0.984090
Name: y, Length: 50000, dtype: float64
8.2 .apply
Use a custom function with groupby
WARNING: .apply is notoriously slow
In this example, we will compute the second highest wage for each worker
def second_largest(col):
return np.unique(col)[-2]
df.groupby('i')['y'].apply(second_largest)i
0 -0.398403
1 0.164539
2 1.927676
3 -0.437750
4 1.534960
...
9995 1.352489
9996 -0.569682
9997 -0.883654
9998 3.068653
9999 -0.199708
Name: y, Length: 10000, dtype: float64
9. Long and wide formats
Usually use long, it stores a single column for a variable then within an observation and over time, it adds new rows
Wide takes a single observation as a single row, and puts each year in a new column
Can convert between wide and long using pd.wide_to_long(), pd.melt(), pd.pivot(), etc.
This is a very confusing process and ultimately whenever I’ve used it I just play around until it gives the result I want
df| i | j | y | t | l | k | alpha | psi | n | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 122 | -0.588273 | 0 | 2 | 6 | 0.0 | 0.348756 | 0.297840 |
| 1 | 0 | 122 | -1.368881 | 1 | 2 | 6 | 0.0 | 0.348756 | 0.619161 |
| 2 | 0 | 107 | -1.915975 | 2 | 2 | 5 | 0.0 | 0.114185 | 0.301757 |
| 3 | 0 | 107 | 0.587144 | 3 | 2 | 5 | 0.0 | 0.114185 | 0.174846 |
| 4 | 0 | 107 | -0.398403 | 4 | 2 | 5 | 0.0 | 0.114185 | 0.495505 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 49995 | 9999 | 121 | -0.564833 | 0 | 2 | 6 | 0.0 | 0.348756 | 0.156580 |
| 49996 | 9999 | 67 | -0.850231 | 1 | 2 | 3 | 0.0 | -0.348756 | 0.782052 |
| 49997 | 9999 | 96 | -0.199708 | 2 | 2 | 4 | 0.0 | -0.114185 | 0.678319 |
| 49998 | 9999 | 65 | -0.605951 | 3 | 2 | 3 | 0.0 | -0.348756 | 0.004485 |
| 49999 | 9999 | 65 | 0.365200 | 4 | 2 | 3 | 0.0 | -0.348756 | 0.416472 |
50000 rows × 9 columns
10. Miscellaneous functions and methods
10.1 .drop_duplicates
Drop duplicate observations
dup_df = pd.concat([df, df.loc[: 100, :]])dup_df.drop_duplicates()| i | j | y | t | l | k | alpha | psi | n | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 122 | -0.588273 | 0 | 2 | 6 | 0.0 | 0.348756 | 0.297840 |
| 1 | 0 | 122 | -1.368881 | 1 | 2 | 6 | 0.0 | 0.348756 | 0.619161 |
| 2 | 0 | 107 | -1.915975 | 2 | 2 | 5 | 0.0 | 0.114185 | 0.301757 |
| 3 | 0 | 107 | 0.587144 | 3 | 2 | 5 | 0.0 | 0.114185 | 0.174846 |
| 4 | 0 | 107 | -0.398403 | 4 | 2 | 5 | 0.0 | 0.114185 | 0.495505 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 49995 | 9999 | 121 | -0.564833 | 0 | 2 | 6 | 0.0 | 0.348756 | 0.156580 |
| 49996 | 9999 | 67 | -0.850231 | 1 | 2 | 3 | 0.0 | -0.348756 | 0.782052 |
| 49997 | 9999 | 96 | -0.199708 | 2 | 2 | 4 | 0.0 | -0.114185 | 0.678319 |
| 49998 | 9999 | 65 | -0.605951 | 3 | 2 | 3 | 0.0 | -0.348756 | 0.004485 |
| 49999 | 9999 | 65 | 0.365200 | 4 | 2 | 3 | 0.0 | -0.348756 | 0.416472 |
50000 rows × 9 columns
10.2 .sort_values
Sort the values of the dataframe based on particular columns
df| i | j | y | t | l | k | alpha | psi | n | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 122 | -0.588273 | 0 | 2 | 6 | 0.0 | 0.348756 | 0.297840 |
| 1 | 0 | 122 | -1.368881 | 1 | 2 | 6 | 0.0 | 0.348756 | 0.619161 |
| 2 | 0 | 107 | -1.915975 | 2 | 2 | 5 | 0.0 | 0.114185 | 0.301757 |
| 3 | 0 | 107 | 0.587144 | 3 | 2 | 5 | 0.0 | 0.114185 | 0.174846 |
| 4 | 0 | 107 | -0.398403 | 4 | 2 | 5 | 0.0 | 0.114185 | 0.495505 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 49995 | 9999 | 121 | -0.564833 | 0 | 2 | 6 | 0.0 | 0.348756 | 0.156580 |
| 49996 | 9999 | 67 | -0.850231 | 1 | 2 | 3 | 0.0 | -0.348756 | 0.782052 |
| 49997 | 9999 | 96 | -0.199708 | 2 | 2 | 4 | 0.0 | -0.114185 | 0.678319 |
| 49998 | 9999 | 65 | -0.605951 | 3 | 2 | 3 | 0.0 | -0.348756 | 0.004485 |
| 49999 | 9999 | 65 | 0.365200 | 4 | 2 | 3 | 0.0 | -0.348756 | 0.416472 |
50000 rows × 9 columns
df.sort_values(['j', 'y'])| i | j | y | t | l | k | alpha | psi | n | |
|---|---|---|---|---|---|---|---|---|---|
| 33263 | 6652 | 0 | -5.064867 | 3 | 0 | 0 | -0.967422 | -1.335178 | 0.985021 |
| 9649 | 1929 | 0 | -4.971061 | 4 | 0 | 0 | -0.967422 | -1.335178 | 0.800174 |
| 40830 | 8166 | 0 | -4.758191 | 0 | 0 | 0 | -0.967422 | -1.335178 | 0.898404 |
| 18141 | 3628 | 0 | -4.627821 | 1 | 0 | 0 | -0.967422 | -1.335178 | 0.311400 |
| 12029 | 2405 | 0 | -4.166904 | 4 | 0 | 0 | -0.967422 | -1.335178 | 0.698459 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 33794 | 6758 | 198 | 4.063326 | 4 | 4 | 9 | 0.967422 | 1.335178 | 0.259600 |
| 21136 | 4227 | 198 | 4.117421 | 1 | 2 | 9 | 0.000000 | 1.335178 | 0.674148 |
| 48123 | 9624 | 198 | 4.356418 | 3 | 4 | 9 | 0.967422 | 1.335178 | 0.825097 |
| 29290 | 5858 | 198 | 4.447053 | 0 | 4 | 9 | 0.967422 | 1.335178 | 0.570896 |
| 2433 | 486 | 198 | 4.559297 | 3 | 4 | 9 | 0.967422 | 1.335178 | 0.797747 |
50000 rows × 9 columns
10.3 .unique
List the unique values in a particular column
WARNING: NumPy .unique returns a sorted array, whereas Pandas .unique returns an array ordered by appearance
df.loc[:, 'y'].unique()array([-0.58827279, -1.36888053, -1.91597465, ..., -0.19970824,
-0.60595076, 0.3652001 ])
np.unique(df.loc[:, 'y'])array([-5.39994589, -5.39103478, -5.31532195, ..., 5.24649836,
5.41834757, 6.03305639])
10.4 .value_counts
Count the number of occurrences of each unique value in a column
df.loc[:, 'i'].value_counts()i
0 5
6670 5
6663 5
6664 5
6665 5
..
3333 5
3334 5
3335 5
3336 5
9999 5
Name: count, Length: 10000, dtype: int64