import numpy as np
import pandas as pd
Lesson 6
Overview
- Pandas
- Initializing a dataframe
- From NumPy arrays
- From a file
- Saving a dataframe
- Filetype options
pd.DataFrame
versuspd.Series
- Converting a Pandas series into a NumPy array
- Accessing the list of a dataframe’s columns
- Series datatypes
- Working with strings
- Indexing
- Accessing elements with
.loc
and.iloc
.reset_index
- Multi-index
- Using indexing to avoid merges
- Merging, concatenating, and appending columns
- Merging
- Concatenating
- Appending columns
- Groupby
.transform
.apply
- Long and wide formats
- Miscellaneous functions and methods
.drop_duplicates
.sort_values
.unique
.value_counts
Pandas
Dataframes in Python
1. Initializing a dataframe
1.1 From NumPy arrays
Initialize an instance of the pd.DataFrame
class with a dictionary linking column names to associated NumPy arrays
= np.random.default_rng(1234)
rng = np.arange(100)
i = rng.uniform(size=100)
y = pd.DataFrame({'i': i, 'y': y}) df
df
i | y | |
---|---|---|
0 | 0 | 0.976700 |
1 | 1 | 0.380196 |
2 | 2 | 0.923246 |
3 | 3 | 0.261692 |
4 | 4 | 0.319097 |
... | ... | ... |
95 | 95 | 0.873824 |
96 | 96 | 0.350609 |
97 | 97 | 0.823521 |
98 | 98 | 0.947199 |
99 | 99 | 0.067041 |
100 rows × 2 columns
1.2 From a file
Use something like pd.read_csv()
, pd.read_stata()
, pd.read_excel()
, pd.read_feather()
, etc.
Source for sample data: here
= pd.read_csv('/Users/adamoppenheimer/Library/CloudStorage/GoogleDrive-oppen040@umn.edu/My Drive/Econ Crash Course 2024/Data/twoway_sample_data.csv') df
df
i | j | y | t | l | k | alpha | psi | |
---|---|---|---|---|---|---|---|---|
0 | 0 | 122 | -0.588273 | 0 | 2 | 6 | 0.0 | 0.348756 |
1 | 0 | 122 | -1.368881 | 1 | 2 | 6 | 0.0 | 0.348756 |
2 | 0 | 107 | -1.915975 | 2 | 2 | 5 | 0.0 | 0.114185 |
3 | 0 | 107 | 0.587144 | 3 | 2 | 5 | 0.0 | 0.114185 |
4 | 0 | 107 | -0.398403 | 4 | 2 | 5 | 0.0 | 0.114185 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | 9999 | 121 | -0.564833 | 0 | 2 | 6 | 0.0 | 0.348756 |
49996 | 9999 | 67 | -0.850231 | 1 | 2 | 3 | 0.0 | -0.348756 |
49997 | 9999 | 96 | -0.199708 | 2 | 2 | 4 | 0.0 | -0.114185 |
49998 | 9999 | 65 | -0.605951 | 3 | 2 | 3 | 0.0 | -0.348756 |
49999 | 9999 | 65 | 0.365200 | 4 | 2 | 3 | 0.0 | -0.348756 |
50000 rows × 8 columns
2. Saving a dataframe
Use df.to_csv()
, df.to_stata()
, df.to_excel()
, df.to_feather()
, etc.
2.1 Filetype options
feather is recommended - see, for example, here
3. pd.DataFrame
versus pd.Series
pd.DataFrame
stores a list of pd.Series
, pd.Series
is a single column. They each have their own class methods.
type(df)
pandas.core.frame.DataFrame
type(df.loc[:, 'i'])
pandas.core.series.Series
We can access a column but treat it as a DataFrame by wrapping the column name in brackets (['col_name']
)
type(df.loc[:, ['i']])
pandas.core.frame.DataFrame
3.1 Converting a Pandas series into a NumPy array
Use .to_numpy()
'i'].to_numpy() df.loc[:,
array([ 0, 0, 0, ..., 9999, 9999, 9999])
type(df.loc[:, 'i'].to_numpy())
numpy.ndarray
3.2 Accessing the list of a dataframe’s columns
Use df.columns
df.columns
Index(['i', 'j', 'y', 't', 'l', 'k', 'alpha', 'psi'], dtype='object')
df.shape
(50000, 8)
4. Series datatypes
Same as NumPy datatypes, e.g. int
, float
, str
, etc.
Convert a column to a new type by using col.astype(dtype)
'i'].dtype df.loc[:,
dtype('int64')
'i'].astype(float) df.loc[:,
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
...
49995 9999.0
49996 9999.0
49997 9999.0
49998 9999.0
49999 9999.0
Name: i, Length: 50000, dtype: float64
'i'].astype(float).dtype df.loc[:,
dtype('float64')
5. Working with strings
Use .str.function()
on a Series
'i'].astype(str).str.strip('9') df.loc[:,
0 0
1 0
2 0
3 0
4 0
..
49995
49996
49997
49998
49999
Name: i, Length: 50000, dtype: object
6. Indexing
Unnecessarily complicated - each dataframe has its own index, which isn’t necessarily a consecutive series of integers, or even numeric (e.g. the index could be strings)
6.1 Accessing elements with .loc
and .iloc
.loc
indexes using the particular index assigned to the given dataframe
.iloc
indexes by integer location
df
i | j | y | t | l | k | alpha | psi | |
---|---|---|---|---|---|---|---|---|
0 | 0 | 122 | -0.588273 | 0 | 2 | 6 | 0.0 | 0.348756 |
1 | 0 | 122 | -1.368881 | 1 | 2 | 6 | 0.0 | 0.348756 |
2 | 0 | 107 | -1.915975 | 2 | 2 | 5 | 0.0 | 0.114185 |
3 | 0 | 107 | 0.587144 | 3 | 2 | 5 | 0.0 | 0.114185 |
4 | 0 | 107 | -0.398403 | 4 | 2 | 5 | 0.0 | 0.114185 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | 9999 | 121 | -0.564833 | 0 | 2 | 6 | 0.0 | 0.348756 |
49996 | 9999 | 67 | -0.850231 | 1 | 2 | 3 | 0.0 | -0.348756 |
49997 | 9999 | 96 | -0.199708 | 2 | 2 | 4 | 0.0 | -0.114185 |
49998 | 9999 | 65 | -0.605951 | 3 | 2 | 3 | 0.0 | -0.348756 |
49999 | 9999 | 65 | 0.365200 | 4 | 2 | 3 | 0.0 | -0.348756 |
50000 rows × 8 columns
'j'].to_numpy() == 70 df[
array([False, False, False, ..., False, False, False])
'j'].to_numpy() == 70, :] df.loc[df[
i | j | y | t | l | k | alpha | psi | |
---|---|---|---|---|---|---|---|---|
500 | 100 | 70 | -1.681804 | 0 | 1 | 3 | -0.430727 | -0.348756 |
501 | 100 | 70 | -1.027119 | 1 | 1 | 3 | -0.430727 | -0.348756 |
502 | 100 | 70 | -1.427879 | 2 | 1 | 3 | -0.430727 | -0.348756 |
1096 | 219 | 70 | 0.846744 | 1 | 3 | 3 | 0.430727 | -0.348756 |
1097 | 219 | 70 | -0.232732 | 2 | 3 | 3 | 0.430727 | -0.348756 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
49832 | 9966 | 70 | 0.259503 | 2 | 4 | 3 | 0.967422 | -0.348756 |
49833 | 9966 | 70 | -0.295527 | 3 | 4 | 3 | 0.967422 | -0.348756 |
49834 | 9966 | 70 | 1.635962 | 4 | 4 | 3 | 0.967422 | -0.348756 |
49970 | 9994 | 70 | 0.672032 | 0 | 1 | 3 | -0.430727 | -0.348756 |
49971 | 9994 | 70 | 0.114335 | 1 | 1 | 3 | -0.430727 | -0.348756 |
245 rows × 8 columns
10, :] df.loc[:
i | j | y | t | l | k | alpha | psi | |
---|---|---|---|---|---|---|---|---|
0 | 0 | 122 | -0.588273 | 0 | 2 | 6 | 0.000000 | 0.348756 |
1 | 0 | 122 | -1.368881 | 1 | 2 | 6 | 0.000000 | 0.348756 |
2 | 0 | 107 | -1.915975 | 2 | 2 | 5 | 0.000000 | 0.114185 |
3 | 0 | 107 | 0.587144 | 3 | 2 | 5 | 0.000000 | 0.114185 |
4 | 0 | 107 | -0.398403 | 4 | 2 | 5 | 0.000000 | 0.114185 |
5 | 1 | 132 | 0.164539 | 0 | 0 | 6 | -0.967422 | 0.348756 |
6 | 1 | 132 | 0.072192 | 1 | 0 | 6 | -0.967422 | 0.348756 |
7 | 1 | 132 | 0.589733 | 2 | 0 | 6 | -0.967422 | 0.348756 |
8 | 1 | 28 | -2.535896 | 3 | 0 | 1 | -0.967422 | -0.908458 |
9 | 1 | 28 | -0.232353 | 4 | 0 | 1 | -0.967422 | -0.908458 |
10 | 2 | 174 | -0.004053 | 0 | 3 | 8 | 0.430727 | 0.908458 |
10, :].loc[:5, :] df.loc[:
i | j | y | t | l | k | alpha | psi | |
---|---|---|---|---|---|---|---|---|
0 | 0 | 122 | -0.588273 | 0 | 2 | 6 | 0.000000 | 0.348756 |
1 | 0 | 122 | -1.368881 | 1 | 2 | 6 | 0.000000 | 0.348756 |
2 | 0 | 107 | -1.915975 | 2 | 2 | 5 | 0.000000 | 0.114185 |
3 | 0 | 107 | 0.587144 | 3 | 2 | 5 | 0.000000 | 0.114185 |
4 | 0 | 107 | -0.398403 | 4 | 2 | 5 | 0.000000 | 0.114185 |
5 | 1 | 132 | 0.164539 | 0 | 0 | 6 | -0.967422 | 0.348756 |
'j'].to_numpy() == 70, :].loc[:5, :] df.loc[df[
i | j | y | t | l | k | alpha | psi |
---|
'j'].to_numpy() == 70, :].iloc[:5, :] df.loc[df[
i | j | y | t | l | k | alpha | psi | |
---|---|---|---|---|---|---|---|---|
500 | 100 | 70 | -1.681804 | 0 | 1 | 3 | -0.430727 | -0.348756 |
501 | 100 | 70 | -1.027119 | 1 | 1 | 3 | -0.430727 | -0.348756 |
502 | 100 | 70 | -1.427879 | 2 | 1 | 3 | -0.430727 | -0.348756 |
1096 | 219 | 70 | 0.846744 | 1 | 3 | 3 | 0.430727 | -0.348756 |
1097 | 219 | 70 | -0.232732 | 2 | 3 | 3 | 0.430727 | -0.348756 |
6.2 .reset_index
This allows you to reset the index to be a consecutive series of integers
Specify the option drop=False
so that the current index becomes a new column in the given dataframe
'j'].to_numpy() == 70, :].reset_index(drop=True) df.loc[df[
i | j | y | t | l | k | alpha | psi | |
---|---|---|---|---|---|---|---|---|
0 | 100 | 70 | -1.681804 | 0 | 1 | 3 | -0.430727 | -0.348756 |
1 | 100 | 70 | -1.027119 | 1 | 1 | 3 | -0.430727 | -0.348756 |
2 | 100 | 70 | -1.427879 | 2 | 1 | 3 | -0.430727 | -0.348756 |
3 | 219 | 70 | 0.846744 | 1 | 3 | 3 | 0.430727 | -0.348756 |
4 | 219 | 70 | -0.232732 | 2 | 3 | 3 | 0.430727 | -0.348756 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
240 | 9966 | 70 | 0.259503 | 2 | 4 | 3 | 0.967422 | -0.348756 |
241 | 9966 | 70 | -0.295527 | 3 | 4 | 3 | 0.967422 | -0.348756 |
242 | 9966 | 70 | 1.635962 | 4 | 4 | 3 | 0.967422 | -0.348756 |
243 | 9994 | 70 | 0.672032 | 0 | 1 | 3 | -0.430727 | -0.348756 |
244 | 9994 | 70 | 0.114335 | 1 | 1 | 3 | -0.430727 | -0.348756 |
245 rows × 8 columns
'j'].to_numpy() == 70, :].reset_index(drop=False) df.loc[df[
index | i | j | y | t | l | k | alpha | psi | |
---|---|---|---|---|---|---|---|---|---|
0 | 500 | 100 | 70 | -1.681804 | 0 | 1 | 3 | -0.430727 | -0.348756 |
1 | 501 | 100 | 70 | -1.027119 | 1 | 1 | 3 | -0.430727 | -0.348756 |
2 | 502 | 100 | 70 | -1.427879 | 2 | 1 | 3 | -0.430727 | -0.348756 |
3 | 1096 | 219 | 70 | 0.846744 | 1 | 3 | 3 | 0.430727 | -0.348756 |
4 | 1097 | 219 | 70 | -0.232732 | 2 | 3 | 3 | 0.430727 | -0.348756 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
240 | 49832 | 9966 | 70 | 0.259503 | 2 | 4 | 3 | 0.967422 | -0.348756 |
241 | 49833 | 9966 | 70 | -0.295527 | 3 | 4 | 3 | 0.967422 | -0.348756 |
242 | 49834 | 9966 | 70 | 1.635962 | 4 | 4 | 3 | 0.967422 | -0.348756 |
243 | 49970 | 9994 | 70 | 0.672032 | 0 | 1 | 3 | -0.430727 | -0.348756 |
244 | 49971 | 9994 | 70 | 0.114335 | 1 | 1 | 3 | -0.430727 | -0.348756 |
245 rows × 9 columns
6.3 Multi-index
It allows you to specify multiple columns as the index
Horrible, please don’t use it. I don’t know how to use it and have never encountered a task in Pandas that couldn’t be accomplished without it. (Of course, many people like it, see here for some benefits.)
6.4 Using indexing to avoid merges
'k'] = df['k'] + 10
df[= df.groupby(['k'])['psi'].first()
k_to_psi 'k'] = df['k'] - 10 df[
k_to_psi
k
10 -1.335178
11 -0.908458
12 -0.604585
13 -0.348756
14 -0.114185
15 0.114185
16 0.348756
17 0.604585
18 0.908458
19 1.335178
Name: psi, dtype: float64
type(k_to_psi)
pandas.core.series.Series
10] k_to_psi.loc[
np.float64(-1.335177736118937)
'k'] + 10] k_to_psi.loc[df[
k
16 0.348756
16 0.348756
15 0.114185
15 0.114185
15 0.114185
...
16 0.348756
13 -0.348756
14 -0.114185
13 -0.348756
13 -0.348756
Name: psi, Length: 50000, dtype: float64
7. Merging, concatenating, and appending columns
7.1 Merging
Use pd.merge(a, b, how='inner', on='col_name')
(see here for more details)
'i', 't', 'j']], df.loc[:, ['i', 't', 'y']], how='inner', on=['i', 't']) pd.merge(df.loc[:, [
i | t | j | y | |
---|---|---|---|---|
0 | 0 | 0 | 122 | -0.588273 |
1 | 0 | 1 | 122 | -1.368881 |
2 | 0 | 2 | 107 | -1.915975 |
3 | 0 | 3 | 107 | 0.587144 |
4 | 0 | 4 | 107 | -0.398403 |
... | ... | ... | ... | ... |
49995 | 9999 | 0 | 121 | -0.564833 |
49996 | 9999 | 1 | 67 | -0.850231 |
49997 | 9999 | 2 | 96 | -0.199708 |
49998 | 9999 | 3 | 65 | -0.605951 |
49999 | 9999 | 4 | 65 | 0.365200 |
50000 rows × 4 columns
7.2 Concatenating
Using pd.concat([a, b, c, ...])
(see here for more details)
Specifying axis=0
concatenates rows
Specifying axis=1
concatenates columns
99, :], df.loc[100: 999, :], df.loc[1000:, :]], axis=0) pd.concat([df.loc[:
i | j | y | t | l | k | alpha | psi | |
---|---|---|---|---|---|---|---|---|
0 | 0 | 122 | -0.588273 | 0 | 2 | 6 | 0.0 | 0.348756 |
1 | 0 | 122 | -1.368881 | 1 | 2 | 6 | 0.0 | 0.348756 |
2 | 0 | 107 | -1.915975 | 2 | 2 | 5 | 0.0 | 0.114185 |
3 | 0 | 107 | 0.587144 | 3 | 2 | 5 | 0.0 | 0.114185 |
4 | 0 | 107 | -0.398403 | 4 | 2 | 5 | 0.0 | 0.114185 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | 9999 | 121 | -0.564833 | 0 | 2 | 6 | 0.0 | 0.348756 |
49996 | 9999 | 67 | -0.850231 | 1 | 2 | 3 | 0.0 | -0.348756 |
49997 | 9999 | 96 | -0.199708 | 2 | 2 | 4 | 0.0 | -0.114185 |
49998 | 9999 | 65 | -0.605951 | 3 | 2 | 3 | 0.0 | -0.348756 |
49999 | 9999 | 65 | 0.365200 | 4 | 2 | 3 | 0.0 | -0.348756 |
50000 rows × 8 columns
'i']], df.loc[:, ['j']], df.loc[:, ['y']]], axis=1) pd.concat([df.loc[:, [
i | j | y | |
---|---|---|---|
0 | 0 | 122 | -0.588273 |
1 | 0 | 122 | -1.368881 |
2 | 0 | 107 | -1.915975 |
3 | 0 | 107 | 0.587144 |
4 | 0 | 107 | -0.398403 |
... | ... | ... | ... |
49995 | 9999 | 121 | -0.564833 |
49996 | 9999 | 67 | -0.850231 |
49997 | 9999 | 96 | -0.199708 |
49998 | 9999 | 65 | -0.605951 |
49999 | 9999 | 65 | 0.365200 |
50000 rows × 3 columns
7.3 Appending columns
Just define a new column in the dataframe as if it is a dictionary
'n'] = rng.uniform(size=len(df)) df.loc[:,
df
i | j | y | t | l | k | alpha | psi | n | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 122 | -0.588273 | 0 | 2 | 6 | 0.0 | 0.348756 | 0.297840 |
1 | 0 | 122 | -1.368881 | 1 | 2 | 6 | 0.0 | 0.348756 | 0.619161 |
2 | 0 | 107 | -1.915975 | 2 | 2 | 5 | 0.0 | 0.114185 | 0.301757 |
3 | 0 | 107 | 0.587144 | 3 | 2 | 5 | 0.0 | 0.114185 | 0.174846 |
4 | 0 | 107 | -0.398403 | 4 | 2 | 5 | 0.0 | 0.114185 | 0.495505 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | 9999 | 121 | -0.564833 | 0 | 2 | 6 | 0.0 | 0.348756 | 0.156580 |
49996 | 9999 | 67 | -0.850231 | 1 | 2 | 3 | 0.0 | -0.348756 | 0.782052 |
49997 | 9999 | 96 | -0.199708 | 2 | 2 | 4 | 0.0 | -0.114185 | 0.678319 |
49998 | 9999 | 65 | -0.605951 | 3 | 2 | 3 | 0.0 | -0.348756 | 0.004485 |
49999 | 9999 | 65 | 0.365200 | 4 | 2 | 3 | 0.0 | -0.348756 | 0.416472 |
50000 rows × 9 columns
8. Groupby
First group observations into categories, then perform a particular operation within a given category
We will compute the average income per worker:
'i')['y'].mean() df.groupby(
i
0 -0.736877
1 -0.388357
2 1.497247
3 -1.197781
4 1.056655
...
9995 0.484574
9996 -0.821621
9997 -0.911059
9998 2.225247
9999 -0.371104
Name: y, Length: 10000, dtype: float64
'i')['y'].var() df.groupby(
i
0 0.919845
1 1.527680
2 1.418687
3 2.724300
4 0.642759
...
9995 1.498775
9996 0.319638
9997 0.156393
9998 1.845940
9999 0.223442
Name: y, Length: 10000, dtype: float64
If the function of interest depends only on the column you are grouping by, you can call the function directly on the groupby object
'i').size() df.groupby(
i
0 5
1 5
2 5
3 5
4 5
..
9995 5
9996 5
9997 5
9998 5
9999 5
Length: 10000, dtype: int64
8.1 .transform
Expand the results of a groupby back to the size of the original dataframe
'i')['y'].transform('mean') df.groupby(
0 -0.736877
1 -0.736877
2 -0.736877
3 -0.736877
4 -0.736877
...
49995 -0.371104
49996 -0.371104
49997 -0.371104
49998 -0.371104
49999 -0.371104
Name: y, Length: 50000, dtype: float64
Now we could do something like compute the ratio of each wage to the worker’s average wage
'y'] / df.groupby('i')['y'].transform('mean') df[
0 0.798332
1 1.857677
2 2.600127
3 -0.796800
4 0.540664
...
49995 1.522032
49996 2.291081
49997 0.538146
49998 1.632831
49999 -0.984090
Name: y, Length: 50000, dtype: float64
8.2 .apply
Use a custom function with groupby
WARNING: .apply
is notoriously slow
In this example, we will compute the second highest wage for each worker
def second_largest(col):
return np.unique(col)[-2]
'i')['y'].apply(second_largest) df.groupby(
i
0 -0.398403
1 0.164539
2 1.927676
3 -0.437750
4 1.534960
...
9995 1.352489
9996 -0.569682
9997 -0.883654
9998 3.068653
9999 -0.199708
Name: y, Length: 10000, dtype: float64
9. Long and wide formats
Usually use long, it stores a single column for a variable then within an observation and over time, it adds new rows
Wide takes a single observation as a single row, and puts each year in a new column
Can convert between wide and long using pd.wide_to_long()
, pd.melt()
, pd.pivot()
, etc.
This is a very confusing process and ultimately whenever I’ve used it I just play around until it gives the result I want
df
i | j | y | t | l | k | alpha | psi | n | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 122 | -0.588273 | 0 | 2 | 6 | 0.0 | 0.348756 | 0.297840 |
1 | 0 | 122 | -1.368881 | 1 | 2 | 6 | 0.0 | 0.348756 | 0.619161 |
2 | 0 | 107 | -1.915975 | 2 | 2 | 5 | 0.0 | 0.114185 | 0.301757 |
3 | 0 | 107 | 0.587144 | 3 | 2 | 5 | 0.0 | 0.114185 | 0.174846 |
4 | 0 | 107 | -0.398403 | 4 | 2 | 5 | 0.0 | 0.114185 | 0.495505 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | 9999 | 121 | -0.564833 | 0 | 2 | 6 | 0.0 | 0.348756 | 0.156580 |
49996 | 9999 | 67 | -0.850231 | 1 | 2 | 3 | 0.0 | -0.348756 | 0.782052 |
49997 | 9999 | 96 | -0.199708 | 2 | 2 | 4 | 0.0 | -0.114185 | 0.678319 |
49998 | 9999 | 65 | -0.605951 | 3 | 2 | 3 | 0.0 | -0.348756 | 0.004485 |
49999 | 9999 | 65 | 0.365200 | 4 | 2 | 3 | 0.0 | -0.348756 | 0.416472 |
50000 rows × 9 columns
10. Miscellaneous functions and methods
10.1 .drop_duplicates
Drop duplicate observations
= pd.concat([df, df.loc[: 100, :]]) dup_df
dup_df.drop_duplicates()
i | j | y | t | l | k | alpha | psi | n | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 122 | -0.588273 | 0 | 2 | 6 | 0.0 | 0.348756 | 0.297840 |
1 | 0 | 122 | -1.368881 | 1 | 2 | 6 | 0.0 | 0.348756 | 0.619161 |
2 | 0 | 107 | -1.915975 | 2 | 2 | 5 | 0.0 | 0.114185 | 0.301757 |
3 | 0 | 107 | 0.587144 | 3 | 2 | 5 | 0.0 | 0.114185 | 0.174846 |
4 | 0 | 107 | -0.398403 | 4 | 2 | 5 | 0.0 | 0.114185 | 0.495505 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | 9999 | 121 | -0.564833 | 0 | 2 | 6 | 0.0 | 0.348756 | 0.156580 |
49996 | 9999 | 67 | -0.850231 | 1 | 2 | 3 | 0.0 | -0.348756 | 0.782052 |
49997 | 9999 | 96 | -0.199708 | 2 | 2 | 4 | 0.0 | -0.114185 | 0.678319 |
49998 | 9999 | 65 | -0.605951 | 3 | 2 | 3 | 0.0 | -0.348756 | 0.004485 |
49999 | 9999 | 65 | 0.365200 | 4 | 2 | 3 | 0.0 | -0.348756 | 0.416472 |
50000 rows × 9 columns
10.2 .sort_values
Sort the values of the dataframe based on particular columns
df
i | j | y | t | l | k | alpha | psi | n | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 122 | -0.588273 | 0 | 2 | 6 | 0.0 | 0.348756 | 0.297840 |
1 | 0 | 122 | -1.368881 | 1 | 2 | 6 | 0.0 | 0.348756 | 0.619161 |
2 | 0 | 107 | -1.915975 | 2 | 2 | 5 | 0.0 | 0.114185 | 0.301757 |
3 | 0 | 107 | 0.587144 | 3 | 2 | 5 | 0.0 | 0.114185 | 0.174846 |
4 | 0 | 107 | -0.398403 | 4 | 2 | 5 | 0.0 | 0.114185 | 0.495505 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | 9999 | 121 | -0.564833 | 0 | 2 | 6 | 0.0 | 0.348756 | 0.156580 |
49996 | 9999 | 67 | -0.850231 | 1 | 2 | 3 | 0.0 | -0.348756 | 0.782052 |
49997 | 9999 | 96 | -0.199708 | 2 | 2 | 4 | 0.0 | -0.114185 | 0.678319 |
49998 | 9999 | 65 | -0.605951 | 3 | 2 | 3 | 0.0 | -0.348756 | 0.004485 |
49999 | 9999 | 65 | 0.365200 | 4 | 2 | 3 | 0.0 | -0.348756 | 0.416472 |
50000 rows × 9 columns
'j', 'y']) df.sort_values([
i | j | y | t | l | k | alpha | psi | n | |
---|---|---|---|---|---|---|---|---|---|
33263 | 6652 | 0 | -5.064867 | 3 | 0 | 0 | -0.967422 | -1.335178 | 0.985021 |
9649 | 1929 | 0 | -4.971061 | 4 | 0 | 0 | -0.967422 | -1.335178 | 0.800174 |
40830 | 8166 | 0 | -4.758191 | 0 | 0 | 0 | -0.967422 | -1.335178 | 0.898404 |
18141 | 3628 | 0 | -4.627821 | 1 | 0 | 0 | -0.967422 | -1.335178 | 0.311400 |
12029 | 2405 | 0 | -4.166904 | 4 | 0 | 0 | -0.967422 | -1.335178 | 0.698459 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
33794 | 6758 | 198 | 4.063326 | 4 | 4 | 9 | 0.967422 | 1.335178 | 0.259600 |
21136 | 4227 | 198 | 4.117421 | 1 | 2 | 9 | 0.000000 | 1.335178 | 0.674148 |
48123 | 9624 | 198 | 4.356418 | 3 | 4 | 9 | 0.967422 | 1.335178 | 0.825097 |
29290 | 5858 | 198 | 4.447053 | 0 | 4 | 9 | 0.967422 | 1.335178 | 0.570896 |
2433 | 486 | 198 | 4.559297 | 3 | 4 | 9 | 0.967422 | 1.335178 | 0.797747 |
50000 rows × 9 columns
10.3 .unique
List the unique values in a particular column
WARNING: NumPy .unique
returns a sorted array, whereas Pandas .unique
returns an array ordered by appearance
'y'].unique() df.loc[:,
array([-0.58827279, -1.36888053, -1.91597465, ..., -0.19970824,
-0.60595076, 0.3652001 ])
'y']) np.unique(df.loc[:,
array([-5.39994589, -5.39103478, -5.31532195, ..., 5.24649836,
5.41834757, 6.03305639])
10.4 .value_counts
Count the number of occurrences of each unique value in a column
'i'].value_counts() df.loc[:,
i
0 5
6670 5
6663 5
6664 5
6665 5
..
3333 5
3334 5
3335 5
3336 5
9999 5
Name: count, Length: 10000, dtype: int64