pandas I: pandas is well designed, actually!Keywords: understanding datatypes in pandas including pandas.array, pandas.Series, pandas.DataFrame; nullable ints, pandas.Categorical; resampling, masking
| Presenter | James Powell james@dutc.io |
| Date | Wednesday, November 18, 2020 |
| Time | 3:30 PM EST |
print('Good afternoon!')
# GIVEN: given a DataFrame that looks like this…
# a b c d e # a - is a label, drawn from "aa" ~ "zz"
# zz 1 .2 .4 .1 # b - is drawn randomly from [0, 1, 2]
# zz 0 .3 .2 .3 # with 50% of a 1 and 25% of a 0 or 2
# yb 2 .1 .7 .6 # c, d, e - is drawn from a normal
# # distribution with a mean of 0.5
# # and std of 2.5
from pandas import DataFrame
from numpy import arange
from numpy.random import choice, normal
from random import randint
from string import ascii_lowercase
from numpy import column_stack
SIZE = 1_000
df = DataFrame({
'a': choice([*ascii_lowercase], size=((SIZE:=1_000), 2)).view('<U2').ravel(),
'b': choice(arange(3), p=[.25, .5, .25], size=SIZE),
'c': normal(0.5, 2.5, size=SIZE),
'd': normal(0.5, 2.5, size=SIZE),
'e': normal(0.5, 2.5, size=SIZE),
})
print(df)
# TASK: convert a into a Categorical value
# TASK: print the number of times the various a-labels appear.
# i.e., xxx - 10 times; yyy - 8 times, zzz - 2 times
# TASK: (i) print the average of column c and the variance of column d
# (iia) print the average of column c and the variance of column d
# for rows where a is after "gg"
# (iib) print the average of column c and the variance of column d
# for rows where c <= d < e
# (iii) print the average of column c and the variance of column d…
# … PER value of a
# TASK: group the above by the the `a` value WITHOUT making it the index
# and WITHOUT sorting it
# TASK: compute a pivot table where the left index is the values of `a`
# tabulated against the values of `b`, where the entries are the values of
# the means of the difference of `c` and `d`
# col col col
# a
# b
# c
# TASK: group by `a` and `b` and show the sum of the `c` values only
# pivot by `a` and `b` and show the sum of the `c` values
# make these two DataFrames look equivalent
# TASK: investigate .groupby.apply and .groupby.transform
# TASK: group by `a` and apply a transformation to compute the mean
# of each of `c`, `d`, and `e` and the sum of `b`
# use `groupby(…).transform`
# TASK: try to repeat the above using `.groupby(…).apply`