ts-python

applied pandas I: pandas is well designed, actually!

Discussion (Wed Nov 18, 2020; 3:30 PM EST)

Keywords: understanding datatypes in pandas including pandas.array, pandas.Series, pandas.DataFrame; nullable ints, pandas.Categorical; resampling, masking

Presenter James Powell james@dutc.io
Date Wednesday, November 18, 2020
Time 3:30 PM EST
print('Good afternoon!')
# GIVEN: given a DataFrame that looks like this…

#     a   b  c  d  e  # a - is a label, drawn from "aa" ~ "zz"
#     zz  1 .2 .4 .1  # b - is drawn randomly from [0, 1, 2]
#     zz  0 .3 .2 .3  #        with 50% of a 1 and 25% of a 0 or 2
#     yb  2 .1 .7 .6  # c, d, e - is drawn from a normal
#                     #           distribution with a mean of 0.5
#                     #           and std of 2.5

from pandas import DataFrame
from numpy import arange
from numpy.random import choice, normal
from random import randint
from string import ascii_lowercase
from numpy import column_stack

SIZE = 1_000
df = DataFrame({
    'a': choice([*ascii_lowercase], size=((SIZE:=1_000), 2)).view('<U2').ravel(),
    'b': choice(arange(3), p=[.25, .5, .25], size=SIZE),
    'c': normal(0.5, 2.5, size=SIZE),
    'd': normal(0.5, 2.5, size=SIZE),
    'e': normal(0.5, 2.5, size=SIZE),
})
print(df)

# TASK: convert a into a Categorical value

# TASK: print the number of times the various a-labels appear.
#       i.e., xxx - 10 times; yyy - 8 times, zzz - 2 times

# TASK: (i) print the average of column c and the variance of column d
#       (iia) print the average of column c and the variance of column d
#             for rows where a is after "gg"
#       (iib) print the average of column c and the variance of column d
#             for rows where c <= d < e
#       (iii) print the average of column c and the variance of column d…
#             … PER value of a

# TASK: group the above by the the `a` value WITHOUT making it the index
#       and WITHOUT sorting it

# TASK: compute a pivot table where the left index is the values of `a`
#       tabulated against the values of `b`, where the entries are the values of
#       the means of the difference of `c` and `d`

#      col     col     col
# a
# b
# c

# TASK: group by `a` and `b` and show the sum of the `c` values only
#       pivot by `a` and `b` and show the sum of the `c` values
#       make these two DataFrames look equivalent

# TASK: investigate .groupby.apply and .groupby.transform
# TASK: group by `a` and apply a transformation to compute the mean
#       of each of `c`, `d`, and `e` and the sum of `b`
#       use `groupby(…).transform`
# TASK: try to repeat the above using `.groupby(…).apply`