ts-python

Seminar X: “Shrek” (“holiday stocking stuffer”; bizarre Python language warts)

Donkey (Eddie Murphy): Hey, what’s your problem, Shrek, what you got against the whole world anyway, huh?

Shrek (Mike Myers): Look, I’m not the one with the problem, okay? It’s the world that seems to have a problem with ME! People take one look at me and go “Aargh! Help! Run! A big stupid ugly ogre!” They judge me before they even know me - that’s why I’m better off alone…

Donkey (Eddie Murphy): You know, Shrek… when we first met, I didn’t think you were a big, stupid, ugly ogre.

Shrek (Mike Myers): Yeah, I know.

— Shrek (2001)

Date	Time	Track	Meeting Link
Fri, Dec 3, 2021	9:30 AM EDT	Python fundmentals & misc.	https://primetime.bluejeans.com/a2m/live-event/xcsugkge

Audience

These sessions are designed for a broad audience of modelers and software programmers of all backgrounds and skill levels.

Our expected audience should comprise attendees…

who are looking to have some holiday fun!
who are interested in the reason behind design decisions and limitations in common analysis libraries like pandas.
who are interested in design warts encountered when building large systems in Python.

During this session, we will endeavour to guide our audience to developing…

familiarity with design issues in Python that affect the development of libraries and large systems.
familiarity with limitations in the core design Python, and how these can affect every day applications.

…and we will share additional tips, tricks, and in-depth guidance on all of these topics!

Abstract

In this special holiday episode, we’re going to have some fun! We’ll discuss design mistakes (“warts”) made in the development of Python and common analysis libraries like pandas. We’ll cover why these exist, how to come to terms with them, how they affect the design and implementation of our own systems and analytical efforts, and what these tell us more broadly about Python and programming.

To Be Continued…

Did you enjoy this episode? Did you learn something new that will help you as you continue or begin to use window methods in your work?

If so, stay tuned for future episodes, which may…

discuss upcoming syntax in Python ≥3.10 and how it interacts with current sytnax and features.
discuss the differences between how Python code was written in the 2.0~2.6 era, the 3.0~3.6 era, and the modern era.
discuss other interesting limitations with Python and common tools like pandas and xarray, how these may affect your work, and common workarounds used to address these.

Notes

Let’s get started!

print("Let's get started!")

Topic I: stop `str`inging me along!

from random import choice as py_choice
from numpy.random import choice as np_choice
from string import ascii_lowercase

print(
    ascii_lowercase,
    py_choice(ascii_lowercase),
    #  np_choice(ascii_lowercase),
    sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)

Why does this matter?

subjectivity of data/datum
restricted computation domain

from pandas import Series

xs = array(dtype='int64')

print(
    xs,
    sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)

from pandas import Series
from operator import or_

s1 = Series([{1, 2}, {2, 3, 4}, {3, 4, 5}])
s2 = Series([{2, 3}, {1, 2, 4}, {1, 4, 5}])

print(
    s1.combine(s2, or_),
    #  s1.explode().append(s2.explode()),
    sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)

What else can you show me?

from pandas import Series, DataFrame
from numpy.random import default_rng

Topic II: `str`ing up trouble!

from numpy import array
from pandas import Series

xs = array(['abc', 'def', 'wxyz'])
s = Series(['abc', 'def', 'wxyz'])

print(
    xs, f'{xs.dtype = }',
    s,  f'{s.dtype = }',
    sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)

Why does this matter?

from pandas import Series

s = Series(['abc', 'def', 'wxyz']).astype('category')

print(
    s,  f'{s.dtype = }',
    sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)

from contextlib import contextmanager
from time import perf_counter
@contextmanager
def timed(msg):
    start = perf_counter()
    try:
        yield
    finally:
        stop = perf_counter()
        print(f'{msg:<16} \N{mathematical bold capital delta}t: {stop - start:.6f}s')

from pandas import Series
from numpy.random import default_rng
from string import ascii_lowercase

rng = default_rng(0)

s1 = Series(
    rng.choice([*ascii_lowercase], size=(100_000, (width := 4))).view(f'<U{width}').ravel()
)
s2 = s1.astype('category')

print(
    s1,  f'{s1.dtype = }',
    sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)

#  with timed('object'):
#      '.' == s1

#  with timed('category'):
#      '.' == s2

Topic III: did I forget about the .index?

from pandas import DataFrame
from numpy.random import default_rng
from string import ascii_lowercase

rng = default_rng(0)

df = DataFrame({
    'a': rng.integers(10, size=(size := 4)),
    'b': rng.random(size=size).round(2),
    'c': rng.choice([*ascii_lowercase], size=size),
})

print(
    df,
    0   in df['a'],
    'n' in df['c'],
    0   in df['c'],
    sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)

Why does this matter?

You forgot about the .index!

Rules of pandas:

Everyone thinks they know how to use pandas.
No one does (even though the core concept is pretty obvious.)
They get really mad at you when you talk about the index on Twitter.

Rules of pandas:

Stay within the restricted computation domain.
All operations are index-aware; pay attention to (and make use of!) index-alignment.
Try to lean into syntax that is amenable to fast-paths.

from pandas import Series, MultiIndex, date_range, IndexSlice
from numpy.random import default_rng
from string import ascii_lowercase

rng = default_rng(0)

dates = date_range('2000-01-01', periods=100)
assets = rng.choice([*ascii_lowercase], size=(10, 4)).view('<U4').ravel()
underlyings = rng.choice([*ascii_lowercase], size=(16, 2)).view('<U2').ravel()

idx = MultiIndex.from_product([
    dates, assets, underlyings
], names=['date', 'asset', 'underlying'])
s = (
    Series(rng.normal(size=len(idx)).round(2), index=idx)
    .sample(frac=.20, random_state=rng.bit_generator)
    .sort_index()
)

print(
    s.loc[IndexSlice[(dts := date_range('2000-01-01', '2000-01-31')), :]]
    .groupby('asset').agg(lambda g:
        ((g[g > 0]).groupby('date').size() > 0)
        .reindex(dates, fill_value=False)
        .pipe(lambda s: s.groupby((~s).cumsum()).cumcount())
        .max()
    ),
    sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)

Topic IV: one past the end… of my patience!

from pandas import Series, date_range
from numpy.random import default_rng

rng = default_rng(0)

idx = date_range('2000-01-01', periods=150, freq='1D')
s = Series(rng.normal(size=len(idx)).round(2), index=idx)

print(
    s.iloc[0:20],
    s.iloc[0:20].reset_index(drop=True),
    s.loc['2000-01-01':'2000-01-31'],
    s.loc['2000-01-01':'2000-01-31'].iloc[:-1],
    sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)

Why does this matter?

The .index, did you forget about it?

from pandas import Series, date_range
from numpy.random import default_rng
from string import ascii_lowercase

rng = default_rng(0)

idx = [*ascii_lowercase]
#  idx = [*ascii_lowercase[::2]]
s = Series(rng.normal(size=len(idx)).round(2), index=idx)

print(
    s.iloc[:5],
    s.loc[:ascii_lowercase[5]],
    sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)

So what do I do?

from pandas import Series, date_range, cut, IntervalIndex, to_datetime
from numpy.random import default_rng
from numpy import iinfo, int64

rng = default_rng(0)

idx = date_range('2000-01-01', periods=150, freq='1D').rename('date')
s = Series(rng.normal(size=len(idx)).round(2), index=idx)

bins = IntervalIndex.from_breaks([
    #  to_datetime(iinfo(int64).min+1, unit='ns'),
    #  to_datetime('2000-01-01'),
    #  to_datetime('2000-02-01'),
    #  to_datetime(iinfo(int64).max, unit='ns'),
], closed='left')

print(
    #  bins,
    #  *s.groupby(cut(s.index.get_level_values('date'), bins)).pipe(
    #      lambda gb: [x for _, x in gb]
    #  ),
    sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)

Topic V: ‘ceteris paribus’—all else being equal

print(
    f'{(a := 0)     is (b := 0) = }',
    f'{(a := 10)    is (b := 10) = }',
    f'{(a := 1_000) is (b := eval("1_000")) = }',
    sep='\n{}\n'.format('\N{box drawings light horizontal}' * 50),
)

from numpy import nan, inf

print(
    f'{(a := +0) is (b := -0) = }',
    f'{(a := +0.0) is (b := -0.0) = }',
    f'{inf is inf = }',
    f'{nan is nan = }',
    f'{inf == inf = }',
    f'{nan == nan = }',
    sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)

class T:
    def __eq__(self, _):
        return False

x = object()

print(
    f'{x is x = }',
    f'{x == x = }',
    sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)

from numpy import array, nan

xs = [nan, nan]
ys = array([nan, nan])

print(
    #  f'{xs[0] != xs[0] = }',
    #  f'{xs == xs = }',
    #  f'{ys == ys = }',

    #  f'{all(x0 == x1 for x0, x1 in zip(xs, xs[1:])) = }',
    sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)

Why does this matter?

What is a Python list? What is a numpy.ndarray?
Restricted computation domain.

from numpy import array

# `list`
# opaque collection of elements
# - fixed shape
# - dynamic size

xs = [1, 2, 3]
for x in xs: pass

# `numpy.ndarray`
# manager object serving as a “restricted computation domain”
# with broadcasting (→ ‘tensor’/‘vector’/‘matrix’)
# - dynamic shape
# - fixed size

ys = array([1, 2, 3])
ys.sum()

print(
    #  f'{xs + xs = }',
    #  f'{xs *  3 = }',
    #  f'{ys + ys = }',
    #  f'{ys *  3 = }',
    sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)

Topic VI: False is not a number!

print(
    f'{True  == 1 = }',
    f'{False == 0 = }',
    sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)

from statistics import mean

signals = [True, False, False, True, True]
print(f'{mean(signals) = }')

from string import ascii_lowercase
from random import choice, seed; seed(0)

word = ''.join(choice(ascii_lowercase) for _ in range(10))

print(
    f'{word = }',
    f'{"".join(x for x in word if x not in "aeiou") = }',
    f'{sum(1 for x in word if x not in "aeiou") = }',
    f'{sum(x not in "aeiou" for x in word) = }',
    sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)

Why does this matter?

Masking!

from pandas import Series
from numpy.random import default_rng

rng = default_rng(0)

s = Series(rng.integers(-10, 10, size=100))

print(
    s[(s > 0) & (s % 2 == 1)],
    sep='\n{}\n'.format(
        '\N{box drawings light horizontal}' * 40,
    ),
)

from contextlib import contextmanager
from time import perf_counter
@contextmanager
def timed(msg):
    start = perf_counter()
    try:
        yield
    finally:
        stop = perf_counter()
        print(f'{msg:<16} \N{mathematical bold capital delta}t: {stop - start:.6f}s')

from pandas import Series
from numpy.random import default_rng

rng = default_rng(0)

s = Series(rng.integers(-10, 10, size=100_000))

print(
    #  s,
    sep='\n{}\n'.format(
        '\N{box drawings light horizontal}' * 40,
    ),
)

#  with timed('indexing'):
#      s[(s > 0) & (s % 2 == 0)] *= 100
#      s[(s < 0) & (s % 2 == 0)] /= 100

#  with timed('masking'):
#      masks = m0, m1 = (
#          (s > 0) & (s % 2 == 0),
#          (s < 0) & (s % 2 == 0),
#      )
#      (s * 100 * m0) + (s / 100 * m1) + s

ts-python

Seminar X: “Shrek” (“holiday stocking stuffer”; bizarre Python language warts)

Audience

Abstract

To Be Continued…

Notes

Topic I: stop stringing me along!

Topic II: string up trouble!

Topic III: did I forget about the .index?

Topic IV: one past the end… of my patience!

Topic V: ‘ceteris paribus’—all else being equal

Topic VI: False is not a number!

Your questions?

Topic I: stop `str`inging me along!

Topic II: `str`ing up trouble!