Materials:
Lt. Cmdr. Philip Francis Queen (Humphrey Bogart): Aboard my ship, excellent performance is standard, standard performance is sub-standard, and sub-standard performance is not permitted to exist. That, I warn you.
— The Caine Mutiny (1954)
| Date | Time | Track | Meeting Link |
|---|---|---|---|
| July 16, 2021 | 9:30 AM EST | Improving Use of Common Libraries | Seminar IV: “The Caine Mutiny” |
These sessions are designed for a broad audience of modelers and software programmers of all backgrounds and skill-levels.
Our expected audience should comprise attendees with a…
pandas, and similar tools… or greater!
During this session, we will endeavour to guide our audience to developing…
… and we will share additional tips, tricks, and in-depth guidance on all of these topics!
Let’s turn our attention to the Python standard library and how it provides a set of first-approximation tools for helping us accomplish common scripting tasks. These scripting tasks typically “surround” our analytical use-cases—they may not be intimately or directly related to the analysis itself, but they provide functionality that supports some data modeling, data cleaning, or other automation-related capability.
In this episode, we’ll look at the Python standard library, focusing on libraries such as pathlib, collections, tempfile, functools, textwrap, itertools, argparse, and others.
Did you enjoy this episode? Did you learn something new that will help you as you continue to add scripting, structuring, and automation surrounding the analytical tasks in your work?
If so, stay tuned for future episodes, which may…
collections or itertoolsIf there are other related topics you’d like to see covered, please reach out to Diego Torres Quintanilla.
typing.Literalenum.Enumcollections.namedtuplejson vs simplejsoncollectionsdataclasses.dataclassfunctools.total_orderingfunctools
inspect.signatureinspecttime.perf_counter, time.perf_counter_nscontextlib.contextmanagerabc.ABCcollections.abctextwrappathlib, tempfileitertoolsprint("Let's get started!")
“Batteries included”?
We will be talking about the the Python Standard Library, which is libraries that are packaged directly with most distributions of Python. Some of these packages are considered to be inextricable from Python (in that they are used internally by the interpreter,) and others are generally assumed to always be available by end-user code.
Separate from this is an unofficial notion of PyData Standard Toolkit or
“ecosystem”, which is a very informal designation comprised of major, mature
tools like numpy, scipy, pandas, and other major tools and their direct
dependencies. The majority of these tools are NumFOCUS
sponsored or
affiliated.
Separate from this is an unofficial notion of “what we usually got,” which
consists of other major, mature common tools that are not direct dependencies
of the above, but are so useful that they are almost always available in most
internal distributions of Python. e.g., requests or httpx, xlrd and
xlwt, &c.
The Python Standard Library serves two purposes…
An example of ①:
from graphlib import TopologicalSorter
graph = {
'b': {'c'},
'c': {'d'},
'd': {'e'},
}
ts = TopologicalSorter(graph)
ts.add('a', 'b', 'c')
print(f'{[*ts.static_order()] = }')
from networkx import DiGraph
from networkx.algorithms.dag import topological_sort
g = DiGraph()
g.add_edge('a', 'b')
g.add_edge('b', 'c')
g.add_edge('c', 'd')
g.add_edge('d', 'e')
print(f"{[*topological_sort(g)] = }")
An example of ②:
from math import sin, exp, pi
from statistics import mean, median, pvariance
print(f'{sin(pi) = :.2f}')
print(f'{exp(1) = :.2f}')
print(f'{exp(pi * 1j) = }')
xs = [1, 2, 3, 4]
print(f'{mean(xs) = }')
print(f'{median(xs) = }')
print(f'{pvariance(xs) = }')
from numpy import array, sin, exp, pi
print(f'{sin(pi) = :.2f}')
print(f'{exp(1) = :.2f}')
print(f'{exp(pi * 1j) = :.2f}')
xs = array([1, 2, 3, 4], dtype='float64')
print(f'{xs.mean() = }')
print(f'{xs.var() = }')
from sympy import exp, pi, I
print(f'{exp(pi * I).simplify() = }')
from array import array
from struct import pack
xs = array('d')
xs.extend([1, 2, 3, 4])
print(f'{xs = }')
print(f'{xs + xs = }')
ys = pack('4d', 1, 2, 3, 4)
# print(f'{ys = }')
Other examples
re vs re2json vs ujson, simplejsonxml vs lxmlurllib vs requests, httpx, urllib3asyncio vs curio, trioWe will not be able to survey the entire standard library. There are useful parts that are likely to be outside of our immediate use-cases that we will largely skip:
from decimal import Decimal, localcontext
with localcontext() as ctx:
ctx.prec = 10
x, y = Decimal('1'), Decimal('3')
print(f'{x / y = }')
x, y = Decimal('1'), Decimal('10')
print(f'{x / y = }')
from ipaddress import IPv4Address, IPv4Network
ip = IPv4Address('192.168.1.100')
net = IPv4Network('192.168.1.0/24')
print(f'{ip.is_loopback = }')
print(f'{ip in net = }')
Gratuitous reminder:
from pandas import Series, DataFrame
from enum import Enum
from numpy.random import default_rng
from functools import total_ordering
from dataclasses import dataclass
rng = default_rng(0)
@dataclass
class Name:
value : str
__hash__ = lambda s: hash(s.value)
@total_ordering
class OrderedEnum(Enum):
__eq__ = lambda s, o: isinstance(o, Stars) and s.value == o.value
__lt__ = lambda s, o: isinstance(o, Stars) and s.value < o.value
__hash__ = lambda s: hash(s.value)
Stars = Enum('Stars', 'Sol Sirius Epsilon Wolf', type=OrderedEnum)
Assets = Enum('Assets', 'Medicine Software Uranium StarGems Credits', type=OrderedEnum)
Assets.Tradeable = {*Assets} - {Assets.Credits}
market = DataFrame({
asset: Series(
rng.random(size=(sz := rng.integers(2, len(Stars) // 2 + 1))) * 1_000,
index=rng.choice([*Stars], size=sz, replace=False)
)
for asset in Assets.Tradeable
}).round(2)
market[Assets.Credits] = 1
market = market.sort_index()
market.index.name, market.columns.name = Name(Assets), Name(Stars)
inventory = Series(
rng.integers(10, 1_000, size=len(Assets.Tradeable)),
index=[*Assets.Tradeable],
)
inventory[Assets.Credits] = 1_000
inventory = inventory.sort_index()
inventory.index.name = Name(Assets)
print(
# market,
# inventory,
# market * inventory,
# (market * inventory).sum(axis='columns'),
# (market * inventory).sum(axis='columns').idxmax(),
)
typing.Literaldef f(mode=True):
pass
def f(mode : str = 'up'):
''' mode can be "up", "down", "left", or "right" '''
pass
from typing import Literal
Mode = Literal['up', 'down', 'left', 'right']
def f(mode : Mode = 'up'):
pass
print(f'{Mode = }')
print(f'{Mode.__args__ = }')
enum.Enumfrom enum import Enum
from random import choice
Mode = Enum('Mode', 'Up Down Left Right')
print(f"{Mode['Up'] = }")
m = Mode['Down']
print(f'{m is Mode.Down = }')
print(f'{[*Mode] = }')
print(f'{choice([*Mode]) = }')
from enum import Enum, auto
class Mode(Enum):
Up = auto()
Down = auto()
Left = auto()
Right = auto()
print(f'{[*Mode] = }')
from enum import Enum, auto
from numpy import array
class Mode(Enum):
Up = [+1, 0]
Down = [-1, 0]
Left = [ 0, +1]
Right = [ 0, -1]
def __add__(self, other):
if isinstance(other, Mode):
return self.value + other.value
return self.value + other
__radd__ = __add__
print(f'{[*Mode] = }')
pos = array([0, 0])
moves = [Mode.Up, Mode.Down, Mode.Left, Mode.Left]
print(f'{sum(array(m.value) for m in moves) = }')
print(f'{pos + sum(array(m.value) for m in moves) = }')
from enum import Enum, auto
from numpy import array
class Mode(Enum):
Up = array([+1, 0])
Down = array([-1, 0])
Left = array([ 0, +1])
Right = array([ 0, -1])
collections.namedtuplefrom collections import namedtuple
objs = [
('xyz', 123, {'a', 'b', 'c'}),
('def', 456, {'a', 'd'}),
]
for name, score, choices in objs:
print(f'{name = } has choices {choices = }')
for obj in objs:
print(f'{obj[0] = } has choices {obj[-1] = }')
from collections import namedtuple
Entrant = namedtuple('Entrant', 'name score choices')
objs = [
Entrant('xyz', 123, {'a', 'b', 'c'}),
Entrant('def', 456, {'a', 'd'}),
]
for name, score, choices in objs:
print(f'{name = } has choices {choices = }')
for obj in objs:
print(f'{obj[0] = } has choices {obj[-1] = }')
for obj in objs:
print(f'{obj.name = } has choices {obj.choices = }')
from collections import namedtuple
class Entrant(namedtuple('Entrant', 'name score choices')):
def __new__(cls, name, score, choices=set()):
if score < 0:
raise ValueError('score must be positive')
return super().__new__(cls, name, score, choices)
objs = [
Entrant('xyz', 123, {'a', 'b', 'c'}),
Entrant('def', 456),
]
for obj in objs:
print(f'{obj = }')
json vs simplejsonfrom collections import namedtuple
from json import dumps
Entrant = namedtuple('Entrant', 'name score choices')
objs = [
Entrant('xyz', 123, {'a', 'b', 'c'}),
Entrant('def', 456, {'a', 'd'}),
]
for obj in objs:
print(f'{dumps(obj) = }')
from collections import namedtuple
from json import dumps
Entrant = namedtuple('Entrant', 'name score choices')
objs = [
Entrant('xyz', 123, {'a', 'b', 'c'}),
Entrant('def', 456, {'a', 'd'}),
]
def default(obj):
if isinstance(obj, set):
return [*obj]
if isinstance(obj, Entrant):
return obj._asdict()
for obj in objs:
print(f'{dumps(obj, default=default) = }')
from collections import namedtuple
from simplejson import dumps
Entrant = namedtuple('Entrant', 'name score choices')
objs = [
Entrant('xyz', 123, {'a', 'b', 'c'}),
Entrant('def', 456, {'a', 'd'}),
]
def default(obj):
if isinstance(obj, set):
return [*obj]
for obj in objs:
print(f'{dumps(obj, default=default) = }')
collectionsfrom collections import deque
xs = deque([1, 2, 3, 4, 5])
xs.append(6)
xs.append(7)
while xs:
print(f'{xs.popleft() = }')
from collections import deque
xs = deque(maxlen=3)
xs.append(1)
xs.append(2)
xs.append(3)
xs.append(4)
print(f'{xs = }')
from collections import defaultdict
d = defaultdict(int)
print(f"{d['abc'] = }")
print(f"{d['def'] = }")
print(f'{d = }')
class passthru(dict):
def __missing__(self, key):
return key
d = passthru({
'abc': 'ABC',
})
print(f"{d['abc'] = }")
print(f"{d['ABC'] = }")
from collections import ChainMap
layer0 = {'abc': 123, }
layer1 = { 'def': 456}
layer2 = {'abc': 789, }
cm = ChainMap(layer2, layer1, layer0)
print(f"{cm['abc'] = }")
print(f"{cm['def'] = }")
print(f'{cm.maps = }')
from collections import ChainMap, deque
layer0 = {'abc': 123, }
layer1 = { 'def': 456}
layer2 = {'abc': 789, }
cm = ChainMap()
cm.maps = deque()
cm.maps.extend([layer2, layer1, layer0])
print(f"{cm['abc'] = }")
cm.maps.popleft()
print(f"{cm['abc'] = }")
from collections import OrderedDict
od1 = OrderedDict({'a': 1, 'b': 2, 'c': 3})
od2 = OrderedDict({'c': 3, 'b': 2, 'a': 1})
print(f'{od1 == od2 = }')
print(f'{od1.popitem() = }')
from collections import Counter
c = Counter('aaabbccddddd')
print(f'{c = }')
from collections import Counter
c1 = Counter('abc')
c2 = Counter('bcd')
print(f'{c1 + c2 = }')
print(f'{c1 & c2 = }')
print(f'{c1 | c2 = }')
from collections import UserDict
class mydict(UserDict):
def __setitem__(self, key, value):
super().__setitem__(key.upper(), value)
d = mydict({'abc': 123})
d.update({'def': 456})
d['xyz'] = 789
print(f'{d = }')
from collections.abc import MutableMapping
class mydict(MutableMapping):
def __init__(self, value={}):
self._d = {}
for k, v in value.items():
self[k] = v
def __getitem__(self, key):
return self._d[key]
def __setitem__(self, key, value):
self._d[key.upper()] = value
def __delitem__(self, key):
del self._d[key]
def __iter__(self):
return iter(self._d)
def __len__(self):
return len(self._d)
def __repr__(self):
return f'mydict({self._d!r})'
d = mydict({'abc': 123})
d.update({'def': 456})
d['xyz'] = 789
print(f'{d = }')
dataclasses.dataclassfrom dataclasses import dataclass
@dataclass
class Entrant:
name : str
score : int
choices : set = ()
def __post_init__(self):
if self.score < 0:
raise ValueError('score must be positive')
self.choices = {*self.choices}
objs = [
Entrant('xyz', 123, {'a', 'b', 'c'}),
Entrant('def', 456, {}),
]
for obj in objs:
print(f'{obj = }')
from attr import attrs, attrib, validators
from collections.abc import Iterable
@attrs(order=True)
class Entrant:
name : str = attrib(validator=validators.instance_of(str))
score : int = attrib(validator=validators.instance_of(int))
choices : set = attrib(validator=validators.instance_of(Iterable),
factory=set, converter=set, kw_only=True)
objs = [
Entrant('xyz', 123, choices=['a', 'b', 'c']),
Entrant('def', 456),
]
for obj in objs:
print(f'{obj = }')
from collections.abc import Iterable
class Entrant:
def __init__(self, name, score, choices=()):
if score < 0:
raise ValueError('score must be positive')
self.name, self.score = name, score
self.choices = {*choices}
def __repr__(self):
return f'Entrant({self.name!r}, {self.score!r}, {self.choices!r})'
def __eq__(self, other):
return self.name == other.name and self.score == other.score and self.choices == other.choices
objs = [
Entrant('xyz', 123, choices=['a', 'b', 'c']),
Entrant('def', 456),
]
for obj in objs:
print(f'{obj = }')
functools.total_orderingfrom enum import Enum, auto
from functools import total_ordering
@total_ordering
class Hands(Enum):
Straight = auto()
Flush = auto()
StraightFlush = auto()
RoyalFlush = auto()
def __lt__(self, other):
return self.value < other.value
def __eq__(self, other):
return self.value == other.value
print(f'{Hands.Straight < Hands.RoyalFlush = }')
print(f'{Hands.StraightFlush >= Hands.RoyalFlush = }')
functoolsfunctools.reducefrom functools import reduce
from operator import add, mul
xs = [1, 2, 3, 4, 5]
print(f'{reduce(add, xs) = }')
print(f'{reduce(mul, xs) = }')
from numpy import array, product
from pandas import Series
xs = array([1, 2, 3, 4, 5])
print(f'{xs.sum() = }')
print(f'{xs.prod() = }')
s = Series([1, 2, 3, 4, 5])
print(
s.expanding().sum(),
s.expanding().apply(product),
sep='\n\n'
)
functools.partialfrom functools import partial
from functools import wraps
def f(a, b, c):
return a + b - c
g = partial(f, c=0)
print(f'{g(1, 2) = }')
help(g)
g = wraps(f)(lambda *a, **kw: f(*a, **kw, c=0))
g.__doc__ = '\n'.join([g.__doc__ or '', 'Fixing c = 0'])
print(f'{g(1, 2) = }')
help(g)
functools.wrapsfrom functools import wraps
def dec(f):
@wraps(f)
def inner(*args, **kwargs):
return f(*args, **kwargs) + 1
return inner
@dec
def f(x, y):
''' adds x and y '''
return x + y
help(f)
functools.lru_cachefrom functools import lru_cache
from time import sleep, perf_counter
@lru_cache
def f(x, y):
sleep(1)
return x + y
start = perf_counter()
print(f'{f(1, 1) = }')
print(f'{f(1, 1) = }')
print(f'{f(1, 1) = }')
print(f'{f(x=1, y=1) = }')
print(f'{f(1, y=1) = }')
stop = perf_counter()
print(f'\N{mathematical bold capital delta}t: {stop - start:.2f}s')
inspect.signaturefrom inspect import signature
from time import sleep, perf_counter
class memoise(dict):
def __init__(self, f):
self.f, self.sig = f, signature(f)
def __call__(self, *args, **kwargs):
key = self.sig.bind(*args, **kwargs)
return self[key.args, frozenset(key.kwargs.items())]
def __missing__(self, key):
args, kwargs = key
self[key] = self.f(*args, **dict(kwargs))
return self[key]
@memoise
def f(x, y):
sleep(1)
return x + y
start = perf_counter()
print(f'{f(1, 1) = }')
print(f'{f(1, 1) = }')
print(f'{f(1, 1) = }')
print(f'{f(x=1, y=1) = }')
print(f'{f(1, y=1) = }')
stop = perf_counter()
print(f'\N{mathematical bold capital delta}t: {stop - start:.2f}s')
inspectdef dec(f):
pass
@dec
def f():
pass
def dec(arg):
def inner_dec(f):
pass
return inner_dec
@dec(...)
def f():
pass
from inspect import getsource
from ast import parse
def dec(f):
# print(f'{getsource(f).splitlines()[0] = }')
print(f'{parse(getsource(f)).body[0].decorator_list = }')
return f
@dec
def f():
pass
from inspect import signature
from itertools import chain
print(f'{signature(len) = }')
print(f'{signature(chain) = }')
from inspect import getsource
from json import loads
from itertools import chain
# print(f'{getsource(chain) = }')
time.perf_counter, time.perf_counter_nsfrom time import perf_counter, perf_counter_ns, sleep
start = perf_counter()
sleep(1)
stop = perf_counter()
print(f'\N{mathematical bold capital delta}t: {stop - start:.2f}s')
start = perf_counter_ns()
sleep(1)
stop = perf_counter_ns()
print(f'\N{mathematical bold capital delta}t: {stop - start:.0f}ns')
contextlib.contextmanagerfrom contextlib import contextmanager
from time import sleep, perf_counter
@contextmanager
def timed(msg=''):
start = perf_counter()
try:
yield
finally:
stop = perf_counter()
print(f'{msg} \N{mathematical bold capital delta}t: {stop - start:.2f}s')
with timed('one-second nap'):
sleep(1)
abc.ABCfrom abc import ABC, abstractmethod
class Interface(ABC):
@abstractmethod
def f(self, a, b):
pass
class BadImplementation(Interface):
pass
# obj = BadImplementation()
class BadImplementation(Interface):
def f(self):
pass
obj = BadImplementation()
from inspect import signature
from collections.abc import Callable
def abstractmethod(f):
f.abstract = True
return f
class Interface:
@abstractmethod
def f(self, a, b):
pass
def __init_subclass__(cls):
methods = {name: meth for name in dir(Interface)
if isinstance(meth := getattr(Interface, name), Callable)
and getattr(meth, 'abstract', False)}
for name, meth in methods.items():
if not hasattr(cls, name) or getattr(cls, name) is meth:
raise TypeError(f'{cls} missing method {name}')
if signature(getattr(cls, name)) != signature(meth):
raise TypeError(f'{cls} mismatched signature on {name}')
try:
class BadImplementation(Interface):
pass
except Exception as e:
print(f'{e = }')
try:
class BadImplementation(Interface):
def f(self):
pass
except Exception as e:
print(f'{e = }')
class GoodImplementation(Interface):
def f(self, a, b):
pass
from inspect import signature
def make_interface(methods):
def interface(cls):
for name, meth in methods.items():
if not hasattr(cls, name) or getattr(cls, name) is meth:
raise TypeError(f'{cls} missing method {name}')
if signature(getattr(cls, name)) != signature(meth):
raise TypeError(f'{cls} mismatched signature on {name}')
return cls
return interface
interface = make_interface({
'f': lambda self, a, b: None
})
try:
@interface
class BadImplementation:
pass
except Exception as e:
print(f'{e = }')
try:
@interface
class BadImplementation:
def f(self):
pass
except Exception as e:
print(f'{e = }')
@interface
class GoodImplementation:
def f(self, a, b):
pass
collections.abcfrom collections.abc import Callable, Iterable, Container, Sized
from pandas import Series
from numpy import array
def f(): pass
class g: pass
h = lambda: None
print(
f'{isinstance(f, Callable) = }',
f'{isinstance(g, Callable) = }',
f'{isinstance(h, Callable) = }',
sep='\n',
)
xs = [1, 2, 3]
ys = {1, 2, 3}
zs = Series([1, 2, 3])
print(
f'{isinstance(xs, Iterable) = }',
f'{isinstance(ys, Iterable) = }',
f'{isinstance(zs, Iterable) = }',
sep='\n',
)
s = 'abc'
print(
# f'{isinstance(xs, Iterable) = }',
# f'{isinstance(xs, Iterable) and not isinstance(xs, str) = }',
sep='\n',
)
xs = array([1, 2, 3])
print(
f'{isinstance(xs, Container) = }',
f'{isinstance(xs, Sized) = }',
f'{len(xs) = }',
f'{bool(xs) = }',
sep='\n',
)
You’ll see this in the pandas codebase:
xs = [1, 2, 3]
try:
iter(xs)
len(xs)
except Exception as e:
pass
from enum import Enum
from pandas import DataFrame
from numpy import zeros
en = Enum('Enum', 'a b c')
df = DataFrame(zeros((len(en), len(en))))
df.index = [*en]
df.columns = [*en]
df.index.name = df.columns.name = en
print(
df,
)
textwrapclass A:
values = ['a', 'b', 'c', 'd', 'e']
print(f'{A.values = }')
class A:
values = '''
a b c d e
'''.split()
print(f'{A.values = }')
from textwrap import dedent, indent, shorten
class A:
message = dedent('''
Some
Message
''').strip()
print(f'{A.message = }')
print(
indent(A.message, '----'),
)
print(
shorten('Some long message', 10, placeholder='...'),
)
pathlib, tempfilefrom pathlib import Path
curdir = Path('.')
for path in curdir.iterdir():
if path.is_file():
print(f'{path = }')
print(f'{path.suffix = }')
print(f'{path.with_stem(path.stem.upper()) = }')
print(f"{path.with_suffix(path.suffix + '.gz') = }")
# print(f'{path.stat() = }')
break
datadir = (curdir / 'a/b/c').mkdir(parents=True, exist_ok=True)
from tempfile import TemporaryFile, NamedTemporaryFile
with TemporaryFile(mode='w+t') as f:
f.write('abc')
f.seek(0)
print(f'{f.read() = }')
with NamedTemporaryFile(mode='w+t') as f:
f.write('abc')
print(f'{f.name = }')
from tempfile import TemporaryDirectory
from pathlib import Path
with TemporaryDirectory(prefix='test-') as d:
d = Path(d)
print(f'{d = }')
itertoolsA generalisation of iteration helpers:
A tool for building simple, powerful iteration helpers!
from itertools import tee, islice, zip_longest, repeat, chain
nwise = lambda g, n=2: zip(*(islice(g, idx, None) for idx, g in enumerate(tee(g, n))))
nwise_longest = lambda g, n=2, fv=object: zip_longest(*(islice(g, idx, None) for idx, g in enumerate(tee(g, n))), fillvalue=fv)
first = lambda g, n=1: zip(g, chain(repeat(True, n), repeat(False)))
last = lambda g, m=1, s=object(): ((x, y[-1] is s) for x, *y in nwise_longest(g, m+1, fv=s))
for x, y in nwise('abcd'):
print(f'{x, y = }')
print()
for x, y in nwise_longest('abcd'):
print(f'{x, y = }')
print()
for x, is_first in first('abcd'):
print(f'{x, is_first = }')
print()
for x, is_last in last('abcd'):
print(f'{x, is_last = }')
print()
Next time!