ts-python

Seminar XI: Out With the Old and in With the New: The Evolution of Python’s Standard Library

HOMER SIMPSON: “I’ve gone back to the time when dinosaurs weren’t just confined to zoos.”

— “Treehouse of Horror V” (S06E06)

Abstract

Ever wonder about the design decision made about Pythons standard library? Why do old “included-batteries” sit around collecting dust instead of being deprecated?

Do you…

… wonder why 3rd party packages are seldom included into the standard library?
… need to make better use of the Python standard library?
… need to update old scripts to take advantage of modern standard library replacements?

Then come join us for a session on Pythons standard library!

Python is touted as a “batteries-included” programming language. But these batteries have been changed and evolved over time, often leaving users with redundant entry points to perform overlapping tasks. For example, to launch a subprocess should one use os.system, os.popen, os.posix_spawn, subprocess.Popen, subprocess.check_output, subprocess.call, or subprocess.run? How do you make a decision amongst these?

In this episode we’ll discuss the most widely used and changed modules in the Python standard library, laying out why upgrades or replacements occurred, and the current recommended approaches for these modules. Additionally, we’ll discuss what 3rd party packages can be used as drop-in replacements for parts of the standard library and why some of these 3rd party packages are preferred.

Keywords

standard library, Python

Notes

Premise: Scoping & Framing

What is the Python standard library exactly?

print("Let's take a look!")

It refers to the packages that are included in a community-supported Python distribution.

It refers to the packages that we can assume are included in any community-supported Python distribution.

It refers to packages without which we would not have a “complete” or “correct” Python installation.

Some of these packages are necessary for the operation of the interpreter itself.

Consider…

from sys import modules

print(modules.keys())

```python -S from sys import modules

print(modules.keys())

But from a practical matter, in a managed corporate environnent, we (end-users)
don’t generally think in terms of the what the community will support.

Our framing is generally…
- actual Python standard library
- standard PyData ecosystem (e.g., `numpy`, `matplotlib`, maybe `pandas`)
- instutitional core libraries (e.g., data accessors, foundational technologies)

A quiz: which are in the Python standard library?

```python
import arrow        # .feather, .parquet
import csv          # .csv
import h5py         # .hdf5
import html         # .html
import configparser # .ini
import json         # .json
import mailbox      # .mbox
import xarray       # .nc
import plistlib     # .plist
import tomllib      # .toml
import lxml         # .xml
import yaml         # .yml

“Silly, old stuff…”

Final:

Withdrawn:

from pkgutil import iter_modules
from collections import defaultdict

paths = defaultdict(set)
for x in iter_modules():
    paths[x.module_finder.path].add(x)

print(*paths.keys(), sep='\n')

from pkgutil import iter_modules
from collections import defaultdict
from itertools import chain
from pathlib import Path

paths = defaultdict(set)
for x in iter_modules():
    paths[x.module_finder.path].add(x)

stdlib = {
    *chain.from_iterable(
        paths[loc]
        for loc
        in paths.keys() - {x for x in paths if Path(x).name == 'site-packages'}
    )
}

print(
    *sorted((x for x in stdlib if not x.name.startswith('_')), key=lambda x: x.name),
    sep='\n',
)

if gimmicks := False:
    import antigravity, this
if superseded := False:
    import (
        aifc, asynchat, asyncore, audioop, cgi, cgitb, chunk, crypt, imghdr,
        imp, mailcap, nis, nntplib, optparse, ossaudiodev, pipes, smtpd,
        sndhdr, spwd, sunau, telnetlib, uu, xdrlib
    )
    import getopt
if instructional := False:
    import idlelib, turtle, turtledemo, xxlimited, xxlimited_35
if file_formats := False:
    import (
        binascii, bz2, configparser, csv, dbm, email, gzip, html, json, lzma,
        mailbox, netrc, plistlib, quopri, sqlite3, tarfile, wave, wsgiref, xml
    )
if protocols := False:
    import ftplib, http, imaplib, poplib, smtplib, socketserver, urllib, webbrowser, xmlrpc
if platform_specific := False:
    import curses, fcntl, grp, pty, resource, syslog, termios, tty
if mechanical_or_technical := False:
    import (
        ast, code, codeop, colorsys, compileall, copyreg, distutils, ensurepip, filecmp,
        gettext, keyword, lib2to3, mimetypes, mmap, modulefinder, pickletools,
        pkgutil, profile pstats, py_compile, pyclbr, pydoc, select, selectors,
        signal, site, ssl, stringprep, sysconfig, tabnanny, token, tokenize, venv,
        zipapp, zipfile, zipimport
    )

if non_very_interesting := False:
    import abc, calendar, cmd, fileinput, graphlib, linecache, statistics, tkinter, unittest

if obvious := False:
    import (
        argparse, cProfile, datetime, decimal, fractions, math, multiprocessing, os,
        random, threading
    )

...

You Don’t Know What You Don’t Know

print("Let's take a look!")

from sys import argv
from pandas import read_csv, concat

if __name__ == '__main__':
    # output_filename = argv[1]
    output_filename = 'data/file.parquet'

    # input_filenames = argv[2:]
    input_filenames = ['data/file0.csv', 'data/file1.csv']

    raw_data = {}
    for fn in input_filenames:
        raw_data[fn] = (
            read_csv(fn, parse_dates=['date'], index_col=['date', 'entity'])
                # .assign(source=fn)
                .squeeze(axis='columns')
        )

    data = concat(raw_data.values())
    data.to_frame().to_parquet(output_filename)

from argparse import ArgumentParser
from pandas import read_csv, concat

parser = ArgumentParser()
parser.add_argument('-o', '--output', help='output filename')
parser.add_argument('filenames', nargs='+', help='input filenames')

if __name__ == '__main__':
    # args = parser.parse_args()
    args = parser.parse_args([
        'data/file0.csv', 'data/file1.csv',
        '-o', 'data/file.parquet',
    ])

    raw_data = {}
    for fn in args.filenames:
        raw_data[fn] = (
            read_csv(fn, parse_dates=['date'], index_col=['date', 'entity'])
                .squeeze(axis='columns')
        )

    data = concat(raw_data.values())
    data.to_frame().to_parquet(args.output)

from argparse import ArgumentParser
from pandas import read_csv, concat
from os import system

parser = ArgumentParser()
parser.add_argument('-o', '--output', help='output filename')
parser.add_argument('filenames', nargs='+', help='input filenames')

if __name__ == '__main__':
    # args = parser.parse_args()
    args = parser.parse_args([
        'data/file0.csv.gz', 'data/file1.csv.gz',
        '-o', 'data/file.parquet',
    ])

    raw_data = {}
    for fn in args.filenames:
        if fn.endswith('.gz'):
            system(f'gunzip -k {fn}')
        raw_data[fn] = (
            read_csv(fn, parse_dates=['date'], index_col=['date', 'entity'])
                .squeeze(axis='columns')
        )

    data = concat(raw_data.values())
    data.to_frame().to_parquet(args.output)

from os import system
system('touch /tmp/some-file')
system('file /tmp/some-file')

from os import system

filename = '/tmp/some-file'
system(f'touch {filename}')
system(f'file {filename}')

from os import system
from shlex import quote

filename = '/tmp/some file'
cmd =f'touch {quote(filename)}'
print(f'{cmd}')

from subprocess import run
from shlex import quote

filename = '/tmp/some file'
run(['touch', filename])
run(['file', filename])

from subprocess import check_call
from shlex import quote

filename = '/tmpx/some-file'
check_call(['touch', filename])
check_call(['file', filename])

from subprocess import check_call, DEVNULL
from shlex import quote

filename = '/tmp/some-file'
check_call(['touch', filename], stdout=DEVNULL, stderr=DEVNULL)
check_call(['file', filename], stdout=DEVNULL, stderr=DEVNULL)

from gzip import open as gz_open

with gz_open('data/file0.csv.gz') as f:
    for line in f:
        print(f'{line = }')

from argparse import ArgumentParser
from pandas import read_csv, concat
from gzip import open as gz_open

parser = ArgumentParser()
parser.add_argument('-o', '--output', help='output filename')
parser.add_argument('filenames', nargs='+', help='input filenames')

if __name__ == '__main__':
    # args = parser.parse_args()
    args = parser.parse_args([
        'data/file0.csv.gz', 'data/file1.csv.gz',
        '-o', 'data/file.parquet',
    ])

    raw_data = {}
    for fn in args.filenames:
        with (open if not fn.endswith('.gz') else gz_open)(fn) as f:
            raw_data[fn] = (
                read_csv(f, parse_dates=['date'], index_col=['date', 'entity'])
                    .squeeze(axis='columns')
            )

    data = concat(raw_data.values())
    data.to_frame().to_parquet(args.output)

from argparse import ArgumentParser
from pandas import read_csv, concat, DataFrame
from gzip import open as gz_open
from dataclasses import dataclass

parser = ArgumentParser()
parser.add_argument('-o', '--output', help='output filename')
parser.add_argument('filenames', nargs='+', help='input filenames')

# from dataclasses import dataclass
# from collections import namedtuple
# from enum import Enum
# from functools import total_ordering
# from contextlib import contextmanager
@dataclass
class Dataset:
    data : DataFrame

    @classmethod
    def from_files(cls, filenames):
        raw_data = {}
        for fn in filenames:
            with (open if not fn.endswith('.gz') else gz_open)(fn) as f:
                raw_data[fn] = (
                    read_csv(f, parse_dates=['date'], index_col=['date', 'entity'])
                        .squeeze(axis='columns')
                )
        return cls(concat(raw_data.values()))

if __name__ == '__main__':
    # args = parser.parse_args()
    args = parser.parse_args([
        'data/file0.csv.gz', 'data/file1.csv.gz',
        '-o', 'data/file.parquet',
    ])

    ds = Dataset.from_files(args.filenames)
    ds.data.to_frame().to_parquet(args.output)

from argparse import ArgumentParser
from pandas import read_csv, concat, DataFrame
from gzip import open as gz_open
from dataclasses import dataclass

parser = ArgumentParser()
parser.add_argument('-o', '--output', help='output filename')
parser.add_argument('filenames', nargs='+', help='input filenames')

@dataclass
class Dataset:
    data : DataFrame

    @classmethod
    def from_files(cls, filenames):
        raw_data = {}
        for fn in filenames:
            with (open if not fn.endswith('.gz') else gz_open)(fn) as f:
                raw_data[fn] = (
                    read_csv(f, parse_dates=['date'], index_col=['date', 'entity'])
                        .squeeze(axis='columns')
                )
        return cls(concat(raw_data.values()))

def test_dataset_processor():
    filenames = ['data/file0.csv.gz', 'data/file1.csv.gz']
    output_filename = 'data/file.parquet'
    ds = Dataset.from_files(filenames)
    ds.data.to_frame().to_parquet(output_filename)

    with open(args.output) as f:
        ...

from argparse import ArgumentParser
from pandas import read_csv, concat, DataFrame
from gzip import open as gz_open
from dataclasses import dataclass
from io import StringIO, IOBase, BytesIO

parser = ArgumentParser()
parser.add_argument('-o', '--output', help='output filename')
parser.add_argument('filenames', nargs='+', help='input filenames')

@dataclass
class Dataset:
    data : DataFrame

    @classmethod
    def from_files(cls, filenames):
        raw_data = {}
        for fn in filenames:
            if not isinstance(fn, IOBase):
                with (open if not fn.endswith('.gz') else gz_open)(fn) as f:
                    raw_data[fn] = (
                        read_csv(f, parse_dates=['date'], index_col=['date', 'entity'])
                            .squeeze(axis='columns')
                    )
            else:
                raw_data[fn] = (
                    read_csv(fn, parse_dates=['date'], index_col=['date', 'entity'])
                        .squeeze(axis='columns')
                )
        return cls(concat(raw_data.values()))

def test_dataset_processor():
    data0 = StringIO('''
        date,entity,value
        2020-01-01,abc,123
        2020-01-02,def,345
    ''')
    data1 = StringIO('''
        date,entity,value
        2020-01-03,xyz,879
    ''')
    ds = Dataset.from_files([data0, data1])
    out = BytesIO()
    ds.data.to_frame().to_parquet(out)

if __name__ == '__main__':
    test_dataset_processor()

from io import StringIO

class T:
    def f(self):
        match ...:
            case _ if ...:
                for _ in {...}:
                    data = StringIO('''
                        abc=123
                        def=456
                    ''')
        return data

print(
    # T().f().getvalue(),
    f'{T().f().getvalue() = }',
    sep='\n',
)

from io import StringIO
from textwrap import dedent

class T:
    def f(self):
        match ...:
            case _ if ...:
                for _ in {...}:
                    data = StringIO(dedent('''
                        abc=123
                            def=456
                        abc=123
                    ''').strip())
        return data

print(
    T().f().getvalue(),
    # f'{T().f().getvalue() = }',
    sep='\n',
)

from argparse import ArgumentParser
from pandas import read_csv, concat, DataFrame
from gzip import open as gz_open
from dataclasses import dataclass
from io import StringIO, IOBase, BytesIO
from textwrap import dedent

parser = ArgumentParser()
parser.add_argument('-o', '--output', help='output filename')
parser.add_argument('filenames', nargs='+', help='input filenames')

@dataclass
class Dataset:
    data : DataFrame

    @classmethod
    def from_files(cls, filenames):
        raw_data = {}
        for fn in filenames:
            if not isinstance(fn, IOBase):
                with (open if not fn.endswith('.gz') else gz_open)(fn) as f:
                    raw_data[fn] = (
                        read_csv(f, parse_dates=['date'], index_col=['date', 'entity'])
                            .squeeze(axis='columns')
                    )
            else:
                raw_data[fn] = (
                    read_csv(fn, parse_dates=['date'], index_col=['date', 'entity'])
                        .squeeze(axis='columns')
                )
        return cls(concat(raw_data.values()))

def test_dataset_processor():
    data0 = StringIO(dedent('''
        date,entity,value
        2020-01-01,abc,123
        2020-01-02,def,345
    ''').strip())
    data1 = StringIO(dedent('''
        date,entity,value
        2020-01-03,xyz,879
    ''').strip())
    ds = Dataset.from_files([data0, data1])
    out = BytesIO()
    ds.data.to_frame().to_parquet(out)

if __name__ == '__main__':
    test_dataset_processor()

def test_dataset_processor():
    pass

from io import StringIO
from pandas import read_csv, concat
from textwrap import dedent

def processor(*files):
    return (
        concat(
            read_csv(f, parse_dates=['date'], index_col=['date', 'entity'])
                .squeeze(axis='columns')
            for f in files
        )
        .sort_index()
        .groupby(['date', 'entity']).max()
    )

def test_dataset_processor():
    data0 = StringIO(dedent('''
        date,entity,value
        2020-01-01,abc,123
        2020-01-02,def,345
    ''').strip())
    data1 = StringIO(dedent('''
        date,entity,value
        2020-01-03,xyz,879
        2020-01-01,abc,999
    ''').strip())
    res = processor(data0, data1)
    assert not res.index.has_duplicates

from io import StringIO
from pandas import read_csv, concat
from textwrap import dedent

def processor(*files):
    """
    >>> data0 = StringIO(dedent('''
    ...    date,entity,value
    ...    2020-01-01,abc,123
    ...    2020-01-02,def,345
    ... ''').strip())
    >>> data1 = StringIO(dedent('''
    ...    date,entity,value
    ...    2020-01-03,xyz,879
    ...    2020-01-01,abc,999
    ... ''').strip())
    >>> res = processor(data0, data1)
    >>> res.index.has_duplicates
    True
    """
    return (
        concat(
            read_csv(f, parse_dates=['date'], index_col=['date', 'entity'])
                .squeeze(axis='columns')
            for f in files
        )
        .sort_index()
        .groupby(['date', 'entity']).max()
    )

from doctest import testmod
testmod()

from argparse import ArgumentParser

parser = ArgumentParser()
parser.add_argument('-o', '--output', help='output filename', default='/dev/stderr')
parser.add_argument('filenames', nargs='+', help='input filenames')

if __name__ == '__main__':
    args = parser.parse_args(['data/file0.csv', 'data/file1.csv'])
    print(f'{args = }')

from argparse import ArgumentParser

parser = ArgumentParser()
parser.add_argument('-o', '--output', help='output filename', default=None)
parser.add_argument('filenames', nargs='+', help='input filenames')

if __name__ == '__main__':
    args = parser.parse_args(['data/file0.csv', 'data/file1.csv'])
    if args.output is None:
        args.output = '/tmp/some-file.parquet'
    print(f'{args = }')

from argparse import ArgumentParser
from os import mkdir, makedirs
from uuid import uuid4

parser = ArgumentParser()
parser.add_argument('-o', '--output', help='output filename', default=None)
parser.add_argument('filenames', nargs='+', help='input filenames')

if __name__ == '__main__':
    args = parser.parse_args(['data/file0.csv', 'data/file1.csv'])
    if args.output is None:
        dirname = f'/tmp/my-script/{uuid4()}'
        mkdir(dirname)
        # makedirs(dirname, exist_ok=True)
        args.output = f'/{dirname}/some-file.parquet'
    print(f'{args = }')

from argparse import ArgumentParser
from tempfile import TemporaryDirectory

parser = ArgumentParser()
parser.add_argument('-o', '--output', help='output filename', default=None)
parser.add_argument('filenames', nargs='+', help='input filenames')

if __name__ == '__main__':
    args = parser.parse_args(['data/file0.csv', 'data/file1.csv'])
    if args.output is None:
        path = TemporaryDirectory(prefix=f'my-proj.{__name__}')
        args.output = f'/{path.name}/some-file.parquet'
    print(f'{args = }')

from tempfile import TemporaryDirectory
from os.path import exists

with TemporaryDirectory() as d:
    print(f'{d         = }')
    print(f'{exists(d) = }')
print(f'{exists(d) = }')

from argparse import ArgumentParser

parser = ArgumentParser()
parser.add_argument('--cache-dir', help='cache directory', default=None)
parser.add_argument('filenames', nargs='+', help='input filenames')

if __name__ == '__main__':
    args = parser.parse_args(['...', '...', '--cache-dir', '.tmp/'])
    cache_file = f'{args.cache_dir}/some-file.cache'

    print(f'{args = }', f'{cache_file = }', sep='\n')

from os.path import abspath, realpath
from tempfile import TemporaryDirectory
from subprocess import check_output
from shlex import quote

with TemporaryDirectory() as d:
    check_output(['touch', f'{d}/file'])
    check_output(['ln', '-s', f'{d}/file', f'{d}/link'])
    print(
        f'{abspath(f"{d}/link")  = }',
        f'{realpath(f"{d}/link") = }',
        sep='\n',
    )

from pathlib import Path
from tempfile import TemporaryDirectory
from subprocess import check_output
from shlex import quote

with TemporaryDirectory() as d:
    d = Path(d)
    check_output(['touch', d / 'file'])
    check_output(['ln', '-s', d / 'file', d / 'link'])
    print(
        f'{(d / "file").absolute() = }',
        f'{(d / "link").resolve()  = }',
        sep='\n',
    )

from contextlib import contextmanager

@contextmanager
def ctx():
    yield

with ctx():
    pass

from contextlib import contextmanager

@contextmanager
def ctx():
    yield ...

def f(obj): pass

if True:
    with ctx() as obj:
        f(obj)
else:
    f(...)

from contextlib import ExitStack, contextmanager

@contextmanager
def ctx():
    yield ...

def f(obj):
    print(f'{obj = }')

with ExitStack() as stack:
    if True:
        obj = stack.enter_context(ctx())
    f(...)

from argparse import ArgumentParser
from pandas import read_csv, concat, DataFrame
from gzip import open as gz_open
from dataclasses import dataclass
from contextlib import ExitStack
from io import IOBase
from pathlib import Path

parser = ArgumentParser()
parser.add_argument('-o', '--output', help='output filename', type=Path)
parser.add_argument('filenames', nargs='+', help='input filenames', type=Path)

@dataclass
class Dataset:
    data : DataFrame

    @classmethod
    def from_files(cls, files):
        raw_data = {}
        for f in files:
            with ExitStack() as stack:
                sources = [
                    f if isinstance(f, IOBase) else
                    stack.enter_context(
                        (open if not '.gz' in f.suffixes else gz_open)(f)
                    )
                ]
                for src in sources:
                    raw_data[f] = (
                        read_csv(f, parse_dates=['date'], index_col=['date', 'entity'])
                            .squeeze(axis='columns')
                    )
        return cls(concat(raw_data.values()))

if __name__ == '__main__':
    # args = parser.parse_args()
    args = parser.parse_args([
        'data/file0.csv.gz', 'data/file1.csv.gz',
        '-o', 'data/file.parquet',
    ])

    ds = Dataset.from_files(args.filenames)
    ds.data.to_frame().to_parquet(args.output)
    print(ds.data)

import itertools
import functools
import dataclasses
import enum
import pathlib
import argparse
import contextlib
import asyncio, threading, multiprocessing, concurrent.futures
import subprocess

import textwrap
import string