ts-python

Seminar XII: “The Boss Baby” (time & memory performance in pandas)

Tim (Miles Bakshi): Even though I never went to business school I did learn to share in kindergarten. And if there isn’t enough love for the two of us then I wanna give you all of mine.

The Boss Baby (2017)

Date Time Track Meeting Link
Fri Feb 11, 2022 9:30 AM EDT pandas https://primetime.bluejeans.com/a2m/live-event/gtcykzxf

Audience

These sessions are designed for a broad audience of modelers and software programmers of all backgrounds and skill levels.

Our expected audience should comprise attendees with a…

During this session, we will endeavour to guide our audience to developing…

…and we will share additional tips, tricks, and in-depth guidance on all of these topics!

Abstract

In a previous episode, we’ve discussed the design of the pandas API, and performance and optimization limitations arising therefrom. In this episode, we’ll take a close look at common performance problems in pandas code, including problems where our code uses too much memory or problems where our code performs computational tasks inefficiently.

We’ll discuss how to spot common sources of performance (time or memory) problems, how to avoid them, and cases where you may need to employ more sophisticated techniques (such as the use of tools like cython or numba) to improve the speed or reduce the memory-usage of your code.

To Be Continued…

Did you enjoy this episode? Did you learn something new that will help you as you continue or begin to use window methods in your work?

If so, stay tuned for future episodes, which may…

Contents

Notes

print("Let's get started!")

Question: “How do I work with you & not against you?” (Writing pandas code that will be amenable to optimisation.)

Question: “How do I work with you & not against you?” (Writing pandas code that will be amenable to optimisation.)

Follow-up:

print("Let's take a look!")

Is Python slow?

from _util import timed
from numpy.random import randint as np_randint
from random import randint as py_randint
from numpy import dot as np_dot

py_dot = lambda xs, ys: sum(x * y for x, y in zip(xs, ys))

with timed('py_xs = …'):
    py_xs = [py_randint(0, 1_000) for _ in range(100_000)]
    py_ys = [py_randint(0, 1_000) for _ in range(100_000)]

with timed('np_xs = …'):
    np_xs = np_randint(0, 1_000, size=100_000)
    np_ys = np_randint(0, 1_000, size=100_000)

with timed('py_dot(py_xs, py_ys)'):
    py_dot(py_xs, py_ys)

with timed('np_dot(np_xs, np_ys)'):
    np_dot(np_xs, np_ys)

with timed('np_dot(py_xs, py_ys)'):
    np_dot(py_xs, py_ys)

with timed('py_dot(np_xs, np_ys)'):
    py_dot(np_xs, np_ys)

NumPy & pandas follow the ‘restricted computation domain’ pattern.

xs = [..., ..., ...]
result = [op(x) for x in xs]
class Manager:
    def __init__(self, data):
        pass
    def op(self):
        ...

xs = Manager([..., ..., ...])
result = xs.op()
from numpy import array

xs = array([0, 1, 2])
print(
    f'{xs.__array_interface__["data"][0] = :#_x}',
    f'{xs.dtype = :}',
    sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)

Separate pure Python “programme structuring” from NumPy/pandas “computation.”

from _util import timed
from pandas import DataFrame
from numpy.random import default_rng
from numpy import where

rng = default_rng(0)

df = DataFrame({
    'a': rng.uniform(-10, 10, size := 300_000),
    'b': rng.random(size=size),
    'c': rng.normal(size=size),
})

with timed('.apply'):
    df['d'] = df['c'].apply(lambda x: x**2 if x > 0 else x**3)

with timed('where'):
    df['d'] = where(df['c'] > 0, df['c'] ** 2, df['c'] ** 3)
from _util import timed
from pandas import DataFrame
from numpy.random import default_rng

rng = default_rng(0)

df = DataFrame({
    'a': rng.uniform(-10, 10, size := 300_000),
    'b': rng.random(size=size),
    'c': rng.normal(size=size),
})

raw_data = rng.normal(size=df.shape)

with timed('for-loop'):
    for name, col in df.items():
        df[name, 'abs'] = col.abs()

with timed('raw abs'):
    abs(raw_data)
from pandas import DataFrame, MultiIndex, date_range, Grouper
from numpy.random import default_rng
from string import ascii_lowercase

rng = default_rng(0)

assets = rng.choice([*ascii_lowercase], size=(5, 4)).view('<U4').ravel()
dates = date_range('2000-01-01', periods=90)

idx = MultiIndex.from_product([dates, assets], names=['date', 'asset'])

df = DataFrame({
    'price': (
        rng.uniform(50, 200, size=len(assets))
      * rng.normal(loc=1, scale=0.01, size=(len(dates), len(assets))).cumprod(axis=0)
    ).ravel(),
    'volume': rng.integers(1_000, 10_000, size=len(idx)).round(-2),
}, index=idx).sort_index()

print(
    # df.head(5),
    df
        # .groupby([Grouper(level='date', freq='M'), 'asset']).mean()
        .groupby([df.index.get_level_values('date').to_period('M'), 'asset']).mean()
        .head(10)
    ,
    '\n'.join([
        f'{df.index.is_monotonic   = }',
        f'{df.index.has_duplicates = }',
    ]),
    sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)

Question: “Should I keep on .rolling?” (Performance considerations in window operations, .groupby, .rolling, .expanding.)

Question: “Should I keep on .rolling?” (Performance considerations in window operations, .groupby, .rolling, .expanding.)

Follow-up:

print("Let's take a look!")

There are (broadly) five types operations you can do on a .groupby:

Given a DataFrame in which each group has N rows:

Therefore:

But note:

The .aggregate vs .transform distinction shows that the grouping of the operation is distinct from the grouping of the result.

operation operates on indexing of result shape of operation
.aggregate individual columns (Series) same as grouping N → 1
.transform individual columns (Series) same as input data N → N
.apply entire group (DataFrame) grouping + … (anything goes) N → …
from _util import timed
from pandas import DataFrame, MultiIndex, date_range, Grouper
from numpy import mean, cumsum
from numpy.random import default_rng
from string import ascii_lowercase

rng = default_rng(0)

assets = rng.choice([*ascii_lowercase], size=(5, 4)).view('<U4').ravel()
dates = date_range('2000-01-01', periods=9_000)

idx = MultiIndex.from_product([dates, assets], names=['date', 'asset'])

df = DataFrame({
    'price': (
        rng.uniform(50, 200, size=len(assets))
      * rng.normal(loc=1, scale=0.01, size=(len(dates), len(assets))).cumprod(axis=0)
    ).ravel(),
    'volume': rng.integers(1_000, 10_000, size=len(idx)).round(-2),
}, index=idx).sort_index()

print(
    df.head(3),
    sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)

# with timed('.groupby(…).mean'):
#     df.groupby(Grouper(level='date', freq='M'))['price'].mean()

# with timed('.groupby(…).agg("mean")'):
#     df.groupby(Grouper(level='date', freq='M'))['price'].agg('mean')

# with timed('.groupby(…).agg(numpy.mean)'):
#     df.groupby(Grouper(level='date', freq='M'))['price'].agg(mean)

# with timed('.groupby(…).apply(numpy.mean)'):
#     df.groupby(Grouper(level='date', freq='M'))['price'].apply(mean)

with timed('.groupby(…).cumsum'):
    df.groupby('asset')['volume'].cumsum()

with timed('.groupby(…).transform("cumsum")'):
    df.groupby('asset')['volume'].transform('cumsum')

with timed('.groupby(…).transform(numpy.cumsum)'):
    df.groupby('asset')['volume'].transform(cumsum)

with timed('.groupby(…).transform(λ g: g.exanding().sum())'):
    df.groupby('asset')['volume'].transform(lambda g: g.expanding().sum())

with timed('.groupby(…).apply(numpy.cumsum)'):
    df.groupby('asset')['volume'].apply(cumsum)
from _util import timed
from pandas import DataFrame, MultiIndex, date_range, IndexSlice
from numpy import mean
from numpy.random import default_rng
from string import ascii_lowercase
from scipy.stats import skew

rng = default_rng(0)

assets = rng.choice([*ascii_lowercase], size=(5, 4)).view('<U4').ravel()
dates = date_range('2000-01-01', periods=9_000)

idx = MultiIndex.from_product([dates, assets], names=['date', 'asset'])

df = DataFrame({
    'price': (
        rng.uniform(50, 200, size=len(assets))
      * rng.normal(loc=1, scale=0.01, size=(len(dates), len(assets))).cumprod(axis=0)
    ).ravel(),
    'volume': rng.integers(1_000, 10_000, size=len(idx)).round(-2),
}, index=idx).sort_index()

s = df.loc[IndexSlice[:, assets[0], :]]['price']

with timed('.rolling(…).mean()'):
    s.rolling(3).mean()

with timed('.rolling(…).agg(numpy.mean)'):
    s.rolling(3).agg(mean)

with timed('.rolling(…).agg(scipy.stats.skew)'):
    s.rolling(3).agg(skew)

with timed('.rolling(…).agg(λ g: g.mean())'):
    s.rolling(3).agg(lambda g: g.mean())
from _util import timed
from pandas import DataFrame, MultiIndex, date_range, IndexSlice
from numpy import mean
from numpy.random import default_rng
from string import ascii_lowercase
from scipy.stats import zscore

rng = default_rng(0)

assets = rng.choice([*ascii_lowercase], size=(5, 4)).view('<U4').ravel()
dates = date_range('2000-01-01', periods=9_000)

idx = MultiIndex.from_product([dates, assets], names=['date', 'asset'])

df = DataFrame({
    'price': (
        rng.uniform(50, 200, size=len(assets))
      * rng.normal(loc=1, scale=0.01, size=(len(dates), len(assets))).cumprod(axis=0)
    ).ravel(),
    'volume': rng.integers(1_000, 10_000, size=len(idx)).round(-2),
}, index=idx).sort_index()

s = df.loc[IndexSlice[:, assets[0], :]]['price']

with timed('.expanding(…).mean()'):
    s.expanding().mean()

with timed('.expanding(…).agg(numpy.mean)'):
    s.expanding().agg(mean)

# with timed('.expanding(…).agg(zscore)'):
#     s.expanding().agg(zscore)

with timed('.expanding(…).agg(λ g: zscore(g)[-1])'):
    s.expanding().agg(lambda g: zscore(g)[-1])
from numpy import arange
from numpy.lib.stride_tricks import as_strided
from _util import timed
from pandas import DataFrame, MultiIndex, date_range, IndexSlice
from numpy import mean
from numpy.random import default_rng
from string import ascii_lowercase
from scipy.stats import skew

rng = default_rng(0)

assets = rng.choice([*ascii_lowercase], size=(5, 4)).view('<U4').ravel()
dates = date_range('2000-01-01', periods=9_000)

idx = MultiIndex.from_product([dates, assets], names=['date', 'asset'])

df = DataFrame({
    'price': (
        rng.uniform(50, 200, size=len(assets))
      * rng.normal(loc=1, scale=0.01, size=(len(dates), len(assets))).cumprod(axis=0)
    ).ravel(),
    'volume': rng.integers(1_000, 10_000, size=len(idx)).round(-2),
}, index=idx).sort_index()

s = df.loc[IndexSlice[:, assets[0], :]]['price']
xs = s.array._ndarray

def rolling(arr, size):
    return as_strided(arr, shape=(arr.shape[0] - size + 1, size, *arr.shape[1:]),
                           strides=(arr.strides[0], *arr.strides))

with timed('.rolling(…).agg(skew)'):
    s.rolling(3).agg(skew)

with timed('rolling()'):
    skew(rolling(xs, 3), axis=-1)

Question: “So what can I do about it?” (Dropping down to numpy, using numba/numexpr/cython)

Question: “So what can I do about it?” (Dropping down to numpy, using numba/numexpr/cython)

Follow-up:

print("Let's take a look!")
from _util import timed
from numpy import array, vectorize
from numpy.random import default_rng

rng = default_rng(0)

xs = rng.normal(size=1_000_000).round(2)

with timed('python'):
    ys = array([x + 1 for x in xs])

op = vectorize(lambda x: x + 1)

with timed('numpy.vectorize'):
    zs = op(xs)

print(
    # f'{xs[:5] = }',
    # f'{ys[:5] = }',
    # f'{zs[:5] = }',
    sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)
from numexpr import evaluate
from _util import timed
from numpy import array, polyval
from numpy.random import default_rng

rng = default_rng(0)

xs = rng.normal(size=1_000_000).round(2)

with timed('python'):
    ys = array([3 * x**2 + 2 * x + x for x in xs])

with timed('numpy'):
    ys = 3 * xs**2 + 2 * xs + 1

with timed('numpy.polyval'):
    ys = polyval([3, 2, 1], xs)

with timed('numexpr'):
    zs = evaluate('3 * xs**2 + 2 * xs + 1')

print(
    # f'{xs[:5] = }',
    # f'{ys[:5] = }',
    # f'{zs[:5] = }',
    sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)
from numba import jit
from _util import timed
from numpy import array, empty_like
from numpy.random import default_rng

rng = default_rng(0)

xs = rng.normal(size=10_000_000).round(2)

def f(xs):
    ys = empty_like(xs)
    for idx, x in enumerate(xs):
        ys[idx] = x + 1
    return ys

with timed('Python'):
    f(xs)

f = jit(nopython=True)(f)
with timed('Numba'):
    f(xs)

Question: “What’s your type?” (Careful choice of pandas column types.)

Question: “What’s your type?” (Careful choice of pandas column types.)

Follow-up:

print("Let's take a look!")
from _util import timed
from pandas import DataFrame, MultiIndex, date_range, Categorical, Series
from numpy.random import default_rng
from string import ascii_lowercase

rng = default_rng(0)

assets = rng.choice([*ascii_lowercase], size=(5, 4)).view('<U4').ravel()
dates = date_range('2000-01-01', periods=9_000)

# assets = Categorical(assets)
idx = MultiIndex.from_product([dates, assets], names=['date', 'asset'])

df = DataFrame({
    'price': (
        rng.uniform(50, 200, size=len(assets))
      * rng.normal(loc=1, scale=0.01, size=(len(dates), len(assets))).cumprod(axis=0)
    ).ravel(),
    'volume': rng.integers(1_000, 10_000, size=len(idx)).round(-2),
}, index=idx).sort_index()

print(
    # df.head(3),
    # df.memory_usage(),
    # df.memory_usage(deep=True),
    sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)

s = Series(
    index=(idx := rng.choice([*ascii_lowercase], size=(50_000, 4)).view('<U4').ravel()),
    data=rng.random(len(idx)),
)

s1, s2 = s.copy(), s.copy()
s1.index = Categorical(s1.index)

with timed('== Categorical'):
    s1.index == '.'

with timed('== str'):
    s2.index == '.'
from pandas import Series, to_numeric
from numpy.random import default_rng
from string import ascii_lowercase

rng = default_rng(0)

s = Series(
    index=(idx := range(50_000)),
    # index=(idx := rng.choice([*ascii_lowercase], size=(50_000, 4)).view('<U4').ravel()),
    data=rng.integers(0, 10, size=len(idx)),
)

s1 = s.astype('int64')
s2 = s.astype('int32')
s3 = s.astype('int8')
s4 = to_numeric(s, downcast='integer')

print(
    f'{s1.memory_usage(deep=True) = :,}',
    f'{s2.memory_usage(deep=True) = :,}',
    f'{s3.memory_usage(deep=True) = :,}',
    f'{s4.memory_usage(deep=True) = :,}',
    sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)
from pandas import Series, to_numeric
from numpy import arange, nan
from numpy.random import default_rng

rng = default_rng(0)

s = Series(
    index=(idx := range(50_000)),
    data=rng.random(size=len(idx)),
)

s.iloc[rng.choice(arange(len(s)), size=len(s)*1//100)] = nan

s1 = s.copy()
s2 = s.astype('Sparse[float64, nan]')

print(
    f'{s1.memory_usage(deep=True) = :,}',
    f'{s2.memory_usage(deep=True) = :,}',
    f'{s2.sparse.density          = }',
    f'{s2.sparse.fill_value       = }',
    sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)
from pandas import Series, to_numeric
from numpy import arange, nan
from numpy.random import default_rng
from string import ascii_lowercase

rng = default_rng(0)

s = Series(
    index=(idx := range(50_000)),
    # data=rng.choice([*ascii_lowercase], size=(len(idx), 4)).view('<U4').ravel(),
    data=rng.choice([*ascii_lowercase], size=(len(idx), 40)).view('<U40').ravel(),
)

s1 = s.copy()
s2 = s.astype('category')
s3 = s.astype('string[pyarrow]')

print(
    f'{s1.memory_usage(deep=True) = :,}',
    f'{s2.memory_usage(deep=True) = :,}',
    f'{s3.memory_usage(deep=True) = :,}',
    sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)

Question: “Do you get that reference?” (Avoiding copies & unneeded references.)

Question: “Do you get that reference?” (Avoiding copies & unneeded references.)

Follow-up:

print("Let's take a look!")
from pandas import read_csv
from pathlib import Path

data_dir = Path('data')
df = read_csv(data_dir / 'data.csv')
print(
    df.memory_usage()
)
from pandas import read_csv
from pathlib import Path

data_dir = Path('data')
for df in read_csv(data_dir / 'data.csv', chunksize=10_000, iterator=True):
    print(
        df.memory_usage()
    )
    break
from pandas import read_sql
from sqlite3 import connect
from pathlib import Path

data_dir = Path('data')
with connect(data_dir / 'data.sqlite') as conn:
    for df in read_sql('select * from data', conn, chunksize=10_000):
        print(
            df.memory_usage()
        )
        break
from pandas import read_csv
from pathlib import Path
from gc import get_referrers

data_dir = Path('data')
prev_df = None
for df in read_csv(data_dir / 'data.csv', chunksize=10_000, iterator=True):
    if prev_df is None:
        prev_df = df
    break

print(
    get_referrers(df)
)

del df
del df
del df
del df
del df

del & %xdel

Question: “Does this look normal to you?” (Data normalisation & pandas.)

Question: “Does this look normal to you?” (Data normalisation & pandas.)

Follow-up:

print("Let's take a look!")
  1. Primary key & atomic columns (→ 1st normal form)
  2. “Every non-trivial functional dependency either does not begin with a proper subset of a candidate key or ends with a prime attribute” (→ 2nd normal form)
  3. “Every non-trivial functional dependency either begins with a superkey or ends with a prime attribute” (→ 3rd normal form)
  4. “Every non-trivial functional dependency either begins with a superkey or ends with an elementary prime attribute” (→ Boyce-Codd normal form)
from pandas import DataFrame, date_range, Series, MultiIndex, Index, Categorical
from numpy.random import default_rng
from string import ascii_lowercase

rng = default_rng(0)

dates = date_range('2000-01-01', freq='15T', periods=1_000)
assets = rng.choice([*ascii_lowercase], size=(10, 4)).view('<U4').ravel()
industries = rng.choice(['healthcare', 'energy', 'tech'], size=len(assets))
exchanges = rng.choice(['NYSE', 'CME'], size=len(assets))

df = DataFrame(
    index=(idx := MultiIndex.from_product([dates, assets], names=['date', 'asset'])),
    data={
        'price': (
            rng.uniform(50, 200, size=len(assets))
          * rng.normal(loc=1, scale=0.01, size=(len(dates), len(assets))).cumprod(axis=0)
        ).ravel(),
        'volume': rng.integers(1_000, 10_000, size=len(idx)).round(-2),
    },
).join(
    Series(industries, index=Index(assets, name='asset'), name='industry'),
).join(
    Series(exchanges, index=Index(assets, name='asset'), name='exchange'),
)

df['industry'] = Categorical(df['industry'])
df['exchange'] = Categorical(df['exchange'])

print(
    df.head(3),
    df.memory_usage(deep=True),
    sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)
from pandas import DataFrame, date_range, Series, MultiIndex, Index, Categorical, IndexSlice
from numpy.random import default_rng
from string import ascii_lowercase

rng = default_rng(0)

dates = date_range('2000-01-01', freq='15T', periods=1_000)
assets = rng.choice([*ascii_lowercase], size=(10, 4)).view('<U4').ravel()
industries = rng.choice(['healthcare', 'energy', 'tech'], size=len(assets))
exchanges = rng.choice(['NYSE', 'CME'], size=len(assets))

df = DataFrame(
    index=(idx := MultiIndex.from_product([dates, assets], names=['date', 'asset'])),
    data={
        'price': (
            rng.uniform(50, 200, size=len(assets))
          * rng.normal(loc=1, scale=0.01, size=(len(dates), len(assets))).cumprod(axis=0)
        ).ravel(),
        'volume': rng.integers(1_000, 10_000, size=len(idx)).round(-2),
    },
)

industry = Series(industries, index=Index(assets, name='asset'), name='industry')
exchange = Series(exchanges, index=Index(assets, name='asset'), name='exchange')

# industry = industry.astype('category')
# exchange = industry.astype('exchange')

print(
    df.describe().T.convert_dtypes(),
    # df.memory_usage(deep=True),
    # industry.memory_usage(deep=True),
    # exchange.memory_usage(deep=True),

    # groups := [g.index for _, g in industry.to_frame().groupby('industry')],
    # DataFrame({k: df.loc[IndexSlice[:, v.index, :]].mean() for k, v in industry.to_frame().groupby('industry')}).T,
    sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)