ts-python

Seminar XII: “The Boss Baby” (time & memory performance in `pandas`)

Tim (Miles Bakshi): Even though I never went to business school I did learn to share in kindergarten. And if there isn’t enough love for the two of us then I wanna give you all of mine.

— The Boss Baby (2017)

Date	Time	Track	Meeting Link
Fri Feb 11, 2022	9:30 AM EDT	`pandas`	https://primetime.bluejeans.com/a2m/live-event/gtcykzxf

Audience

These sessions are designed for a broad audience of modelers and software programmers of all backgrounds and skill levels.

Our expected audience should comprise attendees with a…

strong familiarity with writing code using pandas.
motivation to improve the memory-use and performance of their pandas analytical code.

During this session, we will endeavour to guide our audience to developing…

a strong familiarity with common pandas performance issues and strategies for addressing them.
a familiarity with the source of common performance issues, related to the design of tools like pandas and numpy.

…and we will share additional tips, tricks, and in-depth guidance on all of these topics!

Abstract

In a previous episode, we’ve discussed the design of the pandas API, and performance and optimization limitations arising therefrom. In this episode, we’ll take a close look at common performance problems in pandas code, including problems where our code uses too much memory or problems where our code performs computational tasks inefficiently.

We’ll discuss how to spot common sources of performance (time or memory) problems, how to avoid them, and cases where you may need to employ more sophisticated techniques (such as the use of tools like cython or numba) to improve the speed or reduce the memory-usage of your code.

To Be Continued…

Did you enjoy this episode? Did you learn something new that will help you as you continue or begin to use window methods in your work?

If so, stay tuned for future episodes, which may…

go deeper into performance issues arising in the use of grouping and windowing functions in pandas.
discuss how to extend pandas (e.g., pandas.Index) to work within the structure pandas provides to address common performance issues (without losing expressivity.)
discuss how to drop down into raw-numpy or raw-C to address performance issues in pandas where the design of pandas may prevent us from working at that level.

Seminar XII: “The Boss Baby” (time & memory performance in pandas)

Notes

print("Let's get started!")

Question: “How do I work with you & not against you?” (Writing `pandas` code that will be amenable to optimisation.)

Question: “How do I work with you & not against you?” (Writing pandas code that will be amenable to optimisation.)

What is a “restricted computation domain” and why is it a useful framing?
When is it appropriate and when it is inappropriate to use .apply, .iteritems, .items, or .iterrows?
Why and how should I develop “mechanical sympathy” with Python and pandas?
Where is it easy to add fast-paths to pandas? How do I write code that can later be assessed for performance?
How do I assess performance easily?

Follow-up:

How do I write custom Index types?
How do I write custom pandas.api.extensions.ExtensionArrays or pandas.api.extensions.ExtensionDtype?
What are largely API-equivalent (“drop-in”) pandas alternatives?

print("Let's take a look!")

Is Python slow?

Executing Python bytecode is (relatively) slow.
Python function calls are (relatively) slow.

from _util import timed
from numpy.random import randint as np_randint
from random import randint as py_randint
from numpy import dot as np_dot

py_dot = lambda xs, ys: sum(x * y for x, y in zip(xs, ys))

with timed('py_xs = …'):
    py_xs = [py_randint(0, 1_000) for _ in range(100_000)]
    py_ys = [py_randint(0, 1_000) for _ in range(100_000)]

with timed('np_xs = …'):
    np_xs = np_randint(0, 1_000, size=100_000)
    np_ys = np_randint(0, 1_000, size=100_000)

with timed('py_dot(py_xs, py_ys)'):
    py_dot(py_xs, py_ys)

with timed('np_dot(np_xs, np_ys)'):
    np_dot(np_xs, np_ys)

with timed('np_dot(py_xs, py_ys)'):
    np_dot(py_xs, py_ys)

with timed('py_dot(np_xs, np_ys)'):
    py_dot(np_xs, np_ys)

NumPy & pandas follow the ‘restricted computation domain’ pattern.

xs = [..., ..., ...]
result = [op(x) for x in xs]

class Manager:
    def __init__(self, data):
        pass
    def op(self):
        ...

xs = Manager([..., ..., ...])
result = xs.op()

from numpy import array

xs = array([0, 1, 2])
print(
    f'{xs.__array_interface__["data"][0] = :#_x}',
    f'{xs.dtype = :}',
    sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)

Separate pure Python “programme structuring” from NumPy/pandas “computation.”

from _util import timed
from pandas import DataFrame
from numpy.random import default_rng
from numpy import where

rng = default_rng(0)

df = DataFrame({
    'a': rng.uniform(-10, 10, size := 300_000),
    'b': rng.random(size=size),
    'c': rng.normal(size=size),
})

with timed('.apply'):
    df['d'] = df['c'].apply(lambda x: x**2 if x > 0 else x**3)

with timed('where'):
    df['d'] = where(df['c'] > 0, df['c'] ** 2, df['c'] ** 3)

from _util import timed
from pandas import DataFrame
from numpy.random import default_rng

rng = default_rng(0)

df = DataFrame({
    'a': rng.uniform(-10, 10, size := 300_000),
    'b': rng.random(size=size),
    'c': rng.normal(size=size),
})

raw_data = rng.normal(size=df.shape)

with timed('for-loop'):
    for name, col in df.items():
        df[name, 'abs'] = col.abs()

with timed('raw abs'):
    abs(raw_data)

from pandas import DataFrame, MultiIndex, date_range, Grouper
from numpy.random import default_rng
from string import ascii_lowercase

rng = default_rng(0)

assets = rng.choice([*ascii_lowercase], size=(5, 4)).view('<U4').ravel()
dates = date_range('2000-01-01', periods=90)

idx = MultiIndex.from_product([dates, assets], names=['date', 'asset'])

df = DataFrame({
    'price': (
        rng.uniform(50, 200, size=len(assets))
      * rng.normal(loc=1, scale=0.01, size=(len(dates), len(assets))).cumprod(axis=0)
    ).ravel(),
    'volume': rng.integers(1_000, 10_000, size=len(idx)).round(-2),
}, index=idx).sort_index()

print(
    # df.head(5),
    df
        # .groupby([Grouper(level='date', freq='M'), 'asset']).mean()
        .groupby([df.index.get_level_values('date').to_period('M'), 'asset']).mean()
        .head(10)
    ,
    '\n'.join([
        f'{df.index.is_monotonic   = }',
        f'{df.index.has_duplicates = }',
    ]),
    sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)

Question: “Should I keep on `.rolling`?” (Performance considerations in window operations, `.groupby`, `.rolling`, `.expanding`.)

Question: “Should I keep on .rolling?” (Performance considerations in window operations, .groupby, .rolling, .expanding.)

What is the performance difference between .groupby(…).aggregate, .groupby(…).transform, and .groupby(…).apply?
What is the performance of .rolling and .expanding operations?
When is it appropriate to drop down to raw numpy?

Follow-up:

How do we use .groupby(…).aggregate in cases where we need to operate on multiple columns?
How do we write efficient user-defined functions for .rolling and .expanding?

print("Let's take a look!")

There are (broadly) five types operations you can do on a .groupby:

.sum, .mean, .median, &c: perform a (built-in) computation (usu. an aggregation)
.filter: filter on the groups with a user-defined function
.aggregate: aggregate groups into a single value with a user-defined function
.transform: transform groups with a user-defined function
.apply: perform an arbitrary computation on groups with a user-defined function

Given a DataFrame in which each group has N rows:

.aggregate requires a function that does an N → 1 operation
.transform requires a function that does a N → N operation
.apply requires a function that does a N → … operation (the result can have any number of rows)

Therefore:

.aggregate will change the indexing of the input data, to match the specified grouping
.transform will preserve the indexing of the input data
.apply will concatenate the indexing of the specified grouping to whatever indexing the user-defined function returns

But note:

.aggregate requires a function that operates on each column individually (i.e., takes a Series)
.transform requires a function that operates on each column individually (i.e., takes a Series)
.apply requires a function that operates on each group as a whole (i.e., takes a DataFrame)

The .aggregate vs .transform distinction shows that the grouping of the operation is distinct from the grouping of the result.

operation	operates on	indexing of result	shape of operation
`.aggregate`	individual columns (`Series`)	same as grouping	N → 1
`.transform`	individual columns (`Series`)	same as input data	N → N
`.apply`	entire group (`DataFrame`)	grouping + … (anything goes)	N → …

from _util import timed
from pandas import DataFrame, MultiIndex, date_range, Grouper
from numpy import mean, cumsum
from numpy.random import default_rng
from string import ascii_lowercase

rng = default_rng(0)

assets = rng.choice([*ascii_lowercase], size=(5, 4)).view('<U4').ravel()
dates = date_range('2000-01-01', periods=9_000)

idx = MultiIndex.from_product([dates, assets], names=['date', 'asset'])

df = DataFrame({
    'price': (
        rng.uniform(50, 200, size=len(assets))
      * rng.normal(loc=1, scale=0.01, size=(len(dates), len(assets))).cumprod(axis=0)
    ).ravel(),
    'volume': rng.integers(1_000, 10_000, size=len(idx)).round(-2),
}, index=idx).sort_index()

print(
    df.head(3),
    sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)

# with timed('.groupby(…).mean'):
#     df.groupby(Grouper(level='date', freq='M'))['price'].mean()

# with timed('.groupby(…).agg("mean")'):
#     df.groupby(Grouper(level='date', freq='M'))['price'].agg('mean')

# with timed('.groupby(…).agg(numpy.mean)'):
#     df.groupby(Grouper(level='date', freq='M'))['price'].agg(mean)

# with timed('.groupby(…).apply(numpy.mean)'):
#     df.groupby(Grouper(level='date', freq='M'))['price'].apply(mean)

with timed('.groupby(…).cumsum'):
    df.groupby('asset')['volume'].cumsum()

with timed('.groupby(…).transform("cumsum")'):
    df.groupby('asset')['volume'].transform('cumsum')

with timed('.groupby(…).transform(numpy.cumsum)'):
    df.groupby('asset')['volume'].transform(cumsum)

with timed('.groupby(…).transform(λ g: g.exanding().sum())'):
    df.groupby('asset')['volume'].transform(lambda g: g.expanding().sum())

with timed('.groupby(…).apply(numpy.cumsum)'):
    df.groupby('asset')['volume'].apply(cumsum)

from _util import timed
from pandas import DataFrame, MultiIndex, date_range, IndexSlice
from numpy import mean
from numpy.random import default_rng
from string import ascii_lowercase
from scipy.stats import skew

rng = default_rng(0)

assets = rng.choice([*ascii_lowercase], size=(5, 4)).view('<U4').ravel()
dates = date_range('2000-01-01', periods=9_000)

idx = MultiIndex.from_product([dates, assets], names=['date', 'asset'])

df = DataFrame({
    'price': (
        rng.uniform(50, 200, size=len(assets))
      * rng.normal(loc=1, scale=0.01, size=(len(dates), len(assets))).cumprod(axis=0)
    ).ravel(),
    'volume': rng.integers(1_000, 10_000, size=len(idx)).round(-2),
}, index=idx).sort_index()

s = df.loc[IndexSlice[:, assets[0], :]]['price']

with timed('.rolling(…).mean()'):
    s.rolling(3).mean()

with timed('.rolling(…).agg(numpy.mean)'):
    s.rolling(3).agg(mean)

with timed('.rolling(…).agg(scipy.stats.skew)'):
    s.rolling(3).agg(skew)

with timed('.rolling(…).agg(λ g: g.mean())'):
    s.rolling(3).agg(lambda g: g.mean())

from _util import timed
from pandas import DataFrame, MultiIndex, date_range, IndexSlice
from numpy import mean
from numpy.random import default_rng
from string import ascii_lowercase
from scipy.stats import zscore

rng = default_rng(0)

assets = rng.choice([*ascii_lowercase], size=(5, 4)).view('<U4').ravel()
dates = date_range('2000-01-01', periods=9_000)

idx = MultiIndex.from_product([dates, assets], names=['date', 'asset'])

df = DataFrame({
    'price': (
        rng.uniform(50, 200, size=len(assets))
      * rng.normal(loc=1, scale=0.01, size=(len(dates), len(assets))).cumprod(axis=0)
    ).ravel(),
    'volume': rng.integers(1_000, 10_000, size=len(idx)).round(-2),
}, index=idx).sort_index()

s = df.loc[IndexSlice[:, assets[0], :]]['price']

with timed('.expanding(…).mean()'):
    s.expanding().mean()

with timed('.expanding(…).agg(numpy.mean)'):
    s.expanding().agg(mean)

# with timed('.expanding(…).agg(zscore)'):
#     s.expanding().agg(zscore)

with timed('.expanding(…).agg(λ g: zscore(g)[-1])'):
    s.expanding().agg(lambda g: zscore(g)[-1])

from numpy import arange
from numpy.lib.stride_tricks import as_strided
from _util import timed
from pandas import DataFrame, MultiIndex, date_range, IndexSlice
from numpy import mean
from numpy.random import default_rng
from string import ascii_lowercase
from scipy.stats import skew

rng = default_rng(0)

assets = rng.choice([*ascii_lowercase], size=(5, 4)).view('<U4').ravel()
dates = date_range('2000-01-01', periods=9_000)

idx = MultiIndex.from_product([dates, assets], names=['date', 'asset'])

df = DataFrame({
    'price': (
        rng.uniform(50, 200, size=len(assets))
      * rng.normal(loc=1, scale=0.01, size=(len(dates), len(assets))).cumprod(axis=0)
    ).ravel(),
    'volume': rng.integers(1_000, 10_000, size=len(idx)).round(-2),
}, index=idx).sort_index()

s = df.loc[IndexSlice[:, assets[0], :]]['price']
xs = s.array._ndarray

def rolling(arr, size):
    return as_strided(arr, shape=(arr.shape[0] - size + 1, size, *arr.shape[1:]),
                           strides=(arr.strides[0], *arr.strides))

with timed('.rolling(…).agg(skew)'):
    s.rolling(3).agg(skew)

with timed('rolling()'):
    skew(rolling(xs, 3), axis=-1)

Question: “So what can I do about it?” (Dropping down to `numpy`, using `numba`/`numexpr`/`cython`)

Question: “So what can I do about it?” (Dropping down to numpy, using numba/numexpr/cython)

What does numpy.vectorize do? How is it useful?
Where does numba fit in? How might numba help?
Where does numexpr fit in? How might numexpr help?
Where does cython fit in? How might cython help?

Follow-up:

What are more sophisticated uses of numba, numexpr, or cython?

print("Let's take a look!")

from _util import timed
from numpy import array, vectorize
from numpy.random import default_rng

rng = default_rng(0)

xs = rng.normal(size=1_000_000).round(2)

with timed('python'):
    ys = array([x + 1 for x in xs])

op = vectorize(lambda x: x + 1)

with timed('numpy.vectorize'):
    zs = op(xs)

print(
    # f'{xs[:5] = }',
    # f'{ys[:5] = }',
    # f'{zs[:5] = }',
    sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)

from numexpr import evaluate
from _util import timed
from numpy import array, polyval
from numpy.random import default_rng

rng = default_rng(0)

xs = rng.normal(size=1_000_000).round(2)

with timed('python'):
    ys = array([3 * x**2 + 2 * x + x for x in xs])

with timed('numpy'):
    ys = 3 * xs**2 + 2 * xs + 1

with timed('numpy.polyval'):
    ys = polyval([3, 2, 1], xs)

with timed('numexpr'):
    zs = evaluate('3 * xs**2 + 2 * xs + 1')

print(
    # f'{xs[:5] = }',
    # f'{ys[:5] = }',
    # f'{zs[:5] = }',
    sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)

from numba import jit
from _util import timed
from numpy import array, empty_like
from numpy.random import default_rng

rng = default_rng(0)

xs = rng.normal(size=10_000_000).round(2)

def f(xs):
    ys = empty_like(xs)
    for idx, x in enumerate(xs):
        ys[idx] = x + 1
    return ys

with timed('Python'):
    f(xs)

f = jit(nopython=True)(f)
with timed('Numba'):
    f(xs)

Question: “What’s your type?” (Careful choice of `pandas` column types.)

Question: “What’s your type?” (Careful choice of pandas column types.)

How can I simply assess memory usage?
What is pandas.Categorical, and how is it useful?
What are the various str types? What is str[pyarrow]?
What are Sparse[…] types, and how are they useful?
Can pandas automatically downcast my types to save memory?

Follow-up:

What is vaex and how can it be helpful in cases of very large data sets?
What is dask and how can it be helpful in cases of very large data sets?

print("Let's take a look!")

from _util import timed
from pandas import DataFrame, MultiIndex, date_range, Categorical, Series
from numpy.random import default_rng
from string import ascii_lowercase

rng = default_rng(0)

assets = rng.choice([*ascii_lowercase], size=(5, 4)).view('<U4').ravel()
dates = date_range('2000-01-01', periods=9_000)

# assets = Categorical(assets)
idx = MultiIndex.from_product([dates, assets], names=['date', 'asset'])

df = DataFrame({
    'price': (
        rng.uniform(50, 200, size=len(assets))
      * rng.normal(loc=1, scale=0.01, size=(len(dates), len(assets))).cumprod(axis=0)
    ).ravel(),
    'volume': rng.integers(1_000, 10_000, size=len(idx)).round(-2),
}, index=idx).sort_index()

print(
    # df.head(3),
    # df.memory_usage(),
    # df.memory_usage(deep=True),
    sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)

s = Series(
    index=(idx := rng.choice([*ascii_lowercase], size=(50_000, 4)).view('<U4').ravel()),
    data=rng.random(len(idx)),
)

s1, s2 = s.copy(), s.copy()
s1.index = Categorical(s1.index)

with timed('== Categorical'):
    s1.index == '.'

with timed('== str'):
    s2.index == '.'

from pandas import Series, to_numeric
from numpy.random import default_rng
from string import ascii_lowercase

rng = default_rng(0)

s = Series(
    index=(idx := range(50_000)),
    # index=(idx := rng.choice([*ascii_lowercase], size=(50_000, 4)).view('<U4').ravel()),
    data=rng.integers(0, 10, size=len(idx)),
)

s1 = s.astype('int64')
s2 = s.astype('int32')
s3 = s.astype('int8')
s4 = to_numeric(s, downcast='integer')

print(
    f'{s1.memory_usage(deep=True) = :,}',
    f'{s2.memory_usage(deep=True) = :,}',
    f'{s3.memory_usage(deep=True) = :,}',
    f'{s4.memory_usage(deep=True) = :,}',
    sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)

from pandas import Series, to_numeric
from numpy import arange, nan
from numpy.random import default_rng

rng = default_rng(0)

s = Series(
    index=(idx := range(50_000)),
    data=rng.random(size=len(idx)),
)

s.iloc[rng.choice(arange(len(s)), size=len(s)*1//100)] = nan

s1 = s.copy()
s2 = s.astype('Sparse[float64, nan]')

print(
    f'{s1.memory_usage(deep=True) = :,}',
    f'{s2.memory_usage(deep=True) = :,}',
    f'{s2.sparse.density          = }',
    f'{s2.sparse.fill_value       = }',
    sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)

from pandas import Series, to_numeric
from numpy import arange, nan
from numpy.random import default_rng
from string import ascii_lowercase

rng = default_rng(0)

s = Series(
    index=(idx := range(50_000)),
    # data=rng.choice([*ascii_lowercase], size=(len(idx), 4)).view('<U4').ravel(),
    data=rng.choice([*ascii_lowercase], size=(len(idx), 40)).view('<U40').ravel(),
)

s1 = s.copy()
s2 = s.astype('category')
s3 = s.astype('string[pyarrow]')

print(
    f'{s1.memory_usage(deep=True) = :,}',
    f'{s2.memory_usage(deep=True) = :,}',
    f'{s3.memory_usage(deep=True) = :,}',
    sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)

Question: “Do you get that reference?” (Avoiding copies & unneeded references.)

How do I assess whether something is a copy or a view?
How do I avoid views which might capture references to structures, preventing deällocation?

Follow-up:

How do I identify and eliminate temporary allocations in large computations?

print("Let's take a look!")

from pandas import read_csv
from pathlib import Path

data_dir = Path('data')
df = read_csv(data_dir / 'data.csv')
print(
    df.memory_usage()
)

from pandas import read_csv
from pathlib import Path

data_dir = Path('data')
for df in read_csv(data_dir / 'data.csv', chunksize=10_000, iterator=True):
    print(
        df.memory_usage()
    )
    break

from pandas import read_sql
from sqlite3 import connect
from pathlib import Path

data_dir = Path('data')
with connect(data_dir / 'data.sqlite') as conn:
    for df in read_sql('select * from data', conn, chunksize=10_000):
        print(
            df.memory_usage()
        )
        break

from pandas import read_csv
from pathlib import Path
from gc import get_referrers

data_dir = Path('data')
prev_df = None
for df in read_csv(data_dir / 'data.csv', chunksize=10_000, iterator=True):
    if prev_df is None:
        prev_df = df
    break

print(
    get_referrers(df)
)

del df
del df
del df
del df
del df

del & %xdel

Question: “Does this look normal to you?” (Data normalisation & `pandas`.)

Question: “Does this look normal to you?” (Data normalisation & pandas.)

What is relational normalisation, and how is it relevant to pandas data?
What are the three major normal forms, and how can I normalise pandas data?
How can I use simple Python mechanisms to improve analytical ergonomics?

Follow-up:

What are the other three normal forms?
What is the performance of pandas.{DataFrame,Series}.join and pandas.merge operations?

print("Let's take a look!")

Primary key & atomic columns (→ 1st normal form)
“Every non-trivial functional dependency either does not begin with a proper subset of a candidate key or ends with a prime attribute” (→ 2nd normal form)
“Every non-trivial functional dependency either begins with a superkey or ends with a prime attribute” (→ 3rd normal form)
“Every non-trivial functional dependency either begins with a superkey or ends with an elementary prime attribute” (→ Boyce-Codd normal form)

from pandas import DataFrame, date_range, Series, MultiIndex, Index, Categorical
from numpy.random import default_rng
from string import ascii_lowercase

rng = default_rng(0)

dates = date_range('2000-01-01', freq='15T', periods=1_000)
assets = rng.choice([*ascii_lowercase], size=(10, 4)).view('<U4').ravel()
industries = rng.choice(['healthcare', 'energy', 'tech'], size=len(assets))
exchanges = rng.choice(['NYSE', 'CME'], size=len(assets))

df = DataFrame(
    index=(idx := MultiIndex.from_product([dates, assets], names=['date', 'asset'])),
    data={
        'price': (
            rng.uniform(50, 200, size=len(assets))
          * rng.normal(loc=1, scale=0.01, size=(len(dates), len(assets))).cumprod(axis=0)
        ).ravel(),
        'volume': rng.integers(1_000, 10_000, size=len(idx)).round(-2),
    },
).join(
    Series(industries, index=Index(assets, name='asset'), name='industry'),
).join(
    Series(exchanges, index=Index(assets, name='asset'), name='exchange'),
)

df['industry'] = Categorical(df['industry'])
df['exchange'] = Categorical(df['exchange'])

print(
    df.head(3),
    df.memory_usage(deep=True),
    sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)

from pandas import DataFrame, date_range, Series, MultiIndex, Index, Categorical, IndexSlice
from numpy.random import default_rng
from string import ascii_lowercase

rng = default_rng(0)

dates = date_range('2000-01-01', freq='15T', periods=1_000)
assets = rng.choice([*ascii_lowercase], size=(10, 4)).view('<U4').ravel()
industries = rng.choice(['healthcare', 'energy', 'tech'], size=len(assets))
exchanges = rng.choice(['NYSE', 'CME'], size=len(assets))

df = DataFrame(
    index=(idx := MultiIndex.from_product([dates, assets], names=['date', 'asset'])),
    data={
        'price': (
            rng.uniform(50, 200, size=len(assets))
          * rng.normal(loc=1, scale=0.01, size=(len(dates), len(assets))).cumprod(axis=0)
        ).ravel(),
        'volume': rng.integers(1_000, 10_000, size=len(idx)).round(-2),
    },
)

industry = Series(industries, index=Index(assets, name='asset'), name='industry')
exchange = Series(exchanges, index=Index(assets, name='asset'), name='exchange')

# industry = industry.astype('category')
# exchange = industry.astype('exchange')

print(
    df.describe().T.convert_dtypes(),
    # df.memory_usage(deep=True),
    # industry.memory_usage(deep=True),
    # exchange.memory_usage(deep=True),

    # groups := [g.index for _, g in industry.to_frame().groupby('industry')],
    # DataFrame({k: df.loc[IndexSlice[:, v.index, :]].mean() for k, v in industry.to_frame().groupby('industry')}).T,
    sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)

ts-python

Seminar XII: “The Boss Baby” (time & memory performance in pandas)

Audience

Abstract

To Be Continued…

Contents

Notes

Question: “How do I work with you & not against you?” (Writing pandas code that will be amenable to optimisation.)

Question: “Should I keep on .rolling?” (Performance considerations in window operations, .groupby, .rolling, .expanding.)

Question: “So what can I do about it?” (Dropping down to numpy, using numba/numexpr/cython)

Question: “What’s your type?” (Careful choice of pandas column types.)

Question: “Do you get that reference?” (Avoiding copies & unneeded references.)

Question: “Does this look normal to you?” (Data normalisation & pandas.)

Seminar XII: “The Boss Baby” (time & memory performance in `pandas`)

Question: “How do I work with you & not against you?” (Writing `pandas` code that will be amenable to optimisation.)

Question: “Should I keep on `.rolling`?” (Performance considerations in window operations, `.groupby`, `.rolling`, `.expanding`.)

Question: “So what can I do about it?” (Dropping down to `numpy`, using `numba`/`numexpr`/`cython`)

Question: “What’s your type?” (Careful choice of `pandas` column types.)

Question: “Does this look normal to you?” (Data normalisation & `pandas`.)