pandas)Tim (Miles Bakshi): Even though I never went to business school I did learn to share in kindergarten. And if there isn’t enough love for the two of us then I wanna give you all of mine.
— The Boss Baby (2017)
| Date | Time | Track | Meeting Link |
|---|---|---|---|
| Fri Feb 11, 2022 | 9:30 AM EDT | pandas |
https://primetime.bluejeans.com/a2m/live-event/gtcykzxf |
These sessions are designed for a broad audience of modelers and software programmers of all backgrounds and skill levels.
Our expected audience should comprise attendees with a…
pandas.pandas analytical code.During this session, we will endeavour to guide our audience to developing…
pandas performance issues and strategies for addressing them.pandas and numpy.…and we will share additional tips, tricks, and in-depth guidance on all of these topics!
In a previous episode, we’ve discussed the design of the pandas API, and
performance and optimization limitations arising therefrom. In this episode,
we’ll take a close look at common performance problems in pandas code,
including problems where our code uses too much memory or problems where our
code performs computational tasks inefficiently.
We’ll discuss how to spot common sources of performance (time or memory)
problems, how to avoid them, and cases where you may need to employ more
sophisticated techniques (such as the use of tools like cython or numba) to
improve the speed or reduce the memory-usage of your code.
Did you enjoy this episode? Did you learn something new that will help you as you continue or begin to use window methods in your work?
If so, stay tuned for future episodes, which may…
pandas.pandas (e.g., pandas.Index) to work within the structure pandas provides to address common performance issues (without losing expressivity.)numpy or raw-C to address performance issues in pandas where the design of pandas may prevent us from working at that level.pandas)
pandas code that will be amenable to optimisation.).apply, .iteritems, .items, .iterrows).rolling?” (Performance considerations in window operations, .groupby, .rolling, .expanding.)pandas column types.)pandas.)print("Let's get started!")
pandas code that will be amenable to optimisation.)Question: “How do I work with you & not against you?” (Writing pandas code that will be amenable to optimisation.)
.apply, .iteritems, .items, or .iterrows?pandas?pandas? How do I write code that can later be assessed for performance?Follow-up:
Index types?pandas.api.extensions.ExtensionArrays or pandas.api.extensions.ExtensionDtype?pandas alternatives?print("Let's take a look!")
Is Python slow?
from _util import timed
from numpy.random import randint as np_randint
from random import randint as py_randint
from numpy import dot as np_dot
py_dot = lambda xs, ys: sum(x * y for x, y in zip(xs, ys))
with timed('py_xs = …'):
py_xs = [py_randint(0, 1_000) for _ in range(100_000)]
py_ys = [py_randint(0, 1_000) for _ in range(100_000)]
with timed('np_xs = …'):
np_xs = np_randint(0, 1_000, size=100_000)
np_ys = np_randint(0, 1_000, size=100_000)
with timed('py_dot(py_xs, py_ys)'):
py_dot(py_xs, py_ys)
with timed('np_dot(np_xs, np_ys)'):
np_dot(np_xs, np_ys)
with timed('np_dot(py_xs, py_ys)'):
np_dot(py_xs, py_ys)
with timed('py_dot(np_xs, np_ys)'):
py_dot(np_xs, np_ys)
NumPy & pandas follow the ‘restricted computation domain’ pattern.
xs = [..., ..., ...]
result = [op(x) for x in xs]
class Manager:
def __init__(self, data):
pass
def op(self):
...
xs = Manager([..., ..., ...])
result = xs.op()
from numpy import array
xs = array([0, 1, 2])
print(
f'{xs.__array_interface__["data"][0] = :#_x}',
f'{xs.dtype = :}',
sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)
Separate pure Python “programme structuring” from NumPy/pandas “computation.”
from _util import timed
from pandas import DataFrame
from numpy.random import default_rng
from numpy import where
rng = default_rng(0)
df = DataFrame({
'a': rng.uniform(-10, 10, size := 300_000),
'b': rng.random(size=size),
'c': rng.normal(size=size),
})
with timed('.apply'):
df['d'] = df['c'].apply(lambda x: x**2 if x > 0 else x**3)
with timed('where'):
df['d'] = where(df['c'] > 0, df['c'] ** 2, df['c'] ** 3)
from _util import timed
from pandas import DataFrame
from numpy.random import default_rng
rng = default_rng(0)
df = DataFrame({
'a': rng.uniform(-10, 10, size := 300_000),
'b': rng.random(size=size),
'c': rng.normal(size=size),
})
raw_data = rng.normal(size=df.shape)
with timed('for-loop'):
for name, col in df.items():
df[name, 'abs'] = col.abs()
with timed('raw abs'):
abs(raw_data)
from pandas import DataFrame, MultiIndex, date_range, Grouper
from numpy.random import default_rng
from string import ascii_lowercase
rng = default_rng(0)
assets = rng.choice([*ascii_lowercase], size=(5, 4)).view('<U4').ravel()
dates = date_range('2000-01-01', periods=90)
idx = MultiIndex.from_product([dates, assets], names=['date', 'asset'])
df = DataFrame({
'price': (
rng.uniform(50, 200, size=len(assets))
* rng.normal(loc=1, scale=0.01, size=(len(dates), len(assets))).cumprod(axis=0)
).ravel(),
'volume': rng.integers(1_000, 10_000, size=len(idx)).round(-2),
}, index=idx).sort_index()
print(
# df.head(5),
df
# .groupby([Grouper(level='date', freq='M'), 'asset']).mean()
.groupby([df.index.get_level_values('date').to_period('M'), 'asset']).mean()
.head(10)
,
'\n'.join([
f'{df.index.is_monotonic = }',
f'{df.index.has_duplicates = }',
]),
sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)
.rolling?” (Performance considerations in window operations, .groupby, .rolling, .expanding.)Question: “Should I keep on .rolling?” (Performance considerations in window operations, .groupby, .rolling, .expanding.)
.groupby(…).aggregate, .groupby(…).transform, and .groupby(…).apply?.rolling and .expanding operations?numpy?Follow-up:
.groupby(…).aggregate in cases where we need to operate on multiple columns?.rolling and .expanding?print("Let's take a look!")
There are (broadly) five types operations you can do on a .groupby:
.sum, .mean, .median, &c: perform a (built-in) computation (usu. an aggregation).filter: filter on the groups with a user-defined function.aggregate: aggregate groups into a single value with a user-defined function.transform: transform groups with a user-defined function.apply: perform an arbitrary computation on groups with a user-defined functionGiven a DataFrame in which each group has N rows:
.aggregate requires a function that does an N → 1 operation.transform requires a function that does a N → N operation.apply requires a function that does a N → … operation (the result can have any number of rows)Therefore:
.aggregate will change the indexing of the input data, to match the specified grouping.transform will preserve the indexing of the input data.apply will concatenate the indexing of the specified grouping to whatever indexing the user-defined function returnsBut note:
.aggregate requires a function that operates on each column individually (i.e., takes a Series).transform requires a function that operates on each column individually (i.e., takes a Series).apply requires a function that operates on each group as a whole (i.e., takes a DataFrame)The .aggregate vs .transform distinction shows that the grouping of the operation is distinct from the grouping of the result.
| operation | operates on | indexing of result | shape of operation |
|---|---|---|---|
.aggregate |
individual columns (Series) |
same as grouping | N → 1 |
.transform |
individual columns (Series) |
same as input data | N → N |
.apply |
entire group (DataFrame) |
grouping + … (anything goes) | N → … |
from _util import timed
from pandas import DataFrame, MultiIndex, date_range, Grouper
from numpy import mean, cumsum
from numpy.random import default_rng
from string import ascii_lowercase
rng = default_rng(0)
assets = rng.choice([*ascii_lowercase], size=(5, 4)).view('<U4').ravel()
dates = date_range('2000-01-01', periods=9_000)
idx = MultiIndex.from_product([dates, assets], names=['date', 'asset'])
df = DataFrame({
'price': (
rng.uniform(50, 200, size=len(assets))
* rng.normal(loc=1, scale=0.01, size=(len(dates), len(assets))).cumprod(axis=0)
).ravel(),
'volume': rng.integers(1_000, 10_000, size=len(idx)).round(-2),
}, index=idx).sort_index()
print(
df.head(3),
sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)
# with timed('.groupby(…).mean'):
# df.groupby(Grouper(level='date', freq='M'))['price'].mean()
# with timed('.groupby(…).agg("mean")'):
# df.groupby(Grouper(level='date', freq='M'))['price'].agg('mean')
# with timed('.groupby(…).agg(numpy.mean)'):
# df.groupby(Grouper(level='date', freq='M'))['price'].agg(mean)
# with timed('.groupby(…).apply(numpy.mean)'):
# df.groupby(Grouper(level='date', freq='M'))['price'].apply(mean)
with timed('.groupby(…).cumsum'):
df.groupby('asset')['volume'].cumsum()
with timed('.groupby(…).transform("cumsum")'):
df.groupby('asset')['volume'].transform('cumsum')
with timed('.groupby(…).transform(numpy.cumsum)'):
df.groupby('asset')['volume'].transform(cumsum)
with timed('.groupby(…).transform(λ g: g.exanding().sum())'):
df.groupby('asset')['volume'].transform(lambda g: g.expanding().sum())
with timed('.groupby(…).apply(numpy.cumsum)'):
df.groupby('asset')['volume'].apply(cumsum)
from _util import timed
from pandas import DataFrame, MultiIndex, date_range, IndexSlice
from numpy import mean
from numpy.random import default_rng
from string import ascii_lowercase
from scipy.stats import skew
rng = default_rng(0)
assets = rng.choice([*ascii_lowercase], size=(5, 4)).view('<U4').ravel()
dates = date_range('2000-01-01', periods=9_000)
idx = MultiIndex.from_product([dates, assets], names=['date', 'asset'])
df = DataFrame({
'price': (
rng.uniform(50, 200, size=len(assets))
* rng.normal(loc=1, scale=0.01, size=(len(dates), len(assets))).cumprod(axis=0)
).ravel(),
'volume': rng.integers(1_000, 10_000, size=len(idx)).round(-2),
}, index=idx).sort_index()
s = df.loc[IndexSlice[:, assets[0], :]]['price']
with timed('.rolling(…).mean()'):
s.rolling(3).mean()
with timed('.rolling(…).agg(numpy.mean)'):
s.rolling(3).agg(mean)
with timed('.rolling(…).agg(scipy.stats.skew)'):
s.rolling(3).agg(skew)
with timed('.rolling(…).agg(λ g: g.mean())'):
s.rolling(3).agg(lambda g: g.mean())
from _util import timed
from pandas import DataFrame, MultiIndex, date_range, IndexSlice
from numpy import mean
from numpy.random import default_rng
from string import ascii_lowercase
from scipy.stats import zscore
rng = default_rng(0)
assets = rng.choice([*ascii_lowercase], size=(5, 4)).view('<U4').ravel()
dates = date_range('2000-01-01', periods=9_000)
idx = MultiIndex.from_product([dates, assets], names=['date', 'asset'])
df = DataFrame({
'price': (
rng.uniform(50, 200, size=len(assets))
* rng.normal(loc=1, scale=0.01, size=(len(dates), len(assets))).cumprod(axis=0)
).ravel(),
'volume': rng.integers(1_000, 10_000, size=len(idx)).round(-2),
}, index=idx).sort_index()
s = df.loc[IndexSlice[:, assets[0], :]]['price']
with timed('.expanding(…).mean()'):
s.expanding().mean()
with timed('.expanding(…).agg(numpy.mean)'):
s.expanding().agg(mean)
# with timed('.expanding(…).agg(zscore)'):
# s.expanding().agg(zscore)
with timed('.expanding(…).agg(λ g: zscore(g)[-1])'):
s.expanding().agg(lambda g: zscore(g)[-1])
from numpy import arange
from numpy.lib.stride_tricks import as_strided
from _util import timed
from pandas import DataFrame, MultiIndex, date_range, IndexSlice
from numpy import mean
from numpy.random import default_rng
from string import ascii_lowercase
from scipy.stats import skew
rng = default_rng(0)
assets = rng.choice([*ascii_lowercase], size=(5, 4)).view('<U4').ravel()
dates = date_range('2000-01-01', periods=9_000)
idx = MultiIndex.from_product([dates, assets], names=['date', 'asset'])
df = DataFrame({
'price': (
rng.uniform(50, 200, size=len(assets))
* rng.normal(loc=1, scale=0.01, size=(len(dates), len(assets))).cumprod(axis=0)
).ravel(),
'volume': rng.integers(1_000, 10_000, size=len(idx)).round(-2),
}, index=idx).sort_index()
s = df.loc[IndexSlice[:, assets[0], :]]['price']
xs = s.array._ndarray
def rolling(arr, size):
return as_strided(arr, shape=(arr.shape[0] - size + 1, size, *arr.shape[1:]),
strides=(arr.strides[0], *arr.strides))
with timed('.rolling(…).agg(skew)'):
s.rolling(3).agg(skew)
with timed('rolling()'):
skew(rolling(xs, 3), axis=-1)
numpy, using numba/numexpr/cython)Question: “So what can I do about it?” (Dropping down to numpy, using numba/numexpr/cython)
numpy.vectorize do? How is it useful?numba fit in? How might numba help?numexpr fit in? How might numexpr help?cython fit in? How might cython help?Follow-up:
numba, numexpr, or cython?print("Let's take a look!")
from _util import timed
from numpy import array, vectorize
from numpy.random import default_rng
rng = default_rng(0)
xs = rng.normal(size=1_000_000).round(2)
with timed('python'):
ys = array([x + 1 for x in xs])
op = vectorize(lambda x: x + 1)
with timed('numpy.vectorize'):
zs = op(xs)
print(
# f'{xs[:5] = }',
# f'{ys[:5] = }',
# f'{zs[:5] = }',
sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)
from numexpr import evaluate
from _util import timed
from numpy import array, polyval
from numpy.random import default_rng
rng = default_rng(0)
xs = rng.normal(size=1_000_000).round(2)
with timed('python'):
ys = array([3 * x**2 + 2 * x + x for x in xs])
with timed('numpy'):
ys = 3 * xs**2 + 2 * xs + 1
with timed('numpy.polyval'):
ys = polyval([3, 2, 1], xs)
with timed('numexpr'):
zs = evaluate('3 * xs**2 + 2 * xs + 1')
print(
# f'{xs[:5] = }',
# f'{ys[:5] = }',
# f'{zs[:5] = }',
sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)
from numba import jit
from _util import timed
from numpy import array, empty_like
from numpy.random import default_rng
rng = default_rng(0)
xs = rng.normal(size=10_000_000).round(2)
def f(xs):
ys = empty_like(xs)
for idx, x in enumerate(xs):
ys[idx] = x + 1
return ys
with timed('Python'):
f(xs)
f = jit(nopython=True)(f)
with timed('Numba'):
f(xs)
pandas column types.)Question: “What’s your type?” (Careful choice of pandas column types.)
pandas.Categorical, and how is it useful?str types? What is str[pyarrow]?Sparse[…] types, and how are they useful?pandas automatically downcast my types to save memory?Follow-up:
vaex and how can it be helpful in cases of very large data sets?dask and how can it be helpful in cases of very large data sets?print("Let's take a look!")
from _util import timed
from pandas import DataFrame, MultiIndex, date_range, Categorical, Series
from numpy.random import default_rng
from string import ascii_lowercase
rng = default_rng(0)
assets = rng.choice([*ascii_lowercase], size=(5, 4)).view('<U4').ravel()
dates = date_range('2000-01-01', periods=9_000)
# assets = Categorical(assets)
idx = MultiIndex.from_product([dates, assets], names=['date', 'asset'])
df = DataFrame({
'price': (
rng.uniform(50, 200, size=len(assets))
* rng.normal(loc=1, scale=0.01, size=(len(dates), len(assets))).cumprod(axis=0)
).ravel(),
'volume': rng.integers(1_000, 10_000, size=len(idx)).round(-2),
}, index=idx).sort_index()
print(
# df.head(3),
# df.memory_usage(),
# df.memory_usage(deep=True),
sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)
s = Series(
index=(idx := rng.choice([*ascii_lowercase], size=(50_000, 4)).view('<U4').ravel()),
data=rng.random(len(idx)),
)
s1, s2 = s.copy(), s.copy()
s1.index = Categorical(s1.index)
with timed('== Categorical'):
s1.index == '.'
with timed('== str'):
s2.index == '.'
from pandas import Series, to_numeric
from numpy.random import default_rng
from string import ascii_lowercase
rng = default_rng(0)
s = Series(
index=(idx := range(50_000)),
# index=(idx := rng.choice([*ascii_lowercase], size=(50_000, 4)).view('<U4').ravel()),
data=rng.integers(0, 10, size=len(idx)),
)
s1 = s.astype('int64')
s2 = s.astype('int32')
s3 = s.astype('int8')
s4 = to_numeric(s, downcast='integer')
print(
f'{s1.memory_usage(deep=True) = :,}',
f'{s2.memory_usage(deep=True) = :,}',
f'{s3.memory_usage(deep=True) = :,}',
f'{s4.memory_usage(deep=True) = :,}',
sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)
from pandas import Series, to_numeric
from numpy import arange, nan
from numpy.random import default_rng
rng = default_rng(0)
s = Series(
index=(idx := range(50_000)),
data=rng.random(size=len(idx)),
)
s.iloc[rng.choice(arange(len(s)), size=len(s)*1//100)] = nan
s1 = s.copy()
s2 = s.astype('Sparse[float64, nan]')
print(
f'{s1.memory_usage(deep=True) = :,}',
f'{s2.memory_usage(deep=True) = :,}',
f'{s2.sparse.density = }',
f'{s2.sparse.fill_value = }',
sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)
from pandas import Series, to_numeric
from numpy import arange, nan
from numpy.random import default_rng
from string import ascii_lowercase
rng = default_rng(0)
s = Series(
index=(idx := range(50_000)),
# data=rng.choice([*ascii_lowercase], size=(len(idx), 4)).view('<U4').ravel(),
data=rng.choice([*ascii_lowercase], size=(len(idx), 40)).view('<U40').ravel(),
)
s1 = s.copy()
s2 = s.astype('category')
s3 = s.astype('string[pyarrow]')
print(
f'{s1.memory_usage(deep=True) = :,}',
f'{s2.memory_usage(deep=True) = :,}',
f'{s3.memory_usage(deep=True) = :,}',
sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)
Question: “Do you get that reference?” (Avoiding copies & unneeded references.)
Follow-up:
print("Let's take a look!")
from pandas import read_csv
from pathlib import Path
data_dir = Path('data')
df = read_csv(data_dir / 'data.csv')
print(
df.memory_usage()
)
from pandas import read_csv
from pathlib import Path
data_dir = Path('data')
for df in read_csv(data_dir / 'data.csv', chunksize=10_000, iterator=True):
print(
df.memory_usage()
)
break
from pandas import read_sql
from sqlite3 import connect
from pathlib import Path
data_dir = Path('data')
with connect(data_dir / 'data.sqlite') as conn:
for df in read_sql('select * from data', conn, chunksize=10_000):
print(
df.memory_usage()
)
break
from pandas import read_csv
from pathlib import Path
from gc import get_referrers
data_dir = Path('data')
prev_df = None
for df in read_csv(data_dir / 'data.csv', chunksize=10_000, iterator=True):
if prev_df is None:
prev_df = df
break
print(
get_referrers(df)
)
del df
del df
del df
del df
del df
del & %xdel
pandas.)Question: “Does this look normal to you?” (Data normalisation & pandas.)
pandas data?pandas data?Follow-up:
pandas.{DataFrame,Series}.join and pandas.merge operations?print("Let's take a look!")
from pandas import DataFrame, date_range, Series, MultiIndex, Index, Categorical
from numpy.random import default_rng
from string import ascii_lowercase
rng = default_rng(0)
dates = date_range('2000-01-01', freq='15T', periods=1_000)
assets = rng.choice([*ascii_lowercase], size=(10, 4)).view('<U4').ravel()
industries = rng.choice(['healthcare', 'energy', 'tech'], size=len(assets))
exchanges = rng.choice(['NYSE', 'CME'], size=len(assets))
df = DataFrame(
index=(idx := MultiIndex.from_product([dates, assets], names=['date', 'asset'])),
data={
'price': (
rng.uniform(50, 200, size=len(assets))
* rng.normal(loc=1, scale=0.01, size=(len(dates), len(assets))).cumprod(axis=0)
).ravel(),
'volume': rng.integers(1_000, 10_000, size=len(idx)).round(-2),
},
).join(
Series(industries, index=Index(assets, name='asset'), name='industry'),
).join(
Series(exchanges, index=Index(assets, name='asset'), name='exchange'),
)
df['industry'] = Categorical(df['industry'])
df['exchange'] = Categorical(df['exchange'])
print(
df.head(3),
df.memory_usage(deep=True),
sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)
from pandas import DataFrame, date_range, Series, MultiIndex, Index, Categorical, IndexSlice
from numpy.random import default_rng
from string import ascii_lowercase
rng = default_rng(0)
dates = date_range('2000-01-01', freq='15T', periods=1_000)
assets = rng.choice([*ascii_lowercase], size=(10, 4)).view('<U4').ravel()
industries = rng.choice(['healthcare', 'energy', 'tech'], size=len(assets))
exchanges = rng.choice(['NYSE', 'CME'], size=len(assets))
df = DataFrame(
index=(idx := MultiIndex.from_product([dates, assets], names=['date', 'asset'])),
data={
'price': (
rng.uniform(50, 200, size=len(assets))
* rng.normal(loc=1, scale=0.01, size=(len(dates), len(assets))).cumprod(axis=0)
).ravel(),
'volume': rng.integers(1_000, 10_000, size=len(idx)).round(-2),
},
)
industry = Series(industries, index=Index(assets, name='asset'), name='industry')
exchange = Series(exchanges, index=Index(assets, name='asset'), name='exchange')
# industry = industry.astype('category')
# exchange = industry.astype('exchange')
print(
df.describe().T.convert_dtypes(),
# df.memory_usage(deep=True),
# industry.memory_usage(deep=True),
# exchange.memory_usage(deep=True),
# groups := [g.index for _, g in industry.to_frame().groupby('industry')],
# DataFrame({k: df.loc[IndexSlice[:, v.index, :]].mean() for k, v in industry.to_frame().groupby('industry')}).T,
sep='\n{}\n'.format('\N{box drawings light horizontal}' * 40),
)