Topics: pandas, code review, mistakes
“Why do my code reviews keep pointing out pandas issues?”
If you’ve ever had your code sent back with feedback on pandas usage, this seminar is for you. We’ll dive into the most common mistakes that modelers make with pandas during code review. From mishandling apply functions to inefficient joins and unclear indexing, you’ll learn to spot and fix these pitfalls before your reviewers do.
But it’s not just about avoiding mistakes—it’s about developing an intuition for pandas. We’ll discuss the “why” behind best practices, helping you understand the trade-offs in speed, readability, and maintainability. By the end, you’ll have a stronger grasp of pandas workflows and be better equipped to write production-ready code that won’t raise eyebrows during reviews.
python -m pip install pandas==2.0.3 matplotlib==3.9.* duckdb==1.1.3 numpy<2.*
print("Let's take a look!")
When working with SQL databases and Pandas, it’s common to filter data. However, placing the filter on the wrong side—such as in Pandas instead of SQL—can significantly impact performance. Filtering in SQL leverages the database engine’s optimizations, reducing the data transferred into Python memory. A misplaced filter in Pandas means loading the entire dataset first, which is inefficient and can strain system resources. Always consider whether your filtering logic belongs in the SQL query or in your DataFrame manipulations.
from pathlib import Path
from pandas import DataFrame, date_range, read_sql
from numpy.random import default_rng
from sqlite3 import connect
from contextlib import closing
rng = default_rng(0)
df = DataFrame(
data={
'timestamp': date_range('2000-01', freq='s', periods=(n := 8_000_000)),
'group': rng.choice([*'abcdefghijklmnop'], size=n),
'values': rng.normal(100, 5, size=n),
},
)
data_dir = Path('data')
data_dir.mkdir(exist_ok=True)
with closing(connect(data_dir / 'some.db')) as conn:
df.to_sql('mytable', conn, if_exists='replace', index=False)
print(df.head())
pandas is often referred to as a “tail-end” analytical tool. For large data pipelines it may not be the best tool for working at scale, but at some point you are going to want the vast flexibility that is offers to perform your work.
from pandas import read_sql
from sqlite3 import connect
from contextlib import closing
# with closing(connect('data/some.db')) as conn: # Review
# df = read_sql('select * from mytable', con=conn)
# df = df.loc[df['group'] == 'c']
# print(df)
with closing(connect('data/some.db')) as conn: # Review
df = read_sql('select * from mytable where [group] = "c"', con=conn)
print(df)
aside: using prepared statements during interop
from pandas import read_sql
from sqlite3 import connect
from contextlib import closing
subgroup = "c" # Review
with closing(connect('data/some.db')) as conn:
# df = read_sql(
# f'''
# select * from mytable
# where ([group] = {subgroup!r})
# ''',
# con=conn,
# )
df = read_sql(
'''
select * from mytable
where ([group] = :subgroup)
''',
con=conn,
params={'subgroup': subgroup}
)
print(df)
Using prepared statements is not just about security—it also improves maintainability and clarity in code. By parameterizing queries instead of embedding variables directly, you safeguard against SQL injection vulnerabilities and ensure the code handles special characters correctly. This approach is particularly valuable when dealing with user-supplied inputs or dynamic query conditions.
left, right, inner, outer, semi, anti, cross: joins are easy, right?
from pandas import DataFrame, Timestamp, to_datetime
signal_source = DataFrame({
'timestamp': to_datetime([
'2000-01-01 09:00:00', '2000-01-01 09:03:00', '2000-01-01 09:04:30'
]),
'signal value': [100, 121, 141],
'signal type': ['A', 'B', 'A'],
})
# print(signal_source)
# Review → find closest signal to target_timestamp
target_timestamp = Timestamp('2000-01-01 09:03:10')
target_delta = target_timestamp - signal_source['timestamp']
print(
signal_source.loc[target_delta.abs().idxmin()]
)
But we don’t often just have one target timestamp, we have many…
from pandas import DataFrame, date_range, to_datetime, concat
signal_source = DataFrame({
'timestamp': to_datetime([
'2000-01-01 09:00:00', '2000-01-01 09:03:00', '2000-01-01 09:04:30'
]),
'signal value': [100, 121, 141],
'signal type': ['A', 'B', 'A'],
})
trades = DataFrame({
'timestamp': to_datetime([
'2000-01-01 09:01', '2000-01-01 09:02', '2000-01-01 09:04', '2000-01-01 09:10'
]),
'trade_volume': [10, 20, 15, 14]
})
# parts = [] # Review
# for idx, row in trades.iterrows():
# target_deltas = (row['timestamp'] - signal_source['timestamp'])
# parts.append(
# signal_source.loc[target_deltas.abs().idxmin()]
# )
# print(DataFrame(parts))
from pandas import merge_asof
print(
merge_asof(trades, signal_source, on='timestamp', direction='backward')
)
For operations like inequality joins—matching rows where a value falls within a range—Pandas can become cumbersome and slow. Tools like DuckDB handle such cases more efficiently, especially with large datasets. Its SQL syntax simplifies expressing these joins, and its performance is typically superior, reducing execution time and memory usage.
Instead of finding the closest signal when provided some timestamps, let’s find ALL windows of signals that each timestamp fits into.
from pandas import DataFrame, date_range, to_datetime
signal_source = DataFrame({
'start_time': to_datetime([
'2000-01-01 09:00:00', '2000-01-01 09:03:00', '2000-01-01 09:04:30'
]),
'end_time': to_datetime([
'2000-01-01 09:03:30', '2000-01-01 09:05:00', '2000-01-01 09:06:00'
]),
'signal value': [100, 121, 141],
'signal type': ['A', 'B', 'A'],
})
trades = DataFrame({
'timestamp': to_datetime([
'2000-01-01 09:03:00', '2000-01-01 09:04:00', '2000-01-01 09:05:00',
]),
'trade_volume': [10, 15, 20],
'price': [90, 92, 93],
})
# print(
# 'Price Data:',
# signal_source,
# '',
# 'Trade Data:',
# trades,
# sep='\n',
# end='\n'*3,
# )
# Review — align each trade against its corresponding signal given their start/end times
# from pandas import merge_asof
# print(
# merge_asof(trades, signal_source, left_on='timestamp', right_on='start_time')
# )
# print(
# trades.merge(signal_source, how='cross')
# .loc[lambda d: d['timestamp'].between(d['start_time'], d['end_time'])]
# )
print("Let's take a look!")
Pandas’ SettingWithCopyWarning often confuses users. This warning occurs when you attempt to modify a view of a DataFrame rather than its copy, potentially leading to unpredictable results. Defensive programming techniques, like using .copy() or applying transformations across entire columns, can help you avoid this trap. For instance, instead of modifying a subset, create a new DataFrame for safe manipulation or use masks for row-wise operations.
from pandas import DataFrame, date_range
from numpy.random import default_rng
rng = default_rng(0)
df = DataFrame(
index=date_range('2000-01', freq='D', periods=(n := 100)),
data={
'group': rng.choice(['a', 'b', 'c', 'd'], size=100),
'values': rng.normal(100, 5, size=n),
},
)
# Review
# subset = df.loc[df['group'] == 'c']
# subset['values'] = -subset['values']
# defensive copy → create independent subset
# subset = df.loc[df['group'] == 'c'].copy()
# subset['values'] = -subset['values']
# operate along the whole column → parent new column
# df['values'] = df['values'].mask(df['group'] == 'c', -df['values'])
# reuse & combine masks → parent in place update
# mask = df['group'] == 'c'
# df.loc[mask, 'values'] *= -1
print(
' Parent '.center(40, '\N{box drawings double horizontal}'),
df.head(),
# ' Subset '.center(40, '\N{box drawings double horizontal}'),
# subset.head(),
sep='\n'
)
While the inplace=True parameter is convenient, it’s often better to avoid it. Modifying objects in place can make debugging harder, as it obscures the flow of data transformations. Explicit assignments, though slightly more verbose, lead to clearer and more maintainable code.
from pandas import DataFrame, date_range, NA
from numpy.random import default_rng
rng = default_rng(0)
df = DataFrame(
index=date_range('2000-01', freq='D', periods=(n := 3)),
data={
'price_range': ['100-120', '121-140', '141-160']
},
)
df['low'] = ( # Review
df['price_range'].str.split('-', expand=True)
.drop(columns=[1])
)
print(df)
Wide tables with many columns can be unwieldy, especially if columns represent repeated patterns (e.g., monthly sales). Stacking data into a tall format with fewer columns and more rows often simplifies analysis and supports advanced operations like grouping or pivoting.
from pandas import DataFrame, date_range
from numpy.random import default_rng
from time import sleep
rng = default_rng(0)
df = DataFrame({
'timestamp': date_range(start='2000-01-03', periods=10, freq='6H', tz='US/Eastern'),
'price': rng.uniform(100, 200, size=10).round(2)
})
sleep(5) # data take a while to load
df = df.assign( # Review
price_normed=lambda d: d['price'] / d.groupby(d['timestamp'].dt.floor('D'))['price'].transform('max')
)
print(df)
# the original price values, not the normalized ones
Broadly speaking, we want to ensure our functions do not produce side-effects on the objects in which they operate. This is especially true for working with data where we are working with shared state in multiple places within a given pipeline. We can help enforce this pattern by not assuming column names within a function and by trying to ensure our functions operate on Series and return Series instead of DataFrames.
from pandas import DataFrame, date_range
from numpy.random import default_rng
rng = default_rng(0)
df = DataFrame({
'timestamp': date_range(start='2000-01-03', periods=10, freq='6H', tz='US/Eastern'),
'price': rng.uniform(100, 200, size=10).round(2)
})
# Review
def daily_averages(df):
"""
if you want to use this function, name your datetime column "timestamp"
"""
df['price'] = df.resample('D', on='timestamp')['price'].transform('mean')
return df
def daily_averages(df, on, values):
return df.resample('D', on=on)[values].transform('mean')
print(
# df.assign(
# avg_daily_price=lambda d: daily_averages(d, on='timestamp', values=['price', 'price'])
# )
df.join(
df
avg_daily_price=lambda d: daily_averages(d, on='timestamp', values=['price', 'price'])
)
)
print("Let's take a look!")
When it comes to .groupby operations, we need keep in mind two very important
things:
.apply if possible.from pandas import MultiIndex
from numpy import ascontiguousarray
from numpy.random import default_rng
from string import ascii_lowercase
rng = default_rng(0)
pool = (
ascontiguousarray(
MultiIndex.from_product([list(ascii_lowercase)] * 4)
.to_frame()
.to_numpy()
)
.astype('<U1')
.view('<U4')
.ravel()
)
from pandas import MultiIndex, DataFrame, date_range
cardinality = 10_000
categories = rng.choice(pool, size=cardinality, replace=False)
dates = date_range('2000-01-01', freq='h', periods=2_000)
index = MultiIndex.from_product([categories, dates], names=['groupid', 'date'])
df = (
DataFrame(
index=index,
data={
'xs': (xs := rng.normal(5_000, 200, size=len(index))),
'ys': 5 * xs + 2_000 + rng.normal(0, 400, size=len(index)),
'weights': rng.uniform(0, 2, size=len(index)),
}
)
.sample(frac=.7, random_state=rng)
.reset_index()
)
from pathlib import Path
data_dir = Path('data')
data_dir.mkdir(exist_ok=True)
df.to_pickle(data_dir / 'large-groups.pkl')
print(df.head())
verbs: agg, transform, apply
from pandas import read_pickle
from utils import timed
df = read_pickle('data/large-groups.pkl')
def udf(data):
return data.mean()
# Review
# with timed('apply'):
# res = df.groupby('groupid')['ys'].apply(udf)
# with timed('agg'):
# res = df.groupby('groupid')['ys'].agg(udf)
with timed('agg'):
res = df.groupby('groupid')['ys'].agg('mean')
# aggregate → moving from many rows per group to 1 row per group
# transform → many rows per group to same number of rows per group
from pandas import read_pickle
from utils import timed
df = read_pickle('data/large-groups.pkl').sample(frac=.1, random_state=0)
def udf(data):
return (data['ys'] * data['weights']).sum() / data['weights'].sum()
# Review
with timed('apply'):
res = df.groupby('groupid').apply(udf)
with timed('agg + intermediate')
res = (
df.assign(
tmp=lambda d: d['ys'] * d['weights']
)
.groupby('groupid')
.agg({'tmp': 'sum', 'weights': 'sum'})
.pipe(lambda d: d['tmp'] / d['weights'])
)
print(res)
print("Let's take a look!")
pandas has many data types, most notably there is the 'object' dtype.
This is the datatype we want to avoid as much as possible because it means
all of the operations we want to perform are going to be no more performan than
using a simple Python list.
from pandas import Series
from numpy.random import default_rng
from string import ascii_lowercase, ascii_uppercase
from utils import timed
rng = default_rng(0)
pool = [*ascii_lowercase, *ascii_uppercase]
s = Series(
rng.choice(pool, size=(1_000_000, 10)).view('<U10').ravel()
)
s = s.astype('string')
# Review
with timed('chained str'):
s.str.strip().str.lower().str.slice(0, 4)
with timed('.apply'):
s.apply(lambda v: v.strip().lower()[:4])
Categoricals are a powerful feature in pandas allowing one to encode a relationship between memory inefficient data to memory efficient data. The tradeoff here is that the performance trade-off of Categoricals is highly dependent on the cardinality of the encoded column. This means that it is in our best interest to be able to identify columns that should be treated as Categorical data.
from pandas import read_csv, DataFrame
from io import StringIO
from textwrap import dedent
from numpy.random import default_rng
from utils import simple_buffer
rng = default_rng(0)
data = {}
for i in range(100):
df = DataFrame({
'col1': rng.uniform(90, 110, size=1_000),
'col2': rng.normal(200, 20, size=1_000)
})
with simple_buffer() as buffer:
df.to_csv(buffer, index=False)
data[f'data{i}.csv'] = buffer
#-------
# Review
from pandas import concat
# parts = []
# for path, fileobj in data.items():
# parts.append(
# read_csv(fileobj).assign(origin=path)
# )
# print(concat(parts).memory_usage(deep=True))
# parts = {}
# for path, fileobj in data.items():
# parts[path] = read_csv(fileobj)
# print(
# concat(parts, names=['origin', 'row_no']) #.memory_usage(deep=True)
# )
from pandas import concat, Categorical
parts = []
for i, fileobj in enumerate(data.values()):
parts.append(
read_csv(fileobj).assign(origin=i)
)
print(
concat(parts)
.assign(
origin=lambda d: Categorical.from_codes(d['origin'], categories=data.keys())
)
.memory_usage(deep=True)
)
In fact Categoricals turn up more often than you might think, we can also express common datetime components as Categoricals to improve both the performance and readability of our code.
from numpy.random import default_rng
from pandas import DataFrame, date_range, CategoricalDtype, Categorical
from utils import timed
from calendar import day_name
rng = default_rng(0)
df = (
DataFrame({
'timestamp': date_range('2000-01-01', freq='10min', periods=500_000),
'volume': rng.uniform(1, 300, size=500_000),
'desk': rng.choice(['Alice', 'Bob', 'Charlie', 'Dana'], size=500_000)
})
# Review
.astype({'desk': 'category'})
# .assign(day_of_week=lambda d: d['timestamp'].dt.strftime('%A')) # %A ⇒ full weekday name
.assign(day_of_week=lambda d:
Categorical.from_codes(d['timestamp'].dt.day_of_week, categories=day_name[:])
)
)
print(
# df.loc[lambda d: ~d['day_of_week'].isin(['Saturday', 'Sunday']), 'volume'].mean(),
df.pivot_table(index='desk', columns='day_of_week', values='volume', aggfunc='sum'),
sep='\n' * 2
)
While these tricks make for nice blogposts, does anyone actually minimze numeric dtypes by hand?
from textwrap import dedent
from pandas import read_table, to_numeric
from io import StringIO
buffer = StringIO(
dedent('''
a b
1 1.1
2 2.2
3 3.3
4 4.4
5 5.5
''')
)
df = ( # Review
read_table(buffer, sep='\s{2,}', engine='python')
.astype({
'a': 'int8',
'b': 'float32',
})
)
int64 vs Int64? What’s the big idea here? Null data has a long history in
pandas and NumPy, and is one of the large points where these two tools
diverged from one another. But when do I really want to use a nullable
datatype in pandas? What’s wrong with numpy.nan?
from pandas import DataFrame, date_range
from numpy.random import default_rng
from calendar import month_name
rng = default_rng(0)
prices = DataFrame({
'desk': rng.choice(['Alice', 'Bob', 'Charlie', 'Dana'], size=100),
'timestamp': date_range(start='2000-01-01', periods=100, freq='W', tz='US/Eastern'),
'volume': rng.uniform(100, 200, size=100).astype('int'),
}).assign(
week=lambda d: d['timestamp'].dt.month_name()
).sample(frac=.5)
print( # Review
prices.pivot_table(index='week', columns='desk', values='volume', aggfunc='sum')
.loc[month_name[1:]]
)