ts-python

pandas Mistakes Everyone Makes

Topics: pandas, code review, mistakes

“Why do my code reviews keep pointing out pandas issues?”

If you’ve ever had your code sent back with feedback on pandas usage, this seminar is for you. We’ll dive into the most common mistakes that modelers make with pandas during code review. From mishandling apply functions to inefficient joins and unclear indexing, you’ll learn to spot and fix these pitfalls before your reviewers do.

But it’s not just about avoiding mistakes—it’s about developing an intuition for pandas. We’ll discuss the “why” behind best practices, helping you understand the trade-offs in speed, readability, and maintainability. By the end, you’ll have a stronger grasp of pandas workflows and be better equipped to write production-ready code that won’t raise eyebrows during reviews.

Notes

python -m pip install pandas==2.0.3 matplotlib==3.9.* duckdb==1.1.3 numpy<2.*

The Right Tool For the Job

print("Let's take a look!")

Predicate Pushdown

When working with SQL databases and Pandas, it’s common to filter data. However, placing the filter on the wrong side—such as in Pandas instead of SQL—can significantly impact performance. Filtering in SQL leverages the database engine’s optimizations, reducing the data transferred into Python memory. A misplaced filter in Pandas means loading the entire dataset first, which is inefficient and can strain system resources. Always consider whether your filtering logic belongs in the SQL query or in your DataFrame manipulations.

from pathlib import Path
from pandas import DataFrame, date_range, read_sql
from numpy.random import default_rng
from sqlite3 import connect
from contextlib import closing

rng = default_rng(0)

df = DataFrame(
    data={
        'timestamp': date_range('2000-01', freq='s', periods=(n := 8_000_000)),
        'group':     rng.choice([*'abcdefghijklmnop'], size=n),
        'values':     rng.normal(100, 5, size=n),
    },
)

data_dir = Path('data')
data_dir.mkdir(exist_ok=True)

with closing(connect(data_dir / 'some.db')) as conn:
    df.to_sql('mytable', conn, if_exists='replace', index=False)
print(df.head())

pandas is often referred to as a “tail-end” analytical tool. For large data pipelines it may not be the best tool for working at scale, but at some point you are going to want the vast flexibility that is offers to perform your work.

from pandas import read_sql
from sqlite3 import connect
from contextlib import closing

# with closing(connect('data/some.db')) as conn: # Review
#     df = read_sql('select * from mytable', con=conn)
# df = df.loc[df['group'] == 'c']
# print(df)

with closing(connect('data/some.db')) as conn: # Review
    df = read_sql('select * from mytable where [group] = "c"', con=conn)
print(df)

aside: using prepared statements during interop

from pandas import read_sql
from sqlite3 import connect
from contextlib import closing

subgroup = "c" # Review
with closing(connect('data/some.db')) as conn:
    # df = read_sql(
    #     f'''
    #         select * from mytable
    #         where ([group] = {subgroup!r})
    #     ''',
    #     con=conn,
    # )

    df = read_sql(
        '''
            select * from mytable
            where ([group] = :subgroup)
        ''',
        con=conn,
        params={'subgroup': subgroup}
    )
print(df)

Using prepared statements is not just about security—it also improves maintainability and clarity in code. By parameterizing queries instead of embedding variables directly, you safeguard against SQL injection vulnerabilities and ensure the code handles special characters correctly. This approach is particularly valuable when dealing with user-supplied inputs or dynamic query conditions.

Using the Correct Joins

left, right, inner, outer, semi, anti, cross: joins are easy, right?

from pandas import DataFrame, Timestamp, to_datetime

signal_source = DataFrame({
    'timestamp': to_datetime([
        '2000-01-01 09:00:00', '2000-01-01 09:03:00', '2000-01-01 09:04:30'
    ]),
    'signal value': [100, 121, 141],
    'signal type':  ['A', 'B', 'A'],
})

# print(signal_source)

# Review → find closest signal to target_timestamp
target_timestamp = Timestamp('2000-01-01 09:03:10')
target_delta = target_timestamp - signal_source['timestamp']

print(
    signal_source.loc[target_delta.abs().idxmin()]
)

But we don’t often just have one target timestamp, we have many…

from pandas import DataFrame, date_range, to_datetime, concat

signal_source = DataFrame({
    'timestamp': to_datetime([
        '2000-01-01 09:00:00', '2000-01-01 09:03:00', '2000-01-01 09:04:30'
    ]),
    'signal value': [100, 121, 141],
    'signal type':  ['A', 'B', 'A'],
})

trades = DataFrame({
    'timestamp': to_datetime([
        '2000-01-01 09:01', '2000-01-01 09:02', '2000-01-01 09:04', '2000-01-01 09:10'
    ]),
    'trade_volume': [10, 20, 15, 14]
})

# parts = [] # Review
# for idx, row in trades.iterrows():
#     target_deltas = (row['timestamp'] - signal_source['timestamp'])
#     parts.append(
#         signal_source.loc[target_deltas.abs().idxmin()]
#     )
# print(DataFrame(parts))

from pandas import merge_asof
print(
    merge_asof(trades, signal_source, on='timestamp', direction='backward')
)

Pandas has No inequality join!

For operations like inequality joins—matching rows where a value falls within a range—Pandas can become cumbersome and slow. Tools like DuckDB handle such cases more efficiently, especially with large datasets. Its SQL syntax simplifies expressing these joins, and its performance is typically superior, reducing execution time and memory usage.

Instead of finding the closest signal when provided some timestamps, let’s find ALL windows of signals that each timestamp fits into.

from pandas import DataFrame, date_range, to_datetime

signal_source = DataFrame({
    'start_time': to_datetime([
        '2000-01-01 09:00:00', '2000-01-01 09:03:00', '2000-01-01 09:04:30'
    ]),
    'end_time': to_datetime([
        '2000-01-01 09:03:30', '2000-01-01 09:05:00', '2000-01-01 09:06:00'
    ]),
    'signal value': [100, 121, 141],
    'signal type':  ['A', 'B', 'A'],
})

trades = DataFrame({
    'timestamp': to_datetime([
        '2000-01-01 09:03:00', '2000-01-01 09:04:00', '2000-01-01 09:05:00',
    ]),
    'trade_volume': [10, 15, 20],
    'price':        [90, 92, 93],
})

# print(
#     'Price Data:',
#     signal_source,
#     '',
#     'Trade Data:',
#     trades,
#     sep='\n',
#     end='\n'*3,
# )

# Review — align each trade against its corresponding signal given their start/end times
# from pandas import merge_asof
# print(
#     merge_asof(trades, signal_source, left_on='timestamp', right_on='start_time')
# )

# print(
#     trades.merge(signal_source, how='cross')
#     .loc[lambda d: d['timestamp'].between(d['start_time'], d['end_time'])]
# )

The Gotchas

print("Let's take a look!")

Chained Assignment & SettingWithCopyWarning

Pandas’ SettingWithCopyWarning often confuses users. This warning occurs when you attempt to modify a view of a DataFrame rather than its copy, potentially leading to unpredictable results. Defensive programming techniques, like using .copy() or applying transformations across entire columns, can help you avoid this trap. For instance, instead of modifying a subset, create a new DataFrame for safe manipulation or use masks for row-wise operations.

from pandas import DataFrame, date_range
from numpy.random import default_rng

rng = default_rng(0)

df = DataFrame(
    index=date_range('2000-01', freq='D', periods=(n := 100)),
    data={
        'group': rng.choice(['a', 'b', 'c', 'd'], size=100),
        'values': rng.normal(100, 5, size=n),
    },
)

# Review
# subset = df.loc[df['group'] == 'c']
# subset['values'] = -subset['values']

# defensive copy → create independent subset
# subset = df.loc[df['group'] == 'c'].copy()
# subset['values'] = -subset['values']

# operate along the whole column → parent new column
# df['values'] = df['values'].mask(df['group'] == 'c', -df['values'])

# reuse & combine masks → parent in place update
# mask = df['group'] == 'c'
# df.loc[mask, 'values'] *= -1

print(
    ' Parent '.center(40, '\N{box drawings double horizontal}'),
    df.head(),
    # ' Subset '.center(40, '\N{box drawings double horizontal}'),
    # subset.head(),
    sep='\n'
)

inplace=?

While the inplace=True parameter is convenient, it’s often better to avoid it. Modifying objects in place can make debugging harder, as it obscures the flow of data transformations. Explicit assignments, though slightly more verbose, lead to clearer and more maintainable code.

from pandas import DataFrame, date_range, NA
from numpy.random import default_rng

rng = default_rng(0)

df = DataFrame(
    index=date_range('2000-01', freq='D', periods=(n := 3)),
    data={
        'price_range': ['100-120', '121-140', '141-160']
    },
)

df['low'] = ( # Review
    df['price_range'].str.split('-', expand=True)
    .drop(columns=[1])
)

print(df)

Treat Columns as if they are Immutable

Wide tables with many columns can be unwieldy, especially if columns represent repeated patterns (e.g., monthly sales). Stacking data into a tall format with fewer columns and more rows often simplifies analysis and supports advanced operations like grouping or pivoting.

from pandas import DataFrame, date_range
from numpy.random import default_rng
from time import sleep

rng = default_rng(0)


df = DataFrame({
    'timestamp': date_range(start='2000-01-03', periods=10, freq='6H', tz='US/Eastern'),
    'price': rng.uniform(100, 200, size=10).round(2)
})
sleep(5) # data take a while to load

df = df.assign( # Review
    price_normed=lambda d: d['price'] / d.groupby(d['timestamp'].dt.floor('D'))['price'].transform('max')
)
print(df)


# the original price values, not the normalized ones

Functional Purity

Broadly speaking, we want to ensure our functions do not produce side-effects on the objects in which they operate. This is especially true for working with data where we are working with shared state in multiple places within a given pipeline. We can help enforce this pattern by not assuming column names within a function and by trying to ensure our functions operate on Series and return Series instead of DataFrames.

from pandas import DataFrame, date_range
from numpy.random import default_rng

rng = default_rng(0)

df = DataFrame({
    'timestamp': date_range(start='2000-01-03', periods=10, freq='6H', tz='US/Eastern'),
    'price': rng.uniform(100, 200, size=10).round(2)
})

# Review
def daily_averages(df):
    """
    if you want to use this function, name your datetime column "timestamp"
    """
    df['price'] = df.resample('D', on='timestamp')['price'].transform('mean')
    return df

def daily_averages(df, on, values):
    return df.resample('D', on=on)[values].transform('mean')

print(
    # df.assign(
    #     avg_daily_price=lambda d: daily_averages(d, on='timestamp', values=['price', 'price'])
    # )

    df.join(
        df
        avg_daily_price=lambda d: daily_averages(d, on='timestamp', values=['price', 'price'])
    )
)

Groupby Operations

print("Let's take a look!")

When it comes to .groupby operations, we need keep in mind two very important things:

Use the most specific groupby verb (agg, transform, apply) possible. This often means that we should avoid .apply if possible.
Avoid passing in user-defined functions into any groupby verb.

from pandas import MultiIndex
from numpy import ascontiguousarray
from numpy.random import default_rng
from string import ascii_lowercase

rng = default_rng(0)

pool = (
    ascontiguousarray(
        MultiIndex.from_product([list(ascii_lowercase)] * 4)
        .to_frame()
        .to_numpy()
    )
    .astype('<U1')
    .view('<U4')
    .ravel()
)

from pandas import MultiIndex, DataFrame, date_range

cardinality = 10_000
categories = rng.choice(pool, size=cardinality, replace=False)
dates = date_range('2000-01-01', freq='h', periods=2_000)


index = MultiIndex.from_product([categories, dates], names=['groupid', 'date'])
df = (
    DataFrame(
        index=index,
        data={
            'xs':  (xs := rng.normal(5_000, 200, size=len(index))),
            'ys':  5 * xs + 2_000 + rng.normal(0, 400, size=len(index)),
            'weights': rng.uniform(0, 2, size=len(index)),
        }
    )
    .sample(frac=.7, random_state=rng)
    .reset_index()
)

from pathlib import Path
data_dir = Path('data')
data_dir.mkdir(exist_ok=True)

df.to_pickle(data_dir / 'large-groups.pkl')
print(df.head())

verbs: agg, transform, apply

groupby(…).agg
- G/N → G
groupby(…).transform
- G/N → N
groupby(…).apply
- G/N → ?

from pandas import read_pickle
from utils import timed

df = read_pickle('data/large-groups.pkl')

def udf(data):
    return data.mean()

# Review
# with timed('apply'):
#     res = df.groupby('groupid')['ys'].apply(udf)

# with timed('agg'):
#     res = df.groupby('groupid')['ys'].agg(udf)

with timed('agg'):
    res = df.groupby('groupid')['ys'].agg('mean')

# aggregate → moving from many rows per group to 1 row per group
# transform → many rows per group to same number of rows per group

from pandas import read_pickle
from utils import timed

df = read_pickle('data/large-groups.pkl').sample(frac=.1, random_state=0)

def udf(data):
    return (data['ys'] * data['weights']).sum() / data['weights'].sum()

# Review
with timed('apply'):
    res = df.groupby('groupid').apply(udf)

with timed('agg + intermediate')
    res = (
        df.assign(
            tmp=lambda d: d['ys'] * d['weights']
        )
        .groupby('groupid')
        .agg({'tmp': 'sum', 'weights': 'sum'})
        .pipe(lambda d: d['tmp'] / d['weights'])
    )
print(res)

Optimizing Data Types

print("Let's take a look!")

Avoid Objects

pandas has many data types, most notably there is the 'object' dtype. This is the datatype we want to avoid as much as possible because it means all of the operations we want to perform are going to be no more performan than using a simple Python list.

from pandas import Series
from numpy.random import default_rng
from string import ascii_lowercase, ascii_uppercase
from utils import timed

rng = default_rng(0)

pool = [*ascii_lowercase, *ascii_uppercase]
s = Series(
    rng.choice(pool, size=(1_000_000, 10)).view('<U10').ravel()
)
s = s.astype('string')

# Review
with timed('chained str'):
    s.str.strip().str.lower().str.slice(0, 4)

with timed('.apply'):
    s.apply(lambda v: v.strip().lower()[:4])

Embrace Categorical

Categoricals are a powerful feature in pandas allowing one to encode a relationship between memory inefficient data to memory efficient data. The tradeoff here is that the performance trade-off of Categoricals is highly dependent on the cardinality of the encoded column. This means that it is in our best interest to be able to identify columns that should be treated as Categorical data.

from pandas import read_csv, DataFrame
from io import StringIO
from textwrap import dedent
from numpy.random import default_rng
from utils import simple_buffer

rng = default_rng(0)
data = {}
for i in range(100):
    df = DataFrame({
        'col1': rng.uniform(90, 110, size=1_000),
        'col2': rng.normal(200, 20, size=1_000)
    })

    with simple_buffer() as buffer:
        df.to_csv(buffer, index=False)
    data[f'data{i}.csv'] = buffer
#-------

# Review
from pandas import concat
# parts = []
# for path, fileobj in data.items():
#     parts.append(
#         read_csv(fileobj).assign(origin=path)
#     )
# print(concat(parts).memory_usage(deep=True))

# parts = {}
# for path, fileobj in data.items():
#     parts[path] = read_csv(fileobj)
# print(
#     concat(parts, names=['origin', 'row_no']) #.memory_usage(deep=True)
# )

from pandas import concat, Categorical
parts = []
for i, fileobj in enumerate(data.values()):
    parts.append(
        read_csv(fileobj).assign(origin=i)
    )
print(
    concat(parts)
    .assign(
        origin=lambda d: Categorical.from_codes(d['origin'], categories=data.keys())
    )
    .memory_usage(deep=True)
)

In fact Categoricals turn up more often than you might think, we can also express common datetime components as Categoricals to improve both the performance and readability of our code.

from numpy.random import default_rng
from pandas import DataFrame, date_range, CategoricalDtype, Categorical
from utils import timed
from calendar import day_name

rng = default_rng(0)

df = (
    DataFrame({
        'timestamp': date_range('2000-01-01', freq='10min', periods=500_000),
        'volume': rng.uniform(1, 300, size=500_000),
        'desk': rng.choice(['Alice', 'Bob', 'Charlie', 'Dana'], size=500_000)
    })

    # Review
    .astype({'desk': 'category'})
    # .assign(day_of_week=lambda d: d['timestamp'].dt.strftime('%A')) # %A ⇒ full weekday name
    .assign(day_of_week=lambda d:
        Categorical.from_codes(d['timestamp'].dt.day_of_week, categories=day_name[:])
    )
)

print(
    # df.loc[lambda d: ~d['day_of_week'].isin(['Saturday', 'Sunday']), 'volume'].mean(),

    df.pivot_table(index='desk', columns='day_of_week', values='volume', aggfunc='sum'),
    sep='\n' * 2
)

Minimizing Numeric Dtypes

While these tricks make for nice blogposts, does anyone actually minimze numeric dtypes by hand?

from textwrap import dedent
from pandas import read_table, to_numeric
from io import StringIO

buffer = StringIO(
    dedent('''
       a      b
       1    1.1
       2    2.2
       3    3.3
       4    4.4
       5    5.5
    ''')
)

df = ( # Review
    read_table(buffer, sep='\s{2,}', engine='python')
    .astype({
        'a': 'int8',
        'b': 'float32',
    })
)

Nullable Data Types

int64 vs Int64? What’s the big idea here? Null data has a long history in pandas and NumPy, and is one of the large points where these two tools diverged from one another. But when do I really want to use a nullable datatype in pandas? What’s wrong with numpy.nan?

from pandas import DataFrame, date_range
from numpy.random import default_rng
from calendar import month_name

rng = default_rng(0)

prices = DataFrame({
    'desk': rng.choice(['Alice', 'Bob', 'Charlie', 'Dana'], size=100),
    'timestamp': date_range(start='2000-01-01', periods=100, freq='W', tz='US/Eastern'),
    'volume': rng.uniform(100, 200, size=100).astype('int'),
}).assign(
    week=lambda d: d['timestamp'].dt.month_name()
).sample(frac=.5)

print(  # Review
    prices.pivot_table(index='week', columns='desk', values='volume', aggfunc='sum')
    .loc[month_name[1:]]
)