ts-python

Seminar V: “The Abyss” (a deep dive into a `pandas` problem)

Lindsey Brigman (Mary Elizabeth Mastrantonio): I got over four years invested in this project.

Virgil “Bud” Brigman (Ed Harris): Yeah, you only had three years invested in me.

Lindsey Brigman: Well, you have to have priorities.

— The Abyss (1989)

Date	Time	Theme	Meeting Link
Fri Aug 6, 2021	9:30 AM EDT	a deep dive into a `pandas` problem	Seminar V: “The Abyss”

Seminar V: “The Abyss” (a deep dive into a pandas problem)

Audience

These sessions are designed for a broad audience of modelers and software programmers of all backgrounds and skill levels.

Our expected audience should comprise attendees with at least a…

basic proficiency with pandas

During this session, we will endeavour to guide our audience to developing…

(complete) proficiency with fluent use of the pandas API.
(basic) proficiency with dropping down into numpy for solving pandas problems.

Abstract

In previous episodes, we have looked at many problems involving time-series or columnar-data analysis using pandas. Our goal has been to present these examples in order to understand some broader conceptual point, such as the structure of the pandas.Series and pandas.DataFrame, better use of indices, or better use of windowing operations.

By popular demand, this episode will set aside our goal for deeper thematic or conceptual understanding and instead provide us an opportunity to look at a couple of problems in-depth!

Join us for this episode to look at a number of in-depth pandas examples, drawn from our previous discussions, where we will go line-by-line and look at what constitutes fluent, precise Python and pandas code!

To Be Continued…

Did you enjoy this episode? Did you learn something new that will help you as you try to improve the fluency or the precision of your code?

If so, stay tuned for future episodes, which may…

take a deep dive into sample problems that involve the combined use of pandas, numpy, xarray, and the Python standard library.
take a deep dive into sample problems that require writing small libraries around analytical tools like pandas or xarray.
contrast various different approaches to solving a problem, and discuss how we can get the best “bang-for-our-buck” when employing greater technical sophistication.

If there are other related topics you’d like to see covered, please reach out to Diego Torres Quintanilla.

Notes

print("Let's get started!")

import sys; sys.breakpointhook = sys.exit
from pandas import (
    Timestamp, DataFrame, Series, Interval, 
    to_datetime, to_timedelta, IndexSlice,
)
from numpy import repeat, tile, inf, sign
from numpy.random import default_rng
from string import ascii_lowercase

seed = Timestamp('2020-08-06').asm8.astype('uint32')
rng = default_rng(seed)

tickers = rng.choice([*ascii_lowercase], size=(10, 4)).view('<U4').ravel()
dates   = to_datetime('2021-08-06') + to_timedelta(
    rng.integers(0, 60 * 60, size=50).cumsum(), unit='s'
)
prices = (
    100 * rng.random(size=len(tickers))
  * rng.normal(loc=1, scale=0.01, size=(len(dates), len(tickers))).cumprod(axis=0)
).ravel()
volumes = rng.integers(-50_000, +50_000, size=len(dates) * len(tickers)).round(-2)

trades = DataFrame({
    'date':   repeat(dates, len(tickers)),
    'ticker': tile(tickers, len(dates)),
    'price':  prices,
    'volume': volumes,
}).set_index(['date', 'ticker']).pipe(
    lambda df: df.sample(frac=.2, random_state=seed)
).sort_index()

taxes = Series({
    Interval(     0, 10_000, closed='left'): .050,
    Interval(10_000, 20_000, closed='left'): .030,
    Interval(20_000, 40_000, closed='left'): .015,
    Interval(40_000, 50_001, closed='left'): .005,
})
fees = Series({
    Interval( 0,   5, closed='left'): .015,
    Interval( 5,  15, closed='left'): .025,
    Interval(15, inf, closed='left'): .05,
})

trades = trades.assign(count=1).assign(count=lambda df:
    df['count'].groupby([
        df.index.get_level_values('date').date,
        'ticker'
    ]).transform(lambda s: sum(s)) 
)
trades = trades.assign(total_volume=lambda df:
    df.groupby('ticker')['volume'].transform(
        lambda s: s.expanding(min_periods=1).sum()
    )
)
trades = trades.assign(sign=lambda df: sign(df['total_volume'])).assign(
    new=lambda df:
        df['sign'].groupby('ticker').transform(
            lambda s: s.rolling(2).sum() == 0
        )
)
trades = trades.assign(position=lambda df:
    df.groupby('ticker')['new'].transform(lambda s: s.cumsum())
)
trades = trades.assign(weight=lambda df:
    df.groupby(['ticker', 'position'])['volume'].transform(
        lambda s: s.abs() / s.abs().sum()
    )
)
trades = trades.assign(weighted_price=lambda df: df['weight'] * df['price'])
trades = trades.assign(average_price=lambda df:
    df.groupby(['ticker', 'position'])['weighted_price'].transform(
        lambda s: s.sum()
    )
)
trades['cash'] = -(trades['price'] * trades['volume'])
trades['fee'] = fees.loc[trades['count']].values
trades['tax'] = taxes.loc[trades['volume'].abs()].values
trades['diff_lowest_price'] = (
    trades['average_price'] - 
    trades
        .reset_index('date').set_index('position', append=True)
        .groupby('ticker')['average_price'].min()
).abs()

trades = (
    trades
    .reset_index('date')
    .set_index('position', append=True)
    .sort_index()
)
print(
    trades.loc[
        trades.groupby(['ticker', 'position'])['diff_lowest_price'].head(1)
        .groupby(['ticker']).nsmallest(2).reset_index(0).index,
    ][['cash', 'tax', 'fee']],
    sep='\n',
)

from matplotlib.pyplot import show
from pandas import read_csv, DataFrame, Series, Interval
from numpy import multiply, meshgrid, inf
from io import StringIO
data = StringIO('''
WD          4   89.00
WD          6  109.00
WD          8  149.00
WD         10  218.00
WD         12  239.00
WD         14  279.00
WD         16  329.00
WD         18  409.00
Seagate     4   79.00
Seagate     6  119.00
Seagate     8  159.00
Seagate    10  228.00
Seagate    12  232.00
Seagate    14  299.00
Seagate    16  309.00
Seagate    18  425.00
''')
prices = read_csv(
    data, sep='\s+', header=None,
    names='manufacturer size price'.split()
).set_index(['manufacturer', 'size']).sort_index()

#  taxes = Series({
#      Interval( 0,  10, closed='left'):   0,
#      Interval(10, inf, closed='left'): .10,
#  })

#  prices = prices.loc['WD']['price']

print(
    #  prices,
    #  prices.unstack(),
    #  prices.unstack('manufacturer'),
    #  taxes.loc[prices.index],
    #  (DataFrame(
    #      multiply(*meshgrid((prices / prices.index), prices.index)).T,
    #      index=prices.index,
    #      columns=prices.index,
    #  ) - prices).round(2),
    sep='\n',
)

#  prices.unstack('manufacturer').plot()
#  show()