ts-python

Seminar V: “The Abyss” (a deep dive into a pandas problem)

Lindsey Brigman (Mary Elizabeth Mastrantonio): I got over four years invested in this project.

Virgil “Bud” Brigman (Ed Harris): Yeah, you only had three years invested in me.

Lindsey Brigman: Well, you have to have priorities.

The Abyss (1989)

Date Time Theme Meeting Link
Fri Aug 6, 2021 9:30 AM EDT a deep dive into a pandas problem Seminar V: “The Abyss”

Contents

Audience

These sessions are designed for a broad audience of modelers and software programmers of all backgrounds and skill levels.

Our expected audience should comprise attendees with at least a…

During this session, we will endeavour to guide our audience to developing…

Abstract

In previous episodes, we have looked at many problems involving time-series or columnar-data analysis using pandas. Our goal has been to present these examples in order to understand some broader conceptual point, such as the structure of the pandas.Series and pandas.DataFrame, better use of indices, or better use of windowing operations.

By popular demand, this episode will set aside our goal for deeper thematic or conceptual understanding and instead provide us an opportunity to look at a couple of problems in-depth!

Join us for this episode to look at a number of in-depth pandas examples, drawn from our previous discussions, where we will go line-by-line and look at what constitutes fluent, precise Python and pandas code!

To Be Continued…

Did you enjoy this episode? Did you learn something new that will help you as you try to improve the fluency or the precision of your code?

If so, stay tuned for future episodes, which may…

If there are other related topics you’d like to see covered, please reach out to Diego Torres Quintanilla.

Notes

print("Let's get started!")
import sys; sys.breakpointhook = sys.exit
from pandas import (
    Timestamp, DataFrame, Series, Interval, 
    to_datetime, to_timedelta, IndexSlice,
)
from numpy import repeat, tile, inf, sign
from numpy.random import default_rng
from string import ascii_lowercase

seed = Timestamp('2020-08-06').asm8.astype('uint32')
rng = default_rng(seed)

tickers = rng.choice([*ascii_lowercase], size=(10, 4)).view('<U4').ravel()
dates   = to_datetime('2021-08-06') + to_timedelta(
    rng.integers(0, 60 * 60, size=50).cumsum(), unit='s'
)
prices = (
    100 * rng.random(size=len(tickers))
  * rng.normal(loc=1, scale=0.01, size=(len(dates), len(tickers))).cumprod(axis=0)
).ravel()
volumes = rng.integers(-50_000, +50_000, size=len(dates) * len(tickers)).round(-2)

trades = DataFrame({
    'date':   repeat(dates, len(tickers)),
    'ticker': tile(tickers, len(dates)),
    'price':  prices,
    'volume': volumes,
}).set_index(['date', 'ticker']).pipe(
    lambda df: df.sample(frac=.2, random_state=seed)
).sort_index()

taxes = Series({
    Interval(     0, 10_000, closed='left'): .050,
    Interval(10_000, 20_000, closed='left'): .030,
    Interval(20_000, 40_000, closed='left'): .015,
    Interval(40_000, 50_001, closed='left'): .005,
})
fees = Series({
    Interval( 0,   5, closed='left'): .015,
    Interval( 5,  15, closed='left'): .025,
    Interval(15, inf, closed='left'): .05,
})

trades = trades.assign(count=1).assign(count=lambda df:
    df['count'].groupby([
        df.index.get_level_values('date').date,
        'ticker'
    ]).transform(lambda s: sum(s)) 
)
trades = trades.assign(total_volume=lambda df:
    df.groupby('ticker')['volume'].transform(
        lambda s: s.expanding(min_periods=1).sum()
    )
)
trades = trades.assign(sign=lambda df: sign(df['total_volume'])).assign(
    new=lambda df:
        df['sign'].groupby('ticker').transform(
            lambda s: s.rolling(2).sum() == 0
        )
)
trades = trades.assign(position=lambda df:
    df.groupby('ticker')['new'].transform(lambda s: s.cumsum())
)
trades = trades.assign(weight=lambda df:
    df.groupby(['ticker', 'position'])['volume'].transform(
        lambda s: s.abs() / s.abs().sum()
    )
)
trades = trades.assign(weighted_price=lambda df: df['weight'] * df['price'])
trades = trades.assign(average_price=lambda df:
    df.groupby(['ticker', 'position'])['weighted_price'].transform(
        lambda s: s.sum()
    )
)
trades['cash'] = -(trades['price'] * trades['volume'])
trades['fee'] = fees.loc[trades['count']].values
trades['tax'] = taxes.loc[trades['volume'].abs()].values
trades['diff_lowest_price'] = (
    trades['average_price'] - 
    trades
        .reset_index('date').set_index('position', append=True)
        .groupby('ticker')['average_price'].min()
).abs()

trades = (
    trades
    .reset_index('date')
    .set_index('position', append=True)
    .sort_index()
)
print(
    trades.loc[
        trades.groupby(['ticker', 'position'])['diff_lowest_price'].head(1)
        .groupby(['ticker']).nsmallest(2).reset_index(0).index,
    ][['cash', 'tax', 'fee']],
    sep='\n',
)
from matplotlib.pyplot import show
from pandas import read_csv, DataFrame, Series, Interval
from numpy import multiply, meshgrid, inf
from io import StringIO
data = StringIO('''
WD          4   89.00
WD          6  109.00
WD          8  149.00
WD         10  218.00
WD         12  239.00
WD         14  279.00
WD         16  329.00
WD         18  409.00
Seagate     4   79.00
Seagate     6  119.00
Seagate     8  159.00
Seagate    10  228.00
Seagate    12  232.00
Seagate    14  299.00
Seagate    16  309.00
Seagate    18  425.00
''')
prices = read_csv(
    data, sep='\s+', header=None,
    names='manufacturer size price'.split()
).set_index(['manufacturer', 'size']).sort_index()

#  taxes = Series({
#      Interval( 0,  10, closed='left'):   0,
#      Interval(10, inf, closed='left'): .10,
#  })

#  prices = prices.loc['WD']['price']

print(
    #  prices,
    #  prices.unstack(),
    #  prices.unstack('manufacturer'),
    #  taxes.loc[prices.index],
    #  (DataFrame(
    #      multiply(*meshgrid((prices / prices.index), prices.index)).T,
    #      index=prices.index,
    #      columns=prices.index,
    #  ) - prices).round(2),
    sep='\n',
)

#  prices.unstack('manufacturer').plot()
#  show()