ts-python

No .index, No pandas

Date: Friday, July 26, 2024 at 09:30 AM US/Eastern

Topics pandas

Understanding the .index will change your pandas code forever.

Pandas has seven built-in unique Index types including the CategoricalIndex, DatetimeIndex, TimedeltaIndex, PeriodIndex, IntervalIndex, and more! Do you know which to reach for for a given problem?

In this upcoming seminar, we will focus on the importance of understanding the index in pandas. Not only will we explore each of the standard Index types that pandas offers, but we will also develop an intuition around when you would want to use one over the others. Additionally we will provide guidance on how the Index is used across every single pandas operation and how a better understanding of it can help you make sense of your results. This session is a great opportunity to enhance your skills and gain a deeper understanding of the most widely used DataFrame tool in Python.

pip install pandas numpy scipy matplotlib

Notes

Premise

print("Let's take a look!")

When you see pandas code that we share, it may look slightly unusual.

Our imports look different.

You may typically see people import pandas as pd or import numpy as np

import pandas as pd

df = pd.DataFrame({'x': [1, 2, 3]})

print(df)

… but we from pandas import DataFrame. This is mostly done for didactic purposes so that you can have a “heads-up” on what imports the following code contains & what it is trying to accomplish. While these different import styles are different (one is early-bound and the other is late-bound,) there is no different in performance or memory-use, and, for our purposes, this is a largely superficial choice.

from pandas import DataFrame

df = DataFrame({'x': [1, 2, 3]})

print(df)

You will see that we’ll often use the pandas.Series. This is a much more meaningful difference. It’s not uncommon to see pandas code that uses only pandas.DataFrame and totally ignores the pandas.Series. However, the pandas.Series is a very useful entity for representing a single (one-dimensional) dataset. It’s alignment rules are very straightforward, and (often) it uses less memory than a sparsely-aligned pandas.DataFrame.

from pandas import Series

s = Series([1, 2, 3])

print(s)

We’ll rarely use the pandas.array directly (though we’ll show it to you when we’re trying to make a point about how pandas works internally.) The pandas.array provides us with a necessary indirection over storage formats like numpy.ndarray.

A numpy.ndarray cannot directly represent missing integer data, since there can be no unambiguous in-band encoding for the int64 dtype (unlike float64 which reserves float('nan') values to represent missing data.) A numpy.ma masked array can represent missing data out-of-band (by storing both the .data together with a separate .mask.) A pandas.array is a similar indirection.

Altogether, the pandas.array is interesting to know about but not that interesting in day-to-day use.

from pandas import array

xs = array([1, 2, 3])

print(xs)

Most critically, you’ll not only see us specify indices… but we’ll even specify them first. Our analyses will often start with a discussion of the desired indexing and “shape” of the data… and often proceed along these lines ignoring details of the data itself.

In fact, we’ll often use synthetic or dummy data for our examples, and we’ll use dummy data even when we’re working on actual analyses for our project work!

We’ll use dummy data, because it creates a useful, fixed target for writing our code, and, if we write our code in terms of the structure of the pandas.Series and pandas.DataFrame—in other words, in terms of the index—we’ll find that when we swap our dummy data with real data, everything works perfectly!

from pandas import Series, date_range
from numpy.random import default_rng

rng = default_rng(0)

s = Series(
    index=(idx := date_range('2020-01-01', freq='QE', periods=4)),
    data=rng.integers(-100, +100, size=len(idx)),
)

print(s)

And you will rarely see us .reset_index or .values!

That isn’t to say that .reset_index or .values are never necessary. There are cases where pandas will do an costly index alignment even if the two sides of the operation are already perfectly aligned, there are cases where pandas will try to do a spurious index alignment, and there are cases where pandas will spuriously fail to allow an operation like a .rolling with a date window. In these cases, we may use .reset_index, .values or .droplevel to coërce correct behavior from pandas.

In some of these cases, newer releases of pandas have made this use of .reset_index, .values, or .droplevel unnecessary.

However, altogether, our use of .reset_index or .values is fairly minimal—certainly not the reflexible .groupby(...).reset_index() that we see in user code!

Mechanics

print("Let's take a look!")

the pandas.array

The pandas.array is how we represent just the data from a single, one-dimensional dataset. Altogether, it’s not something we will see used in day-to-day code.

from pandas import array

xs = array(data=[1, 2, 3])

print(
    xs,
)

the pandas.Series

A pandas.Series is the data from a one-dimensional measurement together with a human labelling—the .index. The .index is metadata that allows us to address and refer to parts of our data.

from pandas import array, Series, date_range

s = Series(
    index=date_range('2020-01-01', periods=3),
    data=array([1, 2, 3])
)

print(
    s,
)

Whereas .iloc operations are rarely human-meaningful (excerpt, perhaps, .iloc[0] or .iloc[-1] for the first or last captured measurement,) .loc operations are intended to be human-meaningful and human-readable.

A common example is the use of a DatetimeIndex for timeseries data. It’s definitely the case that the date could be made a column of a DataFrame, but then instead of .loc['2020-01-01'] we’ll have to write ugly .query("date == ...") or masking [df['date'] == ...] code.

from pandas import array, Series, date_range

s = Series(
    index=date_range('2020-01-01', periods=3),
    data=array([1, 2, 3])
)

print(
    s,
    s.loc['2020-01-03'],
    s.iloc[2],
    sep='\n',
)

If we do not specify the index for a pandas.Series, that doesn’t mean that it has no .index. It means it gets the simplest kind of index—a RangeIndex.

We often call the RangeIndex a “degenerate” index, because it rarely has human meaning. That doesn’t mean that RangeIndex are never useful indices—sometimes we do want to track measurements in the strict order in which they were captured—but it means that most uses of RangeIndex are part of a flawed attempt to avoid dealing with indices and index-alignment.

Using .reset_index to get a degenerate RangeIndex will not successfully allow you to avoid dealing with indices and index-alignment, because the moment you subsequently filter or .sort_values on the data, the RangeIndex will become what was previously an Int64Index (and what is now an Index(..., dtype='int64')) with jumbled, discontinuous numbers that will not meaningfully align against anything else!

from pandas import array, Series

s = Series(
    data=array([1, 2, 3])
)

print(
    s,
    sep='\n',
)

index types

There are many types of indices that we see in use. DatetimeIndex and CategoricalIndex are two very common and very useful index types.

We often create DatetimeIndex via a helper function like date_range or to_datetime. It’s usefulness should be fairly obvious in the case of timeseries data.

A CategoricalIndex is a very useful index for data that we may otherwise store in what was previously a StringIndex. If we have textual data that is restricted to an enumerable, finite set of possible values, then we’re actually dealing with categorical data. The use of a CategoricalIndex will save us memory and signficantly speed up filtering and alignment operations!

from pandas import (
    Series, RangeIndex, DatetimeIndex, CategoricalIndex, Timestamp,
    date_range, to_datetime,
)
from numpy.random import default_rng

rng = default_rng(0)

s = Series(
    data=rng.integers(-10, +10, size=3),
    # index=RangeIndex(0, 3),
    # index=RangeIndex(3, 6),
    # index=DatetimeIndex([Timestamp('2020-01-01'), Timestamp('2020-01-02'), Timestamp('2020-01-03')]),
    # index=date_range('2020-01-01', periods=3),
    # index=to_datetime(['2020-01-01', '2020-01-02', '2020-01-03']),
    # index=CategoricalIndex(['a', 'b', 'c']),
)#.sample(n=1, random_state=1)

print(
    s,
    s.index,
    sep='\n',
)

The IntervalIndex is a very useful index if you want to have inexact, ranged lookups:

from pandas import Series, IntervalIndex
from numpy.random import default_rng

rng = default_rng(0)

s = Series(
    index=(idx := IntervalIndex.from_breaks([0, 10, 20, 30], closed='left')),
    data=rng.integers(-10, +10, size=len(idx)),
)

print(
    s,
    s.loc[4],
    sep='\n',
)

A PeriodIndex is a like an IntervalIndex for dates. It’s a very nice way to represent a period of time (a day, a month, a year, a financial quarter,) more precisely and less ambiguously than representing the period with its start or end date.

Unfortunately, the PeriodIndex can be somewhat clusmy to use in-practice, since there’s very little supported mechanism to align it with DatetimeIndex or other PeriodIndex.

from pandas import Series, period_range
from numpy.random import default_rng

rng = default_rng(0)

s = Series(
    index=(idx := period_range('2020', periods=4, freq='Q')),
    data=rng.integers(-10, +10, size=len(idx)),
)

print(
    s,
    # s.loc['2020-01-01'],
    s.loc['2020-01-01':'2020-01-03'],
    sep='\n{}\n'.format(40*'\N{box drawings light horizontal}'),
)

A MultiIndex is one of the most useful index types. A lot of data that we may mistakenly think is two dimensional is actually hierarchically-indexed data.

When first introduced into pandas, the MultiIndex was not particularly well supported. If you had a bad experience with it from five years ago, know that MultiIndex support in pandas is much, much better today. There are, of course, still some gaps. Some operations performed on MultiIndex are slower than they should be, there are some cases where methods that should exist don’t exist yet, and there are still some bugs. But pandas is a tool under active development, and we should anticipate that MultiIndex support will only get better over time.

The use of MultiIndex also introduces the use of IndexSlice. IndexSlice is an important helper for getting cross-sectional slices of our pandas.DataFrame. In fact, we rarely use .xs in our pandas code, since IndexSlice can accomplish everything .xs does and more (albeit with slightly clumsier syntax.)

from pandas import Series, MultiIndex, period_range, IndexSlice
from numpy.random import default_rng

rng = default_rng(0)

s = Series(
    index=(idx :=
        MultiIndex.from_product([
            ['abc', 'def'],
            period_range('2020', periods=2, freq='Q'),
        ])
    ),
    data=rng.integers(-10, +10, size=len(idx)),
)

print(
    s,
    s.loc['abc'],
    s.loc[IndexSlice[:, '2020-01-01':'2020-01-05']],
    sep='\n{}\n'.format(40*'\N{box drawings light horizontal}'),
)

index alignment

All operations performed between pandas.Series or pandas.DataFrame are subject to the rules of index alignment. This includes assignment operations such as df[...] = ... or df.assign(...)!

The simplest way to start thinking about index alignment is by considering the most common alignment modality—exact against exact:

from pandas import Series
from numpy.random import default_rng

rng = default_rng(0)

s0 = Series(
    index=(idx := [*'abc']),
    data=rng.integers(-10, +10, size=len(idx)),
)
s1 = Series(
    index=(idx := [*'abc']),
    data=rng.integers(-10, +10, size=len(idx)),
)

print(
    s0,
    s1,
    s0 + s1,
    sep='\n{}\n'.format(40*'\N{box drawings light horizontal}'),
)

With indexes of ['a', 'b', 'c'] and ['a', 'b', 'c'] we match values against other values exactly based on matching index entries. If one of the indices has the same contents but in different order, this exact against exact matching still occurs:

from pandas import Series
from numpy.random import default_rng

rng = default_rng(0)

s0 = Series(
    index=(idx := [*'abc']),
    data=rng.integers(-10, +10, size=len(idx)),
)
s1 = Series(
    index=(idx := [*'cba']),
    data=rng.integers(-10, +10, size=len(idx)),
)

print(
    s0,
    s1,
    s0 + s1,
    sep='\n{}\n'.format(40*'\N{box drawings light horizontal}'),
)

If one of our indices .has_duplicates or is not sorted (.is_monotonic_increasing/.is_monotonic_decreating in pandas parlance,) we may see other index-alignment modalities triggered.

It’s not uncommon, in the presence of duplicates, to see a Cartesian product index-alignment modality triggered, where every value in one index is matched against every other value in the other.

from pandas import Series
from numpy.random import default_rng

rng = default_rng(0)

s0 = Series(
    index=(idx := [*'abab']),
    data=rng.integers(-10, +10, size=len(idx)),
)
s1 = Series(
    index=(idx := [*'aabb']),
    data=rng.integers(-10, +10, size=len(idx)),
)

print(
    s0,
    s1,
    s0 + s1,
    sep='\n{}\n'.format(40*'\N{box drawings light horizontal}'),
)

However, note that, in practice, we often have to deal only with the exact against exact alignment modality. It’s common that immediately after our data loading, we may consider what would happen if we had duplicates in our data.

We may do .groupby(obj.index).agg(...) or a .sort_index() defensively after data-loading to ensure that our indices are both free of duplicates and monotonic, thus leaving us to worry only about the exact against exact alignment modality.

from pandas import Series
from numpy.random import default_rng

rng = default_rng(0)

s0 = Series(
    index=(idx := [*'abc']),
    data=rng.integers(-10, +10, size=len(idx)),
)
s1 = Series(
    index=(idx := [*'aabb']),
    data=rng.integers(-10, +10, size=len(idx)),
)

s0 = s0.pipe(lambda s: s.groupby(s.index).sum().sort_index())
s1 = s1.pipe(lambda s: s.groupby(s.index).sum().sort_index())

print(
    s0,
    s1,
    s0 + s1,
    sep='\n{}\n'.format(40*'\N{box drawings light horizontal}'),
)

Note that when we align pandas.Series against pandas.DataFrame, we align the .index of the pandas.Series against the .columns of our pandas.DataFrame!

from pandas import Series, DataFrame, date_range
from numpy.random import default_rng

rng = default_rng(0)

s = Series(
    index=(idx := [*'abc']),
    data=rng.integers(-10, +10, size=len(idx)),
)
df = DataFrame(
    index=date_range('2020-01-01', periods=3),
    data={
        'a': rng.integers(-10, +10, size=len(idx)),
        'b': rng.integers(-10, +10, size=len(idx)),
        'c': rng.integers(-10, +10, size=len(idx)),
    }
)

print(
    s,
    df,
    s * df,
    sep='\n{}\n'.format(40*'\N{box drawings light horizontal}'),
)

Here is an example where index alignment can be quite useful. If we have prices of assets across multiple dates and the total held positions of the assets across multiple portfolios, we can determine what the price of each holding is per portfolio.

Unfortunately, the below fails, because in order for pandas to align these MultiIndex, it needs to understand which levels represent the same things.

from pandas import Series, MultiIndex, date_range
from numpy.random import default_rng

rng = default_rng(0)

assets = 'ABC DEF XYZ'.split()
prices = Series(
    index=(idx := MultiIndex.from_product([
        date_range('2020-01-01', periods=3),
        assets,
    ])),
    data=rng.normal(loc=100, scale=5, size=len(idx)),
).round(2)

positions = Series({
    ('Alice', 'ABC'): +100,
    ('Bob',   'ABC'): +200,
    ('Bob',   'DEF'): -100,
})

print(
    prices,
    positions,
    prices * positions,
    sep='\n{}\n'.format(40*'\N{box drawings light horizontal}'),
)

Therefore, we need to name the levels of the indices. When we do this, we see a useful result from our index alignment!

from pandas import Series, MultiIndex, date_range
from numpy.random import default_rng

rng = default_rng(0)

assets = 'ABC DEF XYZ'.split()
prices = Series(
    index=(idx := MultiIndex.from_product([
        date_range('2020-01-01', periods=3),
        assets,
    ], names='date asset'.split())),
    data=rng.normal(loc=100, scale=5, size=len(idx)),
).round(2)

positions = Series({
    ('Alice', 'ABC'): +100,
    ('Bob',   'ABC'): +200,
    ('Bob',   'DEF'): -100,
}).rename_axis(index=['book', 'asset'])

print(
    prices,
    positions,
    (prices * positions).dropna(),
    sep='\n{}\n'.format(40*'\N{box drawings light horizontal}'),
)

.groupby

Note that all .groupby modalities are defined in terms of what they do to our index.

from pandas import Series, MultiIndex, date_range, concat
from numpy.random import default_rng

rng = default_rng(0)

assets = 'ABC DEF XYZ'.split()
prices = Series(
    index=(idx := MultiIndex.from_product([
        date_range('2020-01-01', periods=3),
        assets,
    ], names='date asset'.split())),
    data=rng.normal(loc=100, scale=5, size=len(idx)),
).round(2)

print(
    prices,
    # prices.groupby('asset').agg('max'),
    # prices.groupby('asset').transform('cummax'),
    # prices.groupby('asset').apply(lambda g: concat([g, g])),
    # prices.groupby('asset', group_keys=False).apply(lambda g: concat([g, g])),
    sep='\n{}\n'.format(40*'\N{box drawings light horizontal}'),
)

Example

print("Let's take a look!")

Pricing Data

Let’s start with some simple pricing data. Let’s look at how we construct it.

We’ll start by thinking about the dates we want to operate over. We’ll look at minute-frequency data:

from pandas import date_range

idx = date_range('2010', '2020', freq='min', name='date')

print(idx)

Let’s restrict ourselves to just market hours: 9:30 AM to 4:00 PM.

Unfortunately, we do not have a .between_time method on Index like we do on pandas.Series and pandas.DataFrame. It’s not uncommon that sometimes useful methods don’t (yet!) exist on your index.

from pandas import date_range

dates = date_range('2010', '2020', freq='min', name='date')
dates = dates[dates.indexer_between_time('09:30', '16:00')]

print(dates)

Let’s create some tickers by randomly generating 4-letter strings:

from string import ascii_lowercase

from numpy.random import default_rng
from numpy import unique

rng = default_rng(0)

currencies = {'USD', 'EUR'}
tickers = unique(
    rng.choice([*ascii_lowercase], size=(200, 4)).view('<U4').ravel()
)

print(
    currencies,
    tickers,
    sep='\n',
)

Let’s construct a CategoricalIndex for our assets, since they aren’t arbitrary string data. Let’s allow ourselves to distinguish between tradeable assets (tickers) and fungible assets (currencies.)

from string import ascii_lowercase

from pandas import CategoricalIndex
from numpy.random import default_rng
from numpy import unique

rng = default_rng(0)

currencies = {'USD', 'EUR'}
tickers = unique(
    rng.choice([*ascii_lowercase], size=(20, 4)).view('<U4').ravel()
)

assets = CategoricalIndex([*tickers, *sorted(currencies)], name='asset')

currencies = CategoricalIndex([*sorted(currencies)], dtype=assets.dtype, name='currency')
tickers = CategoricalIndex([*sorted(tickers)], dtype=assets.dtype, name='ticker')

print(assets)

Let’s construct a MultiIndex from the Cartesian product of each ticker for each timestamp:

from string import ascii_lowercase

from pandas import CategoricalIndex, date_range, DataFrame, MultiIndex
from numpy.random import default_rng
from numpy import unique

rng = default_rng(0)

dates = date_range('2010', '2020', freq='min', name='date')
dates = dates[dates.indexer_between_time('09:30', '16:00')]

currencies = {'USD', 'EUR'}
tickers = unique(
    rng.choice([*ascii_lowercase], size=(20, 4)).view('<U4').ravel()
)
assets = CategoricalIndex([*tickers, *sorted(currencies)], name='asset')
currencies = CategoricalIndex([*sorted(currencies)], dtype=assets.dtype, name='currency')
tickers = CategoricalIndex([*sorted(tickers)], dtype=assets.dtype, name='ticker')

idx = MultiIndex.from_product([
    dates,
    tickers,
])

print(idx)

Let’s construct our prices as a DataFrame with a buy side and a sell side:

from logging import getLogger, basicConfig, INFO
from pathlib import Path
from pickle import dump
from string import ascii_lowercase

from pandas import CategoricalIndex, date_range, DataFrame, MultiIndex, IndexSlice
from numpy.random import default_rng
from numpy import unique

logger = getLogger(__name__); basicConfig(level=INFO)

rng = default_rng(0)

dates = date_range('2010', '2020', freq='min', name='date')
dates = dates[dates.indexer_between_time('09:30', '16:00')]

currencies = {'USD', 'EUR'}
tickers = unique(
    rng.choice([*ascii_lowercase], size=(20, 4)).view('<U4').ravel()
)
assets = CategoricalIndex([*tickers, *sorted(currencies)], name='asset')
currencies = CategoricalIndex([*sorted(currencies)], dtype=assets.dtype, name='currency')
tickers = CategoricalIndex([*sorted(tickers)], dtype=assets.dtype, name='ticker')

prices = (
    DataFrame(
        index=(idx := MultiIndex.from_product([
            dates,
            tickers,
        ])),
        data={
            'buy': (
                rng.normal(loc=100, scale=20, size=len(tickers)).clip(10, 200)
              * rng.normal(loc=1, scale=.0001, size=(len(dates), len(tickers))).cumprod(axis=0)
            ).ravel()
        },
    )
    .assign(
        sell=lambda df: df['buy'] * (1 - abs(rng.normal(loc=0, scale=.05, size=len(df))))
    )
    .round(4)
    .rename_axis(columns='direction')
)

print(
    prices.head(3),
    # prices.loc[IndexSlice[:, 'arnq', :]],
    sep='\n',
)

data_dir = Path('data')
data_dir.mkdir(exist_ok=True, parents=True)

prices.to_pickle(filename := (data_dir / 'prices.pkl'))
logger.info('Wrote %s', filename)

with open(filename := (data_dir / 'dtypes.pkl'), mode='wb') as f:
    dump({'currencies': currencies, 'tickers': tickers, 'assets': assets, 'dates': dates}, f)
logger.info('Wrote %s', filename)

We’ll store the pricing data to a Pickle file (which is fine, because we are not distributing this data—we’re moving it merely from our “left-hand” to our “right-hand.”)

Trading Data

Let’s construct our trades by first thinking about the indexing. We want to have ten chances to trade under every name, for every timestamp for every date. We will sample this down to 5% of these chances. (It would be nice if .sample(...) existed on Index, but it doesn’t.)

from pathlib import Path
from pickle import load

from numpy import arange
from numpy.random import default_rng
from pandas import Series, MultiIndex

rng = default_rng(0)

data_dir = Path('data')
with open(data_dir / 'dtypes.pkl', mode='rb') as f:
    dtypes = load(f)
currencies, tickers, assets, dates = dtypes['currencies'], dtypes['tickers'], dtypes['assets'], dtypes['dates']
# globals().update(dtypes)

idx = MultiIndex.from_product([
    dates,
    tickers,
    range(10),
], names=[dates.name, tickers.name, 'trade'])
idx = idx[rng.choice(arange(len(idx)), size=int(len(idx) * .05), replace=True)]

print(
    idx,
    sep='\n',
)

We’ll randomly decide to trade between 1,000 and 10,000 units in each instance (rounded to the nearest hundredth):

from pathlib import Path
from pickle import load

from numpy import arange
from numpy.random import default_rng
from pandas import read_pickle, Series, MultiIndex

rng = default_rng(0)

data_dir = Path('data')
prices = read_pickle(data_dir / 'prices.pkl')
with open(data_dir / 'dtypes.pkl', mode='rb') as f:
    dtypes = load(f)
currencies, tickers, assets, dates = dtypes['currencies'], dtypes['tickers'], dtypes['assets'], dtypes['dates']
# globals().update(dtypes)

idx = MultiIndex.from_product([
    dates,
    tickers,
    range(10),
], names=[dates.name, tickers.name, 'trade'])
idx = idx[rng.choice(arange(len(idx)), size=int(len(idx) * .05), replace=True)]
idx = idx.droplevel('trade')

trades = (
    Series(
        index=idx,
        data=(
            rng.choice([-1, +1], size=len(idx))
          * rng.integers(1_000, 10_000, size=len(idx))
        )
    )
    .round(-2)
    .sort_index()
)

print(
    trades,
    sep='\n',
)

It’s unfortunate that we have a lot of missing functionality on our Index, but pandas is a tool under active development, so it takes time for all of the useful methods to be implemented.

from pandas import Index

print(dir(Index))

In the meantime, here is a great trick we like to use. Use pandas.api.extensions.register_index_accessor to make an ._ext namespace for your indices. This should be a name that is easy to grep for in your code.

Under that namespace, implement the missing functionality, striving to implement it in such a way that it looks like existing functionality. e.g., to implement the missing .updatelevel or .addlevel method, try to copy how .droplevel or .swaplevel looks!

While we’re waiting for functionality to be added to pandas, we can use our own functionality in the ._ext namespace. When the functioanlity is eventually added, we can grep for instances of ._ext.updatelevel and replace them with the official functionality!

from collections.abc import Callable
from dataclasses import dataclass
from pathlib import Path

from pandas import Index, MultiIndex
from pandas.api.extensions import register_index_accessor

@register_index_accessor('_ext')
@dataclass
class _ext:
    obj : Index

    def addlevel(self, **levels):
        levels = {
            k: v if not isinstance(v, Callable) else v(self.obj)
            for k, v in levels.items()
        }
        new_obj = self.obj.copy(deep=False)
        if not isinstance(new_obj, MultiIndex):
            new_obj = MultiIndex.from_arrays([
                new_obj
            ])
        names = new_obj.names
        new_obj.names = [None] * len(names)
        return MultiIndex.from_arrays([
            *(
                new_obj.get_level_values(idx)
                for idx in range(len(names))
            ),
            *levels.values(),
        ], names=[*names, *levels.keys()])

    def updatelevel(self, **levels):
        levels = {
            k: v if not isinstance(v, Callable) else v(self.obj)
            for k, v in levels.items()
        }
        new_obj = self.obj.copy(deep=False)
        if not isinstance(new_obj, MultiIndex):
            new_obj = MultiIndex.from_arrays([
                new_obj
            ])
        names = new_obj.names
        new_obj.names = [None] * len(names)
        return MultiIndex.from_arrays([
            levels[n]
            if n in levels else
            new_obj.get_level_values(idx)
            for idx, n in enumerate(names)
        ], names=names)

if __name__ == '__main__':
    from logging import getLogger, basicConfig, INFO

    from pandas import date_range

    logger = getLogger(__name__); basicConfig(level=INFO)

    idx = Index([*'abc'])
    print(
        idx._ext.addlevel(date=date_range('2020-01-01', periods=3))
    )

    lib_dir = Path('_lib')
    lib_dir.mkdir(exist_ok=True, parents=True)

    with open(filename := (lib_dir / '__init__.py'), mode='wt') as dst:
        with open(__file__) as src:
            print(src.read(), file=dst)

    logger.info('Wrote %s', filename)

The above dramatically simplifies working with MultiIndex. Let’s complete our trade data by adding noise to the time measurements and adding a trading identifier.

from logging import getLogger, basicConfig, INFO
from pathlib import Path
from pickle import load

from numpy import arange
from numpy.random import default_rng
from pandas import read_pickle, Series, MultiIndex, to_timedelta, IndexSlice

import _lib

logger = getLogger(__name__); basicConfig(level=INFO)

rng = default_rng(0)

data_dir = Path('data')
prices = read_pickle(data_dir / 'prices.pkl')
with open(data_dir / 'dtypes.pkl', mode='rb') as f:
    dtypes = load(f)
currencies, tickers, assets, dates = dtypes['currencies'], dtypes['tickers'], dtypes['assets'], dtypes['dates']

idx = MultiIndex.from_product([
    dates,
    tickers,
    range(10),
], names=[dates.name, 'asset', 'trade'])
idx = idx[rng.choice(arange(len(idx)), size=int(len(idx) * .005), replace=True)]
idx = idx.droplevel('trade')

trades = (
    Series(
        index=idx,
        data=(
            rng.choice([-1, +1], size=len(idx))
          * rng.integers(1_000, 100_000, size=len(idx))
        )
    )
    .round(-2)
    .pipe(lambda s: s.set_axis(s.index
        ._ext.updatelevel(date=lambda idx:
            idx.get_level_values('date')
          + to_timedelta(abs(rng.normal(loc=0, scale=10, size=len(idx)).round()), unit='s')
        )
    ))
    .sort_index()
    .loc[
        prices.index.get_level_values('date').min()
       :prices.index.get_level_values('date').max()
    ]
    .pipe(lambda s: s.set_axis(s.index
        ._ext.addlevel(trade=range(len(s)))
    ))
)

print(
    trades,
    sep='\n',
)

data_dir = Path('data')
data_dir.mkdir(exist_ok=True, parents=True)

trades.to_pickle(filename := (data_dir / 'trades.pkl'))
logger.info('Wrote %s', filename)

Finally, we’re ready for analysis. Given our pricing information and our trades, we can execute our trades. When we execute our trades, we create a cashflow entry, denominated in USD, for the cash that we paid or received to buy or sell the volume. We implement a liquidate function for representing what happens when we want to liquidate our entire portfolio—i.e., sell everything as of some date.

Together, these give us all we need to compute P&L, market value, and other interesting portfolio calculations:

from collections import namedtuple
from logging import getLogger, basicConfig, INFO
from pathlib import Path
from pickle import load, dump

from numpy import where
from pandas import read_pickle, MultiIndex, period_range, IndexSlice

import _lib

logger = getLogger(__name__); basicConfig(level=INFO)

data_dir = Path('data')
prices = read_pickle(data_dir / 'prices.pkl')
trades = read_pickle(data_dir / 'trades.pkl')
with open(data_dir / 'dtypes.pkl', mode='rb') as f:
    dtypes = load(f)
currencies, tickers, assets, dates = dtypes['currencies'], dtypes['tickers'], dtypes['assets'], dtypes['dates']

def execute(volumes, prices):
    traded_prices = prices.loc[
        MultiIndex.from_arrays([
            volumes.index.get_level_values('date').floor('min'),
            volumes.index.get_level_values('asset'),
        ])
    ].pipe(lambda df: where(volumes > 0, df['buy'], df['sell']))
    return (
        (traded_prices * -volumes)
        .pipe(lambda s: s.set_axis(s.index._ext.updatelevel(asset=['USD'] * len(volumes))))
    )

Liquidation = namedtuple('Liquidation', 'trades cashflows')
def liquidate(volumes, prices, *, date=None):
    dates = [prices.index.get_level_values('date').max() if date is None else date] * len(volumes)
    trades = (
        (-volumes)
        .pipe(lambda s: s.set_axis(s.index._ext.updatelevel(date=dates)))
    )
    return Liquidation(trades=trades, cashflows=execute(trades, prices))

cashflows = execute(trades, prices)
liquidations = {
    yr: liquidate(
        trades.loc[IndexSlice[:yr.end_time, :]],
        prices,
        date=prices.index.get_level_values('date').unique().asof(yr.end_time),
    )
    for yr in period_range(
        prices.index.get_level_values('date').min(),
        prices.index.get_level_values('date').max(),
        freq='Y',
    )
}

print(
    trades,
    cashflows,
    *liquidations[max(liquidations)],

    sep='\n{}\n'.format(40*'\N{box drawings light horizontal}'),
)

data_dir = Path('data')
data_dir.mkdir(exist_ok=True, parents=True)

cashflows.to_pickle(filename := (data_dir / 'cashflows.pkl'))
logger.info('Wrote %s', filename)

with open(filename := (data_dir / 'liquidations.pkl'), mode='wb') as f:
    dump(liquidations, f)
logger.info('Wrote %s', filename)

Analysis

Let’s compute our P&L, our market value, our cash position, and other useful calculations using our data & our execute and liquidate methods:

from collections import namedtuple
from itertools import chain
from logging import getLogger, basicConfig, INFO
from pathlib import Path
from pickle import load

from pandas import read_pickle, IndexSlice, concat, DataFrame

import _lib

logger = getLogger(__name__); basicConfig(level=INFO)

data_dir = Path('data')
prices = read_pickle(data_dir / 'prices.pkl')
trades = read_pickle(data_dir / 'trades.pkl')
cashflows = read_pickle(data_dir / 'cashflows.pkl')
with open(data_dir / 'dtypes.pkl', mode='rb') as f:
    dtypes = load(f)
currencies, tickers, assets, dates = dtypes['currencies'], dtypes['tickers'], dtypes['assets'], dtypes['dates']
Liquidation = namedtuple('Liquidation', 'trades cashflows') # reconstruct
with open(data_dir / 'liquidations.pkl', mode='rb') as f:
    liquidations = load(f)

by_year = DataFrame(
    index=(idx := liquidations),
    data={
        'cash': (
            cashflows.loc[IndexSlice[:yr.end_time, :]].groupby('asset', observed=True).sum().loc['USD']
            for yr in idx
        ),
        'market value': (
            liquidations[yr].cashflows.groupby('asset', observed=True).sum().loc['USD']
            for yr in idx
        ),
    }
)

print(
    # trades.groupby('asset', observed=True).sum(),
    # trades.groupby('asset', observed=True).cumsum(),

    # cashflows.groupby('asset', observed=True).sum(),
    # liquidations[max(liquidations)].cashflows.groupby('asset', observed=True).sum(),

    by_year,

    sep='\n{}\n'.format(40*'\N{box drawings light horizontal}'),
)