.index, No pandasDate: Friday, July 26, 2024 at 09:30 AM US/Eastern
Topics pandas
Understanding the .index will change your pandas code forever.
Pandas has seven built-in unique Index types including the CategoricalIndex,
DatetimeIndex, TimedeltaIndex, PeriodIndex, IntervalIndex, and more! Do
you know which to reach for for a given problem?
In this upcoming seminar, we will focus on the importance of understanding
the index in pandas. Not only will we explore each of the standard Index
types that pandas offers, but we will also develop an intuition around
when you would want to use one over the others. Additionally we will provide
guidance on how the Index is used across every single pandas operation and
how a better understanding of it can help you make sense of your results.
This session is a great opportunity to enhance your skills and gain a
deeper understanding of the most widely used DataFrame tool in Python.
pip install pandas numpy scipy matplotlib
print("Let's take a look!")
When you see pandas code that we share, it may look slightly unusual.
Our imports look different.
You may typically see people import pandas as pd or import numpy as np…
import pandas as pd
df = pd.DataFrame({'x': [1, 2, 3]})
print(df)
… but we from pandas import DataFrame. This is mostly done for didactic
purposes so that you can have a “heads-up” on what imports the following
code contains & what it is trying to accomplish. While these different import
styles are different (one is early-bound and the other is late-bound,) there
is no different in performance or memory-use, and, for our purposes, this is
a largely superficial choice.
from pandas import DataFrame
df = DataFrame({'x': [1, 2, 3]})
print(df)
You will see that we’ll often use the pandas.Series. This is a much more
meaningful difference. It’s not uncommon to see pandas code that uses only
pandas.DataFrame and totally ignores the pandas.Series. However, the
pandas.Series is a very useful entity for representing a single
(one-dimensional) dataset. It’s alignment rules are very straightforward,
and (often) it uses less memory than a sparsely-aligned pandas.DataFrame.
from pandas import Series
s = Series([1, 2, 3])
print(s)
We’ll rarely use the pandas.array directly (though we’ll show it to you when
we’re trying to make a point about how pandas works internally.) The pandas.array
provides us with a necessary indirection over storage formats like numpy.ndarray.
A numpy.ndarray cannot directly represent missing integer data, since there
can be no unambiguous in-band encoding for the int64 dtype (unlike
float64 which reserves float('nan') values to represent missing data.) A
numpy.ma masked array can represent missing data out-of-band (by storing
both the .data together with a separate .mask.) A pandas.array is a similar
indirection.
Altogether, the pandas.array is interesting to know about but not that
interesting in day-to-day use.
from pandas import array
xs = array([1, 2, 3])
print(xs)
Most critically, you’ll not only see us specify indices… but we’ll even specify them first. Our analyses will often start with a discussion of the desired indexing and “shape” of the data… and often proceed along these lines ignoring details of the data itself.
In fact, we’ll often use synthetic or dummy data for our examples, and we’ll use dummy data even when we’re working on actual analyses for our project work!
We’ll use dummy data, because it creates a useful, fixed target for writing
our code, and, if we write our code in terms of the structure of the pandas.Series
and pandas.DataFrame—in other words, in terms of the index—we’ll find that
when we swap our dummy data with real data, everything works perfectly!
from pandas import Series, date_range
from numpy.random import default_rng
rng = default_rng(0)
s = Series(
index=(idx := date_range('2020-01-01', freq='QE', periods=4)),
data=rng.integers(-100, +100, size=len(idx)),
)
print(s)
And you will rarely see us .reset_index or .values!
That isn’t to say that .reset_index or .values are never necessary. There
are cases where pandas will do an costly index alignment even if the two
sides of the operation are already perfectly aligned, there are cases where
pandas will try to do a spurious index alignment, and there are cases where
pandas will spuriously fail to allow an operation like a .rolling with a
date window. In these cases, we may use .reset_index, .values or
.droplevel to coërce correct behavior from pandas.
In some of these cases, newer releases of pandas have made this use of
.reset_index, .values, or .droplevel unnecessary.
However, altogether, our use of .reset_index or .values is fairly
minimal—certainly not the reflexible .groupby(...)….reset_index() that we
see in user code!
print("Let's take a look!")
pandas.arrayThe pandas.array is how we represent just the data from a single,
one-dimensional dataset. Altogether, it’s not something we will
see used in day-to-day code.
from pandas import array
xs = array(data=[1, 2, 3])
print(
xs,
)
pandas.SeriesA pandas.Series is the data from a one-dimensional measurement together
with a human labelling—the .index. The .index is metadata that allows
us to address and refer to parts of our data.
from pandas import array, Series, date_range
s = Series(
index=date_range('2020-01-01', periods=3),
data=array([1, 2, 3])
)
print(
s,
)
Whereas .iloc operations are rarely human-meaningful (excerpt, perhaps,
.iloc[0] or .iloc[-1] for the first or last captured measurement,) .loc
operations are intended to be human-meaningful and human-readable.
A common example is the use of a DatetimeIndex for timeseries data. It’s
definitely the case that the date could be made a column of a DataFrame, but
then instead of .loc['2020-01-01'] we’ll have to write ugly .query("date ==
...") or masking [df['date'] == ...] code.
from pandas import array, Series, date_range
s = Series(
index=date_range('2020-01-01', periods=3),
data=array([1, 2, 3])
)
print(
s,
s.loc['2020-01-03'],
s.iloc[2],
sep='\n',
)
If we do not specify the index for a pandas.Series, that doesn’t mean that
it has no .index. It means it gets the simplest kind of index—a RangeIndex.
We often call the RangeIndex a “degenerate” index, because it rarely has
human meaning. That doesn’t mean that RangeIndex are never useful
indices—sometimes we do want to track measurements in the strict order in
which they were captured—but it means that most uses of RangeIndex are
part of a flawed attempt to avoid dealing with indices and index-alignment.
Using .reset_index to get a degenerate RangeIndex will not successfully
allow you to avoid dealing with indices and index-alignment, because the moment
you subsequently filter or .sort_values on the data, the RangeIndex will
become what was previously an Int64Index (and what is now an Index(...,
dtype='int64')) with jumbled, discontinuous numbers that will not meaningfully
align against anything else!
from pandas import array, Series
s = Series(
data=array([1, 2, 3])
)
print(
s,
sep='\n',
)
There are many types of indices that we see in use. DatetimeIndex and
CategoricalIndex are two very common and very useful index types.
We often create DatetimeIndex via a helper function like date_range or
to_datetime. It’s usefulness should be fairly obvious in the case of
timeseries data.
A CategoricalIndex is a very useful index for data that we may otherwise
store in what was previously a StringIndex. If we have textual data
that is restricted to an enumerable, finite set of possible values, then
we’re actually dealing with categorical data. The use of a CategoricalIndex
will save us memory and signficantly speed up filtering and alignment operations!
from pandas import (
Series, RangeIndex, DatetimeIndex, CategoricalIndex, Timestamp,
date_range, to_datetime,
)
from numpy.random import default_rng
rng = default_rng(0)
s = Series(
data=rng.integers(-10, +10, size=3),
# index=RangeIndex(0, 3),
# index=RangeIndex(3, 6),
# index=DatetimeIndex([Timestamp('2020-01-01'), Timestamp('2020-01-02'), Timestamp('2020-01-03')]),
# index=date_range('2020-01-01', periods=3),
# index=to_datetime(['2020-01-01', '2020-01-02', '2020-01-03']),
# index=CategoricalIndex(['a', 'b', 'c']),
)#.sample(n=1, random_state=1)
print(
s,
s.index,
sep='\n',
)
The IntervalIndex is a very useful index if you want to have inexact, ranged
lookups:
from pandas import Series, IntervalIndex
from numpy.random import default_rng
rng = default_rng(0)
s = Series(
index=(idx := IntervalIndex.from_breaks([0, 10, 20, 30], closed='left')),
data=rng.integers(-10, +10, size=len(idx)),
)
print(
s,
s.loc[4],
sep='\n',
)
A PeriodIndex is a like an IntervalIndex for dates. It’s a very nice way to
represent a period of time (a day, a month, a year, a financial quarter,) more
precisely and less ambiguously than representing the period with its start or
end date.
Unfortunately, the PeriodIndex can be somewhat clusmy to use in-practice,
since there’s very little supported mechanism to align it with DatetimeIndex
or other PeriodIndex.
from pandas import Series, period_range
from numpy.random import default_rng
rng = default_rng(0)
s = Series(
index=(idx := period_range('2020', periods=4, freq='Q')),
data=rng.integers(-10, +10, size=len(idx)),
)
print(
s,
# s.loc['2020-01-01'],
s.loc['2020-01-01':'2020-01-03'],
sep='\n{}\n'.format(40*'\N{box drawings light horizontal}'),
)
A MultiIndex is one of the most useful index types. A lot of data that we may
mistakenly think is two dimensional is actually hierarchically-indexed data.
When first introduced into pandas, the MultiIndex was not particularly well
supported. If you had a bad experience with it from five years ago, know that
MultiIndex support in pandas is much, much better today. There are, of
course, still some gaps. Some operations performed on MultiIndex are slower
than they should be, there are some cases where methods that should exist don’t
exist yet, and there are still some bugs. But pandas is a tool under active
development, and we should anticipate that MultiIndex support will only
get better over time.
The use of MultiIndex also introduces the use of IndexSlice. IndexSlice
is an important helper for getting cross-sectional slices of our pandas.DataFrame.
In fact, we rarely use .xs in our pandas code, since IndexSlice can accomplish
everything .xs does and more (albeit with slightly clumsier syntax.)
from pandas import Series, MultiIndex, period_range, IndexSlice
from numpy.random import default_rng
rng = default_rng(0)
s = Series(
index=(idx :=
MultiIndex.from_product([
['abc', 'def'],
period_range('2020', periods=2, freq='Q'),
])
),
data=rng.integers(-10, +10, size=len(idx)),
)
print(
s,
s.loc['abc'],
s.loc[IndexSlice[:, '2020-01-01':'2020-01-05']],
sep='\n{}\n'.format(40*'\N{box drawings light horizontal}'),
)
All operations performed between pandas.Series or pandas.DataFrame are
subject to the rules of index alignment. This includes assignment operations
such as df[...] = ... or df.assign(...)!
The simplest way to start thinking about index alignment is by considering the most common alignment modality—exact against exact:
from pandas import Series
from numpy.random import default_rng
rng = default_rng(0)
s0 = Series(
index=(idx := [*'abc']),
data=rng.integers(-10, +10, size=len(idx)),
)
s1 = Series(
index=(idx := [*'abc']),
data=rng.integers(-10, +10, size=len(idx)),
)
print(
s0,
s1,
s0 + s1,
sep='\n{}\n'.format(40*'\N{box drawings light horizontal}'),
)
With indexes of ['a', 'b', 'c'] and ['a', 'b', 'c'] we match values against
other values exactly based on matching index entries. If one of the indices
has the same contents but in different order, this exact against exact
matching still occurs:
from pandas import Series
from numpy.random import default_rng
rng = default_rng(0)
s0 = Series(
index=(idx := [*'abc']),
data=rng.integers(-10, +10, size=len(idx)),
)
s1 = Series(
index=(idx := [*'cba']),
data=rng.integers(-10, +10, size=len(idx)),
)
print(
s0,
s1,
s0 + s1,
sep='\n{}\n'.format(40*'\N{box drawings light horizontal}'),
)
If one of our indices .has_duplicates or is not sorted
(.is_monotonic_increasing/.is_monotonic_decreating in pandas parlance,)
we may see other index-alignment modalities triggered.
It’s not uncommon, in the presence of duplicates, to see a Cartesian product index-alignment modality triggered, where every value in one index is matched against every other value in the other.
from pandas import Series
from numpy.random import default_rng
rng = default_rng(0)
s0 = Series(
index=(idx := [*'abab']),
data=rng.integers(-10, +10, size=len(idx)),
)
s1 = Series(
index=(idx := [*'aabb']),
data=rng.integers(-10, +10, size=len(idx)),
)
print(
s0,
s1,
s0 + s1,
sep='\n{}\n'.format(40*'\N{box drawings light horizontal}'),
)
However, note that, in practice, we often have to deal only with the exact against exact alignment modality. It’s common that immediately after our data loading, we may consider what would happen if we had duplicates in our data.
We may do .groupby(obj.index).agg(...) or a .sort_index() defensively after
data-loading to ensure that our indices are both free of duplicates and
monotonic, thus leaving us to worry only about the exact against exact
alignment modality.
from pandas import Series
from numpy.random import default_rng
rng = default_rng(0)
s0 = Series(
index=(idx := [*'abc']),
data=rng.integers(-10, +10, size=len(idx)),
)
s1 = Series(
index=(idx := [*'aabb']),
data=rng.integers(-10, +10, size=len(idx)),
)
s0 = s0.pipe(lambda s: s.groupby(s.index).sum().sort_index())
s1 = s1.pipe(lambda s: s.groupby(s.index).sum().sort_index())
print(
s0,
s1,
s0 + s1,
sep='\n{}\n'.format(40*'\N{box drawings light horizontal}'),
)
Note that when we align pandas.Series against pandas.DataFrame, we align
the .index of the pandas.Series against the .columns of our
pandas.DataFrame!
from pandas import Series, DataFrame, date_range
from numpy.random import default_rng
rng = default_rng(0)
s = Series(
index=(idx := [*'abc']),
data=rng.integers(-10, +10, size=len(idx)),
)
df = DataFrame(
index=date_range('2020-01-01', periods=3),
data={
'a': rng.integers(-10, +10, size=len(idx)),
'b': rng.integers(-10, +10, size=len(idx)),
'c': rng.integers(-10, +10, size=len(idx)),
}
)
print(
s,
df,
s * df,
sep='\n{}\n'.format(40*'\N{box drawings light horizontal}'),
)
Here is an example where index alignment can be quite useful. If we have prices of assets across multiple dates and the total held positions of the assets across multiple portfolios, we can determine what the price of each holding is per portfolio.
Unfortunately, the below fails, because in order for pandas to align these
MultiIndex, it needs to understand which levels represent the same things.
from pandas import Series, MultiIndex, date_range
from numpy.random import default_rng
rng = default_rng(0)
assets = 'ABC DEF XYZ'.split()
prices = Series(
index=(idx := MultiIndex.from_product([
date_range('2020-01-01', periods=3),
assets,
])),
data=rng.normal(loc=100, scale=5, size=len(idx)),
).round(2)
positions = Series({
('Alice', 'ABC'): +100,
('Bob', 'ABC'): +200,
('Bob', 'DEF'): -100,
})
print(
prices,
positions,
prices * positions,
sep='\n{}\n'.format(40*'\N{box drawings light horizontal}'),
)
Therefore, we need to name the levels of the indices. When we do this, we see a useful result from our index alignment!
from pandas import Series, MultiIndex, date_range
from numpy.random import default_rng
rng = default_rng(0)
assets = 'ABC DEF XYZ'.split()
prices = Series(
index=(idx := MultiIndex.from_product([
date_range('2020-01-01', periods=3),
assets,
], names='date asset'.split())),
data=rng.normal(loc=100, scale=5, size=len(idx)),
).round(2)
positions = Series({
('Alice', 'ABC'): +100,
('Bob', 'ABC'): +200,
('Bob', 'DEF'): -100,
}).rename_axis(index=['book', 'asset'])
print(
prices,
positions,
(prices * positions).dropna(),
sep='\n{}\n'.format(40*'\N{box drawings light horizontal}'),
)
.groupbyNote that all .groupby modalities are defined in terms of what they do to our index.
.agg modality will change our index to one with each group key.transform modality will preserve our original index.apply modality will stack another level onto our index with our group key (unless we set group_keys=False)from pandas import Series, MultiIndex, date_range, concat
from numpy.random import default_rng
rng = default_rng(0)
assets = 'ABC DEF XYZ'.split()
prices = Series(
index=(idx := MultiIndex.from_product([
date_range('2020-01-01', periods=3),
assets,
], names='date asset'.split())),
data=rng.normal(loc=100, scale=5, size=len(idx)),
).round(2)
print(
prices,
# prices.groupby('asset').agg('max'),
# prices.groupby('asset').transform('cummax'),
# prices.groupby('asset').apply(lambda g: concat([g, g])),
# prices.groupby('asset', group_keys=False).apply(lambda g: concat([g, g])),
sep='\n{}\n'.format(40*'\N{box drawings light horizontal}'),
)
print("Let's take a look!")
Let’s start with some simple pricing data. Let’s look at how we construct it.
We’ll start by thinking about the dates we want to operate over. We’ll look at minute-frequency data:
from pandas import date_range
idx = date_range('2010', '2020', freq='min', name='date')
print(idx)
Let’s restrict ourselves to just market hours: 9:30 AM to 4:00 PM.
Unfortunately, we do not have a .between_time method on Index like we do on
pandas.Series and pandas.DataFrame. It’s not uncommon that sometimes
useful methods don’t (yet!) exist on your index.
from pandas import date_range
dates = date_range('2010', '2020', freq='min', name='date')
dates = dates[dates.indexer_between_time('09:30', '16:00')]
print(dates)
Let’s create some tickers by randomly generating 4-letter strings:
from string import ascii_lowercase
from numpy.random import default_rng
from numpy import unique
rng = default_rng(0)
currencies = {'USD', 'EUR'}
tickers = unique(
rng.choice([*ascii_lowercase], size=(200, 4)).view('<U4').ravel()
)
print(
currencies,
tickers,
sep='\n',
)
Let’s construct a CategoricalIndex for our assets, since they aren’t
arbitrary string data. Let’s allow ourselves to distinguish between
tradeable assets (tickers) and fungible assets (currencies.)
from string import ascii_lowercase
from pandas import CategoricalIndex
from numpy.random import default_rng
from numpy import unique
rng = default_rng(0)
currencies = {'USD', 'EUR'}
tickers = unique(
rng.choice([*ascii_lowercase], size=(20, 4)).view('<U4').ravel()
)
assets = CategoricalIndex([*tickers, *sorted(currencies)], name='asset')
currencies = CategoricalIndex([*sorted(currencies)], dtype=assets.dtype, name='currency')
tickers = CategoricalIndex([*sorted(tickers)], dtype=assets.dtype, name='ticker')
print(assets)
Let’s construct a MultiIndex from the Cartesian product of each ticker for
each timestamp:
from string import ascii_lowercase
from pandas import CategoricalIndex, date_range, DataFrame, MultiIndex
from numpy.random import default_rng
from numpy import unique
rng = default_rng(0)
dates = date_range('2010', '2020', freq='min', name='date')
dates = dates[dates.indexer_between_time('09:30', '16:00')]
currencies = {'USD', 'EUR'}
tickers = unique(
rng.choice([*ascii_lowercase], size=(20, 4)).view('<U4').ravel()
)
assets = CategoricalIndex([*tickers, *sorted(currencies)], name='asset')
currencies = CategoricalIndex([*sorted(currencies)], dtype=assets.dtype, name='currency')
tickers = CategoricalIndex([*sorted(tickers)], dtype=assets.dtype, name='ticker')
idx = MultiIndex.from_product([
dates,
tickers,
])
print(idx)
Let’s construct our prices as a DataFrame with a buy side and a sell side:
from logging import getLogger, basicConfig, INFO
from pathlib import Path
from pickle import dump
from string import ascii_lowercase
from pandas import CategoricalIndex, date_range, DataFrame, MultiIndex, IndexSlice
from numpy.random import default_rng
from numpy import unique
logger = getLogger(__name__); basicConfig(level=INFO)
rng = default_rng(0)
dates = date_range('2010', '2020', freq='min', name='date')
dates = dates[dates.indexer_between_time('09:30', '16:00')]
currencies = {'USD', 'EUR'}
tickers = unique(
rng.choice([*ascii_lowercase], size=(20, 4)).view('<U4').ravel()
)
assets = CategoricalIndex([*tickers, *sorted(currencies)], name='asset')
currencies = CategoricalIndex([*sorted(currencies)], dtype=assets.dtype, name='currency')
tickers = CategoricalIndex([*sorted(tickers)], dtype=assets.dtype, name='ticker')
prices = (
DataFrame(
index=(idx := MultiIndex.from_product([
dates,
tickers,
])),
data={
'buy': (
rng.normal(loc=100, scale=20, size=len(tickers)).clip(10, 200)
* rng.normal(loc=1, scale=.0001, size=(len(dates), len(tickers))).cumprod(axis=0)
).ravel()
},
)
.assign(
sell=lambda df: df['buy'] * (1 - abs(rng.normal(loc=0, scale=.05, size=len(df))))
)
.round(4)
.rename_axis(columns='direction')
)
print(
prices.head(3),
# prices.loc[IndexSlice[:, 'arnq', :]],
sep='\n',
)
data_dir = Path('data')
data_dir.mkdir(exist_ok=True, parents=True)
prices.to_pickle(filename := (data_dir / 'prices.pkl'))
logger.info('Wrote %s', filename)
with open(filename := (data_dir / 'dtypes.pkl'), mode='wb') as f:
dump({'currencies': currencies, 'tickers': tickers, 'assets': assets, 'dates': dates}, f)
logger.info('Wrote %s', filename)
We’ll store the pricing data to a Pickle file (which is fine, because we are not distributing this data—we’re moving it merely from our “left-hand” to our “right-hand.”)
Let’s construct our trades by first thinking about the indexing. We want to have
ten chances to trade under every name, for every timestamp for every date. We will
sample this down to 5% of these chances. (It would be nice if .sample(...) existed
on Index, but it doesn’t.)
from pathlib import Path
from pickle import load
from numpy import arange
from numpy.random import default_rng
from pandas import Series, MultiIndex
rng = default_rng(0)
data_dir = Path('data')
with open(data_dir / 'dtypes.pkl', mode='rb') as f:
dtypes = load(f)
currencies, tickers, assets, dates = dtypes['currencies'], dtypes['tickers'], dtypes['assets'], dtypes['dates']
# globals().update(dtypes)
idx = MultiIndex.from_product([
dates,
tickers,
range(10),
], names=[dates.name, tickers.name, 'trade'])
idx = idx[rng.choice(arange(len(idx)), size=int(len(idx) * .05), replace=True)]
print(
idx,
sep='\n',
)
We’ll randomly decide to trade between 1,000 and 10,000 units in each instance (rounded to the nearest hundredth):
from pathlib import Path
from pickle import load
from numpy import arange
from numpy.random import default_rng
from pandas import read_pickle, Series, MultiIndex
rng = default_rng(0)
data_dir = Path('data')
prices = read_pickle(data_dir / 'prices.pkl')
with open(data_dir / 'dtypes.pkl', mode='rb') as f:
dtypes = load(f)
currencies, tickers, assets, dates = dtypes['currencies'], dtypes['tickers'], dtypes['assets'], dtypes['dates']
# globals().update(dtypes)
idx = MultiIndex.from_product([
dates,
tickers,
range(10),
], names=[dates.name, tickers.name, 'trade'])
idx = idx[rng.choice(arange(len(idx)), size=int(len(idx) * .05), replace=True)]
idx = idx.droplevel('trade')
trades = (
Series(
index=idx,
data=(
rng.choice([-1, +1], size=len(idx))
* rng.integers(1_000, 10_000, size=len(idx))
)
)
.round(-2)
.sort_index()
)
print(
trades,
sep='\n',
)
It’s unfortunate that we have a lot of missing functionality on our Index, but
pandas is a tool under active development, so it takes time for all of the
useful methods to be implemented.
from pandas import Index
print(dir(Index))
In the meantime, here is a great trick we like to use. Use
pandas.api.extensions.register_index_accessor to make an ._ext namespace
for your indices. This should be a name that is easy to grep for in your code.
Under that namespace, implement the missing functionality, striving to implement
it in such a way that it looks like existing functionality. e.g., to implement
the missing .updatelevel or .addlevel method, try to copy how .droplevel
or .swaplevel looks!
While we’re waiting for functionality to be added to pandas, we can use our
own functionality in the ._ext namespace. When the functioanlity is eventually
added, we can grep for instances of ._ext.updatelevel and replace them
with the official functionality!
from collections.abc import Callable
from dataclasses import dataclass
from pathlib import Path
from pandas import Index, MultiIndex
from pandas.api.extensions import register_index_accessor
@register_index_accessor('_ext')
@dataclass
class _ext:
obj : Index
def addlevel(self, **levels):
levels = {
k: v if not isinstance(v, Callable) else v(self.obj)
for k, v in levels.items()
}
new_obj = self.obj.copy(deep=False)
if not isinstance(new_obj, MultiIndex):
new_obj = MultiIndex.from_arrays([
new_obj
])
names = new_obj.names
new_obj.names = [None] * len(names)
return MultiIndex.from_arrays([
*(
new_obj.get_level_values(idx)
for idx in range(len(names))
),
*levels.values(),
], names=[*names, *levels.keys()])
def updatelevel(self, **levels):
levels = {
k: v if not isinstance(v, Callable) else v(self.obj)
for k, v in levels.items()
}
new_obj = self.obj.copy(deep=False)
if not isinstance(new_obj, MultiIndex):
new_obj = MultiIndex.from_arrays([
new_obj
])
names = new_obj.names
new_obj.names = [None] * len(names)
return MultiIndex.from_arrays([
levels[n]
if n in levels else
new_obj.get_level_values(idx)
for idx, n in enumerate(names)
], names=names)
if __name__ == '__main__':
from logging import getLogger, basicConfig, INFO
from pandas import date_range
logger = getLogger(__name__); basicConfig(level=INFO)
idx = Index([*'abc'])
print(
idx._ext.addlevel(date=date_range('2020-01-01', periods=3))
)
lib_dir = Path('_lib')
lib_dir.mkdir(exist_ok=True, parents=True)
with open(filename := (lib_dir / '__init__.py'), mode='wt') as dst:
with open(__file__) as src:
print(src.read(), file=dst)
logger.info('Wrote %s', filename)
The above dramatically simplifies working with MultiIndex. Let’s complete our
trade data by adding noise to the time measurements and adding a trading identifier.
from logging import getLogger, basicConfig, INFO
from pathlib import Path
from pickle import load
from numpy import arange
from numpy.random import default_rng
from pandas import read_pickle, Series, MultiIndex, to_timedelta, IndexSlice
import _lib
logger = getLogger(__name__); basicConfig(level=INFO)
rng = default_rng(0)
data_dir = Path('data')
prices = read_pickle(data_dir / 'prices.pkl')
with open(data_dir / 'dtypes.pkl', mode='rb') as f:
dtypes = load(f)
currencies, tickers, assets, dates = dtypes['currencies'], dtypes['tickers'], dtypes['assets'], dtypes['dates']
idx = MultiIndex.from_product([
dates,
tickers,
range(10),
], names=[dates.name, 'asset', 'trade'])
idx = idx[rng.choice(arange(len(idx)), size=int(len(idx) * .005), replace=True)]
idx = idx.droplevel('trade')
trades = (
Series(
index=idx,
data=(
rng.choice([-1, +1], size=len(idx))
* rng.integers(1_000, 100_000, size=len(idx))
)
)
.round(-2)
.pipe(lambda s: s.set_axis(s.index
._ext.updatelevel(date=lambda idx:
idx.get_level_values('date')
+ to_timedelta(abs(rng.normal(loc=0, scale=10, size=len(idx)).round()), unit='s')
)
))
.sort_index()
.loc[
prices.index.get_level_values('date').min()
:prices.index.get_level_values('date').max()
]
.pipe(lambda s: s.set_axis(s.index
._ext.addlevel(trade=range(len(s)))
))
)
print(
trades,
sep='\n',
)
data_dir = Path('data')
data_dir.mkdir(exist_ok=True, parents=True)
trades.to_pickle(filename := (data_dir / 'trades.pkl'))
logger.info('Wrote %s', filename)
Finally, we’re ready for analysis. Given our pricing information and our
trades, we can execute our trades. When we execute our trades, we create a
cashflow entry, denominated in USD, for the cash that we paid or received to
buy or sell the volume. We implement a liquidate function for representing
what happens when we want to liquidate our entire portfolio—i.e., sell
everything as of some date.
Together, these give us all we need to compute P&L, market value, and other interesting portfolio calculations:
from collections import namedtuple
from logging import getLogger, basicConfig, INFO
from pathlib import Path
from pickle import load, dump
from numpy import where
from pandas import read_pickle, MultiIndex, period_range, IndexSlice
import _lib
logger = getLogger(__name__); basicConfig(level=INFO)
data_dir = Path('data')
prices = read_pickle(data_dir / 'prices.pkl')
trades = read_pickle(data_dir / 'trades.pkl')
with open(data_dir / 'dtypes.pkl', mode='rb') as f:
dtypes = load(f)
currencies, tickers, assets, dates = dtypes['currencies'], dtypes['tickers'], dtypes['assets'], dtypes['dates']
def execute(volumes, prices):
traded_prices = prices.loc[
MultiIndex.from_arrays([
volumes.index.get_level_values('date').floor('min'),
volumes.index.get_level_values('asset'),
])
].pipe(lambda df: where(volumes > 0, df['buy'], df['sell']))
return (
(traded_prices * -volumes)
.pipe(lambda s: s.set_axis(s.index._ext.updatelevel(asset=['USD'] * len(volumes))))
)
Liquidation = namedtuple('Liquidation', 'trades cashflows')
def liquidate(volumes, prices, *, date=None):
dates = [prices.index.get_level_values('date').max() if date is None else date] * len(volumes)
trades = (
(-volumes)
.pipe(lambda s: s.set_axis(s.index._ext.updatelevel(date=dates)))
)
return Liquidation(trades=trades, cashflows=execute(trades, prices))
cashflows = execute(trades, prices)
liquidations = {
yr: liquidate(
trades.loc[IndexSlice[:yr.end_time, :]],
prices,
date=prices.index.get_level_values('date').unique().asof(yr.end_time),
)
for yr in period_range(
prices.index.get_level_values('date').min(),
prices.index.get_level_values('date').max(),
freq='Y',
)
}
print(
trades,
cashflows,
*liquidations[max(liquidations)],
sep='\n{}\n'.format(40*'\N{box drawings light horizontal}'),
)
data_dir = Path('data')
data_dir.mkdir(exist_ok=True, parents=True)
cashflows.to_pickle(filename := (data_dir / 'cashflows.pkl'))
logger.info('Wrote %s', filename)
with open(filename := (data_dir / 'liquidations.pkl'), mode='wb') as f:
dump(liquidations, f)
logger.info('Wrote %s', filename)
Let’s compute our P&L, our market value, our cash position, and other useful
calculations using our data & our execute and liquidate methods:
from collections import namedtuple
from itertools import chain
from logging import getLogger, basicConfig, INFO
from pathlib import Path
from pickle import load
from pandas import read_pickle, IndexSlice, concat, DataFrame
import _lib
logger = getLogger(__name__); basicConfig(level=INFO)
data_dir = Path('data')
prices = read_pickle(data_dir / 'prices.pkl')
trades = read_pickle(data_dir / 'trades.pkl')
cashflows = read_pickle(data_dir / 'cashflows.pkl')
with open(data_dir / 'dtypes.pkl', mode='rb') as f:
dtypes = load(f)
currencies, tickers, assets, dates = dtypes['currencies'], dtypes['tickers'], dtypes['assets'], dtypes['dates']
Liquidation = namedtuple('Liquidation', 'trades cashflows') # reconstruct
with open(data_dir / 'liquidations.pkl', mode='rb') as f:
liquidations = load(f)
by_year = DataFrame(
index=(idx := liquidations),
data={
'cash': (
cashflows.loc[IndexSlice[:yr.end_time, :]].groupby('asset', observed=True).sum().loc['USD']
for yr in idx
),
'market value': (
liquidations[yr].cashflows.groupby('asset', observed=True).sum().loc['USD']
for yr in idx
),
}
)
print(
# trades.groupby('asset', observed=True).sum(),
# trades.groupby('asset', observed=True).cumsum(),
# cashflows.groupby('asset', observed=True).sum(),
# liquidations[max(liquidations)].cashflows.groupby('asset', observed=True).sum(),
by_year,
sep='\n{}\n'.format(40*'\N{box drawings light horizontal}'),
)