JIMMY APOLLO: “Well, folks, when you’re right 52% of the time, you’re wrong 48% of the time.”
— “Lisa the Greek” (S03E14)
Why does pandas need so much additional datetime funcationality compared to
numpy? How do you make sense of data collected across different timezones?
How do I know if I need a PeriodIndex or a DatetimeIndex? And importantly,
how do I keep track of all of these features?
Do you…
Then come join us for a session on a time-series analysis in pandas!
Pandas originated in the finance industry. This means that as the library has grown, working with datetimes has been remained in the forefront of supported features. But amongst the vast API that pandas offers, it can be difficult to ensure your time-series analyses are running efficiently.
In this episode, we will walk through a complex time-series analysis problems using pandas. We will demonstrate how you can perform scenario analysis in an extensible and flexible manner, and how you can make the best computational use of pandas datetime interfaces.
Keywords
Our goal is to work through a time series problem using pandas.
This has certain implications.
First, this means we are looking at data that is not “evenly sampled,” otherwise we could simply use NumPy. In fact, we may prefer to use NumPy over pandas, given the relative simpliciy of understanding NumPy performance and memory use.
print("Let's take a look!")
from IPython.display import display
from numpy.random import default_rng
rng = default_rng(0)
xs = rng.normal(loc=1, scale=.05, size=100).cumprod().round(4)
display(xs)
This does not mean that we cannot rely on NumPy at all.
For example, if the data itself is the dates, then NumPy may be wholly sufficient for our use.
from IPython.display import display
from numpy import timedelta64, datetime64, median
from numpy.random import default_rng
rng = default_rng(0)
xs = datetime64('2020-01-01') + rng.integers(1, 3, size=10).cumsum().astype(timedelta64)
display(
xs,
# xs.mean(),
(xs[1:] - xs[:-1]).mean(),
)
In fact, except for dealing with timezones, NumPy’s datetime64 and
timedelta64 types are quite flexible and quite useful.
For some use-cases, they may even be superior to panda’s datetime types.
from IPython.display import display
from numpy import datetime64, int64, iinfo, busday_offset
display(
# iinfo(int64).max,
# datetime64(),
# iinfo(int64).min,
# datetime64().astype(int64),
# datetime64(10_000, 'D'),
# datetime64(10_000, 'M'),
# datetime64(10_000, 'Y'),
# datetime64(10_000, '10Y'),
# datetime64(10_000, '[Y/4]'),
busday_offset(datetime64(10_000, 'D'), 20, roll='following'),
# busday_offset(datetime64(10_000, 'D'), 20, roll='preceding'),
# busday_offset(datetime64(10_000, 'D'), 20, roll='modifiedfollowing'),
)
Since we are interested in looking at time series problems in pandas, we necessarily are looking at problems where NumPy is insufficient.
This means problems which…
from IPython.display import display
from pandas import Timestamp, array, Series
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay
cal = CustomBusinessDay(calendar=USFederalHolidayCalendar())
display(
# Timestamp('2020-01-01'),
# Timestamp('2020-01-01').asm8,
# Timestamp('2020-01-01').asm8.dtype,
# array([
# Timestamp('2020-01-01'),
# Timestamp('2020-01-02'),
# Timestamp('2020-01-04'),
# ]),
# array([
# Timestamp('2020-01-01'),
# Timestamp('2020-01-02'),
# Timestamp('2020-01-04'),
# ]).tz_localize('US/Eastern'),
Series(
data=0,
index=
array([
Timestamp('2020-01-01'),
Timestamp('2020-01-02'),
Timestamp('2020-01-04'),
]).tz_localize('US/Eastern'),
).resample(cal).mean(),
)
The below data contains daily exchange rates for selected currencies, scraped from the Federal Reserve website.
print("Let's take a look!")
from IPython.display import display
from pathlib import Path
from pandas import read_pickle
data_dir = Path('data')
currencies = {
'AUD': read_pickle(data_dir / 'AUD.pkl').loc['2020'],
'CAD': read_pickle(data_dir / 'CAD.pkl').loc['2020'],
'GBP': read_pickle(data_dir / 'GBP.pkl').loc['2020'],
}
display(
*currencies.values(),
)
The below data contains trades performed by various trading desks on related assets. The pricing information is in the local currency. The dates are in the local timezone.
from IPython.display import display
from pathlib import Path
from pandas import read_csv
data_dir = Path('data')
desks = {'London', 'Los Angeles', 'New York', 'Sydney', 'Toronto',}
trades = {
desk: read_csv(data_dir / f'{desk}.csv')
for desk in sorted(desks)
}
display(
trades,
)
Using the above data, answer the following questions:
from IPython.display import display
from pathlib import Path
from pandas import read_csv, read_pickle, to_datetime
data_dir = Path('data')
desks = {'New York'}
trades = {
desk: read_pickle(data_dir / f'{desk}.pkl')
# desk: read_csv(data_dir / f'{desk}.csv')
for desk in sorted(desks)
}['New York']
display(
# trades.groupby('asset')['volume'].sum()
# trades.set_index(['date', 'asset', 'desk']).sort_index(),
# .dt.tz_localize # take a naïve data → timezone aware
# .dt.tz_convert # convert from one timezone to another
# to_datetime(trades['date'].str[:-6], format='%Y-%m-%d %H:%M:%S').dt.tz_localize('US/Eastern', ambiguous='infer'),
# trades['date'].dt,
# trades.index.get_level_values('date'),
)
from IPython.display import display
from pathlib import Path
from pandas import read_pickle, MultiIndex, concat
data_dir = Path('data')
currencies = {
'AUD': read_pickle(data_dir / 'AUD.pkl').loc['2020'],
'CAD': read_pickle(data_dir / 'CAD.pkl').loc['2020'],
'GBP': read_pickle(data_dir / 'GBP.pkl').loc['2020'],
}
desks = {'London', 'Los Angeles', 'New York', 'Sydney', 'Toronto',}
trades = {
desk: read_pickle(data_dir / f'{desk}.pkl').pipe(
lambda df: df.set_axis(
MultiIndex.from_arrays([
df.index.get_level_values('date').tz_convert('UTC'),
df.index.get_level_values('asset'),
df.index.get_level_values('desk'),
], names=df.index.names)
)
)
for desk in sorted(desks)
}
display(
# concat(trades.values()),
# currencies['GBP'],
currencies['GBP'].asof(
concat(trades.values()).index.get_level_values('date').floor('D').tz_localize(None)
).bfill()
)
Try to discuss:
Timestamp, Timedelta, PeriodIndex, &c.).resample, .asof, .groupby).tz_localize, .tz_convert)from IPython.display import display
from pandas import Timestamp, to_datetime, date_range
from dateutil.parser import parse
from numpy.random import default_rng
rng = default_rng(0)
print(
# Timestamp('2020/01/04 09:04:23'),
# to_datetime(['2020-01-04', '2020/03/05', 'Apr-04 2020']),
# to_datetime(['2020-01-04', '2020-03-05', '2020-04-04'], format='%Y-%m-%d'),
# date_range('2020-01-01', '2020-03-31', freq='14D'),
)
from IPython.display import display
from pandas import Timedelta, to_timedelta, timedelta_range
from numpy.random import default_rng
rng = default_rng(0)
from IPython.display import display
from pandas import IntervalIndex, Series
from numpy.random import default_rng
rng = default_rng(0)
s = Series(
data=rng.integers(-10, +10, size=4),
index=IntervalIndex.from_breaks([0, 10, 20, 50, 100]),
)
display(
s.loc[30:100],
)
from IPython.display import display
from pandas import PeriodIndex, period_range, Series
from numpy.random import default_rng
rng = default_rng(0)
s = Series(
data=rng.integers(-10, +10, size=4),
index=period_range('2020-01', periods=4, freq='M'),
)
print(
s.loc['2020-01-03':'2020-04-03'],
)
from IPython.display import display
from pandas import Series, to_datetime, to_timedelta, Grouper
from numpy.random import default_rng
rng = default_rng(0)
s = Series(
index=(idx :=
to_datetime('2020-01-01') + to_timedelta(rng.integers(12, 48, size=100).cumsum(), unit='H')
),
data=rng.random(size=len(idx)).round(2),
).rename_axis('date').rename('signal')
display(
# s.groupby(Grouper(level='date', freq='D')).mean().head(3).index,
# s.groupby(Grouper(level='date', freq='D')).mean().loc['2020-01-01':'2020-01-01'],
# s.groupby(Grouper(level='date', freq='D')).mean().head(),
# s.groupby(s.index.to_period('D')).mean().head(),
s.resample('D').mean(),
)
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay
busday = CustomBusinessDay(calendar=USFederalHolidayCalendar())