ts-python

Seminar X: Deep Dive into a Time-Series Problem

JIMMY APOLLO: “Well, folks, when you’re right 52% of the time, you’re wrong 48% of the time.”

— “Lisa the Greek” (S03E14)

Abstract

Why does pandas need so much additional datetime funcationality compared to numpy? How do you make sense of data collected across different timezones? How do I know if I need a PeriodIndex or a DatetimeIndex? And importantly, how do I keep track of all of these features?

Do you…

… need to unify various timezones in your data?
… need to work around specific dates, such as holidays and weekends, in your analyses?
… need to produce multiple forecasts subject to a variable number of scenarios and constraints?

Then come join us for a session on a time-series analysis in pandas!

Pandas originated in the finance industry. This means that as the library has grown, working with datetimes has been remained in the forefront of supported features. But amongst the vast API that pandas offers, it can be difficult to ensure your time-series analyses are running efficiently.

In this episode, we will walk through a complex time-series analysis problems using pandas. We will demonstrate how you can perform scenario analysis in an extensible and flexible manner, and how you can make the best computational use of pandas datetime interfaces.

Keywords

time-series, pandas, Python, datetime, timezone

Notes

Premise

Our goal is to work through a time series problem using pandas.

This has certain implications.

First, this means we are looking at data that is not “evenly sampled,” otherwise we could simply use NumPy. In fact, we may prefer to use NumPy over pandas, given the relative simpliciy of understanding NumPy performance and memory use.

print("Let's take a look!")

from IPython.display import display
from numpy.random import default_rng

rng = default_rng(0)

xs = rng.normal(loc=1, scale=.05, size=100).cumprod().round(4)

display(xs)

This does not mean that we cannot rely on NumPy at all.

For example, if the data itself is the dates, then NumPy may be wholly sufficient for our use.

from IPython.display import display
from numpy import timedelta64, datetime64, median
from numpy.random import default_rng

rng = default_rng(0)

xs = datetime64('2020-01-01') + rng.integers(1, 3, size=10).cumsum().astype(timedelta64)

display(
    xs,
    # xs.mean(),
    (xs[1:] - xs[:-1]).mean(),
)

In fact, except for dealing with timezones, NumPy’s datetime64 and timedelta64 types are quite flexible and quite useful.

For some use-cases, they may even be superior to panda’s datetime types.

from IPython.display import display
from numpy import datetime64, int64, iinfo, busday_offset

display(
    # iinfo(int64).max,
    # datetime64(),
    # iinfo(int64).min,
    # datetime64().astype(int64),
    # datetime64(10_000, 'D'),
    # datetime64(10_000, 'M'),
    # datetime64(10_000, 'Y'),
    # datetime64(10_000, '10Y'),
    # datetime64(10_000, '[Y/4]'),
    busday_offset(datetime64(10_000, 'D'), 20, roll='following'),
    # busday_offset(datetime64(10_000, 'D'), 20, roll='preceding'),
    # busday_offset(datetime64(10_000, 'D'), 20, roll='modifiedfollowing'),
)

Since we are interested in looking at time series problems in pandas, we necessarily are looking at problems where NumPy is insufficient.

This means problems which…

are willing to work within the constraints of pandas datetime types
where the date information is metadata not data
deal with timezones
need to work with holiday calendars
need some of the conveniences available in pandas

from IPython.display import display
from pandas import Timestamp, array, Series

from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay

cal = CustomBusinessDay(calendar=USFederalHolidayCalendar())

display(
    # Timestamp('2020-01-01'),
    # Timestamp('2020-01-01').asm8,
    # Timestamp('2020-01-01').asm8.dtype,

    # array([
    #     Timestamp('2020-01-01'),
    #     Timestamp('2020-01-02'),
    #     Timestamp('2020-01-04'),
    # ]),

    # array([
    #     Timestamp('2020-01-01'),
    #     Timestamp('2020-01-02'),
    #     Timestamp('2020-01-04'),
    # ]).tz_localize('US/Eastern'),

    Series(
        data=0,
        index=
            array([
                Timestamp('2020-01-01'),
                Timestamp('2020-01-02'),
                Timestamp('2020-01-04'),
            ]).tz_localize('US/Eastern'),
    ).resample(cal).mean(),
)

Problem

The below data contains daily exchange rates for selected currencies, scraped from the Federal Reserve website.

print("Let's take a look!")

from IPython.display import display
from pathlib import Path
from pandas import read_pickle

data_dir = Path('data')

currencies = {
    'AUD': read_pickle(data_dir / 'AUD.pkl').loc['2020'],
    'CAD': read_pickle(data_dir / 'CAD.pkl').loc['2020'],
    'GBP': read_pickle(data_dir / 'GBP.pkl').loc['2020'],
}

display(
    *currencies.values(),
)

The below data contains trades performed by various trading desks on related assets. The pricing information is in the local currency. The dates are in the local timezone.

from IPython.display import display
from pathlib import Path
from pandas import read_csv

data_dir = Path('data')

desks = {'London', 'Los Angeles', 'New York', 'Sydney', 'Toronto',}
trades = {
    desk: read_csv(data_dir / f'{desk}.csv')
    for desk in sorted(desks)
}

display(
    trades,
)

Using the above data, answer the following questions:

how many trades were done on each name in each month across all locations?
what was the average time between trades on a specific name?
what was the traded spread per day?

from IPython.display import display
from pathlib import Path
from pandas import read_csv, read_pickle, to_datetime

data_dir = Path('data')

desks = {'New York'}
trades = {
    desk: read_pickle(data_dir / f'{desk}.pkl')
    # desk: read_csv(data_dir / f'{desk}.csv')
    for desk in sorted(desks)
}['New York']

display(
    # trades.groupby('asset')['volume'].sum()
    # trades.set_index(['date', 'asset', 'desk']).sort_index(),

    # .dt.tz_localize # take a naïve data → timezone aware
    # .dt.tz_convert  # convert from one timezone to another

    # to_datetime(trades['date'].str[:-6], format='%Y-%m-%d %H:%M:%S').dt.tz_localize('US/Eastern', ambiguous='infer'),
    # trades['date'].dt,
    # trades.index.get_level_values('date'),
)

from IPython.display import display
from pathlib import Path
from pandas import read_pickle, MultiIndex, concat

data_dir = Path('data')

currencies = {
    'AUD': read_pickle(data_dir / 'AUD.pkl').loc['2020'],
    'CAD': read_pickle(data_dir / 'CAD.pkl').loc['2020'],
    'GBP': read_pickle(data_dir / 'GBP.pkl').loc['2020'],
}

desks = {'London', 'Los Angeles', 'New York', 'Sydney', 'Toronto',}
trades = {
    desk: read_pickle(data_dir / f'{desk}.pkl').pipe(
        lambda df: df.set_axis(
            MultiIndex.from_arrays([
                df.index.get_level_values('date').tz_convert('UTC'),
                df.index.get_level_values('asset'),
                df.index.get_level_values('desk'),
            ], names=df.index.names)
        )
    )
    for desk in sorted(desks)
}

display(
    # concat(trades.values()),
    # currencies['GBP'],
    currencies['GBP'].asof(
        concat(trades.values()).index.get_level_values('date').floor('D').tz_localize(None)
    ).bfill()
)

Try to discuss:

what are the pandas types for working with time series data? (e.g., Timestamp, Timedelta, PeriodIndex, &c.)
what are pandas operations for working with time series data? (e.g., .resample, .asof, .groupby)
what are pandas operations for working with time zone data? (e.g., .tz_localize, .tz_convert)
what are pandas operations for working with holiday calendars?

from IPython.display import display
from pandas import Timestamp, to_datetime, date_range
from dateutil.parser import parse
from numpy.random import default_rng

rng = default_rng(0)

print(
    # Timestamp('2020/01/04 09:04:23'),
    # to_datetime(['2020-01-04', '2020/03/05', 'Apr-04 2020']),
    # to_datetime(['2020-01-04', '2020-03-05', '2020-04-04'], format='%Y-%m-%d'),
    # date_range('2020-01-01', '2020-03-31', freq='14D'),
)

from IPython.display import display
from pandas import Timedelta, to_timedelta, timedelta_range
from numpy.random import default_rng

rng = default_rng(0)

from IPython.display import display
from pandas import IntervalIndex, Series
from numpy.random import default_rng

rng = default_rng(0)

s = Series(
    data=rng.integers(-10, +10, size=4),
    index=IntervalIndex.from_breaks([0, 10, 20, 50, 100]),
)

display(
    s.loc[30:100],
)

from IPython.display import display
from pandas import PeriodIndex, period_range, Series
from numpy.random import default_rng

rng = default_rng(0)

s = Series(
    data=rng.integers(-10, +10, size=4),
    index=period_range('2020-01', periods=4, freq='M'),
)

print(
    s.loc['2020-01-03':'2020-04-03'],
)

from IPython.display import display
from pandas import Series, to_datetime, to_timedelta, Grouper
from numpy.random import default_rng

rng = default_rng(0)

s = Series(
    index=(idx :=
        to_datetime('2020-01-01') + to_timedelta(rng.integers(12, 48, size=100).cumsum(), unit='H')
    ),
    data=rng.random(size=len(idx)).round(2),
).rename_axis('date').rename('signal')

display(
    # s.groupby(Grouper(level='date', freq='D')).mean().head(3).index,
    # s.groupby(Grouper(level='date', freq='D')).mean().loc['2020-01-01':'2020-01-01'],
    # s.groupby(Grouper(level='date', freq='D')).mean().head(),
    # s.groupby(s.index.to_period('D')).mean().head(),
    s.resample('D').mean(),
)

from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay

busday = CustomBusinessDay(calendar=USFederalHolidayCalendar())