ts-python

What Date/Time is It?

Date: Friday, Feb 28, 2025 at 09:30 AM US/Eastern

Dates and datetimes are deceptively tricky in Python and pandas. Whether you’re working with timestamps in financial data, scheduling events, or aligning time series, small mistakes can lead to major errors.

In this seminar, we’ll break down the core datetime implementations in Python and pandas, showing how to parse, manipulate, and analyze date-based data effectively. But more importantly, we’ll explore the hidden pitfalls—handling time zones, ambiguous/nonexistent dates, subtle indexing issues, and more—that can cause silent failures in your analysis.

By the end of this session, you’ll walk away with:

If you’ve ever been burned by a timezone bug or an off-by-one-day error, this seminar is for you!

Notes

python -m pip install numpy pandas pyarrow python-dateutil

Foundational Theory

print("Let's take a look!")

Let’s start from the very beginning. Say we have some measurement that we have captured over time.

We could record this data in a pandas.Series for the purposes of performing analyses.

from pandas import Series
from numpy import tile, arange, repeat
from numpy.random import default_rng

rng = default_rng(0)

entities = ['abc', 'def', 'xyz']

s = Series(
    index=(idx := tile(entities, 3)),
    data=rng.random(size=len(idx)),
).rename_axis('entity')

print(
    s,
    # s.loc['abc'],
    # s.loc['abc'].diff(1),
    s
        .to_frame('value')
        .assign(num=repeat(arange(len(entities)), 3))
        .set_index('num', append=True)
        .unstack('entity')['value']
    ,
    sep=f'\n{"\N{box drawings light horizontal}"*40}\n',
)

If we look at how the evolving values are tagged, we see the first indication of a “time” value. We are presenting the various time samples as numerical values 0…2. If the actual measurement was captured on an even frequency, then this numerical value could represent the number of those units since an “epoch.”

Indeed, that’s precisely how much of our timeseries data may be represented.

from numpy import array

timestamps = [
    1_577_836_800_000_000_000,
    1_577_836_801_000_000_000,
    1_577_836_802_000_000_000,
]
xs = array(timestamps).astype('datetime64[ns]')

print(f'{xs = }')

Typically, the “epoch” that is selected is 1970-01-01. The “frequency” can vary based on the desired fidelity of our measurement.

from numpy import array

timestamps = [
    1_577_836_800,
    1_577_836_801,
    1_577_836_802,
]

xs = array(timestamps)

print(
    xs.astype('datetime64[s]'),
    (xs * 1_000).astype('datetime64[ms]'),
    sep=f'\n{"\N{box drawings light horizontal}"*40}\n',
)

This choice of units will also affect the maximum value we can represent.

from numpy import array

print(
    array([2 ** (64-1) - 1]).astype('datetime64[ns]'),
    array([2 ** (64-1) - 1]).astype('datetime64[s]'),
    sep='\n',
)

The reason we may want to represent this data in numpy as dtype=datetime64[ns] is that we want to convenient datetime operations on it, similar to what is afforded to us in pure Python with the datetime module.

from datetime import datetime, timedelta

x = datetime(2020, 1, 1, 9, 30, 0)

print(
    f'{x                     = }',
    f'{x.year                = }',
    f'{x.weekday()           = }',
    f'{x + timedelta(days=3) = }',
    sep='\n',
)
from numpy import array

x = array(1_577_836_800, dtype='datetime64[s]')[()]
y = array(            3, dtype='timedelta64[D]')[()]

print(
    f'{x     = }',
    f'{x + y = }',
    sep='\n',
)

There are two general kinds of time that we may capture in our code:

These are reflected quite well in our choices of functions in the Python standard library.

from time import time, perf_counter, sleep

before = time() # “wall clock”
sleep(1)
after = time()
assert after > before, 'Time went backwards!'

before = perf_counter() # “monotonic”
sleep(1)
after = perf_counter()
assert after > before, 'Time went backwards!'

But it’s important to note that datetimes are a very special type of measurement, given their intimate relationship to the human legal, social, and political world.

As a consequence of people wanting to wake up at 9:00 according to their local “wall clock” no matter where they live on earth, we have time zones.

These time zones are decided by regulatory bodies. Sometimes a country will have time zones roughly aligned with the longitudes; sometimes a country will have a single time zone, despite spanning many longitudes.

Additionally, these time zones change over time, since the regulatory bodies that govern them change over time. Furthermore, there may be offsets applied to these time zones (e.g., “daylight saving” time) to accomplish various economic or social objectives.

And that isn’t even taking into account that the earth’s rotation around the sun is not an even 365 days×24 hours/day×60 minutes/hour×60 seconds/minute— sometimes we have to insert leap-days or leap-seconds into our calendar to ensure the alignment of our calendar to the seasons.

Time zones may seem very complicated, because the political mechanisms behind them are complicated. But, in essence, a time zone is a very simple idea.

Every time we collect a measurement, we want to collect up to two additional pieces of data associated with that measurement: how that measurement is calibrated and the reason that that calibration is selected. Anything less than this is a simplification that is throwing away information.

The calibration is the time zone, the reasoning for that calibration can be, for example, the geographic coördinates for a physical entity or, for example, the governing body which regulates a social entity.

Here is how we represent a datetime with a timezone in pure Python.

from datetime import datetime
from zoneinfo import ZoneInfo

# ts = datetime(2020, 1, 1, 9, 30) # timezone-naïve
ts = datetime(2020, 1, 1, 9, 30, tzinfo=ZoneInfo('US/Eastern')) # timezone-aware

print(
    f'{ts                                    = :%Y-%m-%d %H:%M:%S}',
    f'{ts.astimezone(ZoneInfo("US/Eastern")) = :%Y-%m-%d %H:%M:%S}',
    f'{ts.astimezone(ZoneInfo("US/Pacific")) = :%Y-%m-%d %H:%M:%S}',
    sep='\n',
)

Note, of course, that we are providing a timezone and not a timezone offset. You may be familiar with UTC—Coördinated Universal Time—which is often represented by the letter Z or called “Zulu time,” which is a reference to the nautical time zone (GMT.)

A consequence is that…

from datetime import datetime
from zoneinfo import ZoneInfo

tss = [
    datetime(2020,  x, 1, tzinfo=ZoneInfo('US/Eastern'))
    for x in range(1, 12+1)
]

for ts in tss:
    print(f'{ts:%a %d %b, %Y (%Z)} ({ts.utcoffset()})')
from datetime import datetime, timedelta
from itertools import groupby, pairwise
from zoneinfo import ZoneInfo

tss = [
    datetime(2020,  1, 1, tzinfo=ZoneInfo('US/Eastern')) + timedelta(days=x, hours=y)
    for x in range(366+1)
    for y in range(24+1)
]

for (x, xs), (y, ys) in pairwise(groupby(tss, lambda ts: ts.utcoffset())):
    print(f'-{-x} → -{-y} on {next(ys):%a %b %d, %Y @ %H:%M (%Z)}')

This is a single, coördinated, global reference point that we can “convert” a local measurement into. Given our timezone and some rules, we can derive a timezone offset which will typically be the offset from this reference point.

Storing UTC-time instead of a timezone-aware timestamp is a very common thing to do; it‘s a loss of information, but we might argue that we threw away information that might not have been strictly necessary for our use-case.

In Python

In the time module, the most useful thing is perf_counter or monotonic.

Don’t use time.time for measuring timings; it isn’t monotonic. Represent timestamps using datetime for better human readability.

from time import time, monotonic, perf_counter

print(
    f'{time()         = }',
    f'{monotonic()    = }',
    f'{perf_counter() = }',
    sep='\n',
)

In datetime, we have date, datetime, timedelta, and time.

datetime represents the “wall-clock” date and time and, by default, is timezone naïve.

date represents just the date; it can be thought of as a datetime value with day-level fidelity. There is no way to represent a timezone aware date-only in Python. Note that the Python date cannot represent dates before 1 AD.

timedelta represents a fixed delta between dates or datetimes.

time represents a time by itself.

from datetime import date, datetime, timedelta, time

print(
    f'{date(2020, 1, 1)            = }',
    f'{datetime(2020, 1, 1, 9, 30) = }',
    f'{timedelta(days=3)           = }',
    f'{time(9, 30)                 = }',
    sep='\n',
)

We can represent a timezone-aware datetime in Python by using zoneinfo.

from zoneinfo import ZoneInfo
from datetime import datetime

dt = datetime(2020, 1, 1, tzinfo=ZoneInfo('US/Eastern'))
dt = datetime(2020, 1, 1).astimezone()

print(
    f'{dt = }',
)

To determine the timezone (the Olson TZ name) in a portable manner may be tricky. Here is how I may do it using my system’s configuration.

from pathlib import Path

print(
    Path('/etc/localtime').resolve().relative_to('/usr/share/zoneinfo')
)

In NumPy

In NumPy, we have a paramterised datetime64[…] dtype that can be used to store datetime values in an int64 with flexible units, using the Unix Epoch.

from numpy import datetime64

w = datetime64('2020-01-01 09:30:00')
x = datetime64('2020-01-01 09:30:00', 's')
y = datetime64('2020-01-01 09:30:00', 'D')
z = datetime64('2020-01-01 09:30:00', 'Y')

print(
    f'{w.astype(int) = :<16,} {w = }',
    f'{x.astype(int) = :<16,} {x = }',
    f'{y.astype(int) = :<16,} {y = }',
    f'{z.astype(int) = :<16,} {z = }',
    sep='\n',
)
from numpy import datetime64

for unit in ['Y', 'W', '4Y', '3M']:
    x = datetime64('2020-01-01 09:30:00', unit)
    print(f'{x.astype(int) = :<16,} {x = }')

There is a timedelta64[…] type as well.

from numpy import datetime64, timedelta64

x = datetime64('2020-01-01')
for unit in ['s', 'D', '7D', 'W']:
    y = timedelta64('1', unit)
    print(f'{x + y = }')

We can’t really do that much with a NumPy datetime64[ns].

from numpy import array

xs = array(['2020-01-01', '2020-01-02'], dtype='datetime64[s]')
ys = array(['2020-01-01', '2020-01-03'], dtype='datetime64[s]')

print(
    f'{xs = }',
    f'{ys = }',
    f'{xs - ys = }',
    sep=f'\n{"\N{box drawings light horizontal}"*40}\n',
)

NumPy does not handle timezones!

In pandas

In pandas, we have a Timestamp type to represent a single timestamp. It extends the Python datetime.datetime type.

from datetime import datetime
from pandas import Timestamp

dt = datetime(2020, 1, 1, 9, 30)
ts = Timestamp(2020, 1, 1, 9, 30)

print(
    f'{dt = }',
    f'{ts = }',
    f'{({*dir(ts)} ^ {*dir(dt)}) = }',
    sep=f'\n{"\N{box drawings light horizontal}"*40}\n',
)

We can see that it adds only a little bit on top of a datetime.datetime.

from pandas import Timestamp

ts = Timestamp(2020, 1, 1, 9, 30)

print(
    f'{ts = }',
    f'{ts.to_numpy() = }',
    f'{ts.to_period("Y") = }',
    f'{ts.is_month_start = }',
    f'{ts.is_quarter_start = }',
    f'{ts.tz_localize("US/Eastern") = }',
    f'{ts.tz_localize("US/Eastern").tz_convert("US/Pacific") = }',
    sep=f'\n{"\N{box drawings light horizontal}"*40}\n',
)

pandas also provides a Timedelta type that extends the datetime.timedelta type.

from pandas import Timedelta

td = Timedelta('3d')

print(
    f'{td = }',
)

When representing single scalar values, where pandas really diverges from what is available in pure Python is the Period type. A Period represents an interval of time.

Here’s an interesting question: is a “date” a Timestamp or a Period?

from pandas import Period, Timestamp

p = Period('2020-01-01')

print(
    f'{p = }',
    f'{p.start_time = }',
    f'{p.end_time   = }',
    sep=f'\n{"\N{box drawings light horizontal}"*40}\n',
)

Of course, in pandas, we are most interested containers such as pandas.array subtypes and pandas.Index subtypes.

The popular pandas.date_range gives us a DatetimeIndex.

from pandas import date_range

idx = date_range('2020-01-01', '2020-01-14')

print(
    # f'{idx = }',
    f'{idx.astype("datetime64[ms]") = }',
    f'{idx.astype("datetime64[s]") = }',
    # f'{idx.astype("datetime64[D]") = }',
    sep=f'\n{"\N{box drawings light horizontal}"*40}\n',
)

We also have a PeriodIndex which can be quite useful!

from pandas import period_range, Series, Timestamp
from numpy.random import default_rng

rng = default_rng(0)

idx = period_range('2020-01-01', '2020-01-14', freq='d')
s = Series(index=idx, data=rng.normal(size=len(idx)))

print(
    s,
    # s.loc['2020-01-01'],
    # s.loc['2020-01-01':'2020-01-03'],
    # s.loc['2020-01-01 09:00:00'],
    # s.loc['2020-01-01 09:00:00':'2020-01-03 12:00:00'],
    s.loc[lambda s: Timestamp('2020-01-01 09:00:00').to_period(s.index.freq)],
    sep=f'\n{"\N{box drawings light horizontal}"*40}\n',
)

Finally, we have datetime, timedelta, and period array types.

from pandas import array, date_range, period_range, timedelta_range

xs = array(date_range('2020-01-01', periods=3))
ys = array(period_range('2020-01-01', periods=3))
zs = array(timedelta_range('1d', periods=3))

print(
    xs,
    ys,
    zs,
    sep=f'\n{"\N{box drawings light horizontal}"*40}\n',
)

The pandas.Series and pandas.DataFrame have been extended to call methods on the DatetimeArray type via the .dt registered accessor. Furthermore, they have been extended to support useful operations involving a DatetimeIndex.

from pandas import Series, date_range
from numpy.random import default_rng

rng = default_rng(0)

s = Series(
    index=(idx := date_range('2020-01-01', freq='h', periods=48)),
    data=rng.random(size=len(idx)),
)

print(
    # s,
    # s.between_time('09:00', '17:00'),
    # s.index.between_time('09:00', '17:00'),
    # s.index.indexer_between_time('09:00', '17:00'),
    # s[s.index.indexer_between_time('09:00', '17:00')],
    sep=f'\n{"\N{box drawings light horizontal}"*40}\n',
)
from pandas import merge_asof, Series, date_range, to_timedelta
from numpy.random import default_rng

rng = default_rng(0)

s0 = Series(
    index=(idx := date_range('2020-01-01', freq='h', periods=8, name='timestamp')),
    data=rng.random(size=len(idx)),
    name='s0',
)
s1 = Series(
    index=(idx := s0.index + to_timedelta(rng.uniform(0, 60*60), unit='s')),
    data=rng.random(size=len(idx)),
    name='s1',
)

print(
    s0.head(),
    s1.head(),
    merge_asof(s0, s1, left_index=True, right_index=True),
    sep=f'\n{"\N{box drawings light horizontal}"*40}\n',
)
from pandas import Series, date_range

s = Series(
    data=date_range('2020-01-01', periods=4)
)

print(
    s,
    f'{s.dt = }',
    f'{s.dt.year = }',
    sep=f'\n{"\N{box drawings light horizontal}"*40}\n',
)
from string import ascii_lowercase

from numpy.random import default_rng
from pandas import Series, Categorical, MultiIndex, to_datetime, to_timedelta

rng = default_rng(0)

s = Series(
    index=(idx := MultiIndex.from_product([
        rng.choice([*ascii_lowercase], size=(3, 4)).view('<U4').ravel(),
        Categorical('available ready active'.split()),
    ], names=['entity', 'state'])),
    data=(
        to_datetime('2020-01-01')
        + to_timedelta(
            rng.integers(14, size=(
                idx.get_level_values('entity').nunique(),
                idx.get_level_values('state').nunique(),
            )).cumsum(-1).ravel(),
            unit='d',
        )
    ),
)

print(
    s.head(),
    s
    .groupby(['entity', 'state'], observed=True).agg(
        lambda g: {
            'available': lambda g: g.head(1),
            'ready':     lambda g: g.tail(1),
            'active':    lambda g: g.tail(1),
        }[g.index.get_level_values('state')[0]](g)
    )
    .groupby('entity', observed=True).agg(
        lambda g: g.droplevel('entity').loc['active'] - g.droplevel('entity').loc['available']
    )
    .mean()
    ,
    sep=f'\n{"\N{box drawings light horizontal}"*40}\n',
)

Unfortunately, there are still some gaps in pandas datetime functionality.

from string import ascii_lowercase

from numpy.random import default_rng
from pandas import Series, MultiIndex, date_range

rng = default_rng(0)

s = Series(
    index=(idx := MultiIndex.from_product([
        date_range('2020-01-01', periods=90),
        rng.choice([*ascii_lowercase], size=(3, 4)).view('<U4').ravel(),
    ], names=['timestamp', 'entity'])),
    data=(
        rng.normal(loc=1, scale=0.01, size=(
            idx.get_level_values('timestamp').nunique(),
            idx.get_level_values('entity').nunique(),
        )).cumprod(-1).ravel()
    )
)

print(
    s.head(),
    # s.groupby('entity', observed=True).max(),
    # s.groupby('entity', observed=True).cummax(),
    # s.groupby('entity', observed=True).idxmax(),
    # s.groupby('entity', observed=True).agg(lambda g: g.droplevel('entity').idxmax()),
    # s.groupby('entity', observed=True).cumidxmax(),
    # s.groupby('entity', observed=True).transform(
    #     lambda g: g.expanding().agg(lambda x: x.max())
    # ),
    s.groupby('entity', observed=True).transform(
        lambda g: g.expanding().agg(lambda x: x.idxmax())
    ),
    sep=f'\n{"\N{box drawings light horizontal}"*40}\n',
)

In Arrow

Arrow has support for many similar types as pandas, except it also has support for separated dates and times. In arrow, we have a pyarrow.timestamp and pyarrow.duration corresponding to our pandas.Timestamp and pandas.Timedelta.

from pyarrow import timestamp, duration, date64, time64

print(
    f'{timestamp("ms") = }',
    f'{timestamp("ns") = }',
    f'{duration("s") = }',
    f'{duration("ns") = }',
    f'{date64() = }',
    f'{time64("us") = }',
    sep='\n',
)
from pandas import array as pd_array, date_range, period_range, timedelta_range
from pyarrow import array as pa_array

xs = pd_array(
    date_range('2020-01-01', periods=4)
)
ys = pd_array(
    period_range('2020-01-01', periods=4)
)
zs = pd_array(
    timedelta_range('1d', periods=4)
)

print(
    f'{pa_array(xs) = }',
    f'{pa_array(xs).type = }',
    f'{pa_array(ys) = }',
    f'{pa_array(ys).type = }',
    f'{pa_array(zs) = }',
    f'{pa_array(zs).type = }',
    sep=f'\n{"\N{box drawings light horizontal}"*40}\n',
)
from pandas import array, date_range, period_range, timedelta_range

xs = array(date_range('2020-01-01', periods=4), dtype='timestamp[s][pyarrow]')
ys = array(timedelta_range('1d', periods=4), dtype='duration[s][pyarrow]')

print(
    xs,
    ys,
    # f'{xs.day = }',
    # f'{xs.astype("datetime64[s]").day = }',
    sep=f'\n{"\N{box drawings light horizontal}"*40}\n',
)
from datetime import date, time
from pandas import array

xs = array([date(2020, 1, 1), date(2020, 1, 2), date(2020, 1, 3)], dtype='date64[pyarrow]')
ys = array([time(9, 15), time(9, 30), time(9, 45)], dtype='time64[us][pyarrow]')

print(
    xs,
    ys,
    sep=f'\n{"\N{box drawings light horizontal}"*40}\n',
)

In Files

Let’s look at how datetimes operate in common file formats.

Pickle

Pickle should only ever be used to move data from your “left hand” to your “right hand.” In those cases, it’s a great choice, and it can perfectly maintain and represent any arbitrarily complex datetime value.

from datetime import datetime
from pickle import dumps, loads
from zoneinfo import ZoneInfo

dt = datetime(2020, 1, 1, 9, 30, tzinfo=ZoneInfo('US/Eastern'))

print(
    f'{dt = }',
    f'{loads(dumps(dt)) = }',
    sep=f'\n{"\N{box drawings light horizontal}"*40}\n',
)

CSV

CSV is an extremely common data format, but it is purely textual. Therefore, common CSV readers and writers need to establish a way to represent datetime values. pandas.Series.to_csv and pandas.read_csv chooses ISO-8601/RFC-3339 for representing these.

In pure Python this looks like the below. Did you notice that we lost the timezone and only kept the timezone offset?

from datetime import datetime
from zoneinfo import ZoneInfo

dt = datetime(2020, 1, 1, 9, 30, tzinfo=ZoneInfo('US/Eastern'))

print(
    f'{dt = }',
    f'{dt.isoformat() = }',
    f'{datetime.fromisoformat(dt.isoformat()) = }',
    sep=f'\n{"\N{box drawings light horizontal}"*40}\n',
)

This ends up getting us in trouble when using pandas

from itertools import islice
from pathlib import Path
from string import ascii_lowercase
from tempfile import TemporaryDirectory

from numpy.random import default_rng
from pandas import MultiIndex, Series, date_range, read_csv

rng = default_rng(0)

s = Series(
    index=(idx := MultiIndex.from_product([
        date_range('2020-01-01', '2020-12-31', freq='h'),
        rng.choice([*ascii_lowercase], size=(3, 4)).view('<U4').ravel(),
        ], names=['timestamp', 'entity'],
    )),
    data=rng.normal(size=len(idx)),
).sort_index()

s = s.pipe(lambda s: s
    .set_axis(MultiIndex.from_arrays([
        # s.index.get_level_values('timestamp').tz_localize('US/Eastern'),
        s.index.get_level_values('timestamp').tz_localize('UTC').tz_convert('US/Eastern'),
        s.index.get_level_values('entity'),
    ], names=s.index.names))
)

with TemporaryDirectory() as d:
    d = Path(d)
    s.to_csv(filename := (d / 's.csv'))
    with open(filename) as f:
        for ln in islice(f, 3):
            print(f'{ln = }')

    s = read_csv(
        filename,
        parse_dates=['timestamp'],
        index_col=['timestamp', 'entity'],
    ).squeeze(axis='columns')

    print(
        # s,
        s.index.get_level_values('timestamp'),
        s.pipe(lambda s: s
            .set_axis(
                MultiIndex.from_arrays([
                    [x.tz_convert('US/Eastern') for x in s.index.get_level_values('timestamp')],
                    s.index.get_level_values('entity'),
                ], names=s.index.names)
            )
        ),
        sep=f'\n{"\N{box drawings light horizontal}"*40}\n',
    )

Parquet & Feather

from itertools import islice
from pathlib import Path
from string import ascii_lowercase
from tempfile import TemporaryDirectory

from numpy.random import default_rng
from pandas import MultiIndex, Series, date_range, read_feather, read_parquet

rng = default_rng(0)

s = Series(
    index=(idx := MultiIndex.from_product([
        date_range('2020-01-01', '2020-12-31', freq='h'),
        rng.choice([*ascii_lowercase], size=(3, 4)).view('<U4').ravel(),
        ], names=['timestamp', 'entity'],
    )),
    data=rng.normal(size=len(idx)),
).sort_index()

s = s.pipe(lambda s: s
    .set_axis(MultiIndex.from_arrays([
        s.index.get_level_values('timestamp').tz_localize('UTC').tz_convert('US/Eastern'),
        s.index.get_level_values('entity'),
    ], names=s.index.names))
)

with TemporaryDirectory() as d:
    d = Path(d)

    s.to_frame().to_feather(filename := (d / 's.feather'))
    s = read_feather(
        filename,
    ).squeeze(axis='columns')

    print(
        s,
        s.index.get_level_values('timestamp'),
        sep=f'\n{"\N{box drawings light horizontal}"*40}\n',
    )

    s.to_frame().to_parquet(filename := (d / 's.parquet'))
    s = read_parquet(
        filename,
    ).squeeze(axis='columns')

    print(
        s,
        s.index.get_level_values('timestamp'),
        sep=f'\n{"\N{box drawings light horizontal}"*40}\n',
    )

In Our Lives

pytz

Notice that all previous examples avoided use of pytz. Generally, with zoneinfo in the Python standard library and with timezone functionality accessed via pandas .tz_convert and .tz_localize there is not a strong reason to use pytz.

nanoseconds or microseconds

Especially with raw numerical datetime values, be aware of the distinction between microsecond and nanosecond precision.

from datetime import datetime
from pandas import Timestamp

# microsecond precision
dt = datetime(2020, 1, 1, microsecond=1)
ts = Timestamp(2020, 1, 1, nanosecond=1)

print(
    f'{dt = }',
    f'{ts = }',
    f'{Timestamp(ts.to_pydatetime()) == ts = }',
    sep=f'\n{"\N{box drawings light horizontal}"*40}\n',
)
from pathlib import Path
from tempfile import TemporaryDirectory

from pandas import Series, date_range, to_timedelta, read_parquet

s0 = Series(date_range('2020-01-01', periods=4) + to_timedelta('1ns'), name='s').astype('datetime64[ns]')

with TemporaryDirectory() as d:
    d = Path(d)

    s0.to_frame().to_parquet(filename := (d / 's.parquet'))
    s1 = read_parquet(filename).squeeze('columns')

    print(
        s0,
        s1,
        s0 == s1,
        sep=f'\n{"\N{box drawings light horizontal}"*40}\n',
    )

Unusual Timezones

We may see an unusual timezone “GMT+2:00” referenced as a (legacy) global timezone. Unfortunately, “GMT+2:00” does not exist as a database identifier and is not guaranteed to be understood by Python tooling. Confusingly, we may have to use “Etc/GMT-2” to represent this in our code.

from string import ascii_lowercase

from numpy.random import default_rng
from pandas import Series, MultiIndex, date_range

rng = default_rng(0)

s = Series(
    index=(idx := MultiIndex.from_product([
        # date_range('2020-01-01', periods=3).tz_localize('GMT+2:00'), # “global timezone”
        date_range('2020-01-01', periods=3).tz_localize('Etc/GMT-2'), # “global timezone”
        rng.choice([*ascii_lowercase], size=(3, 4)).view('<U4').ravel(),
    ], names='timestamp entity'.split())),
    data=rng.normal(size=len(idx)),
)

print(
    s,
    f'{s.index.get_level_values("timestamp")[0].utcoffset() = !s}',
)

Getting a Date

Sometimes we want to represent dates, independent of times.

There are approximately five ways to do this in pandas

Which do we choose?

from datetime import date

from numpy.random import default_rng
from pandas import Series, date_range, period_range

rng = default_rng(0)

s = Series(
    # # index=(idx := [date(2020, 1, 1), date(2020, 1, 2), date(2020, 1, 3)]),
    # # index=(idx := period_range('2020-01-01', periods=3)),
    index=(idx := date_range('2020-01-01', periods=3)),
    # index=(idx := date_range('2020-01-01', periods=3).tz_localize('UTC')),
    index=(idx := date_range('2020-01-01', periods=3).tz_localize('US/Eastern')),
    data=rng.normal(size=len(idx)),
)

print(
    s,
)