Photo by Pascal Müller on Unsplash
Dealing with datetime data is a common task in data analysis and manipulation. Python, along with the powerful pandas library, provides a robust set of tools for handling datetime objects.
OverflowError: value too large?
In pandas, Timestamps are represented as nanoseconds. With default resolution it is possible to represent approximately 584 years using 64 bit integers. Therefore conversions with pd.to_datetime() can result in OverflowError.
In[]: pd.to_datetime(“1650–09–28”)
# OverflowError: value too large
# OutOfBoundsDatetime: Out of bounds nanosecond timestamp:
# 1650-09-28, at position 0
By default, pandas timestamps range from September 21st 1677 to April 11th 2262. This behavior stems from underlying NumPy datetime64[ns] objects. It is possible to suppress this error by forcing conversion. This results in NaT (Not a Time) objects.
In [1]: pd.Timestamp.min
Out[1]: Timestamp('1677-09-21 00:12:43.145224193')
In [2]: pd.Timestamp.max
Out[2]: Timestamp('2262-04-11 23:47:16.854775807')
In [3]: pd.to_datetime("1650–09–28", errors="coerce")
Out[3]: NaT
As a workaround it is possible to use NumPy datetime objects with lower precision, e.g timestamps with only seconds instead of nanoseconds or core date and datetime objects from Python datetime module.
In [1]: datetime.datetime(1650,9,28)
Out[1]: datetime.datetime(1650, 9, 28, 0, 0)
In [2]: datetime.date(1650,9,28)
Out[2]: datetime.date(1650,9,28)
In [3]: np.datetime64("1650-09-28", "s")
Out[3]: numpy.datetime64('1650-09-28T00:00:00')
# So if this fails
pd.to_datetime(df["col"])
# Try this
df["col"].apply(np.datetime64)
Data type confusion
ValueError: You are trying to merge on datetime64[ns] and object columns. If you wish to proceed you should use pd.concat
No, you probably should not use pd.concat here. This error usually occurs when you are trying to merge dates with timestamps. Grouping by date returns a new dataframe with datetime.date index, which is incompatible with pandas DatetimeIndex.
In []: df.groupby(df.index.date)['col'].mean().index.dtype
Out[]: dtype('O')
With pandas datetime index, resampling is a better way to aggregate hourly observations to daily data. Resampling returns a weird datatype <M8[ns] or >M8[ns]. These are machine-specific datatypes for general type datetime64[ns]. Returned datatype depends on the endianness of the machine. This datatype is compatible with pandas timestamps.
In []: df.resample("D")['col'].mean().index.dtype
Out[]: dtype('<M8[ns]')
Timestamp is a point in time that consist of time and date. Pandas timestamps are compatible with datetime from datetime and datetime64 from NumPy but not compatible with date from datetime so any merge between timestamps and dates results in ValueError.
DT = "2023-09-28"
pd.to_datetime(DT) == datetime.date(2023,9,28) # False
pd.to_datetime(DT) == np.datetime64(DT) # True
pd.to_datetime(DT) == datetime.datetime(2023,9,28,0,0,0,0) # True
# If this fails due to
# ValueError: You are trying to merge on datetime64[ns] and object columns.
pd.merge(df1, df2, left_index=True, right_index=True)
# Check dtype of both indices
print(df1.index.dtype)
print(df2.index.dtype)
# Convert the one of dtype('O') to datetime/timestamp using either this
df1.index = df1.index.astype('datetime64')
# or this
df1.index = pd.to_datetime(df1.index)
Days first?
UserWarning: Parsing dates in %d/%m/%Y format when dayfirst=False (the default) was specified. Pass
dayfirst=True
or specify a format to silence this warning
Pandas knows that pd.to_datetime("31/12/2023")
is probably 31st of December, 2023 and presents you with a warning. However, pd.to_datetime("11/12/2023")
returns November 12th 2023 unless dayfirst=True
is specified. Pandas defaults to US behavior, but Australia, India, French speaking parts of Canada and most of the Europe put day first.
Daylight saving time
In the U.S., daylight saving time starts on the second Sunday in March and ends on the first Sunday in November. In March, a jump from 1:59 a.m. to 3:00 a.m. is observed in local time. In November, local time falls from 1:59 a.m. to 1:00 a.m. If a time series is observed in local time but modelled in pandas without timezone information, duplicate hours occur in the fall and a missing value occurs in the spring. To fix it, use local time.
# Localizing a timestamp
timestamp = pd.Timestamp('2023-06-01 12:00:00', tz='America/Los_Angeles')
The hour over which local time jumps in March does not exist, and trying to use it results in an error.
NonExistentTimeError: 2023–03–12 02:00:00
# This timestamp does not exist.
pd.Timestamp("2023-03-12 02:00:00", tz='America/Los_Angeles')
In fall, the same hour is observed twice in local time so using it without additional information results in an error.
AmbiguousTimeError: Cannot infer dst time from 2023–11–05 01:00:00, try using the ‘ambiguous’ argument
# In fall three are two hours at 1 a.m.
pd.Timestamp('2023–11–05 01:00:00', tz='America/Los_Angeles') # Error
# Define which one it is by using fold = 0 (first) or 1 (second).
pd.Timestamp('2023–11–05 01:00:00', tz='America/Los_Angeles', fold=0) # OK
pd.Timestamp('2023–11–05 01:00:00', tz='America/Los_Angeles', fold=1) # OK
# Or infer from series by order
In []: pd.to_datetime(
pd.Series(['2023-11-05 01:00:00', '2023-11-05 01:00:00'])
).dt.tz_localize('America/New_York', ambiguous='infer')
Out[]: 0 2023-11-05 01:00:00-04:00
1 2023-11-05 01:00:00-05:00
dtype: datetime64[ns, America/New_York]
Coordinated Universal Time
Coordinated Universal Time (UTC) is a time standard that is uniform and does not observe any daylight saving time adjustments. It remains constant throughout the year and does not shift backward or forward in response to changes in local timekeeping practices. For this reason, many datasets are given in UTC.
When analyzing natural day-night cycles and related phenomena like changes in temperature by time of day or sales at a local coffee shop, using local time instead of UTC often makes more sense. Local time is directly relevant to the local context and is what people and businesses typically use in their daily activities. For instance, a coffee shop’s sales data is most meaningful when analyzed in the context of its operating hours, which are determined by the local time zone.
Furthermore, using local time makes the analysis more interpretable for stakeholders who are familiar with the local time zone. It avoids the need for users to mentally convert UTC times to their local time zone, which can introduce errors and confusion.
# Localizing a timestamp to UTC
timestamp = pd.Timestamp('2023-06-01 12:00:00')
utc_timesamp = timestamp.tz_localize('UTC')
# Converting to local time
ny_timestamp = utc_timestamp.tz_convert('America/New_York')
# With a dataframe
df['utc_timestamp'].dt.tz_convert('America/New_York')
# With a datetime index
df.index.tz_convert('America/New_York')
Converting from Time to Duration
Sometimes reading an Excel file or similar results in datetime.time
column for duration instead of desired pd.Timedelta
datatype. Often this happens when max duration is less than 24 hours. Timedeltas integrate better to pandas and provide useful attributes under .dt
accessor. It is possible to correct this misinterpretation by treating time as duration since midnight.
from datetime import datetime
dt_midnight = datetime(2023, 1, 1)
df.Duration.apply(lambda x: datetime.combine(dt_midnight, x) — dt_midnight)
Resulting series is by default of type timedelta64[ns]. Calling .dt.seconds
on this series returns integer values representing time deltas as full seconds. Calling .astype(int)
is similar but returns nanoseconds as integers.