5.4 KiB
categories | date | draft | tags | title | type | |||
---|---|---|---|---|---|---|---|---|
|
2023-11-24 09:19:20 | false |
|
Medieval Buzzfeed - Debugging Dodgy Datetimes in Pandas and Parquet | posts |
I was recently attempting to cache the results of a long-running SQL query to a local parquet file using SQL via a workflow like this:
import os import pandas as pd import sqlalchemy env = os.environ engine = sqlalchemy.create_engine(f"mysql+pymysql://{env['SQL_USER']}:{env['SQL_PASSWORD']}@{env['SQL_HOST']}/{env['SQL_DB']}") connection = engine.connect() with engine.connect() as conn: df = pd.read_sql("SELECT * FROM articles", connection) df.to_parquet("articles.parquet")
This ended up yielding the following slightly cryptic error message:
ValueError: Can't infer object conversion type: 0 2023-03-23 11:31:30
1 2023-03-20 09:37:35
2 2023-02-27 10:46:47
3 2023-02-24 10:34:42
4 2023-02-23 08:51:11
...
908601 2023-11-09 14:30:00
908602 2023-11-08 14:30:00
908603 2023-11-07 14:30:00
908604 2023-11-06 14:30:00
908605 2023-11-02 13:30:00
Name: published_at, Length: 908606, dtype: object
So obviously there is an issue with my published_at
timestamp column. Googling didn't help me very much, lots of people suggesting that because there are maybe some nan
values in the column, Pandas can't infer the correct data type before serializing to parquet.
I tried doing df.fillna(0, inplace=True)
on my dataframe, hoping that pandas would be able to coerce the value into a zeroed out unix epoch but I noticed I was still getting the issue.
A quick inspection of df.published_at.dtype
returned 'O'. That's pandas' catchall "I don't know what this is" object data type.
I tried to force the data type to a date with pd.to_datetime(df.published_at)
but I got another error :
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1201-11-01 12:00:00, at position 154228
Sure enough if I inspect the record at row 154228
the datestamp is in the year of our lord 1201. I don't /think/ the article would have been published approximately 780 years before the internet was invented. Aside from the fact that this is obviously wrong, the error essentially tells us that the date was so long ago that it's not possible to represent it in terms of how many nanoseconds it was before the unix epoch (1 Jan 1970) without the data structure running out of memory.
We now need to do some clean up and make some assumptions about the data.
We can be pretty confident that none of the news articles from before the unix epoch matter. In this use case, I'm actually only interested in news from the last couple of years so I could probably be even more cut throat than that. I check how many articles are older than that:
import datetime EPOCH = datetime.datetime.fromtimestamp(0) df[df.published_at < EPOCH]
The only result - our article from the dark ages. I'm going to treat the unix epoch as a sort of nan
value and set all articles with dates older than this (thankfully only the one) to have that value:
df.loc[df.published_at < EPOCH, 'published_at'] = EPOCH
Now when I re-run my to_datetime
conversion it works! We can overwrite the column on our dataframe and write it out to disk!
df.published_at = pd.to_datetime(df.published_at) df.to_parquet("test.parquet")