brainsteam.co.uk/brainsteam/content/posts/2023/11/24/Medieval Buzzfeed - Debuggi...

5.5 KiB

categories date draft preview tags title type url
Data Science
2023-11-24 09:19:20 false /social/28472124cfd8d16c2775f165f45bb9c6759112ee1d541b587d34ec216a5ce15d.png
pandas
python
Medieval Buzzfeed - Debugging Dodgy Datetimes in Pandas and Parquet posts /2023/11/24/medieval-buzzfeed-debugging-dodgy-datetimes-in-pandas-and-parquet/

I was recently attempting to cache the results of a long-running SQL query to a local parquet file using SQL via a workflow like this:

import os
import pandas as pd
import sqlalchemy

env = os.environ

engine = sqlalchemy.create_engine(f"mysql+pymysql://{env['SQL_USER']}:{env['SQL_PASSWORD']}@{env['SQL_HOST']}/{env['SQL_DB']}")

connection = engine.connect()
with engine.connect() as conn:
    df = pd.read_sql("SELECT * FROM articles", connection)


df.to_parquet("articles.parquet")

This ended up yielding the following slightly cryptic error message:

ValueError: Can't infer object conversion type: 0         2023-03-23 11:31:30
1         2023-03-20 09:37:35
2         2023-02-27 10:46:47
3         2023-02-24 10:34:42
4         2023-02-23 08:51:11
                 ...         
908601    2023-11-09 14:30:00
908602    2023-11-08 14:30:00
908603    2023-11-07 14:30:00
908604    2023-11-06 14:30:00
908605    2023-11-02 13:30:00
Name: published_at, Length: 908606, dtype: object

So obviously there is an issue with my published_at timestamp column. Googling didn't help me very much, lots of people suggesting that because there are maybe some nan values in the column, Pandas can't infer the correct data type before serializing to parquet.

I tried doing df.fillna(0, inplace=True) on my dataframe, hoping that pandas would be able to coerce the value into a zeroed out unix epoch but I noticed I was still getting the issue.

A quick inspection of df.published_at.dtype returned 'O'. That's pandas' catchall "I don't know what this is" object data type.

I tried to force the data type to a date with pd.to_datetime(df.published_at) but I got another error :

OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1201-11-01 12:00:00, at position 154228

Sure enough if I inspect the record at row 154228 the datestamp is in the year of our lord 1201. I don't /think/ the article would have been published approximately 780 years before the internet was invented. Aside from the fact that this is obviously wrong, the error essentially tells us that the date was so long ago that it's not possible to represent it in terms of how many nanoseconds it was before the unix epoch (1 Jan 1970) without the data structure running out of memory.

We now need to do some clean up and make some assumptions about the data.

We can be pretty confident that none of the news articles from before the unix epoch matter. In this use case, I'm actually only interested in news from the last couple of years so I could probably be even more cut throat than that. I check how many articles are older than that:

import datetime

EPOCH =  datetime.datetime.fromtimestamp(0)

df[df.published_at < EPOCH]

The only result - our article from the dark ages. I'm going to treat the unix epoch as a sort of nan value and set all articles with dates older than this (thankfully only the one) to have that value:

df.loc[df.published_at < EPOCH, 'published_at'] = EPOCH

Now when I re-run my to_datetime conversion it works! We can overwrite the column on our dataframe and write it out to disk!

df.published_at = pd.to_datetime(df.published_at)

df.to_parquet("test.parquet")