brainsteam.co.uk/brainsteam/content/posts/2023/11/24/Medieval Buzzfeed - Debuggi...

114 lines
5.4 KiB
Markdown
Raw Normal View History

2024-09-08 15:00:57 +01:00
---
categories:
- Data Science
date: '2023-11-24 09:19:20'
draft: false
tags:
- pandas
- python
title: Medieval Buzzfeed - Debugging Dodgy Datetimes in Pandas and Parquet
type: posts
---
<!-- wp:paragraph -->
<p>I was recently attempting to cache the results of a long-running SQL query to a local parquet file using SQL via a workflow like this:</p>
<!-- /wp:paragraph -->
<!-- wp:enlighter/codeblock {"language":"python"} -->
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import os
import pandas as pd
import sqlalchemy
env = os.environ
engine = sqlalchemy.create_engine(f"mysql+pymysql://{env['SQL_USER']}:{env['SQL_PASSWORD']}@{env['SQL_HOST']}/{env['SQL_DB']}")
connection = engine.connect()
with engine.connect() as conn:
df = pd.read_sql("SELECT * FROM articles", connection)
df.to_parquet("articles.parquet")</pre>
<!-- /wp:enlighter/codeblock -->
<!-- wp:paragraph -->
<p>This ended up yielding the following slightly cryptic error message:</p>
<!-- /wp:paragraph -->
<!-- wp:code -->
<pre class="wp-block-code"><code>ValueError: Can't infer object conversion type: 0 2023-03-23 11:31:30
1 2023-03-20 09:37:35
2 2023-02-27 10:46:47
3 2023-02-24 10:34:42
4 2023-02-23 08:51:11
...
908601 2023-11-09 14:30:00
908602 2023-11-08 14:30:00
908603 2023-11-07 14:30:00
908604 2023-11-06 14:30:00
908605 2023-11-02 13:30:00
Name: published_at, Length: 908606, dtype: object</code></pre>
<!-- /wp:code -->
<!-- wp:paragraph -->
<p>So obviously there is an issue with my <code>published_at</code> timestamp column. Googling didn't help me very much, lots of people suggesting that because there are maybe some <code>nan</code> values in the column, Pandas can't infer the correct data type before serializing to parquet.</p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p>I tried doing <code>df.fillna(0, inplace=True)</code> on my dataframe, hoping that pandas would be able to coerce the value into a zeroed out unix epoch but I noticed I was still getting the issue. </p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p>A quick inspection of <code>df.published_at.dtype</code> returned 'O'. That's pandas' catchall "I don't know what this is" object data type. </p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p>I tried to force the data type to a date with <code>pd.to_datetime(df.published_at)</code> but I got another error :</p>
<!-- /wp:paragraph -->
<!-- wp:code -->
<pre class="wp-block-code"><code>OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1201-11-01 12:00:00, at position 154228</code></pre>
<!-- /wp:code -->
<!-- wp:paragraph -->
<p>Sure enough if I inspect the record at row <code>154228</code> the datestamp is in the year of our lord 1201. I don't /think/ the article would have been published approximately 780 years before the internet was invented. Aside from the fact that this is obviously wrong, the error essentially tells us that the date was so long ago that it's not possible to represent it in terms of how many nanoseconds it was before the unix epoch (1 Jan 1970) without the data structure running out of memory.</p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p>We now need to do some clean up and make some assumptions about the data. </p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p>We can be pretty confident that none of the news articles from before the unix epoch matter. In this use case, I'm actually only interested in news from the last couple of years so I could probably be even more cut throat than that. I check how many articles are older than that:</p>
<!-- /wp:paragraph -->
<!-- wp:enlighter/codeblock {"language":"python"} -->
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import datetime
EPOCH = datetime.datetime.fromtimestamp(0)
df[df.published_at < EPOCH]</pre>
<!-- /wp:enlighter/codeblock -->
<!-- wp:paragraph -->
<p>The only result - our article from the dark ages. I'm going to treat the unix epoch as a sort of <code>nan</code> value and set all articles with dates older than this (thankfully only the one) to have that value:</p>
<!-- /wp:paragraph -->
<!-- wp:enlighter/codeblock {"language":"python"} -->
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">
df.loc[df.published_at < EPOCH, 'published_at'] = EPOCH</pre>
<!-- /wp:enlighter/codeblock -->
<!-- wp:paragraph -->
<p>Now when I re-run my <code>to_datetime</code> conversion it works! We can overwrite the column on our dataframe and write it out to disk!</p>
<!-- /wp:paragraph -->
<!-- wp:enlighter/codeblock {"language":"python"} -->
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">df.published_at = pd.to_datetime(df.published_at)
df.to_parquet("test.parquet")</pre>
<!-- /wp:enlighter/codeblock -->
<!-- wp:paragraph -->
<p></p>
<!-- /wp:paragraph -->