<preclass="EnlighterJSRAW"data-enlighter-language="python"data-enlighter-theme=""data-enlighter-highlight=""data-enlighter-linenumbers=""data-enlighter-lineoffset=""data-enlighter-title=""data-enlighter-group="">import os
<p>So obviously there is an issue with my <code>published_at</code> timestamp column. Googling didn't help me very much, lots of people suggesting that because there are maybe some <code>nan</code> values in the column, Pandas can't infer the correct data type before serializing to parquet.</p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p>I tried doing <code>df.fillna(0, inplace=True)</code> on my dataframe, hoping that pandas would be able to coerce the value into a zeroed out unix epoch but I noticed I was still getting the issue. </p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p>A quick inspection of <code>df.published_at.dtype</code> returned 'O'. That's pandas' catchall "I don't know what this is" object data type. </p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p>I tried to force the data type to a date with <code>pd.to_datetime(df.published_at)</code> but I got another error :</p>
<!-- /wp:paragraph -->
<!-- wp:code -->
<preclass="wp-block-code"><code>OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1201-11-01 12:00:00, at position 154228</code></pre>
<!-- /wp:code -->
<!-- wp:paragraph -->
<p>Sure enough if I inspect the record at row <code>154228</code> the datestamp is in the year of our lord 1201. I don't /think/ the article would have been published approximately 780 years before the internet was invented. Aside from the fact that this is obviously wrong, the error essentially tells us that the date was so long ago that it's not possible to represent it in terms of how many nanoseconds it was before the unix epoch (1 Jan 1970) without the data structure running out of memory.</p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p>We now need to do some clean up and make some assumptions about the data. </p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p>We can be pretty confident that none of the news articles from before the unix epoch matter. In this use case, I'm actually only interested in news from the last couple of years so I could probably be even more cut throat than that. I check how many articles are older than that:</p>
<p>The only result - our article from the dark ages. I'm going to treat the unix epoch as a sort of <code>nan</code> value and set all articles with dates older than this (thankfully only the one) to have that value:</p>