Convert to Timestamp (DateTime) pandas time column with dates sliced in-between as headers

Hey guys.

Being pretty average at Pandas, yesterday I stumbled upon a formatting challenge. I download some datasheets from the web for machine learning from time to time. This time I got some weird Time & date formatting which might’ve been good for regular use with Excel but unsuitable when it comes to Neural Networks:

Time & Date                 Country  ...  Consensus  Forecast
15                  4:00 PM      DE  ...        NaN       NaN
16    Tuesday April 02 2019     NaN  ...  Consensus  Forecast
17                  7:00 AM      EA  ...        NaN       NaN
18                  7:00 AM      ES  ...     -33.3K    -38.6K
19                  8:30 AM      GB  ...       49.8      49.1
20                  9:00 AM      CY  ...        NaN     8.90%
21                  9:40 AM      RO  ...      2.50%     2.50%
22                 10:00 AM      IE  ...        NaN     5.50%
23                  5:30 PM      DE  ...        NaN       NaN
24  Wednesday April 03 2019     NaN  ...  Consensus  Forecast
25                  7:15 AM      ES  ...         55      52.5
26                  7:45 AM      IT  ...       50.8      50.1
27                  7:50 AM      FR  ...       48.7      48.7
28                  7:55 AM      DE  ...       54.9      54.9
29                  8:00 AM      EA  ...       52.7      52.7
30                  8:30 AM      GB  ...       50.9      50.5
31                  9:00 AM      EA  ...      0.20%     0.40%
32                  9:00 AM      EA  ...      2.30%     1.80%
33                 11:25 AM      PL  ...      1.50%     1.50%
34   Thursday April 04 2019     NaN  ...  Consensus  Forecast
35                  4:30 AM      NL  ...        NaN     2.60%
36                  6:00 AM      DE  ...      0.30%     0.50%
37                  7:30 AM      DE  ...        NaN        54
38                 11:30 AM      EA  ...        NaN       NaN
39     Friday April 05 2019     NaN  ...  Consensus  Forecast
40                  6:00 AM      DE  ...      0.50%     0.70%
41                  6:45 AM      FR  ...     €-4.7B    €-4.7B
42                  7:30 AM      GB  ...     -2.40%    -2.20%
43                  7:30 AM      GB  ...      2.30%     1.50%
44                 11:30 AM      ES  ...        NaN      92.5

I have uploaded the dataset to Github, you can find it here:

https://raw.githubusercontent.com/irmscher9/datasets/master/EconomicalCalendar2019-2020-2.csv

You can upload csv to pandas dataframe like this:

df = pd.read_csv('https://raw.githubusercontent.com/irmscher9/datasets/master/EconomicalCalendar2019-2020-2.csv')

So it has dates as headers followed by AM & PM times underneath in the same column. Now we need to convert it to ‘datetime’.

I sat and thought for a bit. I smelled lambda functions right away which I still need to improve my skill with. It was late evening after having a complex supper. So I instantly surrendered and went to StackOverflow to post a question.

Next day I received an answer. And the guy did answer my question with a few clauses that required a fix. There we go:

df['Date'] = df['Time & Date'].apply(lambda x: np.nan if (('AM' in str(x))|('PM' in str(x))) else x).ffill()

(('AM' in str(x))|('PM' in str(x))) need to be wrapped by str(), otherwise we’ll get an error:

TypeError: argument of type 'float' is not iterable

.ffill() is a Forward fill FYI, and will propagate last valid observation forward.

Since we only have time left in the first column, shall we rename it:

df.rename(columns={'Time & Date': 'Time'}, inplace=True)

It looks like we have quite a few null values, so we need to filter them out:

Print(df[df['Time'].isnull()])
df = df[~df['Time'].isnull()]

So now we have a separate column ‘Date’ with the date and ‘Time’ column with the time. We have to combine those two together into a DateTime column (pandas format). With that said we must get rid of the old Time & Date column slices first:

df = df[df['Time'].str.contains('AM|PM', regex=True)]

And now we are ready to reset index! But don’t forget to pass ‘Drop=True’ argument, otherwise it will keep both old and new index.

df.reset_index(inplace=True, drop=True)

Ok, we are almost done here. Final steps! Let’s concatenate Time and Date columns together, leaving one whitespace between them:

df['Datetime'] = df.Date + ' ' + df.Time

It’s only left to cast Panda’s to_datetime on our new column:

df['Datetime'] = pd.to_datetime(df.Datetime)

And we are done! Congratulations, now we have a proper Date&Time column in our dataframe that we can later use for our Machine Learning needs.

And looks like my first blog post is also finished, thanks for reading. From now on I plan to take notes of my coding adventures, and there’s going to be lots of them in the nearest future. So stay tuned!

Cheers =)

PS: Full piece of code is over here in cace you need it:

import pandas as pd
import numpy as np


df=pd.read_csv('https://raw.githubusercontent.com/irmscher9/datasets/master/EconomicalCalendar2019-2020-2.csv')

print(df[15:45])

df['Date'] = df['Time & Date'].apply(lambda x: np.nan if (('AM' in str(x))|('PM' in str(x))) else x).ffill()
print(df[15:45])

df.rename(columns={'Time & Date': 'Time'}, inplace=True)

print(df[df['Time'].isnull()])
df = df[~df['Time'].isnull()]

df = df[df['Time'].str.contains('AM|PM', regex=True)]
print(df[15:45])

df.reset_index(inplace=True, drop=True)
df['Datetime'] = df.Date + ' ' + df.Time
df['Datetime'] = pd.to_datetime(df.Datetime)

df.drop('Date', axis=1, inplace=True)
print(df[15:45])