Hey guys.
Being pretty average at Pandas, yesterday I stumbled upon a formatting challenge. I download some datasheets from the web for machine learning from time to time. This time I got some weird Time & date formatting which might’ve been good for regular use with Excel but unsuitable when it comes to Neural Networks:
Time & Date Country ... Consensus Forecast
15 4:00 PM DE ... NaN NaN
16 Tuesday April 02 2019 NaN ... Consensus Forecast
17 7:00 AM EA ... NaN NaN
18 7:00 AM ES ... -33.3K -38.6K
19 8:30 AM GB ... 49.8 49.1
20 9:00 AM CY ... NaN 8.90%
21 9:40 AM RO ... 2.50% 2.50%
22 10:00 AM IE ... NaN 5.50%
23 5:30 PM DE ... NaN NaN
24 Wednesday April 03 2019 NaN ... Consensus Forecast
25 7:15 AM ES ... 55 52.5
26 7:45 AM IT ... 50.8 50.1
27 7:50 AM FR ... 48.7 48.7
28 7:55 AM DE ... 54.9 54.9
29 8:00 AM EA ... 52.7 52.7
30 8:30 AM GB ... 50.9 50.5
31 9:00 AM EA ... 0.20% 0.40%
32 9:00 AM EA ... 2.30% 1.80%
33 11:25 AM PL ... 1.50% 1.50%
34 Thursday April 04 2019 NaN ... Consensus Forecast
35 4:30 AM NL ... NaN 2.60%
36 6:00 AM DE ... 0.30% 0.50%
37 7:30 AM DE ... NaN 54
38 11:30 AM EA ... NaN NaN
39 Friday April 05 2019 NaN ... Consensus Forecast
40 6:00 AM DE ... 0.50% 0.70%
41 6:45 AM FR ... €-4.7B €-4.7B
42 7:30 AM GB ... -2.40% -2.20%
43 7:30 AM GB ... 2.30% 1.50%
44 11:30 AM ES ... NaN 92.5
I have uploaded the dataset to Github, you can find it here:
https://raw.githubusercontent.com/irmscher9/datasets/master/EconomicalCalendar2019-2020-2.csv
You can upload csv to pandas dataframe like this:
df = pd.read_csv('https://raw.githubusercontent.com/irmscher9/datasets/master/EconomicalCalendar2019-2020-2.csv')
So it has dates as headers followed by AM & PM times underneath in the same column. Now we need to convert it to ‘datetime’.
I sat and thought for a bit. I smelled lambda functions right away which I still need to improve my skill with. It was late evening after having a complex supper. So I instantly surrendered and went to StackOverflow to post a question.
Next day I received an answer. And the guy did answer my question with a few clauses that required a fix. There we go:
df['Date'] = df['Time & Date'].apply(lambda x: np.nan if (('AM' in str(x))|('PM' in str(x))) else x).ffill()
(('AM' in str(x))|('PM' in str(x)))
need to be wrapped by str(), otherwise we’ll get an error:
TypeError: argument of type 'float' is not iterable
.ffill() is a Forward fill FYI, and will propagate last valid observation forward.
Since we only have time left in the first column, shall we rename it:
df.rename(columns={'Time & Date': 'Time'}, inplace=True)
It looks like we have quite a few null values, so we need to filter them out:
Print(df[df['Time'].isnull()])
df = df[~df['Time'].isnull()]
So now we have a separate column ‘Date’ with the date and ‘Time’ column with the time. We have to combine those two together into a DateTime column (pandas format). With that said we must get rid of the old Time & Date column slices first:
df = df[df['Time'].str.contains('AM|PM', regex=True)]
And now we are ready to reset index! But don’t forget to pass ‘Drop=True’ argument, otherwise it will keep both old and new index.
df.reset_index(inplace=True, drop=True)
Ok, we are almost done here. Final steps! Let’s concatenate Time and Date columns together, leaving one whitespace between them:
df['Datetime'] = df.Date + ' ' + df.Time
It’s only left to cast Panda’s to_datetime on our new column:
df['Datetime'] = pd.to_datetime(df.Datetime)
And we are done! Congratulations, now we have a proper Date&Time column in our dataframe that we can later use for our Machine Learning needs.
And looks like my first blog post is also finished, thanks for reading. From now on I plan to take notes of my coding adventures, and there’s going to be lots of them in the nearest future. So stay tuned!
Cheers =)
PS: Full piece of code is over here in cace you need it:
import pandas as pd
import numpy as np
df=pd.read_csv('https://raw.githubusercontent.com/irmscher9/datasets/master/EconomicalCalendar2019-2020-2.csv')
print(df[15:45])
df['Date'] = df['Time & Date'].apply(lambda x: np.nan if (('AM' in str(x))|('PM' in str(x))) else x).ffill()
print(df[15:45])
df.rename(columns={'Time & Date': 'Time'}, inplace=True)
print(df[df['Time'].isnull()])
df = df[~df['Time'].isnull()]
df = df[df['Time'].str.contains('AM|PM', regex=True)]
print(df[15:45])
df.reset_index(inplace=True, drop=True)
df['Datetime'] = df.Date + ' ' + df.Time
df['Datetime'] = pd.to_datetime(df.Datetime)
df.drop('Date', axis=1, inplace=True)
print(df[15:45])