我正在编写一份问题陈述,要求我填写丢失日期的行(即熊猫数据栏中的两个日期之间的日期)。请参阅下面的例子。我正在使用Pandas作为我目前的方法(下面提到)。
输入数据示例(大约有25000行)
A | B | C | Date1 | Date2
a1 | b1 | c1 | 1Jan1990 | 15Aug1990 <- this row should be repeated for all dates between the two dates
.......................
a3 | b3 | c3 | 11May1986 | 11May1986 <- this row should NOT be repeated. Just 1 entry since both dates are same.
.......................
a5 | b5 | c5 | 1Dec1984 | 31Dec2017 <- this row should be repeated for all dates between the two dates
..........................
..........................预期输出:
A | B | C | Month | Year
a1 | b1 | c1 | 1 | 1990 <- Since date 1 column for this row was Jan 1990
a1 | b1 | c1 | 2 | 1990
.......................
.......................
a1 | b1 | c1 | 7 | 1990
a1 | b1 | c1 | 8 | 1990 <- Since date 2 column for this row was Aug 1990
..........................
a3 | b3 | c3 | 5 | 1986 <- only 1 row since two dates in input dataframe were same for this row.
...........................
a5 | b5 | c5 | 12 | 1984 <- since date 1 column for this row was Dec 1984
a5 | b5 | c5 | 1 | 1985
..........................
..........................
a5 | b5 | c5 | 11 | 2017
a5 | b5 | c5 | 12 | 2017 <- Since date 2 column for this row was Dec 2017我知道实现这一目标的更为传统的方法(我目前的做法):
由于输入数据大约有25000行,所以我相信输出数据会非常大,所以我正在寻找更多的Pythonic方法来实现这一点(如果可能的话,并且比迭代方法更快)!
发布于 2018-12-14 13:33:36
在我看来,这里使用的最好的工具是PeriodIndex (生成日期之间的月份和年份)。
但是,PeriodIndex一次只能对一行进行操作。因此,如果我们要使用PeriodIndex,那么每一行都必须单独处理。不幸的是,这意味着遍历DataFrame的行:
import pandas as pd
df = pd.DataFrame([('a1','b1','c1','1Jan1990','15Aug1990'),
('a3','b3','c3','11May1986','11May1986'),
('a5','b5','c5','1Dec1984','31Dec2017')],
columns=['A','B','C','Date1','Date2'])
result = []
for tup in df.itertuples():
index = pd.PeriodIndex(start=tup.Date1, end=tup.Date2, freq='M')
new_df = pd.DataFrame([(tup.A, tup.B, tup.C)], index=index)
new_df['Month'] = new_df.index.month
new_df['Year'] = new_df.index.year
result.append(new_df)
result = pd.concat(result, axis=0)
print(result)收益率
0 1 2 Month Year
1990-01 a1 b1 c1 1 1990 <--- Beginning of row 1
1990-02 a1 b1 c1 2 1990
1990-03 a1 b1 c1 3 1990
1990-04 a1 b1 c1 4 1990
1990-05 a1 b1 c1 5 1990
1990-06 a1 b1 c1 6 1990
1990-07 a1 b1 c1 7 1990
1990-08 a1 b1 c1 8 1990 <--- End of row 1
1986-05 a3 b3 c3 5 1986 <--- Beginning and End of row 2
1984-12 a5 b5 c5 12 1984 <--- Beginning row 3
1985-01 a5 b5 c5 1 1985
1985-02 a5 b5 c5 2 1985
1985-03 a5 b5 c5 3 1985
1985-04 a5 b5 c5 4 1985
... .. .. .. ... ...
2017-09 a5 b5 c5 9 2017
2017-10 a5 b5 c5 10 2017
2017-11 a5 b5 c5 11 2017
2017-12 a5 b5 c5 12 2017 <--- End of row 3
[406 rows x 5 columns]请注意,您可能并不真正需要定义Month和Year列。
new_df['Month'] = new_df.index.month
new_df['Year'] = new_df.index.year因为您已经有了一个PeriodIndex,这使得计算月份和年份变得非常容易。
发布于 2018-12-14 14:55:01
下面是另一种使用2种帮助理解和numpy.repeat的方法
import numpy as np
import pandas as pd
repeats = (pd.to_datetime(df['Date2']) - pd.to_datetime(df['Date1'])) // np.timedelta64(1, 'M') + 1
periods = np.concatenate([pd.period_range(start=pd.to_datetime(d), periods=r, freq='M')
for d, r in zip(df['Date1'], repeats)])
new_df = (pd.DataFrame(
np.repeat(df.values, repeats, 0),
columns=df.columns,
index=periods)
.assign(month = [x.month for x in periods],
year = [x.year for x in periods])
.drop(['Date1', 'Date2'], axis=1))
print(new_df)
[out]
A B C month year
1990-01 a1 b1 c1 1 1990
1990-02 a1 b1 c1 2 1990
1990-03 a1 b1 c1 3 1990
1990-04 a1 b1 c1 4 1990
1990-05 a1 b1 c1 5 1990
1990-06 a1 b1 c1 6 1990
1990-07 a1 b1 c1 7 1990
1990-08 a1 b1 c1 8 1990
1986-05 a3 b3 c3 5 1986
1984-12 a5 b5 c5 12 1984
1985-01 a5 b5 c5 1 1985
1985-02 a5 b5 c5 2 1985
1985-03 a5 b5 c5 3 1985
1985-04 a5 b5 c5 4 1985
1985-05 a5 b5 c5 5 1985
1985-06 a5 b5 c5 6 1985
1985-07 a5 b5 c5 7 1985
1985-08 a5 b5 c5 8 1985
1985-09 a5 b5 c5 9 1985
1985-10 a5 b5 c5 10 1985
1985-11 a5 b5 c5 11 1985
1985-12 a5 b5 c5 12 1985
...发布于 2018-12-14 14:43:58
给定样本数据
df = pd.DataFrame({'Date1': ["1Jan1990", "11May1986", "1Dec1984"],
'Date2': ["5Jul1990", "11May1986", "7Apr1985"],
'A': ['a1', 'a3', 'a5'],
'B': ['b1', 'b3', 'b5'],
'C': ['c1', 'c3', 'c5'],}) 这里有一个没有显式迭代的解决方案
# Convert to pandas datetime
df['Date1'] = pd.to_datetime(df['Date1'])
df['Date2'] = pd.to_datetime(df['Date2'])
# Split and stack by dates
df = pd.concat([df.drop('Date2', 1).rename(columns={'Date1': 'Date'}),
df.drop('Date1', 1).rename(columns={'Date2': 'Date'})])
df = df.drop_duplicates().set_index('Date')
# Break down by dates
df = (df.groupby(['A', 'B', 'C'], as_index=False)
.resample('M') # with end of month interval
.ffill() # propagating everything else forward
.reset_index(level=0, drop=True)) # getting rid of auxiliary index
# Get the year and a month
df['Year'] = df.index.year
df['Month'] = df.index.month结果是
A B C Year Month
Date
1990-01-31 a1 b1 c1 1990 1
1990-02-28 a1 b1 c1 1990 2
1990-03-31 a1 b1 c1 1990 3
1990-04-30 a1 b1 c1 1990 4
1990-05-31 a1 b1 c1 1990 5
1990-06-30 a1 b1 c1 1990 6
1990-07-31 a1 b1 c1 1990 7
1986-05-31 a3 b3 c3 1986 5
1984-12-31 a5 b5 c5 1984 12
1985-01-31 a5 b5 c5 1985 1
1985-02-28 a5 b5 c5 1985 2
1985-03-31 a5 b5 c5 1985 3
1985-04-30 a5 b5 c5 1985 4https://stackoverflow.com/questions/53780270
复制相似问题