如何在对数据进行采样的同时分割一个持续时间,如下面的示例所示。我可以用例如group_by函数替换for循环吗?
我想用熊猫来转换这样的数据:
activity name time started time ended
0 Bedtime 2021-10-25 00:00:00 2021-10-25 08:25:42
1 videos 2021-10-25 08:25:42 2021-10-25 08:51:54
2 Commute 2021-10-25 08:51:54 2021-10-25 09:29:34这方面:
time started Bedtime videos Commute
2021-10-25 00:00:00 0 days 01:00:00 0 days 00:00:00 0 days
2021-10-25 01:00:00 0 days 01:00:00 0 days 00:00:00 0 days
2021-10-25 02:00:00 0 days 01:00:00 0 days 00:00:00 0 days
2021-10-25 03:00:00 0 days 01:00:00 0 days 00:00:00 0 days
2021-10-25 04:00:00 0 days 01:00:00 0 days 00:00:00 0 days
2021-10-25 05:00:00 0 days 01:00:00 0 days 00:00:00 0 days
2021-10-25 06:00:00 0 days 01:00:00 0 days 00:00:00 0 days
2021-10-25 07:00:00 0 days 01:00:00 0 days 00:00:00 0 days
2021-10-25 08:00:00 0 days 00:25:42 0 days 00:26:12 0 days 00:08:06
...我走到这一步:
import pandas as pd
df=pd.DataFrame({'activity name':['Bedtime','videos','Commute'],'time started':["2021-10-25 00:00:00","2021-10-25 08:25:42","2021-10-25 08:51:54"],'time ended':["2021-10-25 08:25:42","2021-10-25 08:51:54","2021-10-25 09:29:34"]})
# converting strings to datetime
df['time ended']=pd.to_datetime(df['time ended'])
df['time started']=pd.to_datetime(df['time started'])
# calclating the duration
df['duration']=df['time ended']-df['time started']
# changeing index
df.index=df['time started']
df=df.drop(columns=['time started','time ended'])
for a in df['activity name'].unique():
df[a]=(df['activity name']==a)*df['duration']
df=df.drop(columns=['activity name','duration'])
df.resample('H').first()time started
2021-10-25 00:00:00 0 days 08:25:42 0 days 00:00:00 0 days
2021-10-25 01:00:00 NaT NaT NaT
2021-10-25 02:00:00 NaT NaT NaT
2021-10-25 03:00:00 NaT NaT NaT
2021-10-25 04:00:00 NaT NaT NaT
2021-10-25 05:00:00 NaT NaT NaT
2021-10-25 06:00:00 NaT NaT NaT
2021-10-25 07:00:00 NaT NaT NaT
2021-10-25 08:00:00 0 days 00:00:00 0 days 00:26:12 0 days发布于 2021-12-16 19:57:27
试试这个:
import pandas as pd
from io import StringIO
txtfile = StringIO(
""" activity name time started time ended
0 Bedtime 2021-10-25 00:00:00 2021-10-25 08:25:42
1 videos 2021-10-25 08:25:42 2021-10-25 08:51:54
2 Commute 2021-10-25 08:51:54 2021-10-25 09:29:34"""
)
df = pd.read_csv(txtfile, sep="\s\s+", engine="python")
df[["time started", "time ended"]] = df[["time started", "time ended"]].apply(
pd.to_datetime
)
df_e = df.assign(
date=[
pd.date_range(s, e, freq="s")
for s, e in zip(df["time started"], df["time ended"])
]
).explode("date")
df_out = (
df_e.groupby(["activity name", pd.Grouper(key="date", freq="H")])["activity name"]
.count()
.unstack(0)
.apply(pd.to_timedelta, unit="s")
)
print(df_out)输出:
activity name Bedtime Commute videos
date
2021-10-25 00:00:00 0 days 01:00:00 NaT NaT
2021-10-25 01:00:00 0 days 01:00:00 NaT NaT
2021-10-25 02:00:00 0 days 01:00:00 NaT NaT
2021-10-25 03:00:00 0 days 01:00:00 NaT NaT
2021-10-25 04:00:00 0 days 01:00:00 NaT NaT
2021-10-25 05:00:00 0 days 01:00:00 NaT NaT
2021-10-25 06:00:00 0 days 01:00:00 NaT NaT
2021-10-25 07:00:00 0 days 01:00:00 NaT NaT
2021-10-25 08:00:00 0 days 00:25:43 0 days 00:08:06 0 days 00:26:13
2021-10-25 09:00:00 NaT 0 days 00:29:35 NaT地址@DerekO注释:
import pandas as pd
from io import StringIO
txtfile = StringIO(
""" activity name time started time ended
0 Bedtime 2021-10-25 00:00:00 2021-10-25 08:25:42
1 videos 2021-10-25 08:25:42 2021-10-25 08:51:54
2 Commute 2021-10-25 08:51:54 2021-10-25 09:29:34
3 Bedtime 2021-10-25 11:00:00 2021-10-25 13:04:31"""
)
df = pd.read_csv(txtfile, sep="\s\s+", engine="python")
df[["time started", "time ended"]] = df[["time started", "time ended"]].apply(
pd.to_datetime
)
df_e = df.assign(
date=[
pd.date_range(s, e, freq="s")
for s, e in zip(df["time started"], df["time ended"])
]
).explode("date")
df_out = (
df_e.groupby(["activity name", pd.Grouper(key="date", freq="H")])["activity name"]
.count()
.unstack(0)
.apply(pd.to_timedelta, unit="s")
.sort_index()
)
print(df_out)输出:
activity name Bedtime Commute videos
date
2021-10-25 00:00:00 0 days 01:00:00 NaT NaT
2021-10-25 01:00:00 0 days 01:00:00 NaT NaT
2021-10-25 02:00:00 0 days 01:00:00 NaT NaT
2021-10-25 03:00:00 0 days 01:00:00 NaT NaT
2021-10-25 04:00:00 0 days 01:00:00 NaT NaT
2021-10-25 05:00:00 0 days 01:00:00 NaT NaT
2021-10-25 06:00:00 0 days 01:00:00 NaT NaT
2021-10-25 07:00:00 0 days 01:00:00 NaT NaT
2021-10-25 08:00:00 0 days 00:25:43 0 days 00:08:06 0 days 00:26:13
2021-10-25 09:00:00 NaT 0 days 00:29:35 NaT
2021-10-25 11:00:00 0 days 01:00:00 NaT NaT
2021-10-25 12:00:00 0 days 01:00:00 NaT NaT
2021-10-25 13:00:00 0 days 00:04:32 NaT NaT发布于 2021-12-16 07:08:34
虽然我同意使用groupby和resample是最好的,但我不能让这样的解决方案发挥作用。相反,您可以为原始DataFrame的每一行创建一个新的DataFrame,并将它们连接在一起,从而强行解决这个问题。
它的工作方式是使用pd.date_range在开始和结束时间的地板之间创建一个DatetimeIndex,并且开始和结束时间也被插入到DatetimeIndex中。然后,这个DatetimeIndex中所有日期时间之间的差异是新DataFrame的值。
为了使我的解决方案尽可能健壮,我在原始的DataFrame中添加了两个额外的行,并重复了一个类别,并测试了以下情况:开始时间恰好落在小时上,而不是在小时之前。
import pandas as pd
from pandas._libs.tslibs.timedeltas import Timedelta
df=pd.DataFrame({
'activity name':['Bedtime','videos','Commute','Work','Commute'],
'time started':["2021-10-25 00:00:00","2021-10-25 08:25:42","2021-10-25 08:51:54","2021-10-25 09:29:34","2021-10-25 17:00:00"],
'time ended':["2021-10-25 08:25:42","2021-10-25 08:51:54","2021-10-25 09:29:34","2021-10-25 17:00:00","2021-10-25 18:01:00"]})
# converting strings to datetime
df['time ended']=pd.to_datetime(df['time ended'])
df['time started']=pd.to_datetime(df['time started'])
## column names with spaces can't be accessed by name when using iterruples to iterate through the df
df.columns = [col.replace(" ","_") for col in df.columns]启动df:
>>> df
activity_name time_started time_ended
0 Bedtime 2021-10-25 00:00:00 2021-10-25 08:25:42
1 videos 2021-10-25 08:25:42 2021-10-25 08:51:54
2 Commute 2021-10-25 08:51:54 2021-10-25 09:29:34
3 Work 2021-10-25 09:29:34 2021-10-25 17:00:00
4 Commute 2021-10-25 17:00:00 2021-10-25 18:01:00## we use the start and end times to determine what daterange we create
start_time = df['time_started'].min().floor('h')
end_time = df['time_started'].max().ceil('h')
## setup an empty DataFrame to hold the final result
new_columns = list(df.activity_name.unique())
df_new = pd.DataFrame(columns=new_columns)
for row in df.itertuples(index=True):
new_row = {}
daterange_start = row.time_started.floor('1h')
daterange_end = row.time_ended.floor('1h')
datetimes_index = pd.date_range(daterange_start, daterange_end, freq='1h')
all_datetimes = datetimes_index.union([row.time_started, row.time_ended])
## take the difference and shift by -1 to drop the first NaT
new_row[row.activity_name] = all_datetimes.to_series().diff().shift(-1)
## if the first row starts in the middle of an hour, we don't want the difference between the beginning of the hour and the time in that row
if (row.Index == 0) & (row.time_started > daterange_start):
df_new = df_new.append(pd.DataFrame(new_row))[1:]
else:
df_new = df_new.append(pd.DataFrame(new_row))
df_new.index.name = 'time_started'
df_new.reset_index(inplace=True)结果:
>>> df_new
time_started Bedtime videos Commute Work
0 2021-10-25 00:00:00 0 days 01:00:00 NaT NaT NaT
1 2021-10-25 01:00:00 0 days 01:00:00 NaT NaT NaT
2 2021-10-25 02:00:00 0 days 01:00:00 NaT NaT NaT
3 2021-10-25 03:00:00 0 days 01:00:00 NaT NaT NaT
4 2021-10-25 04:00:00 0 days 01:00:00 NaT NaT NaT
5 2021-10-25 05:00:00 0 days 01:00:00 NaT NaT NaT
6 2021-10-25 06:00:00 0 days 01:00:00 NaT NaT NaT
7 2021-10-25 07:00:00 0 days 01:00:00 NaT NaT NaT
8 2021-10-25 08:00:00 0 days 00:25:42 NaT NaT NaT
9 2021-10-25 08:25:42 NaT NaT NaT NaT
10 2021-10-25 08:00:00 NaT 0 days 00:25:42 NaT NaT
11 2021-10-25 08:25:42 NaT 0 days 00:26:12 NaT NaT
12 2021-10-25 08:51:54 NaT NaT NaT NaT
13 2021-10-25 08:00:00 NaT NaT 0 days 00:51:54 NaT
14 2021-10-25 08:51:54 NaT NaT 0 days 00:08:06 NaT
15 2021-10-25 09:00:00 NaT NaT 0 days 00:29:34 NaT
16 2021-10-25 09:29:34 NaT NaT NaT NaT
17 2021-10-25 09:00:00 NaT NaT NaT 0 days 00:29:34
18 2021-10-25 09:29:34 NaT NaT NaT 0 days 00:30:26
19 2021-10-25 10:00:00 NaT NaT NaT 0 days 01:00:00
20 2021-10-25 11:00:00 NaT NaT NaT 0 days 01:00:00
21 2021-10-25 12:00:00 NaT NaT NaT 0 days 01:00:00
22 2021-10-25 13:00:00 NaT NaT NaT 0 days 01:00:00
23 2021-10-25 14:00:00 NaT NaT NaT 0 days 01:00:00
24 2021-10-25 15:00:00 NaT NaT NaT 0 days 01:00:00
25 2021-10-25 16:00:00 NaT NaT NaT 0 days 01:00:00
26 2021-10-25 17:00:00 NaT NaT NaT NaT
27 2021-10-25 17:00:00 NaT NaT 0 days 01:00:00 NaT
28 2021-10-25 18:00:00 NaT NaT 0 days 00:01:00 NaT
29 2021-10-25 18:01:00 NaT NaT NaT NaT对于每个活动,我们创建了一个新的DataFrame,以获得与all_datetimes.to_series().diff().shift(-1)之间的时间差异,这意味着活动中的每个更改之间都存在NaT。这些都是无用的,所以我们将删除所有活动都是NaT的任何行。
然后,我们在time_started列中放置重复的时间戳,并保留这些副本的第一个值,并在time_started列中显示所有时间戳:
df_new = df_new.dropna(subset=new_columns, how='all').drop_duplicates(subset=['time_started'], keep='first')
df_new['time_started'] = df_new['time_started'].apply(lambda x: x.floor('1h'))结果:
>>> df_new
time_started Bedtime videos Commute Work
0 2021-10-25 00:00:00 0 days 01:00:00 NaT NaT NaT
1 2021-10-25 01:00:00 0 days 01:00:00 NaT NaT NaT
2 2021-10-25 02:00:00 0 days 01:00:00 NaT NaT NaT
3 2021-10-25 03:00:00 0 days 01:00:00 NaT NaT NaT
4 2021-10-25 04:00:00 0 days 01:00:00 NaT NaT NaT
5 2021-10-25 05:00:00 0 days 01:00:00 NaT NaT NaT
6 2021-10-25 06:00:00 0 days 01:00:00 NaT NaT NaT
7 2021-10-25 07:00:00 0 days 01:00:00 NaT NaT NaT
8 2021-10-25 08:00:00 0 days 00:25:42 NaT NaT NaT
11 2021-10-25 08:00:00 NaT 0 days 00:26:12 NaT NaT
14 2021-10-25 08:00:00 NaT NaT 0 days 00:08:06 NaT
15 2021-10-25 09:00:00 NaT NaT 0 days 00:29:34 NaT
18 2021-10-25 09:00:00 NaT NaT NaT 0 days 00:30:26
19 2021-10-25 10:00:00 NaT NaT NaT 0 days 01:00:00
20 2021-10-25 11:00:00 NaT NaT NaT 0 days 01:00:00
21 2021-10-25 12:00:00 NaT NaT NaT 0 days 01:00:00
22 2021-10-25 13:00:00 NaT NaT NaT 0 days 01:00:00
23 2021-10-25 14:00:00 NaT NaT NaT 0 days 01:00:00
24 2021-10-25 15:00:00 NaT NaT NaT 0 days 01:00:00
25 2021-10-25 16:00:00 NaT NaT NaT 0 days 01:00:00
27 2021-10-25 17:00:00 NaT NaT 0 days 01:00:00 NaT
28 2021-10-25 18:00:00 NaT NaT 0 days 00:01:00 NaT现在我们用pd.Timedelta("0s")填充所有的pd.Timedelta("0s"),然后我们可以根据time_started列中的值进行分组,并将它们加在一起:
df_new = df_new.fillna(pd.Timedelta(0)).groupby("time_started").sum().reset_index()最终结果:
>>> df_new
time_started Bedtime videos Commute Work
0 2021-10-25 00:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
1 2021-10-25 01:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
2 2021-10-25 02:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
3 2021-10-25 03:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
4 2021-10-25 04:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
5 2021-10-25 05:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
6 2021-10-25 06:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
7 2021-10-25 07:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
8 2021-10-25 08:00:00 0 days 00:25:42 0 days 00:26:12 0 days 00:08:06 0 days 00:00:00
9 2021-10-25 09:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:29:34 0 days 00:30:26
10 2021-10-25 10:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00
11 2021-10-25 11:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00
12 2021-10-25 12:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00
13 2021-10-25 13:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00
14 2021-10-25 14:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00
15 2021-10-25 15:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00
16 2021-10-25 16:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00
17 2021-10-25 17:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00 0 days 00:00:00
18 2021-10-25 18:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:01:00 0 days 00:00:00https://stackoverflow.com/questions/70319624
复制相似问题