首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >在对数据进行过采样时拆分持续时间

在对数据进行过采样时拆分持续时间
EN

Stack Overflow用户
提问于 2021-12-11 22:48:11
回答 2查看 76关注 0票数 3

如何在对数据进行采样的同时分割一个持续时间,如下面的示例所示。我可以用例如group_by函数替换for循环吗?

我想用熊猫来转换这样的数据:

代码语言:javascript
复制
  activity name         time started           time ended
0       Bedtime  2021-10-25 00:00:00  2021-10-25 08:25:42
1        videos  2021-10-25 08:25:42  2021-10-25 08:51:54
2       Commute  2021-10-25 08:51:54  2021-10-25 09:29:34

这方面:

代码语言:javascript
复制
time started        Bedtime         videos           Commute                   
2021-10-25 00:00:00 0 days 01:00:00 0 days 00:00:00  0 days
2021-10-25 01:00:00 0 days 01:00:00 0 days 00:00:00  0 days
2021-10-25 02:00:00 0 days 01:00:00 0 days 00:00:00  0 days
2021-10-25 03:00:00 0 days 01:00:00 0 days 00:00:00  0 days
2021-10-25 04:00:00 0 days 01:00:00 0 days 00:00:00  0 days
2021-10-25 05:00:00 0 days 01:00:00 0 days 00:00:00  0 days
2021-10-25 06:00:00 0 days 01:00:00 0 days 00:00:00  0 days
2021-10-25 07:00:00 0 days 01:00:00 0 days 00:00:00  0 days
2021-10-25 08:00:00 0 days 00:25:42 0 days 00:26:12  0 days 00:08:06
...

我走到这一步:

代码语言:javascript
复制
import pandas as pd
df=pd.DataFrame({'activity name':['Bedtime','videos','Commute'],'time started':["2021-10-25 00:00:00","2021-10-25 08:25:42","2021-10-25 08:51:54"],'time ended':["2021-10-25 08:25:42","2021-10-25 08:51:54","2021-10-25 09:29:34"]})
# converting strings to datetime
df['time ended']=pd.to_datetime(df['time ended'])
df['time started']=pd.to_datetime(df['time started'])

# calclating the duration
df['duration']=df['time ended']-df['time started']

# changeing index
df.index=df['time started']
df=df.drop(columns=['time started','time ended'])

for a in df['activity name'].unique():
    df[a]=(df['activity name']==a)*df['duration']

df=df.drop(columns=['activity name','duration'])
df.resample('H').first()
代码语言:javascript
复制
time started                                               
2021-10-25 00:00:00 0 days 08:25:42 0 days 00:00:00  0 days
2021-10-25 01:00:00             NaT             NaT     NaT
2021-10-25 02:00:00             NaT             NaT     NaT
2021-10-25 03:00:00             NaT             NaT     NaT
2021-10-25 04:00:00             NaT             NaT     NaT
2021-10-25 05:00:00             NaT             NaT     NaT
2021-10-25 06:00:00             NaT             NaT     NaT
2021-10-25 07:00:00             NaT             NaT     NaT
2021-10-25 08:00:00 0 days 00:00:00 0 days 00:26:12  0 days
EN

回答 2

Stack Overflow用户

发布于 2021-12-16 19:57:27

试试这个:

代码语言:javascript
复制
import pandas as pd
from io import StringIO

txtfile = StringIO(
    """  activity name         time started           time ended
0       Bedtime  2021-10-25 00:00:00  2021-10-25 08:25:42
1        videos  2021-10-25 08:25:42  2021-10-25 08:51:54
2       Commute  2021-10-25 08:51:54  2021-10-25 09:29:34"""
)

df = pd.read_csv(txtfile, sep="\s\s+", engine="python")

df[["time started", "time ended"]] = df[["time started", "time ended"]].apply(
    pd.to_datetime
)
df_e = df.assign(
    date=[
        pd.date_range(s, e, freq="s")
        for s, e in zip(df["time started"], df["time ended"])
    ]
).explode("date")

df_out = (
    df_e.groupby(["activity name", pd.Grouper(key="date", freq="H")])["activity name"]
    .count()
    .unstack(0)
    .apply(pd.to_timedelta, unit="s")
)

print(df_out)

输出:

代码语言:javascript
复制
activity name               Bedtime         Commute          videos
date                                                               
2021-10-25 00:00:00 0 days 01:00:00             NaT             NaT
2021-10-25 01:00:00 0 days 01:00:00             NaT             NaT
2021-10-25 02:00:00 0 days 01:00:00             NaT             NaT
2021-10-25 03:00:00 0 days 01:00:00             NaT             NaT
2021-10-25 04:00:00 0 days 01:00:00             NaT             NaT
2021-10-25 05:00:00 0 days 01:00:00             NaT             NaT
2021-10-25 06:00:00 0 days 01:00:00             NaT             NaT
2021-10-25 07:00:00 0 days 01:00:00             NaT             NaT
2021-10-25 08:00:00 0 days 00:25:43 0 days 00:08:06 0 days 00:26:13
2021-10-25 09:00:00             NaT 0 days 00:29:35             NaT

地址@DerekO注释:

代码语言:javascript
复制
import pandas as pd
from io import StringIO

txtfile = StringIO(
    """  activity name         time started           time ended
0       Bedtime  2021-10-25 00:00:00  2021-10-25 08:25:42
1        videos  2021-10-25 08:25:42  2021-10-25 08:51:54
2       Commute  2021-10-25 08:51:54  2021-10-25 09:29:34
3       Bedtime  2021-10-25 11:00:00  2021-10-25 13:04:31"""
)

df = pd.read_csv(txtfile, sep="\s\s+", engine="python")

df[["time started", "time ended"]] = df[["time started", "time ended"]].apply(
    pd.to_datetime
)
df_e = df.assign(
    date=[
        pd.date_range(s, e, freq="s")
        for s, e in zip(df["time started"], df["time ended"])
    ]
).explode("date")

df_out = (
    df_e.groupby(["activity name", pd.Grouper(key="date", freq="H")])["activity name"]
    .count()
    .unstack(0)
    .apply(pd.to_timedelta, unit="s")
    .sort_index()
)

print(df_out)

输出:

代码语言:javascript
复制
activity name               Bedtime         Commute          videos
date                                                               
2021-10-25 00:00:00 0 days 01:00:00             NaT             NaT
2021-10-25 01:00:00 0 days 01:00:00             NaT             NaT
2021-10-25 02:00:00 0 days 01:00:00             NaT             NaT
2021-10-25 03:00:00 0 days 01:00:00             NaT             NaT
2021-10-25 04:00:00 0 days 01:00:00             NaT             NaT
2021-10-25 05:00:00 0 days 01:00:00             NaT             NaT
2021-10-25 06:00:00 0 days 01:00:00             NaT             NaT
2021-10-25 07:00:00 0 days 01:00:00             NaT             NaT
2021-10-25 08:00:00 0 days 00:25:43 0 days 00:08:06 0 days 00:26:13
2021-10-25 09:00:00             NaT 0 days 00:29:35             NaT
2021-10-25 11:00:00 0 days 01:00:00             NaT             NaT
2021-10-25 12:00:00 0 days 01:00:00             NaT             NaT
2021-10-25 13:00:00 0 days 00:04:32             NaT             NaT
票数 1
EN

Stack Overflow用户

发布于 2021-12-16 07:08:34

虽然我同意使用groupbyresample是最好的,但我不能让这样的解决方案发挥作用。相反,您可以为原始DataFrame的每一行创建一个新的DataFrame,并将它们连接在一起,从而强行解决这个问题。

它的工作方式是使用pd.date_range在开始和结束时间的地板之间创建一个DatetimeIndex,并且开始和结束时间也被插入到DatetimeIndex中。然后,这个DatetimeIndex中所有日期时间之间的差异是新DataFrame的值。

为了使我的解决方案尽可能健壮,我在原始的DataFrame中添加了两个额外的行,并重复了一个类别,并测试了以下情况:开始时间恰好落在小时上,而不是在小时之前。

代码语言:javascript
复制
import pandas as pd
from pandas._libs.tslibs.timedeltas import Timedelta

df=pd.DataFrame({
    'activity name':['Bedtime','videos','Commute','Work','Commute'],
    'time started':["2021-10-25 00:00:00","2021-10-25 08:25:42","2021-10-25 08:51:54","2021-10-25 09:29:34","2021-10-25 17:00:00"],
    'time ended':["2021-10-25 08:25:42","2021-10-25 08:51:54","2021-10-25 09:29:34","2021-10-25 17:00:00","2021-10-25 18:01:00"]})

# converting strings to datetime
df['time ended']=pd.to_datetime(df['time ended'])
df['time started']=pd.to_datetime(df['time started'])

## column names with spaces can't be accessed by name when using iterruples to iterate through the df
df.columns = [col.replace(" ","_") for col in df.columns]

启动df:

代码语言:javascript
复制
>>> df
  activity_name        time_started          time_ended
0       Bedtime 2021-10-25 00:00:00 2021-10-25 08:25:42
1        videos 2021-10-25 08:25:42 2021-10-25 08:51:54
2       Commute 2021-10-25 08:51:54 2021-10-25 09:29:34
3          Work 2021-10-25 09:29:34 2021-10-25 17:00:00
4       Commute 2021-10-25 17:00:00 2021-10-25 18:01:00

代码语言:javascript
复制
## we use the start and end times to determine what daterange we create
start_time = df['time_started'].min().floor('h')
end_time = df['time_started'].max().ceil('h')

## setup an empty DataFrame to hold the final result
new_columns = list(df.activity_name.unique())
df_new = pd.DataFrame(columns=new_columns)

for row in df.itertuples(index=True):
    new_row = {}
    daterange_start = row.time_started.floor('1h')
    daterange_end = row.time_ended.floor('1h')
    datetimes_index = pd.date_range(daterange_start, daterange_end, freq='1h')

    all_datetimes = datetimes_index.union([row.time_started, row.time_ended])

    ## take the difference and shift by -1 to drop the first NaT
    new_row[row.activity_name] = all_datetimes.to_series().diff().shift(-1)
    
    ## if the first row starts in the middle of an hour, we don't want the difference between the beginning of the hour and the time in that row
    if (row.Index == 0) & (row.time_started > daterange_start):
        df_new = df_new.append(pd.DataFrame(new_row))[1:]
    else:
        df_new = df_new.append(pd.DataFrame(new_row))

df_new.index.name = 'time_started'
df_new.reset_index(inplace=True)

结果:

代码语言:javascript
复制
>>> df_new
          time_started         Bedtime          videos         Commute            Work
0  2021-10-25 00:00:00 0 days 01:00:00             NaT             NaT             NaT
1  2021-10-25 01:00:00 0 days 01:00:00             NaT             NaT             NaT
2  2021-10-25 02:00:00 0 days 01:00:00             NaT             NaT             NaT
3  2021-10-25 03:00:00 0 days 01:00:00             NaT             NaT             NaT
4  2021-10-25 04:00:00 0 days 01:00:00             NaT             NaT             NaT
5  2021-10-25 05:00:00 0 days 01:00:00             NaT             NaT             NaT
6  2021-10-25 06:00:00 0 days 01:00:00             NaT             NaT             NaT
7  2021-10-25 07:00:00 0 days 01:00:00             NaT             NaT             NaT
8  2021-10-25 08:00:00 0 days 00:25:42             NaT             NaT             NaT
9  2021-10-25 08:25:42             NaT             NaT             NaT             NaT
10 2021-10-25 08:00:00             NaT 0 days 00:25:42             NaT             NaT
11 2021-10-25 08:25:42             NaT 0 days 00:26:12             NaT             NaT
12 2021-10-25 08:51:54             NaT             NaT             NaT             NaT
13 2021-10-25 08:00:00             NaT             NaT 0 days 00:51:54             NaT
14 2021-10-25 08:51:54             NaT             NaT 0 days 00:08:06             NaT
15 2021-10-25 09:00:00             NaT             NaT 0 days 00:29:34             NaT
16 2021-10-25 09:29:34             NaT             NaT             NaT             NaT
17 2021-10-25 09:00:00             NaT             NaT             NaT 0 days 00:29:34
18 2021-10-25 09:29:34             NaT             NaT             NaT 0 days 00:30:26
19 2021-10-25 10:00:00             NaT             NaT             NaT 0 days 01:00:00
20 2021-10-25 11:00:00             NaT             NaT             NaT 0 days 01:00:00
21 2021-10-25 12:00:00             NaT             NaT             NaT 0 days 01:00:00
22 2021-10-25 13:00:00             NaT             NaT             NaT 0 days 01:00:00
23 2021-10-25 14:00:00             NaT             NaT             NaT 0 days 01:00:00
24 2021-10-25 15:00:00             NaT             NaT             NaT 0 days 01:00:00
25 2021-10-25 16:00:00             NaT             NaT             NaT 0 days 01:00:00
26 2021-10-25 17:00:00             NaT             NaT             NaT             NaT
27 2021-10-25 17:00:00             NaT             NaT 0 days 01:00:00             NaT
28 2021-10-25 18:00:00             NaT             NaT 0 days 00:01:00             NaT
29 2021-10-25 18:01:00             NaT             NaT             NaT             NaT

对于每个活动,我们创建了一个新的DataFrame,以获得与all_datetimes.to_series().diff().shift(-1)之间的时间差异,这意味着活动中的每个更改之间都存在NaT。这些都是无用的,所以我们将删除所有活动都是NaT的任何行。

然后,我们在time_started列中放置重复的时间戳,并保留这些副本的第一个值,并在time_started列中显示所有时间戳:

代码语言:javascript
复制
df_new = df_new.dropna(subset=new_columns, how='all').drop_duplicates(subset=['time_started'], keep='first')
df_new['time_started'] = df_new['time_started'].apply(lambda x: x.floor('1h'))

结果:

代码语言:javascript
复制
>>> df_new
          time_started         Bedtime          videos         Commute            Work
0  2021-10-25 00:00:00 0 days 01:00:00             NaT             NaT             NaT
1  2021-10-25 01:00:00 0 days 01:00:00             NaT             NaT             NaT
2  2021-10-25 02:00:00 0 days 01:00:00             NaT             NaT             NaT
3  2021-10-25 03:00:00 0 days 01:00:00             NaT             NaT             NaT
4  2021-10-25 04:00:00 0 days 01:00:00             NaT             NaT             NaT
5  2021-10-25 05:00:00 0 days 01:00:00             NaT             NaT             NaT
6  2021-10-25 06:00:00 0 days 01:00:00             NaT             NaT             NaT
7  2021-10-25 07:00:00 0 days 01:00:00             NaT             NaT             NaT
8  2021-10-25 08:00:00 0 days 00:25:42             NaT             NaT             NaT
11 2021-10-25 08:00:00             NaT 0 days 00:26:12             NaT             NaT
14 2021-10-25 08:00:00             NaT             NaT 0 days 00:08:06             NaT
15 2021-10-25 09:00:00             NaT             NaT 0 days 00:29:34             NaT
18 2021-10-25 09:00:00             NaT             NaT             NaT 0 days 00:30:26
19 2021-10-25 10:00:00             NaT             NaT             NaT 0 days 01:00:00
20 2021-10-25 11:00:00             NaT             NaT             NaT 0 days 01:00:00
21 2021-10-25 12:00:00             NaT             NaT             NaT 0 days 01:00:00
22 2021-10-25 13:00:00             NaT             NaT             NaT 0 days 01:00:00
23 2021-10-25 14:00:00             NaT             NaT             NaT 0 days 01:00:00
24 2021-10-25 15:00:00             NaT             NaT             NaT 0 days 01:00:00
25 2021-10-25 16:00:00             NaT             NaT             NaT 0 days 01:00:00
27 2021-10-25 17:00:00             NaT             NaT 0 days 01:00:00             NaT
28 2021-10-25 18:00:00             NaT             NaT 0 days 00:01:00             NaT

现在我们用pd.Timedelta("0s")填充所有的pd.Timedelta("0s"),然后我们可以根据time_started列中的值进行分组,并将它们加在一起:

代码语言:javascript
复制
df_new = df_new.fillna(pd.Timedelta(0)).groupby("time_started").sum().reset_index()

最终结果:

代码语言:javascript
复制
>>> df_new
          time_started         Bedtime          videos         Commute            Work
0  2021-10-25 00:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
1  2021-10-25 01:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
2  2021-10-25 02:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
3  2021-10-25 03:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
4  2021-10-25 04:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
5  2021-10-25 05:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
6  2021-10-25 06:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
7  2021-10-25 07:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
8  2021-10-25 08:00:00 0 days 00:25:42 0 days 00:26:12 0 days 00:08:06 0 days 00:00:00
9  2021-10-25 09:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:29:34 0 days 00:30:26
10 2021-10-25 10:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00
11 2021-10-25 11:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00
12 2021-10-25 12:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00
13 2021-10-25 13:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00
14 2021-10-25 14:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00
15 2021-10-25 15:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00
16 2021-10-25 16:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00
17 2021-10-25 17:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00 0 days 00:00:00
18 2021-10-25 18:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:01:00 0 days 00:00:00
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/70319624

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档