我有以下数据:
date, industry, symbol, roc
25-02-2015, Health, abc, 200
25-02-2015, Health, xyz, 150
25-02-2015, Mining, tyr, 45
25-02-2015, Mining, ujk, 70
26-02-2015, Health, abc, 60
26-02-2015, Health, xyz, 310
26-02-2015, Mining, tyr, 65
26-02-2015, Mining, ujk, 23
我需要确定平均'roc',最大'roc',最小'roc‘以及每个date+industry存在多少个符号。换句话说,我需要按日期和行业分组,然后确定各种平均值、最大值/分钟等。
到目前为止,我正在做以下工作,但似乎很慢,效率很低:
sector_df = primary_df.groupby(['date', 'industry'], sort=True).mean()
tmp_max_df = primary_df.groupby(['date', 'industry'], sort=True).max()
tmp_min_df = primary_df.groupby(['date', 'industry'], sort=True).min()
tmp_count_df = primary_df.groupby(['date', 'industry'], sort=True).count()
sector_df['max_roc'] = tmp_max_df['roc']
sector_df['min_roc'] = tmp_min_df['roc']
sector_df['count'] = tmp_count_df['roc']
sector_df.reset_index(inplace=True)
sector_df.set_index(['date', 'industry'], inplace=True)
上面的代码工作,得到了date+industry索引的数据,向我展示了每个date+industry的min/max 'roc‘是多少,以及每个date+industry存在多少符号。
我基本上是做了一个完整的小组多次(以确定平均,最大,最小,计数的‘机器人’)。这很慢,因为它一遍又一遍地做着同样的事情。
有没有办法只做一次。然后对该对象执行均值、最大值等操作,并将结果分配给sector_df?
发布于 2015-06-01 00:42:38
您希望使用agg
执行聚合。
In [72]:
df.groupby(['date','industry']).agg([pd.Series.mean, pd.Series.max, pd.Series.min, pd.Series.count])
Out[72]:
roc
mean max min count
date industry
2015-02-25 Health 175.0 200 150 2
Mining 57.5 70 45 2
2015-02-26 Health 185.0 310 60 2
Mining 44.0 65 23 2
这允许您传递要执行的函数的可迭代(在本例中为列表)。
编辑
要访问单个结果,需要为每个轴传递一个元组:
In [78]:
gp.loc[('2015-02-25','Health'),('roc','mean')]
Out[78]:
175.0
其中gp = df.groupby(['date','industry']).agg([pd.Series.mean, pd.Series.max, pd.Series.min, pd.Series.count])
发布于 2015-06-01 00:49:14
您只需将groupby部件保存到一个变量中,如下所示:
primary_df = pd.DataFrame([['25-02-2015', 'Health', 'abc', 200],
['25-02-2015', 'Health', 'xyz', 150],
['25-02-2015', 'Mining', 'tyr', 45],
['25-02-2015', 'Mining', 'ujk', 70],
['26-02-2015', 'Health', 'abc', 60],
['26-02-2015', 'Health', 'xyz', 310],
['26-02-2015', 'Mining', 'tyr', 65],
['26-02-2015', 'Mining', 'ujk', 23]],
columns='date industry symbol roc'.split())
grouped = primary_df.groupby(['date', 'industry'], sort=True)
sector_df = grouped.mean()
tmp_max_df = grouped.max()
tmp_min_df = grouped.min()
tmp_count_df = grouped.count()
sector_df['max_roc'] = tmp_max_df['roc']
sector_df['min_roc'] = tmp_min_df['roc']
sector_df['count'] = tmp_count_df['roc']
sector_df.reset_index(inplace=True)
sector_df.set_index(['date', 'industry'], inplace=True)
https://stackoverflow.com/questions/30568995
复制相似问题