文章/答案/技术大牛

发布

社区首页 >问答首页 >Pandas计算不包括焦点元素的组元素的数量

问Pandas计算不包括焦点元素的组元素的数量
EN

Stack Overflow用户

提问于 2021-09-19 19:10:23

回答 1查看 183关注 0票数 5

我的数据框如下所示

company   tool   category         year  month
Amazon    A      productivity     2014     9
Amazon    B      productivity     2014     8
Apple     A      productivity     2014     6
Apple     C      CRM              2015     4 
Apple     D      CRM              2015     3
Google    C      CRM              2015     6
Google    E      HR               2014     9 
Google    F      productivity     2014     11
Google    G      productivity     2014     12

第一列显示工具的购买者，工具列对应于工具的名称，类别显示工具的用途，年和月是购买日期。

对于每个工具，我想创建以下数据：

tool   monthlydate    cumulative_sales no_companies_comp year month
A      2014/06              1                 0          2014  6
A      2014/07              1                 0          2014  7
A      2014/08              1                 1          2014  8
A      2014/09              2                 1          2014  9
A      2014/10              2                 1          2014  10
A      2014/11              2                 2          2014  11
A      2014/12              2                 2          2014  12

其中cumulative_sales对应于手头工具在给定年份月份的累计销售额，no_companies_comp对应于在给定年份月份购买竞争对手工具的公司累积数量(请注意，一家公司可能购买了多个竞争对手工具，但我们只计算第一次购买，因为我们对公司数量感兴趣)。我怎样才能做到这一点呢？

python

pandas

回答 1

Stack Overflow用户

发布于 2021-09-25 13:16:19

通过一个简单的groupby，我们可以获得每个工具的销售量和购买公司的数量：

>>> sales = df.groupby(['tool', 'year', 'month']).size()
>>> sales
tool  year  month
A     2014  6        1
            9        1
B     2014  8        1
C     2015  4        1
            6        1
D     2015  3        1
E     2014  9        1
F     2014  11       1
G     2014  12       1
dtype: int64
>>> companies = df.groupby(['tool', 'year', 'month'])['company'].nunique()
>>> companies
tool  year  month
A     2014  6        1
            9        1
B     2014  8        1
C     2015  4        1
            6        1
D     2015  3        1
E     2014  9        1
F     2014  11       1
G     2014  12       1
Name: company, dtype: int64

然后，累积销售额就很容易了：

>>> sales.groupby('tool').cumsum()
tool  year  month
A     2014  6        1
            9        2
B     2014  8        1
C     2015  4        1
            6        2
D     2015  3        1
E     2014  9        1
F     2014  11       1
G     2014  12       1
dtype: int64

请注意，缺少了几个月，因此我们应该重新编制索引：

>>> dates = [(2014 + n // 12, (n - 1) % 12 + 1) for n in range(6, 19)]
>>> idx = pd.MultiIndex.from_tuples([
...     (tool, year, month) for tool in df['tool'].unique() for year, month in dates
... ], names=['tool', 'year', 'month'])
>>> cum_sales = sales.reindex(idx, fill_value=0).groupby('tool').cumsum()
>>> cum_sales.unstack('tool')
tool        A  B  C  D  E  F  G
year month                     
2014 6      1  0  0  0  0  0  0
     7      1  0  0  0  0  0  0
     8      1  1  0  0  0  0  0
     9      2  1  0  0  1  0  0
     10     2  1  0  0  1  0  0
     11     2  1  0  0  1  1  0
2015 1      2  1  0  0  1  1  0
     2      2  1  0  0  1  1  0
     3      2  1  0  1  1  1  0
     4      2  1  1  1  1  1  0
     5      2  1  1  1  1  1  0
     6      2  1  2  1  1  1  0
     12     2  1  0  0  1  1  0

当然，您可以根据需要更改日期范围。

购买竞争对手工具的公司数量是购买任何工具的公司数量减去购买每个工具的公司数量。我们可以使用transform做到这一点，但与上面类似，我们需要首先重新建立索引：

>>> companies = companies.reindex(idx, fill_value=0)
>>> total_companies = companies.groupby(['year', 'month']).transform('sum')
>>> cum_compet_companies = (total_companies - companies).groupby('tool').cumsum()
>>> cum_compet_companies.unstack('tool')
tool        A  B  C  D  E  F  G
year month                     
2014 6      0  1  1  1  1  1  1
     7      0  1  1  1  1  1  1
     8      1  1  2  2  2  2  2
     9      2  3  4  4  3  4  4
     10     2  3  4  4  3  4  4
     11     3  4  5  5  4  4  5
2015 1      3  4  5  5  4  4  5
     2      3  4  5  5  4  4  5
     3      4  5  6  5  5  5  6
     4      5  6  6  6  6  6  7
     5      5  6  6  6  6  6  7
     6      6  7  6  7  7  7  8
     12     3  4  5  5  4  4  5

剩下的只是简单地连接数据并添加monthlydate，可能是在玩索引：

>>> res = cum_sales.to_frame('cumulative_sales').join(
...     cum_compet_companies.to_frame('no_companies_comp')
... ).reset_index()
>>> res['monthlydate'] = res['year'].combine(res['month'], lambda y, m: f'{y}/{m:02}')
>>> res.set_index(['tool', 'monthlydate']).loc['A']  # just tool A
             year  month  cumulative_sales  no_companies_comp
monthlydate                                                  
2014/06      2014      6                 1                  0
2014/07      2014      7                 1                  0
2014/08      2014      8                 1                  1
2014/09      2014      9                 2                  2
2014/10      2014     10                 2                  2
2014/11      2014     11                 2                  3
2015/12      2015     12                 2                  3
2015/01      2015      1                 2                  3
2015/02      2015      2                 2                  3
2015/03      2015      3                 2                  4
2015/04      2015      4                 2                  5
2015/05      2015      5                 2                  5
2015/06      2015      6                 2                  6
>>> res.set_index(['tool', 'monthlydate'])  # all tools
                  year  month  cumulative_sales  no_companies_comp
tool monthlydate                                                  
A    2014/06      2014      6                 1                  0
     2014/07      2014      7                 1                  0
     2014/08      2014      8                 1                  1
     2014/09      2014      9                 2                  2
     2014/10      2014     10                 2                  2
...                ...    ...               ...                ...
G    2015/02      2015      2                 0                  5
     2015/03      2015      3                 0                  6
     2015/04      2015      4                 0                  7
     2015/05      2015      5                 0                  7
     2015/06      2015      6                 0                  8

[91 rows x 4 columns]

票数 6

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/69246340

复制

相似问题

问Pandas计算不包括焦点元素的组元素的数量
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Pandas计算不包括焦点元素的组元素的数量EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Pandas计算不包括焦点元素的组元素的数量
EN