groupby
是pandas
中非常重要的一个函数, 主要用于数据聚合和分类计算. 其思想是“split-apply-combine”
(拆分 - 应用 - 合并).
groupby
,按照某个属性column
分组,得到的是一个分组之后的对象apply(function)
DataFrame.``groupby
(self, by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, observed=False, **kwargs)
by和as_index最常用
DataFrameGroupBy or SeriesGroupBy Depends on the calling object and returns groupby object that contains information about the groups.
groupby
后面接上分组的列属性名称(单个)In [1]: df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
....: 'foo', 'bar', 'foo', 'foo'],
....: 'B': ['one', 'one', 'two', 'three',
....: 'two', 'two', 'one', 'three'],
....: 'C': np.random.randn(8),
....: 'D': np.random.randn(8)})
....:
In [2]: df
Out[2]:
A B C D
0 foo one -1.202872 -0.055224
1 bar one -1.814470 2.395985
2 foo two 1.018601 1.552825
3 bar three -0.595447 0.166599
4 foo two 1.395433 0.047609
5 bar two -0.392670 -0.136473
6 foo one 0.007207 -0.561757
7 foo three 1.928123 -1.623033
In [3]: df.groupby('A').sum() # 分组,然后将sum()函数应用于分组结果
Out[3]:
C D
A
bar -2.802588 2.42611
foo 3.146492 -0.63958
In [4]: df.groupby(['A', 'B']).sum() # 多个属性用列表形式,形成层次化索引
Out[4]:
C D
A B
bar one -1.814470 2.395985
three -0.595447 0.166599
two -0.392670 -0.136473
foo one -1.195665 -0.616981
three 1.928123 -1.623033
two 2.414034 1.600434
导入数据
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# 如何读取csv数据,对数据用|分开
url = "https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user"
df = pd.read_csv(url, sep="|")
df.head() # 查看前5行
现有数据特点
解决问题
groupby
mean()
sort_values
,默认是升序asc
df.column
df.groupby("occupation").age.mean().sort_values(ascending=False) # 默认是升序
# df.groupby(df["occupation"]).age.mean().sort_values(ascending=False)
# df.groupby(by="occupation").age.mean().sort_values(ascending=False) by可以省略
occupation
retired 63.071429
doctor 43.571429
educator 42.010526
healthcare 41.562500
librarian 40.000000
administrator 38.746835
executive 38.718750
marketing 37.615385
......
Name: age, dtype: float64
df
按照每一种occupation
拆分成多个部分occupation
的age
的平均值Dataframe
或者Series
groupby
之后是一个对象,,直到应用一个函数(mean函数)之后才会变成一个Series
或者Dataframe
.type(df.groupby("occupation"))
# output
pandas.core.groupby.groupby.DataFrameGroupBy
size
函数求和df.groupby(['occupation','gender']).size()
# Output
occupation gender
administrator F 36
M 43
artist F 13
M 15
doctor M 7
educator F 26
M 69
......
df.groupby(['occupation','gender']).age.mean()
# Output
occupation gender
administrator F 40.638889
M 37.162791
artist F 30.307692
M 32.333333
doctor M 43.571429
educator F 39.115385
M 43.101449
engineer F 29.500000
M 36.600000
groupby
细说column
,也可以是和df
同行的Series
groupby
的column
作为index
, 默认是True
demo = df[:5]
demo.groupby("gender").apply(lambda x: print(x))
# result
user_id age gender occupation zip_code
1 2 53 F other 94043
4 5 33 F other 15213
user_id age gender occupation zip_code
1 2 53 F other 94043
4 5 33 F other 15213
user_id age gender occupation zip_code
0 1 24 M technician 85711
2 3 23 M writer 32067
3 4 24 M technician 43537
有个DF
数据出现了两次,解释看Stack Overflow
# 分组之后进行遍历
grouped = df.groupby(["sex", "age"])
for name, group in grouped:
print("name: {}".format(name))
print("group: {}".format(group))
print("--------------")
# 选择一个组
grouped = df.groupby("sex")
grouped.get_group("male")
df.groupby(["sex", "age"]).get_group(("male", 18))
# 分组之后聚合:均值、最大最小值、计数、求和等,需要调用agg()方法
grouped = df.groupby("sex")
grouped["age"].agg(len)
grouped["age"].agg(['mean','std','count','max']) # 能够传入多个聚合函数
grouped["age"].agg(np.max)
reset_index()
as_index=False
# 1
res = grouped.agg(len) # grouped.count()
res.reset_index() # 索引重排
# 2
grouped = df.groupby(["sex", "age"], as_index=False)