我有一个像这样的数据文件:
data.head()
Out[2]:
Area Area Id Variable Name Variable Id Year \
0 Argentina 9 Conservation agriculture area 4454 1982
1 Argentina 9 Conservation agriculture area 4454 1987
2 Argentina 9 Conservation agriculture area 4454 1992
3 Argentina 9 Conservation agriculture area 4454 1997
4 Argentina 9 Conservation agriculture area 4454 2002
Value Symbol Md
0 2.0
1 6.0
2 500.0
Variable Name
是列,Area
和Year
是索引,Value
是值。对我来说最直观的方法是使用:
data.pivot(index=['Area', 'Year'], columns='Variable Name', values='Value)
然而,我得到了错误:
Traceback (most recent call last):
File "C:\Users\patri\Miniconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2862, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-4-4c786386b703>", line 1, in <module>
pd.concat(data_list).pivot(index=['Area', 'Year'], columns='Variable Name', values='Value')
File "C:\Users\patri\Miniconda3\lib\site-packages\pandas\core\frame.py", line 3853, in pivot
return pivot(self, index=index, columns=columns, values=values)
File "C:\Users\patri\Miniconda3\lib\site-packages\pandas\core\reshape\reshape.py", line 377, in pivot
index=MultiIndex.from_arrays([index, self[columns]]))
File "C:\Users\patri\Miniconda3\lib\site-packages\pandas\core\series.py", line 250, in __init__
data = SingleBlockManager(data, index, fastpath=True)
File "C:\Users\patri\Miniconda3\lib\site-packages\pandas\core\internals.py", line 4117, in __init__
fastpath=True)
File "C:\Users\patri\Miniconda3\lib\site-packages\pandas\core\internals.py", line 2719, in make_block
return klass(values, ndim=ndim, fastpath=fastpath, placement=placement)
File "C:\Users\patri\Miniconda3\lib\site-packages\pandas\core\internals.py", line 1844, in __init__
placement=placement, **kwargs)
File "C:\Users\patri\Miniconda3\lib\site-packages\pandas\core\internals.py", line 115, in __init__
len(self.mgr_locs)))
ValueError: Wrong number of items passed 119611, placement implies 2
我该怎么解释呢?我也尝试过另一种方法:
data.set_index(['Area', 'Variable Name', 'Year']).loc[:, 'Value'].unstack('Variable Name')
试图获得相同的结果,但我得到了以下错误:
Traceback (most recent call last):
File "C:\Users\patri\Miniconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2862, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-5-222325ea01e1>", line 1, in <module>
pd.concat(data_list).set_index(['Area', 'Variable Name', 'Year']).loc[:, 'Value'].unstack('Variable Name')
File "C:\Users\patri\Miniconda3\lib\site-packages\pandas\core\series.py", line 2028, in unstack
return unstack(self, level, fill_value)
File "C:\Users\patri\Miniconda3\lib\site-packages\pandas\core\reshape\reshape.py", line 458, in unstack
fill_value=fill_value)
File "C:\Users\patri\Miniconda3\lib\site-packages\pandas\core\reshape\reshape.py", line 110, in __init__
self._make_selectors()
File "C:\Users\patri\Miniconda3\lib\site-packages\pandas\core\reshape\reshape.py", line 148, in _make_selectors
raise ValueError('Index contains duplicate entries, '
ValueError: Index contains duplicate entries, cannot reshape
数据有问题吗?我已经确认在dataframe的任何行中都没有Area
、Variable Name
和Year
的重复组合,所以我认为不应该有任何重复的条目,但我可能错了。考虑到这两种方法目前都不起作用,我如何从长格式转换为宽格式?我检查了答案here和here,但这两种情况都涉及到某种I型聚合。
我尝试过像这样使用pivot_table
:
data.pivot_table(index=['Area', 'Year'], columns='Variable Name', values='Value')
但是,我认为正在进行某种类型的聚合,数据集中有许多缺失的值,这导致了这个错误:
Traceback (most recent call last):
File "C:\Users\patri\Miniconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2862, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-7-77b28d2f0dbb>", line 1, in <module>
pd.concat(data_list).pivot_table(index=['Area', 'Year'], columns='Variable Name', values='Value')
File "C:\Users\patri\Miniconda3\lib\site-packages\pandas\core\reshape\pivot.py", line 136, in pivot_table
agged = grouped.agg(aggfunc)
File "C:\Users\patri\Miniconda3\lib\site-packages\pandas\core\groupby.py", line 4036, in aggregate
return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
File "C:\Users\patri\Miniconda3\lib\site-packages\pandas\core\groupby.py", line 3468, in aggregate
result, how = self._aggregate(arg, _level=_level, *args, **kwargs)
File "C:\Users\patri\Miniconda3\lib\site-packages\pandas\core\base.py", line 435, in _aggregate
**kwargs), None
File "C:\Users\patri\Miniconda3\lib\site-packages\pandas\core\base.py", line 391, in _try_aggregate_string_function
return f(*args, **kwargs)
File "C:\Users\patri\Miniconda3\lib\site-packages\pandas\core\groupby.py", line 1037, in mean
return self._cython_agg_general('mean', **kwargs)
File "C:\Users\patri\Miniconda3\lib\site-packages\pandas\core\groupby.py", line 3354, in _cython_agg_general
how, alt=alt, numeric_only=numeric_only)
File "C:\Users\patri\Miniconda3\lib\site-packages\pandas\core\groupby.py", line 3425, in _cython_agg_blocks
raise DataError('No numeric types to aggregate')
pandas.core.base.DataError: No numeric types to aggregate
发布于 2017-11-08 03:46:41
我认为您需要先将列Value
转换为数字,然后将pivot_table
与默认聚合函数mean
一起使用。
#if all float data saved as strings
data['Value'] = data['Value'].astype(float)
#if some bad data like strings and first method return value error
data['Value'] = pd.to_numeric(data['Value'], errors='coerce')
data.pivot_table(index=['Area', 'Year'], columns='Variable Name', values='Value')
或者:
data.groupby(['Area', 'Variable Name', 'Year'])[ 'Value'].mean().unstack('Variable Name')
https://stackoverflow.com/questions/47178861
复制相似问题