我被困在了一些我认为可以用Numpy很容易解决的问题上,我就是没看到。让我们定义一个包含一些缺失值的示例数组:
import numpy as np
input_data = np.array([[1,3,5,8,6],[3,np.nan,np.nan,5,6],[np.nan,6,7,np.nan,2]])
Out[530]: [[1, 3, 5, 8, 6], [3, nan, nan, 5, 6], [nan, 6, 7, nan, 2]]
我要寻找的是获得一个数组,它为每个元素提供到每一行中前一个有效值的距离。在上面的例子中,这可能是这样的:
delta_valid = [[nan, 1, 1, 1, 1], [nan, 1, 2, 3, 1], [nan, nan, 1, 1, 2]]
每行中的第一个元素总是NaN,因为没有以前的值(不确定是否有更好的方法来定义这个值)。
谁能帮我在Numpy上得到这个结果?非常感谢!
发布于 2017-05-04 03:19:04
您基本上是在(1,2,3,...)
的范围内,直到下一个non-NaN
。为了解决这种情况,我们可以在每一行上使用一些diff
+ cumsum
魔术,如下所示-
def closest_distance_per_row(a):
m0 = np.ones(a.shape,dtype=int)
mask = ~np.isnan(a)
for i,item in enumerate(a):
idx = np.flatnonzero(mask[i])
if len(idx)>0:
m0[i,:idx[0]] = 0
m0[i,idx[1:]] = idx[:-1] - idx[1:] +1
out = np.full(a.shape,np.nan,dtype=float)
out[:,1:] = m0[:,:-1].cumsum(1)
out[out==0] = np.nan
out[~mask.any(1)] = np.nan
return out
样本运行-
In [353]: a
Out[353]:
array([[ 1., 3., 5., 8., 6.],
[ 3., nan, nan, 5., 6.],
[ nan, 6., 7., nan, 2.]])
In [354]: closest_distance_per_row(a)
Out[354]:
array([[ nan, 1., 1., 1., 1.],
[ nan, 1., 2., 3., 1.],
[ nan, nan, 1., 1., 2.]])
In [343]: a
Out[343]:
array([[ nan, nan, nan, nan, nan, nan, 4., nan, 3., 1.],
[ nan, nan, 6., nan, nan, nan, nan, nan, nan, nan],
[ 0., nan, 2., nan, 1., nan, 0., nan, nan, nan],
[ 3., nan, 2., nan, 8., 6., nan, 4., 2., nan],
[ nan, 0., nan, nan, nan, nan, nan, nan, nan, nan],
[ nan, nan, 2., nan, 0., nan, nan, 1., nan, nan]])
In [344]: closest_distance_per_row(a)
Out[344]:
array([[ nan, nan, nan, nan, nan, nan, nan, 1., 2., 1.],
[ nan, nan, nan, 1., 2., 3., 4., 5., 6., 7.],
[ nan, 1., 2., 1., 2., 1., 2., 1., 2., 3.],
[ nan, 1., 2., 1., 2., 1., 1., 2., 1., 1.],
[ nan, nan, 1., 2., 3., 4., 5., 6., 7., 8.],
[ nan, nan, nan, 1., 2., 1., 2., 3., 1., 2.]])
运行时测试-
In [4]: a = np.random.randint(0,9,(5000,5000)).astype(float)
In [5]: a.ravel()[np.random.choice(a.size, int(a.size*0.5), replace=0)] = np.nan
In [6]: %timeit two_loops(a)
1 loops, best of 3: 16.7 s per loop
In [7]: %timeit closest_distance_per_row(a)
1 loops, best of 3: 339 ms per loop
In [8]: 16700/339.0 # Speedup with one loop (proposed in this post) over two loops
Out[8]: 49.26253687315634
发布于 2017-05-04 01:28:35
这是一个解决你的问题的方法。这可能不是最理想的,因为我也许可以用地图和/或列表理解做一些更花哨的事情,但至少它解决了您眼前的问题:
import numpy as np
input_data = np.array([[1,3,5,8,6],[3,np.nan,np.nan,5,6],[np.nan,6,7,np.nan,2]])
def distance(vector):
dist = np.nan
dists = []
for a in vector:
dists.append(dist)
dist = dist + 1 if np.isnan(a) else 1
return np.array(dists)
dists = np.empty(input_data.shape)
for row_num, row in enumerate(input_data):
dists[row_num, :] = distance(row)
它目前也只适用于2d数组,但它可能很容易被泛化。
另外,上面的代码也不是很优化。为了与接受的答案进行更公平的比较,下面是一个更优化的版本,没有额外的函数调用,也没有列表构建:
def two_loops(input_data):
dists = np.empty(input_data.shape)
for row_num, row in enumerate(input_data):
dist = np.nan
for col_num, value in enumerate(row):
dists[row_num, col_num] = dist
dist = dist + 1 if np.isnan(value) else 1
return dists
这使得执行时间更加相似。当我测量时,我的解决方案要花费大约两倍的时间来执行。
https://stackoverflow.com/questions/43778450
复制相似问题