社区首页 >问答首页 >如何在使用numpy.fromfile读取文件时跳过行？

问如何在使用numpy.fromfile读取文件时跳过行？
EN

Stack Overflow用户

提问于 2022-02-11 08:25:13

回答 2查看 552关注 0票数 0

我正在读取一个.pksc文件，其中包含大量天文物体的坐标和速度。我在做阅读

import numpy as np
f=open('halos_10x10.pksc') 
data = np.fromfile(f,count=N*10,dtype=np.float32)

该文件可以找到这里。它非常大，我想跳过第一个m对象(如果文件中有行，则跳过与这些对象对应的第一个m行)。我怎么能做到这一点，我看不出有什么可以跳过的？另外，也可以跳过文件中的最后一个k对象。Tnx！

python

numpy

file-read

fromfile

CDN&音视频通信出海专场

提供游戏出海、社交娱乐等方案，助力企业抢占出海市场

回答 2

Stack Overflow用户

回答已采纳

发布于 2022-02-11 10:36:29

首先要注意的是，您的PKSC文件是二进制文件，是一个连续的字节字符串，在数据中没有明显的中断。

另一方面，文本文件的行由某些断行字符明确地分隔，所以一次读取行，忽略前面的M行，然后读取您关心的其余行数：REMAINING_LINES = ALL_LINES - M_LINES - K_LINES非常容易。

np.fromfile()一次读取二进制文件项。

要做到这一点，它需要dtype=参数告诉读者一个项目有多大。对于PKSC文件，我们将项表示为32位整数np.int32。

我搜索了又搜索，但找不到文件的规范。幸运的是，您提供的链接有一个用于读取文件的样例Python脚本；我还找到了一个文档完整的Python，用于处理这类文件(websk.py，下面链接)。

我了解到PKSC文件具有以下属性：

前3项是头项：
- 第一个标题项是在标题项之后的相关数据记录的计数。
每个相关数据记录包含10项。

np.fromfile()还将count=参数作为要读取多少项的指示。

下面是如何读取3个标题项，获取后面的Halo记录的总数，并读取前两个记录(每个记录10项)：

Nitems_per_record = 10

f = open('halos_10x10.pksc')

headers = np.fromfile(f, dtype=np.int32, count=3)
print(f'Headers: {headers}')
print(f'This file contains {headers[0]} records with Halo data')

record1 = np.fromfile(f, dtype=np.int32, count=Nitems_per_record)
print(f'First record:\n{record1}')

record2 = np.fromfile(f, dtype=np.int32, count=Nitems_per_record)
print(f'Second record:\n{record2}')

Headers: [2079516 2079516 2079516]
This file contains 2079516 records with Halo data
First record:
[ 1170060708 -1011158654 -1006515961 -1022926100  1121164875  1110446585  1086444250  1170064687 -1011110709 -1006510502]
Second record:
[ 1170083367 -1013908122 -1006498824 -1014626384 -1020456945 -1033004197  1084104229  1170090354 -1013985376 -1006510502]

根据websky.py，第二和第三标题项目也有相关的价值，也许你也关心这些？我从该代码中合成了以下内容：

RTHMAXin    = headers[1]
redshiftbox = headers[2]

一次读取多个记录需要重新格式化数据。阅读3项记录：

f = open('halos_10x10.pksc')

np.fromfile(f, dtype=np.int32, count=3)  # reading, but ignoring header items

three_records = np.fromfile(f, dtype=np.int32, count=3*Nitems_per_record)
print(f'Initial:\n{three_records}')

reshaped_records = np.reshape(three_records, (3, Nitems_per_record))
print(f'Re-shaped:\n{reshaped}')

Initial:
[ 1170060708 -1011158654 -1006515961 -1022926100  1121164875  1110446585
  1086444250  1170064687 -1011110709 -1006510502  1170083367 -1013908122
 -1006498824 -1014626384 -1020456945 -1033004197  1084104229  1170090354
 -1013985376 -1006510502  1169622353 -1009409432 -1006678295 -1045415727
 -1017794908 -1051267742  1084874393  1169623221 -1009509109 -1006675510]
Re-shaped:
[[ 1170060708 -1011158654 -1006515961 -1022926100  1121164875  1110446585  1086444250  1170064687 -1011110709 -1006510502]
 [ 1170083367 -1013908122 -1006498824 -1014626384 -1020456945 -1033004197  1084104229  1170090354 -1013985376 -1006510502]
 [ 1169622353 -1009409432 -1006678295 -1045415727 -1017794908 -1051267742  1084874393  1169623221 -1009509109 -1006675510]]

那么，跳下去怎么样？

只需修剪重塑的数据

最简单的方法就是读取所有数据，然后从前面和后面剪裁你不想要的东西：

m = 1
k = 1 * -1
trimmed_records = reshaped_records[m:k]
print(f'Trimmed:\n{trimmed_records}')

Trimmed:
[[ 1170083367 -1013908122 -1006498824 -1014626384 -1020456945 -1033004197  1084104229  1170090354 -1013985376 -1006510502]]

我不知道为什么要跳过，但这是最容易理解和实现的。

如果你的记忆是记忆，那就继续读。

丢弃M记录，读取较少的K+M记录

在我看来，下一个选择是：

从第一个头获取记录计数(A记录)
读取和忽略M记录
考虑到您已经读取了M记录，并且希望在record K：R = A - M - K上停下来，请计算出需要读取多少剩余的记录

忽略M记录只会节省一点内存；数据仍然会被读取和解释。最后不读取记录K肯定会节省内存：

f = open('halos_10x10.pksc')
headers = np.fromfile(f, dtype=np.int32, count=3)

Arecords = headers[0]
Mrecords = 1_000_000
Krecords = 1_000_000

Nitems = Mrecords * Nitems_per_record
np.fromfile(f, dtype=np.int32, count=Nitems)

Rrecords = Arecords - Mrecords - Krecords  # Remaining records to read
Nitems = Rrecords * Nitems_per_record
data = np.fromfile(f, dtype=np.int32, count=Nitems)
data = np.reshape(data, (Rrecords, Nitems_per_record))

print(f'From {Arecords} to {Rrecords} records:\n{data.shape}')

From 2079516 to 79516 records:
(79516, 10)

票数 1

Stack Overflow用户

发布于 2022-02-11 15:36:34

如果您只需要将大文件块成较小的文件，那么您就可以独立地对它们进行操作：

import numpy as np

Nrecords_per_chunk = 100_000
Nitems_per_record = 10

f_in = open('halos_10x10.pksc', 'rb')
headers = np.fromfile(f_in, dtype=np.int32, count=3)

Nitems = Nrecords_per_chunk * Nitems_per_record

fnumber = 1
while True:
    items = np.fromfile(f_in, dtype=np.int32, count=Nitems)

    # Because at the end of the file, we're very likely to get less back than we asked for
    Nrecords_read = int(items.shape[0] / Nitems_per_record)

    # At End Of File: Weird luck, chunk_size was a perfect multiple of number of records
    if Nrecords_read == 0:
        break

    records = np.reshape(items, (Nrecords_read, Nitems_per_record))

    with open(f'halos_{fnumber}.pksc', 'wb') as f_out:
        # Keep same format by having 3 "header" items, each item's value is the record count
        new_headers = np.array([Nrecords_read]*3, dtype=np.int32)
        new_headers.tofile(f_out)
        records.tofile(f_out)

    # At End Of File
    if Nrecords_read < Nrecords_per_chunk:
        break

    fnumber += 1

f_in.close()


# Test that first 100_000 records from the main file match the records from the first chunked file

f_in = open('halos_10x10.pksc')
np.fromfile(f_in, dtype=np.int32, count=3)
Nitems = Nrecords_per_chunk * Nitems_per_record
items = np.fromfile(f_in, dtype=np.int32, count=Nitems)
records_orig = np.reshape(items, (Nrecords_per_chunk, Nitems_per_record))
f_in.close()

f_in = open('halos_1.pksc')
np.fromfile(f_in, dtype=np.int32, count=3)
Nitems = Nrecords_per_chunk * Nitems_per_record
items = np.fromfile(f_in, dtype=np.int32, count=Nitems)
records_chunked = np.reshape(items, (Nrecords_per_chunk, Nitems_per_record))
f_in.close()

assert np.array_equal(records_orig, records_chunked)