我的目标是从网站访问一些数据,并将这些数据放在内存中(而不是本地下载),这样我就可以做一些进一步的操作。下面是我的python代码:
import pandas as pd
import requests
from requests.auth import HTTPBasicAuth
year = 2019
month_str = 'Jan'
date = 2
month = 1
user = XXXX
password = XXXX
response = requests.get('http_some_url/%i/%s/%02d/%i%02d%02d.gz' % (year,month_str,date,year,month,date), auth = HTTPBasicAuth(user, password))
x = pd.read_csv(response.text, compression='gzip', sep = '|')
print(x.head())数据位于文件夹"year“=> "month_str”=> "date“中,文件名为"year+month+date.gz”。当我运行这段代码时,它返回
"ValueError: embedded null byte". 什么才是正确的方法呢?
更新:
print(response)
<Response [200]>当我打印响应时,它返回200,这意味着它有一些响应。
更新:
response = requests.get('http_some_url/%i/%s/%02d/%i%02d%02d.gz' % (year,month_str,date,year,month,date), auth = HTTPBasicAuth(user, password))
print(response)
x = pd.read_csv(response.content, compression='gzip', sep = '|')
print(x)在我将response.text替换为response.content并打印之后,它返回:
AttributeError: 'bytes' object has no attribute 'read'下面是gzip文件中的一些示例:
093013399690000|310001|C|A|59.85|73.15|A||
093030000913000|353701|C|A|59.85|73.15|B||
093100000411000|460501|C|A|59.85|73.15|B||
093130000630000|697401|C|A|59.85|73.15|B||
093200000464000|841501|C|A|59.85|73.15|B||
093230000508000|1013801|C|A|59.85|73.15|B||
093300000550000|1148701|C|A|59.85|73.15|B||
093330000394000|1313701|C|A|59.85|73.15|B||
093400000590000|1485801|C|A|59.85|73.15|B||
093430000495000|1652601|C|A|59.85|73.15|B||
093500000593000|1856201|C|A|59.85|73.15|B||发布于 2019-11-04 03:26:15
看起来你的字符串格式错误。
f'http_some_url/{year}/{month_str}/{date}/{year}{month}{date}.gz'发布于 2019-11-04 03:47:07
你只需要熊猫就行了:
import pandas as pd
year = 2019
month_str = 'Jan'
date = 2
month = 1
user = XXXX
password = XXXX
gzip_url = f'http://{user}:{password}@some_url/{year}/{month_str}/{date:02d}/{year}{month:02d}{date:02d}.gz'
x = pd.read_csv(gzip_url, compression='gzip', sep = '|')
print(x.head())这是一个概念的证明:
Python 3.7.5 (default, Oct 17 2019, 12:16:48)
[GCC 9.2.1 20190827 (Red Hat 9.2.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> gzip_file = 'http://127.0.0.1:8000/testfile.gz'
>>> df = pd.read_csv(gzip_file, compression='gzip', sep='|')
>>> df.head()
093013399690000 310001 C A 59.85 73.15 A.1 Unnamed: 7 Unnamed: 8
0 93030000913000 353701 C A 59.85 73.15 B NaN NaN
1 93100000411000 460501 C A 59.85 73.15 B NaN NaN
2 93130000630000 697401 C A 59.85 73.15 B NaN NaN
3 93200000464000 841501 C A 59.85 73.15 B NaN NaN
4 93230000508000 1013801 C A 59.85 73.15 B NaN NaN
>>> 正如我们在聊天中所讨论的,这里有一个使用requests的替代方案
import pandas as pd
import requests
from requests.auth import HTTPBasicAuth
from gzip import decompress
from io import StringIO
year = 2019
month_str = 'Jan'
date = 2
month = 1
user = XXXX
password = XXXX
gzip_url = f'http://some_url/{year}/{month_str}/{date:02d}/{year}{month:02d}{date:02d}.gz'
with requests.get(gzip_url, auth=HTTPBasicAuth(user, password)) as request:
if request.ok:
df = pd.read_csv(StringIO(decompress(request.content).decode('utf8')), sep='|')
print(df.head())https://stackoverflow.com/questions/58683725
复制相似问题