我正在尝试从下载csv
https://oceanwatch.pifsc.noaa.gov/erddap/griddap/goes-poes-1d-ghrsst-RAN.html
或者我试图从这里找到的网站上抓取html表格输出的数据帧。
[https://oceanwatch.pifsc.noaa.gov/erddap/griddap/goes-poes-1d-ghrsst-RAN.htmlTable?analysed_sst[(2019-02-09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-6.975):1:(42.025)][(179.025):1:(238.025)],analysis_error[(2019-02-09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-6.975):1:(42.025)][(179.025):1:(238.025)],mask[(2019-02-09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-6.975):1:(42.025)][(179.025):1:(238.025)],sea_ice_fraction[(2019-02-09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-6.975):1:(42.025)][(179.025):1:(238.025)]](https://oceanwatch.pifsc.noaa.gov/erddap/griddap/goes-poes-1d-ghrsst-RAN.htmlTable?analysed_sst[(2019-02-09T12:00:00Z%29:1:(2019-02-09T12:00:00Z%29][(-6.975%29:1:(42.025%29][(179.025%29:1:(238.025%29],analysis_error[(2019-02-09T12:00:00Z%29:1:(2019-02-09T12:00:00Z%29][(-6.975%29:1:(42.025%29][(179.025%29:1:(238.025%29],mask[(2019-02-09T12:00:00Z%29:1:(2019-02-09T12:00:00Z%29][(-6.975%29:1:(42.025%29][(179.025%29:1:(238.025%29],sea_ice_fraction[(2019-02-09T12:00:00Z%29:1:(2019-02-09T12:00:00Z%29][(-6.975%29:1:(42.025%29][(179.025%29:1:(238.025%29])
我已经尝试用以下方法抓取数据:
library(rvest)
url <- read_html("https://oceanwatch.pifsc.noaa.gov/erddap/griddap/goes-
poes-1d-ghrsst-RAN.htmlTable?analysed_sst[(2019-02-09T12:00:00Z):1:(2019-
02-09T12:00:00Z)][(-7):1:(42)][(179):1:(238)],analysis_error[(2019-02-
09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-7):1:(42)][(179):1:
(238)],mask[(2019-02-09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-7):1:(42)]
[(179):1:(238)],sea_ice_fraction[(2019-02-09T12:00:00Z):1:(2019-02-
09T12:00:00Z)][(-7):1:(42)][(179):1:(238)]")
test <- url %>%
html_nodes(xpath='table.erd.commonBGColor.nowrap') %>%
html_text()
我试着用以下命令下载csv
download.file(url, destfile = "~/Documents/test.csv", mode = 'wb')
但这两种方法都不起作用。download.file函数下载了一个带有节点描述的csv。rvest方法在我的macbook上给了我一个巨大的字符串,在我的windows上给了我一个空的数据框。我也尝试过使用selectorgadget (chrome扩展)来获取我需要的数据,但是selectorgadget似乎不能在htmlTable上工作
发布于 2019-02-11 04:14:25
我设法使用htmltab包找到了解决方案,但不确定它是否是最优的,它是一个网页的大数据框架,需要一段时间才能加载到数据框架中。table2用于实际的表格,因为在你给出的链接中有2个html表格。
url1 <- "https://oceanwatch.pifsc.noaa.gov/erddap/griddap/goes-poes-1d-ghrsst-RAN.htmlTable?analysed_sst[(2019-02-09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-6.975):1:(42.025)][(179.025):1:(238.025)],analysis_error[(2019-02-09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-6.975):1:(42.025)][(179.025):1:(238.025)],mask[(2019-02-09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-6.975):1:(42.025)][(179.025):1:(238.025)],sea_ice_fraction[(2019-02-09T12:00:00Z):1:(2019-02-09T12:00:00Z)][(-6.975):1:(42.025)][(179.025):1:(238.025)]"
tbls <- htmltab(url1,which = "//table[2]")
rdf <- as.data.frame(tbls)
如果有帮助,请告诉我。
https://stackoverflow.com/questions/54622580
复制