首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >用RSelenium从网页正文中提取文本

用RSelenium从网页正文中提取文本
EN

Stack Overflow用户
提问于 2021-10-24 01:57:37
回答 1查看 127关注 0票数 0

我需要从一堆使用JavaScript渲染的网页中提取文本。

下面的代码通常适用于我,结果只是文本和行返回,这是很好的。

然而,在某些页面上,它并不起作用。

如何使用RSelenium来提取“网址失败”网页的正文?

代码语言:javascript
复制
library("tidyverse")
library("rvest")
library("RSelenium")

remDr <- remoteDriver(port = 4445L)
remDr$open()

# URL Works
url <- "https://www.td.com/ca/en/personal-banking/products/credit-cards/travel-rewards/rewards-visa-card/"

# URL Fails
# url <- "https://www.bmo.com/main/personal/credit-cards/bmo-cashback-mastercard/"

remDr$navigate(url)

pg <-  
  remDr$getPageSource()[[1]] %>% 
  read_html(encoding = "UTF-8") %>% 
  html_node(xpath = "//body") %>%
  as.character() %>% 
  htm2txt::htm2txt()

remDr$close()

@NadPat提出的解决方案

代码语言:javascript
复制
url <- "https://www.bmo.com/main/personal/credit-cards/bmo-cashback-mastercard/"
remDr$navigate(url)
text <- remDr$findElement(using = 'xpath', value = '/html')
text$getElementText()

我的结果是:

代码语言:javascript
复制
Selenium message:a is null
Build info: version: '2.53.1', revision: 'a36b8b1', time: '2016-06-30 17:37:03'
System info: host: 'fe72a1de69e7', ip: '172.17.0.2', os.name: 'Linux', os.arch: 'amd64', os.version: '5.4.0-84-generic', java.version: '1.8.0_91'
Driver info: driver.version: unknown

Error:   Summary: UnknownError
     Detail: An unknown server-side error occurred while processing the command.
     class: org.openqa.selenium.WebDriverException
     Further Details: run errorDetails method

对于失败的URL,正在读取某些内容,因为remDr$getPageSource()[[1]]返回:

代码语言:javascript
复制
[1] "<html xmlns=\"http://www.w3.org/1999/xhtml\"><head><script>\n\nsitePrefix = 'BMO';\nvar pageNameMapping = {};\n\n//channelDemo\npageNameMapping[\"atm_en\"]=\"channelDemo\";\npageNameMapping[\"atm_fr\"]=\"channelDemo\";\n\n//Every Day Banking\npageNameMapping[\"Personal\"]=\"PERS\";\npageNameMapping[\"Bank Accounts\"]=\"Bank-Accounts\";\npageNameMapping[\"Daily savings account\"]=\"Premium-Rate-Savings\";\npageNameMapping[\"High Interest Savings Account\"]=\"Smart-Saver\";\npageNameMapping[\"Chequing account\"]=\"Primary-Chequing\";\npageNameMapping[\"Business Premium Rate Savings\"]=\"Business Premium Rate Account\";\n\n//Cards\npageNameMapping[\"Credit Cards\"]=\"CC\";\n\n\n//Mortgages\npageNameMapping[\"Mortgages\"]=\"MTG\";\npageNameMapping[\"Special Offers\"]=\"Special-Offers\";\n\n//Wealth Management\npageNameMapping[\"Wealth Management\"]=\"Wealth\";\npageNameMapping[\"AdviceDirect\"]=\"Advicedirect\";\n\n//Online Investing\npageNameMapping[\"Online Investing\"]=\"ONL-INVS\";\npageNameMapping...

我用Docker设置RSelenium的方式有问题吗?

===

更新:我从docker上下载了最新版本的standalone-firefox,现在@NadPat的解决方案对我有效。

代码语言:javascript
复制
docker pull selenium/standalone-firefox:latest
EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2021-10-24 16:08:50

启动浏览器,

代码语言:javascript
复制
library(RSelenium)
driver = rsDriver(
     port = 4841L,
       browser = c("firefox"))

remDr <- driver[["client"]]

url <- "https://www.bmo.com/main/personal/credit-cards/bmo-cashback-mastercard/"

第一种方法,

代码语言:javascript
复制
remDr$navigate(url)
text <- remDr$findElement(using = 'xpath', value = '/html')
text$getElementText()
[[1]]
[1] "Skip navigation\nPersonal\nPrivate Wealth\nBusiness\nCommercial\nCapital Markets\nSearch\nFind us\nSupport\nEN\nLogin\nBank Accounts\nCredit Cards\nMortgages\nLoans & Lines of Credit\nInvestments\nFinancial Planning\nInsurance\nWays to Bank\nAbout BMO\nPersonal\nCredit Cards\nBMO CashBack Mastercard\nBMO CashBack® Mastercard®*\nEnjoy the most cash back on groceries in Canada without paying an annual fee\nfootnote\n*\nFootnote\n* Based on a comparison of the non-promotional groce

第二种方法,

代码语言:javascript
复制
text <- remDr$findElement(using = 'xpath', value = '//*[@id="main"]')
text$getElementText()
[[1]]
[1] "Personal\nCredit Cards\nBMO CashBack Mastercard\nBMO CashBack® Mastercard®*\nEnjoy the most cash back on groceries in Canada without paying an annual fee\nfootnote\n*\nFootnote\n* Based on a comparison of the non-promotional grocery rewards earn rate on cash back credit cards with no annual fee as of June 1, 2021.\nWelcome offer\nGet up to 5% cash back in your first 3 months‡‡ and a 1.99% introductory interest rate on balance transfers for 9 months with a 1% transfer fee.§§\nAPPL
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/69693343

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档