我试图从URL中提取表并保存超链接。当前代码将表保存到Excel,但超链接未保存。我知道这是因为pd.read_html将数据提取为文本。我如何提取它与超链接?
当前代码:
from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd
driver = webdriver.Chrome()
driver.get("https://eservices.bcsc.bc.ca/eder/formsearch.aspx")
## after you press "Search" the table will be displayed.
input("Press ENTER to continue...") ##Come back to terminal and press ENTER to continue execution
output = driver.find_element(By.XPATH, "/html/body/form/table/tbody/tr/td[2]/div/div/div[2]").get_attribute('outerHTML')
dfs = pd.read_html(output)
xlWriter = pd.ExcelWriter('testreport.xlsx', engine='xlsxwriter')
for i, df in enumerate(dfs):
df.to_excel(xlWriter, sheet_name='Sheet{}'.format(i))
xlWriter.save()发布于 2022-06-03 14:47:12
我找到了答案。要从表中获取所有链接(不仅仅是报表):
## store links from the table into urls
urls = [x.get_attribute("href") for x in driver.find_elements(By.XPATH,"//table/tbody/tr/td[2]/div/div/div[2]/table/descendant::a[@href]")]若要将链接保存到带有表其他内容的文件中,请执行以下操作:
##extend table with links and store it under new file
df = pd.read_excel("./testreport.xlsx")
df['Link'] = urls
df.to_excel("./testreportwithlinks.xlsx", index=False)发布于 2022-06-03 04:20:30
urls=[x.get_attribute("href") for x in driver.find_elements(By.XPATH,"//a[@href and text()='Report']")]要获得文本报告中的所有78 href值,您可以执行上述操作。
https://stackoverflow.com/questions/72484101
复制相似问题