首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >在python中通过Beautifulsoup抓取和下载修改了名称的Pdf文件

在python中通过Beautifulsoup抓取和下载修改了名称的Pdf文件
EN

Stack Overflow用户
提问于 2021-05-19 23:03:40
回答 1查看 41关注 0票数 0

我想从https://www.archives.gov/research/pentagon-papers下载这些文件

代码语言:javascript
运行
复制
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

url = "https://www.archives.gov/research/pentagon-papers"

# If there is no such folder, the script will create one automatically
folder_location = r'E:\webscraping'
if not os.path.exists(folder_location): os.mkdir(folder_location)

response = requests.get(url)

soup = BeautifulSoup(response.text, "html.parser")

# Downloading the files
for link in soup.select("a[href$='.pdf']"):
    # Name the pdf files using the last portion of each link which are unique in this case
    filename = os.path.join(folder_location, link['href'].split('/')[-1])
    with open(filename, 'wb') as f:
        f.write(requests.get(urljoin(url, link['href'])).content)

然而,我希望文件的名称不像文件名,但作为他们的描述。例如,我希望表中的第三个文件名为[Part II] U.S. Involvement in the Franco-Viet Minh War, 1950-1954.pdf,而不是Pentagon-Papers-Part-II.pdf

for循环的link元素中,这被存储为contents,但是我不知道如何提取它。

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2021-05-19 23:10:47

如您所愿,使用<a>标记中的文本作为名称如何?

下面是操作步骤:

代码语言:javascript
运行
复制
import os
from urllib.parse import urljoin

import requests
from bs4 import BeautifulSoup

url = "https://www.archives.gov/research/pentagon-papers"

# If there is no such folder, the script will create one automatically
folder_location = r'E:\webscraping'
if not os.path.exists(folder_location):
    os.mkdir(folder_location)

soup = BeautifulSoup(requests.get(url).text, "html.parser")

# Downloading the files
for link in soup.select("a[href$='.pdf']"):
    filename = os.path.join(
        folder_location,
        (
            link.getText()
            .rstrip()
            .replace(" ", "_")
            .replace(",", "")
            .replace(".", "")
        ),
    )
    with open(f"{filename}.pdf", 'wb') as f:
        f.write(requests.get(urljoin(url, link['href'])).content)

这应该会生成所描述的文件:

代码语言:javascript
运行
复制
E:\webscraping/Index
E:\webscraping/[Part_I]_Vietnam_and_the_US_1940-1950
E:\webscraping/[Part_II]_US_Involvement_in_the_Franco-Viet_Minh_War_1950-1954
E:\webscraping/[Part_III]_The_Geneva_Accords
E:\webscraping/[Part_IV_A_1]_Evolution_of_the_War_NATO_and_SEATO:_A_Comparison
E:\webscraping/[Part_IV_A_2]_Evolution_of_the_War_Aid_for_France_in_Indochina_1950-54
E:\webscraping/[Part_IV_A_3]_Evolution_of_the_War_US_and_France's_Withdrawal_from_Vietnam_1954-56
E:\webscraping/[Part_IV_A_4]_Evolution_of_the_War_US_Training_of_Vietnamese_National_Army_1954-59
E:\webscraping/[Part_IV_A_5]_Evolution_of_the_War_Origins_of_the_Insurgency
E:\webscraping/[Part_IV_B_1]_Evolution_of_the_War_Counterinsurgency:_The_Kennedy_Commitments_and_Programs_1961

and more ...
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/67605853

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档