腾讯云

文章/答案/技术大牛

发布

社区首页 >问答首页 >读取CSV，如果文本匹配，则打开具有匹配文件名的html文件，并在文本中复制。

问读取CSV，如果文本匹配，则打开具有匹配文件名的html文件，并在文本中复制。
EN

Stack Overflow用户

提问于 2019-11-13 11:25:04

回答 1查看 128关注 0票数 3

好吧，我想我只是错过了连接器，我对python非常陌生。

目标:阅读CSV

读取目录中的所有文件名

如果索引(X)处的一行=目录中的文件名，则

打开HTML文件，用HTML文件中的文本替换索引(X)处的文本

目前为止的代码：

import fileinput
import csv
import os
import sys
import glob
from bs4 import BeautifulSoup

htmlfiles_path = "c:\\somedirectory\\" #path to directory containing the html files
filename_search = glob.glob("c:\\somedirectory\\*.HTM") #get list of filenames

#open csv

with open ('content.csv', mode='rt') as content_file:
    reader = csv.reader (content_file, delimiter=',')
    for row in reader:
        for field in row:
            if filename_search(some matching logic i am stuck on):
                for htmlcontentfile in glob.glob(os.path.join(path, ".HTM")):
                    markup(htmlcontentfile)
                    soup = BeatifulSoup(open(markup, "r").read())
                        content_file.write(soup.get_text())
                #i think something else goes here

我让csv阅读器开始工作，而glob则拉出文件名列表，在连接这些文件时遇到了一些问题。任何帮助都会很棒。

我查找了其他问题，其中一些代码是基于这个问题的，但是我没有在python中找到应对这个挑战的任何东西。如果有，把我引向正确的方向！

EDIT1:我在代码中打开的csv中使用"wt“。但那不是它被卡住的地方。

我有一个装满文件的文件夹。示例：

内容/d100.htm

内容/d101q.htm

内容/d102s.htm

以及CSV：

示例CSV

CSV档案：

标题名称位置

加州总统d 100.html

目标:打开csv，在“内容”文件夹中的任何文件的位置下查找匹配项。

如果找到匹配，打开相应的HTM文件，只解析文本。

将csv中的字段替换为文件的文本内容

这有意义吗？

python

beautifulsoup

glob

回答 1

Stack Overflow用户

回答已采纳

发布于 2019-12-02 07:19:57

答案：

@barny如果我没有运行代码，我就不会在这里发帖。我很抱歉误解了我要找的东西。

无论如何，我通过稍微修改问题陈述并使用Excel来完成它来解决这个问题。

最初的要求：

CSV

文本/答案/目标文件内容

一些文本\引用文件001.htm \ some

其他一些文本呢引用文件002.htm \ some

找到文件，并将内容解析到它旁边的列。

略为改变的问题：

将所有htm文件解析为csv，并列出它们各自的文件名。然后使用Excel来匹配内容。

Excel没有让BSoup或Python进行匹配的工作，而是已经有了一个函数，index(match())可以执行我的请求的第二部分。因此，我让Python和Bsoup打开每个HTML文件，并将其放入CSV中。我还在另一列中带了一个长的文件名。就像这样：

文件：

内容/001.htm

内容/002.htm

内容/003.htm

预期CSV产出格式：

HTML文件的内容--文件名

代码：

import fileinput
import csv
import os
import sys
import glob
from bs4 import BeautifulSoup

path = "<the path>"


def main():
   for filepath in glob.glob(os.path.join('<the path>', '*.HTM')): #find folder containing html files 
    with open(filepath) as f:
        contentstuff = f.read() #find an html file, and read it
        soup = BeautifulSoup(contentstuff, "html.parser") #parse the html out
        with open (path + '\\htmlpages.csv', 'a', encoding='utf-8', newline='') as content_file:
            writer = csv.writer (content_file, delimiter=',') #start writer for file content to CSV
            fp = filepath[-12:] #trim the file name to necessary name
            for body_tag in soup.find_all('body'):
                bodye = (body_tag.text.replace("\t", "").replace("\n", "")) #deal with necessary formatting between Bsoup and Excel
                print(bodye) #show me the work
                writer.writerow([bodye, fp])  #do the actual writing