使用 beautifulsoup 循环遍历 2 个文件并将输出打印到单个 panda 数据帧答案

【问题标题】：loop through 2 files with beautifulsoup and print output to a single panda dataframe使用 beautifulsoup 循环遍历 2 个文件并将输出打印到单个 panda 数据帧
【发布时间】：2019-11-26 13:09:42
【问题描述】：

我有 2 个 html 文件如下

表格 - 3.html

    <table width='100%' border='0' cellpadding='0' class='blackbg textheadtitle'>
        <tr>
            <td width='41%' align='left'>Title</td>
            <td width='10%' align='left'>Year</td>
                <table width='99%' border='0' cellpadding='1' class="normal">
        <tr>
            <td width='41%' align='left'><strong>Quatermass 2</strong></td>
            <td width='10%' align='left'>1957</td>


    <table width='100%' border='0' cellpadding='0' class='blackbg textheadtitle'>
        <tr>
            <td width='41%' align='left'>Title</td>
            <td width='10%' align='left'>Year</td>
                <table width='99%' border='0' cellpadding='1' class="normal">
        <tr>
            <td width='41%' align='left'><strong>Ghostbusters</strong></td>
            <td width='10%' align='left'>1985</td>

表格 - 4.html

    <table width='100%' border='0' cellpadding='0' class='blackbg textheadtitle'>
        <tr>
            <td width='41%' align='left'>Title</td>
            <td width='10%' align='left'>Year</td>
                <table width='99%' border='0' cellpadding='1' class="normal">
        <tr>
            <td width='41%' align='left'><strong>Life of Brian</strong></td>
            <td width='10%' align='left'>1985</td>

我想从文件中提取以下熊猫数据框 => 表中的第 0 行和第 1 行 - 3.html 和表中的第 2 行 - 4.html

           Title  Year
0   Quatermass 2  1957
1   Ghostbusters  1985
2  Life of Brian  1985

我的 python 代码如下所示，输入文件名列出了我的 2 个文件。

import re
import pandas as pd
from bs4 import BeautifulSoup

#input results
inputfilename = 'html_files.txt'


#read input postcodes
inputfile = open(inputfilename, 'rb')   #rb = read binary
html_pages = inputfile.readlines()

for page in html_pages:

    soup = BeautifulSoup(page, "lxml")
    titles = soup.find_all("td", {"width": "41%"}, string=re.compile(r'^(?!Title$)'))

    titles_list = [each.text for each in titles ]

    #df = pd.DataFrame(titles_list, columns=['Title'])

    years = soup.find_all("td", {"width": "10%"}, string=re.compile(r'^\d{4}$'))
    year_list = [each.text for each in years ]

    d = {'Title':titles_list, 'Year':year_list}

    df = pd.DataFrame(data=d)
    df.to_csv('output.csv', index=False)

    print(df)

循环下面的部分是我调用漂亮的汤函数，用于将我的数据从一个文件中提取到所需的数据帧中。但是，当我应用循环并缩进时，我的所有代码都会产生如下所示的空数据帧

Empty DataFrame
Columns: [Title, Year]
Index: []

谁能帮我遍历这两个文件并将数据生成到数据框中？

【问题讨论】：

标签： python pandas dataframe beautifulsoup

【解决方案1】：

from bs4 import BeautifulSoup
import pandas as pd
import re


filenames = ['one.txt', 'two.txt']

value1 = []
value2 = []
for file_name in filenames:
    with open(file_name) as html_file:
        soup = BeautifulSoup(html_file, 'html.parser')
        for item in soup.findAll('td', attrs={'width': '41%'}, string=re.compile(r'^(?!Title$)')):
            value1.append(item.text)
        for item in soup.findAll("td", {"width": "10%"}, string=re.compile(r'^\d{4}$')):
            value2.append(item.text)
data = []
for item in zip(value1, value2):
    data.append(item)

df = pd.DataFrame(data, columns=['Title', 'Year'])
print(df)

输出：

           Title  Year
0   Quatermass 2  1957
1   Ghostbusters  1985
2  Life of Brian  1985

截图：

【讨论】：

保存它使用df.to_csv('output.csv')