【问题标题】:loop through 2 files with beautifulsoup and print output to a single panda dataframe使用 beautifulsoup 循环遍历 2 个文件并将输出打印到单个 panda 数据帧
【发布时间】:2019-11-26 13:09:42
【问题描述】:

我有 2 个 html 文件如下

表格 - 3.html

    <table width='100%' border='0' cellpadding='0' class='blackbg textheadtitle'>
        <tr>
            <td width='41%' align='left'>Title</td>
            <td width='10%' align='left'>Year</td>
                <table width='99%' border='0' cellpadding='1' class="normal">
        <tr>
            <td width='41%' align='left'><strong>Quatermass 2</strong></td>
            <td width='10%' align='left'>1957</td>


    <table width='100%' border='0' cellpadding='0' class='blackbg textheadtitle'>
        <tr>
            <td width='41%' align='left'>Title</td>
            <td width='10%' align='left'>Year</td>
                <table width='99%' border='0' cellpadding='1' class="normal">
        <tr>
            <td width='41%' align='left'><strong>Ghostbusters</strong></td>
            <td width='10%' align='left'>1985</td>

表格 - 4.html

    <table width='100%' border='0' cellpadding='0' class='blackbg textheadtitle'>
        <tr>
            <td width='41%' align='left'>Title</td>
            <td width='10%' align='left'>Year</td>
                <table width='99%' border='0' cellpadding='1' class="normal">
        <tr>
            <td width='41%' align='left'><strong>Life of Brian</strong></td>
            <td width='10%' align='left'>1985</td>

我想从文件中提取以下熊猫数据框 => 表中的第 0 行和第 1 行 - 3.html 和表中的第 2 行 - 4.html

           Title  Year
0   Quatermass 2  1957
1   Ghostbusters  1985
2  Life of Brian  1985

我的 python 代码如下所示,输入文件名列出了我的 2 个文件。

import re
import pandas as pd
from bs4 import BeautifulSoup

#input results
inputfilename = 'html_files.txt'


#read input postcodes
inputfile = open(inputfilename, 'rb')   #rb = read binary
html_pages = inputfile.readlines()

for page in html_pages:

    soup = BeautifulSoup(page, "lxml")
    titles = soup.find_all("td", {"width": "41%"}, string=re.compile(r'^(?!Title$)'))

    titles_list = [each.text for each in titles ]

    #df = pd.DataFrame(titles_list, columns=['Title'])

    years = soup.find_all("td", {"width": "10%"}, string=re.compile(r'^\d{4}$'))
    year_list = [each.text for each in years ]

    d = {'Title':titles_list, 'Year':year_list}

    df = pd.DataFrame(data=d)
    df.to_csv('output.csv', index=False)

    print(df)

循环下面的部分是我调用漂亮的汤函数,用于将我的数据从一个文件中提取到所需的数据帧中。但是,当我应用循环并缩进时,我的所有代码都会产生如下所示的空数据帧

Empty DataFrame
Columns: [Title, Year]
Index: []

谁能帮我遍历这两个文件并将数据生成到数据框中?

【问题讨论】:

    标签: python pandas dataframe beautifulsoup


    【解决方案1】:
    from bs4 import BeautifulSoup
    import pandas as pd
    import re
    
    
    filenames = ['one.txt', 'two.txt']
    
    value1 = []
    value2 = []
    for file_name in filenames:
        with open(file_name) as html_file:
            soup = BeautifulSoup(html_file, 'html.parser')
            for item in soup.findAll('td', attrs={'width': '41%'}, string=re.compile(r'^(?!Title$)')):
                value1.append(item.text)
            for item in soup.findAll("td", {"width": "10%"}, string=re.compile(r'^\d{4}$')):
                value2.append(item.text)
    data = []
    for item in zip(value1, value2):
        data.append(item)
    
    df = pd.DataFrame(data, columns=['Title', 'Year'])
    print(df)
    

    输出:

               Title  Year
    0   Quatermass 2  1957
    1   Ghostbusters  1985
    2  Life of Brian  1985
    

    截图:

    【讨论】:

    • 保存它使用df.to_csv('output.csv')
    猜你喜欢
    • 1970-01-01
    • 2021-08-25
    • 1970-01-01
    • 1970-01-01
    • 2021-11-18
    • 2023-03-20
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多