【问题标题】:Beautiful Soup - Extract data only contain td tag (without tag like div, id, class....)Beautiful Soup - 提取数据仅包含 td 标签(没有标签,如 div、id、class ....)
【发布时间】:2020-04-30 06:23:41
【问题描述】:

我是 Beautiful Soup 的新手,我有这样的数据,其中包含 3 组用户数据(对于这种情况)。

我想获取每个 USER_ID 的所有信息并保存到数据库。

  • 用户 ID
  • 标题
  • 内容
  • PID(不是每个用户都有这一行)
  • 日期
  • 网址
<table align="center" border="0" style="width:550px">
    <tbody>
        <tr>
            <td colspan="2">USER_ID 11111</td>
        </tr>
        <tr>
            <td colspan="2">string_a</td>
        </tr>
        <tr>
            <td colspan="2"><strong>content: aaa</strong></td>
        </tr>
        <tr>
            <td colspan="2"><strong>date:</strong>2020-05-01 00:00:00 To 2020-05-03 23:59:59</td>
        </tr>
        <tr>
            <td colspan="2"><strong>URL:https://aaa.com</strong></td>
        </tr>
        <tr>
            <td colspan="2">&nbsp;</td>
        </tr>
        <tr>
            <td colspan="2">&nbsp;</td>
        </tr>
        <tr>
            <td colspan="2">USER_ID 22222</td>
        </tr>
        <tr>
            <td colspan="2">string_b</td>
        </tr>
        <tr>
            <td colspan="2"><strong>content: bbb</strong></td>
        </tr>
        <tr>
            <td colspan="2"><strong>date:</strong>2020-05-01 00:00:00 To 2020-05-03 23:59:59</td>
        </tr>
        <tr>
            <td colspan="2"><strong>URL:https://aaa.com</strong></td>
        </tr>
        <tr>
            <td colspan="2">&nbsp;</td>
        </tr>
        <tr>
            <td colspan="2">&nbsp;</td>
        </tr>
        <tr>
            <td colspan="2">USER_ID 33333</td>
        </tr>
        <tr>
            <td colspan="2">string_c</td>
        </tr>
        <tr>
            <td colspan="2"><strong>content: ccc</strong></td>
        </tr>
        <tr>
            <td colspan="2"><strong>date:</strong>2020-05-01 00:00:00 To 2020-05-03 23:59:59</td>
        </tr>
        <tr>
            <td colspan="2"><strong>PID:</strong><strong>ABCDE</strong></td>
        </tr>
        <tr>
            <td colspan="2"><strong>URL:https://ccc.com</strong></td>
        </tr>
        <tr>
            <td colspan="2">&nbsp;</td>
        </tr>
        <tr>
            <td colspan="2">&nbsp;</td>
        </tr>
    </tbody>
</table>

我的问题是,
所有数据仅在 td 内,不包含 div 名称和父标签。我无法分成 3 组数据。

我尝试了下面的代码,它可以找到所有的 USER_ID,但我不知道如何获取每个 USER_ID 的其他数据

soup = BeautifulSoup(content, 'html.parser')
p = soup.find_all('td', text=re.compile("^USER_ID"))
for item in p:
   title = item.find_next_siblings('td') # <--- return empty
   ...

我正在使用
蟒蛇 3.6
django 2.0.2

【问题讨论】:

  • 检查下面的答案:)

标签: python beautifulsoup


【解决方案1】:
from bs4 import BeautifulSoup
import re
from more_itertools import split_when

data = """<table align="center" border="0" style="width:550px">
    <tbody>
        <tr>
            <td colspan="2">USER_ID 11111</td>
        </tr>
        <tr>
            <td colspan="2">string_a</td>
        </tr>
        <tr>
            <td colspan="2"><strong>content: aaa</strong></td>
        </tr>
        <tr>
            <td colspan="2"><strong>date:</strong>2020-05-01 00:00:00 To 2020-05-03 23:59:59</td>
        </tr>
        <tr>
            <td colspan="2"><strong>URL:https://aaa.com</strong></td>
        </tr>
        <tr>
            <td colspan="2">&nbsp;</td>
        </tr>
        <tr>
            <td colspan="2">&nbsp;</td>
        </tr>
        <tr>
            <td colspan="2">USER_ID 22222</td>
        </tr>
        <tr>
            <td colspan="2">string_b</td>
        </tr>
        <tr>
            <td colspan="2"><strong>content: bbb</strong></td>
        </tr>
        <tr>
            <td colspan="2"><strong>date:</strong>2020-05-01 00:00:00 To 2020-05-03 23:59:59</td>
        </tr>
        <tr>
            <td colspan="2"><strong>URL:https://aaa.com</strong></td>
        </tr>
        <tr>
            <td colspan="2">&nbsp;</td>
        </tr>
        <tr>
            <td colspan="2">&nbsp;</td>
        </tr>
        <tr>
            <td colspan="2">USER_ID 33333</td>
        </tr>
        <tr>
            <td colspan="2">string_c</td>
        </tr>
        <tr>
            <td colspan="2"><strong>content: ccc</strong></td>
        </tr>
        <tr>
            <td colspan="2"><strong>date:</strong>2020-05-01 00:00:00 To 2020-05-03 23:59:59</td>
        </tr>
        <tr>
            <td colspan="2"><strong>PID:</strong><strong>ABCDE</strong></td>
        </tr>
        <tr>
            <td colspan="2"><strong>URL:https://ccc.com</strong></td>
        </tr>
        <tr>
            <td colspan="2">&nbsp;</td>
        </tr>
        <tr>
            <td colspan="2">&nbsp;</td>
        </tr>
    </tbody>
</table>"""

soup = BeautifulSoup(data, 'html.parser')

target = soup.find("table", align="center")

goal = [item.text for item in target.select(
    "td", text=re.compile("^USER_ID")) if item.text.strip() != '']


final = list(split_when(goal, lambda _, y: y.startswith("USER")))

print(final)  # list of lists

for x in final:  # or loop
    print(x)

输出

[['USER_ID 11111', 'string_a', 'content: aaa', 'date:2020-05-01 00:00:00 To 2020-05-03 23:59:59', 'URL:https://aaa.com'], ['USER_ID 22222', 'string_b', 'content: bbb', 'date:2020-05-01 00:00:00 To 2020-05-03 23:59:59', 'URL:https://aaa.com'], ['USER_ID 33333', 'string_c', 'content: ccc', 'date:2020-05-01 00:00:00 To 2020-05-03 23:59:59', 'PID:ABCDE', 'URL:https://ccc.com']]

还有

['USER_ID 11111', 'string_a', 'content: aaa', 'date:2020-05-01 00:00:00 To 2020-05-03 23:59:59', 'URL:https://aaa.com']
['USER_ID 22222', 'string_b', 'content: bbb', 'date:2020-05-01 00:00:00 To 2020-05-03 23:59:59', 'URL:https://aaa.com']
['USER_ID 33333', 'string_c', 'content: ccc', 'date:2020-05-01 00:00:00 To 2020-05-03 23:59:59', 'PID:ABCDE', 'URL:https://ccc.com']

【讨论】:

  • 我收到了这个错误。 target = [item.text for item in target.select("td", text=re.compile("^USER_ID")) if item.text.strip() != ''] TypeError: select() got an unexpected关键字参数“文本”
  • @user3114168 您正在运行哪个bs4 版本?看来您使用的是不支持CSS SELECTORS 的旧版本,您可以将select 更改为findAll
  • 我使用的是“bs4 version 4.6.0”,将“select”更改为“findAll”。对于 print(final),它只得到这样的 USER_ID... [['USER_ID 11111'], ['USER_ID 22222'], ['USER_ID 33333']] 有没有我做错了?
  • @user3114168 您必须将您的版本升级到最新版本 4.9.0, pip install bs4 --upgrade
【解决方案2】:

尝试使用以下代码来识别 find_all_next('td') 并检查 if 条件以破坏 dataset

import re
from bs4 import BeautifulSoup

html='''<table align="center" border="0" style="width:550px">
    <tbody>
        <tr>
            <td colspan="2">USER_ID 11111</td>
        </tr>
        <tr>
            <td colspan="2">string_a</td>
        </tr>
        <tr>
            <td colspan="2"><strong>content: aaa</strong></td>
        </tr>
        <tr>
            <td colspan="2"><strong>date:</strong>2020-05-01 00:00:00 To 2020-05-03 23:59:59</td>
        </tr>
        <tr>
            <td colspan="2"><strong>URL:https://aaa.com</strong></td>
        </tr>
        <tr>
            <td colspan="2">&nbsp;</td>
        </tr>
        <tr>
            <td colspan="2">&nbsp;</td>
        </tr>
        <tr>
            <td colspan="2">USER_ID 22222</td>
        </tr>
        <tr>
            <td colspan="2">string_b</td>
        </tr>
        <tr>
            <td colspan="2"><strong>content: bbb</strong></td>
        </tr>
        <tr>
            <td colspan="2"><strong>date:</strong>2020-05-01 00:00:00 To 2020-05-03 23:59:59</td>
        </tr>
        <tr>
            <td colspan="2"><strong>URL:https://aaa.com</strong></td>
        </tr>
        <tr>
            <td colspan="2">&nbsp;</td>
        </tr>
        <tr>
            <td colspan="2">&nbsp;</td>
        </tr>
        <tr>
            <td colspan="2">USER_ID 33333</td>
        </tr>
        <tr>
            <td colspan="2">string_c</td>
        </tr>
        <tr>
            <td colspan="2"><strong>content: ccc</strong></td>
        </tr>
        <tr>
            <td colspan="2"><strong>date:</strong>2020-05-01 00:00:00 To 2020-05-03 23:59:59</td>
        </tr>
        <tr>
            <td colspan="2"><strong>PID:</strong><strong>ABCDE</strong></td>
        </tr>
        <tr>
            <td colspan="2"><strong>URL:https://ccc.com</strong></td>
        </tr>
        <tr>
            <td colspan="2">&nbsp;</td>
        </tr>
        <tr>
            <td colspan="2">&nbsp;</td>
        </tr>
    </tbody>
</table>'''

soup=BeautifulSoup(html,'html.parser')

final_list=[]
for item in soup.find_all('td',text=re.compile("USER_ID")):
    row_list=[]
    row_list.append(item.text.strip())
    siblings=item.find_all_next('td')
    for sibling in siblings:
        if "USER_ID" in sibling.text:
            break
        else:
            if sibling.text.strip()!='':
               row_list.append(sibling.text.strip())
    final_list.append(row_list)

print(final_list)

输出

[['USER_ID 11111', 'string_a', 'content: aaa', 'date:2020-05-01 00:00:00 To 2020-05-03 23:59:59', 'URL:https://aaa.com'], ['USER_ID 22222', 'string_b', 'content: bbb', 'date:2020-05-01 00:00:00 To 2020-05-03 23:59:59', 'URL:https://aaa.com'], ['USER_ID 33333', 'string_c', 'content: ccc', 'date:2020-05-01 00:00:00 To 2020-05-03 23:59:59', 'PID:ABCDE', 'URL:https://ccc.com']]

如果你想打印每个列表,试试这个。

soup=BeautifulSoup(html,'html.parser')

for item in soup.find_all('td',text=re.compile("USER_ID")):
    row_list=[]
    row_list.append(item.text.strip())
    siblings=item.find_all_next('td')
    for sibling in siblings:
        if "USER_ID" in sibling.text:
            break
        else:
            if sibling.text.strip()!='':
               row_list.append(sibling.text.strip())
    print(row_list)

输出

['USER_ID 11111', 'string_a', 'content: aaa', 'date:2020-05-01 00:00:00 To 2020-05-03 23:59:59', 'URL:https://aaa.com']
['USER_ID 22222', 'string_b', 'content: bbb', 'date:2020-05-01 00:00:00 To 2020-05-03 23:59:59', 'URL:https://aaa.com']
['USER_ID 33333', 'string_c', 'content: ccc', 'date:2020-05-01 00:00:00 To 2020-05-03 23:59:59', 'PID:ABCDE', 'URL:https://ccc.com']

【讨论】:

    【解决方案3】:

    你可以简单地使用soup.select('table tr')

    例子

    from bs4 import BeautifulSoup
    
    html = '<table align="center" border="0" style="width:550px"><tbody>' \
           '<tr><td colspan="2">USER_ID 11111</td></tr>' \
            '<tr><td colspan="2">string_a</td></tr>' \
            '<tr><td colspan="2"><strong>content: aaa</strong></td></tr>' \
            '<tr><td colspan="2"><strong>date:</strong>2020-05-01 00:00:00 To 2020-05-03 23:59:59</td></tr>' \
            '<tr><td colspan="2"><strong>URL:https://aaa.com</strong></td></tr>' \
            '<tr><td colspan="2">&nbsp;</td></tr>' \
            '<tr><td colspan="2">&nbsp;</td></tr>' \
            '<tr><td colspan="2">USER_ID 22222</td></tr>' \
            '<tr><td colspan="2">string_b</td></tr>' \
            '<tr><td colspan="2"><strong>content: bbb</strong></td></tr>' \
            '<tr><td colspan="2"><strong>date:</strong>2020-05-01 00:00:00 To 2020-05-03 23:59:59</td></tr>' \
            '<tr><td colspan="2"><strong>URL:https://aaa.com</strong></td></tr>' \
            '<tr><td colspan="2">&nbsp;</td></tr>' \
            '<tr><td colspan="2">&nbsp;</td></tr>' \
            '<tr><td colspan="2">USER_ID 33333</td></tr>' \
            '<tr><td colspan="2">string_c</td></tr>' \
            '<tr><td colspan="2"><strong>content: ccc</strong></td></tr>' \
            '<tr><td colspan="2"><strong>date:</strong>2020-05-01 00:00:00 To 2020-05-03 23:59:59</td></tr>' \
            '<tr><td colspan="2"><strong>PID:</strong><strong>ABCDE</strong></td></tr>' \
            '<tr><td colspan="2"><strong>URL:https://ccc.com</strong></td></tr>' \
            '<tr><td colspan="2">&nbsp;</td></tr>' \
            '<tr><td colspan="2">&nbsp;</td></tr></tbody></table>'
    
    soup = BeautifulSoup(html, features="lxml")
    elements = soup.select('table tr')
    print(elements)
    
    for element in elements:
        print(element.text)
    

    打印出来

    USER_ID 11111
    string_a
    content: aaa
    date:2020-05-01 00:00:00 To 2020-05-03 23:59:59
    URL:https://aaa.com
     
     
    USER_ID 22222
    string_b
    content: bbb
    date:2020-05-01 00:00:00 To 2020-05-03 23:59:59
    URL:https://aaa.com
     
     
    USER_ID 33333
    string_c
    content: ccc
    date:2020-05-01 00:00:00 To 2020-05-03 23:59:59
    PID:ABCDE
    URL:https://ccc.com
    

    【讨论】:

    • 但是不能分离3组数据。由于我想将每个用户数据存储到具有不同字段(USED_ID、标题、内容、URL......)的数据库中
    猜你喜欢
    • 1970-01-01
    • 2015-09-08
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多