标签外的 BeautifulSoup 文本答案

【问题标题】：BeautifulSoup text outside of tags标签外的 BeautifulSoup 文本
【发布时间】：2016-10-17 09:21:18
【问题描述】：

我正在尝试从本网站上的 Seinfled 的每一集中提取 Kramer 的所有台词：

http://www.imsdb.com/TV/Seinfeld.html

我已经将剧集名称列表提取到我标记为 episode-list.txt 的文件中

我现在尝试只解析 KRAMER 之后的行，但它们似乎在标签之外，这就是我难过的地方。看这里 --> http://www.imsdb.com/transcripts/Seinfeld-Good-News,-Bad-News.html

下面是我尝试使用 BeautifulSoup 运行的代码。任何线索将不胜感激。另外，特此征求任何不请自来的建议哈哈。如果您发现我所做的任何事情让您觉得代码笨拙或粗鲁，我会很高兴收到反馈。

干杯！

from BeautifulSoup import BeautifulSoup
import requests

text = open ("episode-list.txt","r")


for line in text.readlines():
    url = "http://www.imsdb.com/transcripts/Seinfeld-" + line.strip('\n').replace(" ", "-") + ".html"
    r = requests.get(url)
    soup = BeautifulSoup(r.content)
    for tag in soup:
            print soup.findAll('???')

【问题讨论】：

标签： python parsing beautifulsoup screen-scraping

【解决方案1】：

这里有一个代码 sn-p 可作为您入门的参考...

import re
from bs4 import BeautifulSoup

html = """
<b>                             KRAMER
</b>               (enters) Are you up?

<b>               
</b><b>                             JERRY
</b>               (To Kramer) Yeah...(in the phone) Yeah, 
               people do move! Have you ever seen the 
               big trucks out on the street? Yeah, 
               no problem (hangs up the phone).
<b> 
</b><b>               
</b><b>                             KRAMER
</b>               Boy, the Mets blew it tonight, huh?
"""

soup = BeautifulSoup(html, 'html.parser')
for kramer in soup.find_all('b', text=re.compile("\s+KRAMER\s+")):
    print kramer.next_sibling.strip()

输出将是...

(enters) Are you up?
Boy, the Mets blew it tonight, huh?

【讨论】：