【问题标题】:BeautifulSoup text outside of tags标签外的 BeautifulSoup 文本
【发布时间】:2016-10-17 09:21:18
【问题描述】:

我正在尝试从本网站上的 Seinfled 的每一集中提取 Kramer 的所有台词:

http://www.imsdb.com/TV/Seinfeld.html

我已经将剧集名称列表提取到我标记为 episode-list.txt 的文件中

我现在尝试只解析 KRAMER 之后的行,但它们似乎在标签之外,这就是我难过的地方。看这里 --> http://www.imsdb.com/transcripts/Seinfeld-Good-News,-Bad-News.html

下面是我尝试使用 BeautifulSoup 运行的代码。任何线索将不胜感激。另外,特此征求任何不请自来的建议哈哈。如果您发现我所做的任何事情让您觉得代码笨拙或粗鲁,我会很高兴收到反馈。

干杯!

from BeautifulSoup import BeautifulSoup
import requests

text = open ("episode-list.txt","r")


for line in text.readlines():
    url = "http://www.imsdb.com/transcripts/Seinfeld-" + line.strip('\n').replace(" ", "-") + ".html"
    r = requests.get(url)
    soup = BeautifulSoup(r.content)
    for tag in soup:
            print soup.findAll('???')

【问题讨论】:

    标签: python parsing beautifulsoup screen-scraping


    【解决方案1】:

    这里有一个代码 sn-p 可作为您入门的参考...

    import re
    from bs4 import BeautifulSoup
    
    html = """
    <b>                             KRAMER
    </b>               (enters) Are you up?
    
    <b>               
    </b><b>                             JERRY
    </b>               (To Kramer) Yeah...(in the phone) Yeah, 
                   people do move! Have you ever seen the 
                   big trucks out on the street? Yeah, 
                   no problem (hangs up the phone).
    <b> 
    </b><b>               
    </b><b>                             KRAMER
    </b>               Boy, the Mets blew it tonight, huh?
    """
    
    soup = BeautifulSoup(html, 'html.parser')
    for kramer in soup.find_all('b', text=re.compile("\s+KRAMER\s+")):
        print kramer.next_sibling.strip()
    

    输出将是...

    (enters) Are you up?
    Boy, the Mets blew it tonight, huh?
    

    【讨论】:

      猜你喜欢
      • 2015-10-22
      • 2013-10-31
      • 2018-09-24
      • 1970-01-01
      • 2016-01-10
      • 2015-11-04
      • 1970-01-01
      • 2019-07-26
      相关资源
      最近更新 更多