【问题标题】:Beautiful Soup, fetching table data from WikipediaBeautiful Soup,从 Wikipedia 获取表格数据
【发布时间】:2021-02-24 05:34:13
【问题描述】:

我正在关注 Seppe vanden Broucke 和 Bart Baesens 所著的“Practical Web Scraping for Data Science Best Practices and examples with Python”一书。

下一个代码应该从 Wikipedia 获取数据,即权力的游戏剧集列表:

import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/w/index.php' + \
'?title=List_of_Game_of_Thrones_episodes&oldid=802553687'
r = requests.get(url)
html_contents = r.text
html_soup = BeautifulSoup(html_contents, 'html.parser')
# We'll use a list to store our episode list
episodes = []
ep_tables = html_soup.find_all('table', class_='wikiepisodetable')
for table in ep_tables:
    headers = []
    rows = table.find_all('tr')
    for header in table.find('tr').find_all('th'):
        headers.append(header.text)
        for row in table.find_all('tr')[1:]:
            values = []
            for col in row.find_all(['th','td']):
                values.append(col.text)
                if values:
                    episode_dict = {headers[i]: values[i] for i in
                                    range(len(values))}
                    episodes.append(episode_dict)
                    for episode in episodes:
                        print(episode)

但在运行代码时会显示下一个错误:

{'No.overall': '1'}

IndexError Traceback(最近一次调用最后一次)

<ipython-input-8-d2e64c7e0540> in <module>
     20                 if values:
     21                     episode_dict = {headers[i]: values[i] for i in
---> 22                                     range(len(values))}
     23                     episodes.append(episode_dict)
     24                     for episode in episodes:

<ipython-input-8-d2e64c7e0540> in <dictcomp>(.0)
     19                 values.append(col.text)
     20                 if values:
---> 21                     episode_dict = {headers[i]: values[i] for i in
     22                                     range(len(values))}
     23                     episodes.append(episode_dict)

IndexError: list index out of range

谁能告诉为什么会这样?

【问题讨论】:

    标签: python web-scraping beautifulsoup web-crawler


    【解决方案1】:

    问题不在于代码,而在于代码的缩进。第三个for 循环应该与第二个并行,而不是在第二个for 循环内。书中是这样显示的。

    import requests
    from bs4 import BeautifulSoup
    url = 'https://en.wikipedia.org/w/index.php' + \
    '?title=List_of_Game_of_Thrones_episodes&oldid=802553687'
    r = requests.get(url)
    html_contents = r.text
    html_soup = BeautifulSoup(html_contents, 'html.parser')
    # We'll use a list to store our episode list
    episodes = []
    ep_tables = html_soup.find_all('table', class_='wikitable plainrowheaders wikiepisodetable')
    for table in ep_tables:
        headers = []
        rows = table.find_all('tr')
        # Start by fetching the header cells from the first row to determine
        # the field names
        for header in table.find('tr').find_all('th'):
            headers.append(header.text)
        # Then go through all the rows except the first one
        for row in table.find_all('tr')[1:]:
            values = []
            # And get the column cells, the first one being inside a th-tag
            for col in row.find_all(['th','td']):
                values.append(col.text)
            if values:
                episode_dict = {headers[i]: values[i] for i in
            range(len(values))}
            episodes.append(episode_dict)
    # Show the results
    for episode in episodes:
     print(episode)
    

    【讨论】:

    • 啊哈。最初的代码没有很好地复制。好的。 :)
    • Awesome Ananth,你说得对,我会更加小心识别,我是 python 新手,甚至书上说识别是新手最常见的错误之一,这个错误帮助我在寻找其他答案之前仔细检查身份,但是,我认为通过这种方式我学到了更多,感谢你帮助 Ananth 和 @karlcow,这两个答案都帮助我更好地理解和理解发生了什么
    【解决方案2】:

    你的踪迹是

    {'No.overall': '1'}
    Traceback (most recent call last):
      File "/Users/karl/code/deleteme/foo.py", line 20, in <module>
        episode_dict = {headers[i]: values[i] for i in
      File "/Users/karl/code/deleteme/foo.py", line 20, in <dictcomp>
        episode_dict = {headers[i]: values[i] for i in
    IndexError: list index out of range
    

    代码可能过于缩进,并且在选择变量时有点难以阅读。了解您要准确提取的内容会很有用。剧集列表? 由于这本书可能表结构已经改变。

    如果是,那么每个相关的剧集标题都具有这种形状。

    <td class="summary" style="text-align:left">"<a href="/wiki/Stormborn" title="Stormborn">Stormborn</a>"</td>
    
    import requests
    from bs4 import BeautifulSoup
    url = 'https://en.wikipedia.org/w/index.php?title=List_of_Game_of_Thrones_episodes&oldid=802553687'
    r = requests.get(url)
    html_contents = r.text
    html_soup = BeautifulSoup(html_contents, 'html.parser')
    # We'll use a list to store our episode list
    episodes = []
    ep_tables = html_soup.find_all('table', class_='wikiepisodetable')
    for table in ep_tables:
        headers = []
        rows = table.find_all('tr')
        for header in table.find('tr').find_all('th'):
            headers.append(header.text)
            for row in table.find_all('tr')[1:]:
                values = []
                for col in row.find_all('td', class_='summary'):
                    print(col.text)
    

    【讨论】:

    • 谢谢,非常感谢 karlcow,我会在课程名称中挖掘更多内容以进行进一步的练习。关于试图提取的内容,这本书声称代码,我引用:“现在让我们尝试制定以下用例。你会注意到我们的权力的游戏维基百科页面有许多维护良好的表格列出了剧集及其导演、编剧、播出日期和观众人数。让我们尝试获取所有这些数据“我看到您的回复有效地获取了剧集标题,我将尝试获取本书提出的所有数据,
    • 您要构建哪种数据结构?
    • 是一个列表,但正如你提到的,这只是一个没有很好复制的代码。我在学习的时候一般都是打代码而不是抄代码,为了更好的理解,熟悉语法,这一次,我不会忘记识别不好的后果。
    猜你喜欢
    • 1970-01-01
    • 2018-02-02
    • 1970-01-01
    • 1970-01-01
    • 2021-03-03
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多