使用美汤时只获得最后一行答案

【问题标题】：Obtaining just the last row when using beautiful soup使用美汤时只获得最后一行
【发布时间】：2021-01-24 05:20:59
【问题描述】：

我有以下代码：

from bs4 import BeautifulSoup

import requests

import pandas as pd

def Get_Top_List_BR(url):
    
    
    response = requests.get(url)

    page = response.text

    soup = BeautifulSoup(page)

    table = soup.find(id='table')
   
    rows = [row for row in table.find_all('tr')]
    

    movies = {}

    for row in rows[1:]:
        items = row.find_all('td')
        link = items[1].find('a')
        title, url_string = link.text, link['href']
        #split url string into unique movie serial number
        url = url_string.split('?', 1)[0].split('t', 4)[-1].split('/', 1)[0]
        #set serial number as key to avoid duplication in any other category-especially title
        movies[url] = [url_string] +[i.text for i in items]
   
    movie_page = pd.DataFrame(movies).T  #transpose
    movie_page.columns = ['URL', 'Rank', 'Title', 'Genre', 'Budget', 'Running Time','Gross',
                    'Theaters', 'Total_Gross', 'Release_Date', 'Distributor', 'Estimated']

    return movie_page

df_test_BR = Get_Top_List_BR('https://www.boxofficemojo.com/year/2019/?grossesOption=calendarGrosses&area=BR/')

df_test_BR.head(10)

问题：我只得到最后一行。问题：如何修复它以返回所有行？

【问题讨论】：

标签： python pandas web-scraping beautifulsoup

【解决方案1】：

首先，我不确定您使用的是哪个 Python 版本，但您实现 BeautifulSoup 的方式是不正确的，至少在我的版本中是这样。 BeautifulSoup 强烈推荐使用解析器here。您的以下代码：

 response = requests.get(url)
 page = response.text
 soup = BeautifulSoup(page)
 table = soup.find(id='table')

应该是：

response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
table = soup.find(id='table')

您的问题是如何在 for 循环中定义 url。我设法遍历了所有元素，但是如何定义 url 是一个特别的问题。您阅读在 for 循环中定义 url 的方式返回空格。

所以你说它只返回最后一项。当它到达最后一项时，它将在 for 循环中获取 url。但是 url 只是空格，并且 key 已经存在于电影中。因此，它会覆盖那里的现有数据。

我不确定您希望如何定义 url，但此代码会按照您的意愿执行 - 获取所有电影、它们的名称、href 值，并返回前 10 个。唯一的区别应该是如何您定义了url 和movies[url]，但请注意不要再次访问网址。

此外，您在 for 循环中重新定义 url 以表示唯一 ID 的方式应该反映这一点 - 将其命名为 unique_id（或者，在本示例中为 uid）。我还包含了 print 语句来演示它遍历整个循环并获取前 10 个值。

def Get_Top_List_GR(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    table = soup.find(id='table')
    rows = [row for row in table.find_all('tr')]

    movies = {}
    for row in rows[1:]:
        items = row.find_all('td')
        link = items[1].find('a')
        title, url_string = link.text, link['href']
        # split url string into unique movie serial number
        uid = url_string.split("/")[-2]
        print("{0} - {1} - {2}".format(url, title, uid))
        # set serial number as key to avoid duplication in any other category-        especially title
        movies[uid] = [url_string] + [i.text for i in items]
    movie_page = pd.DataFrame(movies).T  # transpose
    return movie_page

df_test_ = Get_Top_List_GR('https://www.boxofficemojo.com/year/2019/?grossesOption=calendarGrosses&area=BR/')
print(df_test_.head(10))

【讨论】：