【问题标题】:How can I loop through all <th> tags within my script for web scraping?如何循环遍历脚本中的所有 <th> 标记以进行网络抓取?
【发布时间】:2019-12-29 17:56:19
【问题描述】:

截至目前,我只得到['1'] 作为下面我当前代码打印内容的输出。我想在网站https://www.baseball-reference.com/teams/NYY/2019.shtml 上的Rk 列中的团队击球台上抢1-54。

我将如何修改 colNum 以便它可以在 Rk 列中打印 1-54?我指出colNum 行是因为我觉得问题出在那儿,但我可能是错的。

import pandas as pd
import requests
from bs4 import BeautifulSoup

page = requests.get('https://www.baseball-reference.com/teams/NYY/2019.shtml')
soup = BeautifulSoup(page.content, 'html.parser')  # parse as HTML page, this is the source code of the page
week = soup.find(class_='table_outer_container')

items = week.find("thead").get_text() # grabs table headers
th = week.find("th").get_text() # grabs Rk only.

tbody = week.find("tbody")
tr = tbody.find("tr")

thtwo = tr.find("th").get_text()
colNum = [thtwo for thtwo in thtwo]
print(colNum)

【问题讨论】:

    标签: python python-3.x debugging web-scraping beautifulsoup


    【解决方案1】:

    你的错误是在你提到的最后几行。如果我理解正确,您需要“Rk”列中所有值的列表。为了获取所有行,您必须使用 find_all() 函数。我稍微调整了您的代码,以便在以下几行中获取每行中第一个字段的文本:

    import pandas as pd
    import requests
    from bs4 import BeautifulSoup
    
    page = requests.get('https://www.baseball-reference.com/teams/NYY/2019.shtml')
    soup = BeautifulSoup(page.content, 'html.parser')
    is the source code of the page
    week = soup.find(class_='table_outer_container')
    
    items = week.find("thead").get_text()
    th = week.find("th").get_text()
    
    tbody = week.find("tbody")
    tr = tbody.find_all("tr")
    colnum = [row.find("th").get_text() for row in tr]
    
    print(colnum)
    

    【讨论】:

      猜你喜欢
      • 2021-09-05
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2011-08-14
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多