如何循环遍历脚本中的所有 <th> 标记以进行网络抓取？答案

【问题标题】：How can I loop through all <th> tags within my script for web scraping?如何循环遍历脚本中的所有 <th> 标记以进行网络抓取？
【发布时间】：2019-12-29 17:56:19
【问题描述】：

截至目前，我只得到['1'] 作为下面我当前代码打印内容的输出。我想在网站https://www.baseball-reference.com/teams/NYY/2019.shtml 上的Rk 列中的团队击球台上抢1-54。

我将如何修改 colNum 以便它可以在 Rk 列中打印 1-54？我指出colNum 行是因为我觉得问题出在那儿，但我可能是错的。

import pandas as pd
import requests
from bs4 import BeautifulSoup

page = requests.get('https://www.baseball-reference.com/teams/NYY/2019.shtml')
soup = BeautifulSoup(page.content, 'html.parser')  # parse as HTML page, this is the source code of the page
week = soup.find(class_='table_outer_container')

items = week.find("thead").get_text() # grabs table headers
th = week.find("th").get_text() # grabs Rk only.

tbody = week.find("tbody")
tr = tbody.find("tr")

thtwo = tr.find("th").get_text()
colNum = [thtwo for thtwo in thtwo]
print(colNum)

【问题讨论】：

标签： python python-3.x debugging web-scraping beautifulsoup

【解决方案1】：

你的错误是在你提到的最后几行。如果我理解正确，您需要“Rk”列中所有值的列表。为了获取所有行，您必须使用 find_all() 函数。我稍微调整了您的代码，以便在以下几行中获取每行中第一个字段的文本：

import pandas as pd
import requests
from bs4 import BeautifulSoup

page = requests.get('https://www.baseball-reference.com/teams/NYY/2019.shtml')
soup = BeautifulSoup(page.content, 'html.parser')
is the source code of the page
week = soup.find(class_='table_outer_container')

items = week.find("thead").get_text()
th = week.find("th").get_text()

tbody = week.find("tbody")
tr = tbody.find_all("tr")
colnum = [row.find("th").get_text() for row in tr]

print(colnum)

【讨论】：