beautifulSoup soup.select() 为 css 选择器返回空答案

【问题标题】：beautifulSoup soup.select() returning empty for css selectorbeautifulSoup soup.select() 为 css 选择器返回空
【发布时间】：2020-02-16 14:01:49
【问题描述】：

我正在尝试解析来自该站点的一些链接 https://news.ycombinator.com/

我想选择一个特定的表

document.querySelector("#hnmain > tbody > tr:nth-child(3) > td > table")

我知道 bs4 有 css 选择器限制。但问题是我什至无法选择像#hnmain > tbody 和soup.select('#hnmain > tbody') 这样简单，因为它返回empty

使用下面的代码，我无法解析 tbody，而我使用 js（截图）

from bs4 import BeautifulSoup
import requests
print("-"*100)
print("Hackernews parser")
print("-"*100)
url="https://news.ycombinator.com/"
res=requests.get(url)
html=res.content
soup=BeautifulSoup(html)
table=soup.select('#hnmain > tbody')
print(table)

输出：

soup=BeautifulSoup(html)
[]

【问题讨论】：

标签： python web-scraping beautifulsoup python-3.7

【解决方案1】：

我没有从 beautifulsoup 或 curl 脚本中获取 html 标签 tbody。这意味着

soup.select('tbody')

返回空列表。这与您获得空列表的相同原因。

只需提取您正在寻找的链接就可以了

soup.select("a.storylink")

它会从网站获取你想要的链接。

【讨论】：

谢谢！！您的方法对于获取主要故事链接很有用，但是我需要从每个帖子中获取更多属性，例如 upvote_count、comment_count、url_to_right_of_main_url、posted_ago，其中这些项目是 span/a 而没有特定的类。请帮忙！

【解决方案2】：

为什么不直接转到链接，而不是通过正文和表格？我对此进行了测试，效果很好：

links=soup.select('a',{'class':'storylink'})

如果你想要表格，因为每页只有一个，你也不需要浏览其他元素 - 你可以直接进入它。

table = soup.select('table')

【讨论】：

该页面共有 3 个表格
哦，是的，我的错。那么您可以根据类 ID 或其他内容进行解析，但如果我是您，我不会遍历属性层次结构。
您的方法对于获取主要故事链接很有用，但我需要从每个帖子中获取更多属性，例如 upvote_count、comment_count、site_url，其中这些项目是 span/a 而没有特定的类。请帮忙
有属性的类 ID 为“subtext”，其中包含很多帖子信息。 span 属性也有 ID，具体取决于存储的信息 - 仔细查看 HTML 并找到您想要获取的信息的模式。

【解决方案3】：

数据以 3 行为一组排列，其中第三行是用于间隔的空行。循环顶部行并使用 next_sibling 在每个点获取关联的第二行。 bs4 4.7.1+

from bs4 import BeautifulSoup as bs
import requests

r = requests.get('https://news.ycombinator.com/')
soup = bs(r.content, 'lxml')
top_rows = soup.select('.athing')

for row in top_rows:
    title = row.select_one('.storylink')
    print(title.text)
    print(title['href'])
    print('https://news.ycombinator.com/' + row.select_one('.sitebit a')['href'])
    next_row = row.next_sibling
    print(next_row.select_one('.score').text)
    print(next_row.select_one('.hnuser').text)
    print(next_row.select_one('.age a').text)
    print(next_row.select_one('a:nth-child(6)').text)
    print(100*'-')

【讨论】：