使用 beautifulsoup 解析数据答案

【问题标题】：parsing data using beautifulsoup使用 beautifulsoup 解析数据
【发布时间】：2014-08-14 03:21:09
【问题描述】：

我正在学习 BS4，我正在尝试从热门网站上抓取几个表格、列表等，以熟悉 th 语法。我很难获得正确格式的列表。这是代码：

from bs4 import BeautifulSoup
import urllib2
import requests

headers = {
  'Connection': 'keep-alive',
  'Cache-Control': 'no-cache',
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
  'Pragma': 'no-cache',
  'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36',
  'Accept-Language': 'en-US,en;q=0.8'
}

url = 'https://www.yahoo.com'

req = urllib2.Request(url, None, headers)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
terms = soup.find('ol').get_text()
print terms

打印以下列表：

1Amanda Knox2Meagan Good3Dog the Bounty Hunter4Adrienne Bailon5Powerball winner6Gillian Anderson7Catherine Zeta-Jones8Mickey Rourke9Halle Berry10Lake Tahoe hotels

正确的术语用数字分隔，这增加了额外的工作，以解析文件，使其读取 kike “Amanda Knox”、“Megan Good”等。

由于我对BS4不是很熟悉，有没有办法在我的术语定义中获取“tile=”标签后面的术语？

【问题讨论】：

标签： python html html-parsing beautifulsoup

【解决方案1】：

这是因为ol 标签内有多个元素，get_text() 加入了每个标签的文本。

相反，您可以使用CSS Selector 来获取实际条款：

for li in soup.select('ol.trendingnow_trend-list > li > a'):
    print li.get_text()

打印：

Hope Solo
Christy Mack
Dog the Bounty Hunter
Adrienne Bailon
Powerball winner
Catherine Zeta-Jones
Mickey Rourke
Valerie Velardi
Halle Berry
Lake Tahoe hotels

ol.trendingnow_trend-list > li > a css 选择器与 li 内的每个 a 标记相匹配，而 ol 标记内具有 trendingnow_trend-list 类属性。

仅供参考，这是从右上角的块中获取Trending Now 术语的列表。

【讨论】：

很好的解释。非常感谢！