【问题标题】:Website scraping with python 2.7 and beautifulsoup 4使用 python 2.7 和 beautifulsoup 4 抓取网站
【发布时间】:2017-04-13 11:28:23
【问题描述】:

我在用 beautifulsoup 抓取网站“http://www.queensbronxba.com/directory/”时卡住了。我几乎完成了抓取,我只留下了在段落标签中找到的列表中的公司名称。问题是同一个 div 中有更多的段落标签,但我只需要第一个,因为它给出了公司名称。所以我需要关于以下 div 的第一段,而不仅仅是第一段。这是我用来 srcape 的代码:

page = requests.get("http://www.queensbronxba.com/directory/")  
soup = BeautifulSoup(page.content, 'html.parser')  
company = soup.find(class_="boardMemberWrap")  
contact = company.find_all(class_="boardMember")  
info = contact[0]
print(info.prettify())

name_tags = company.select("h4")  
names = [nt.get_text() for nt in company_tags]  
names

company_tags = company.select("p")  #here I need help to get only first paragraphs of following div containers  
companies = [ct.get_text() for ct in company_tags]  
companies

phone_tags = company.select('a[href^="tel"]')  
phones = [pt.get_text() for pt in phone_tags]  
phones

email_tags = company.select('a[href^="mailto"]')  
emails = [et.get_text() for et in email_tags]  
emails

【问题讨论】:

  • 具体说明您的问题。现在你还在纠结什么?
  • company_tags 上有评论说我需要帮助。
  • 您应该在问题中描述代码之外的问题,因此明确说明。如果您只想要获得的所有文本中的一个段落,请解析文本,也许将文本拆分为\n

标签: python beautifulsoup


【解决方案1】:
import requests
from bs4 import BeautifulSoup

page = requests.get("http://www.queensbronxba.com/directory/")
soup = BeautifulSoup(page.content, 'html.parser')  
company = soup.find(class_="boardMemberWrap")  
contact = company.findAll(class_="boardMemberInfo")
info = contact[0]
print(info.prettify())


name_tags = company.select("h4")
names = [nt.get_text() for nt in name_tags]
print(names)


for name in company.findAll(class_="boardMember"):
    for n in name.findAll('p')[:1]:
    print(n.text)


phone_tags = company.select('a[href^="tel"]')  
phones = [pt.get_text() for pt in phone_tags]  
print(phones)


email_tags = company.select('a[href^="mailto"]')  
emails = [et.get_text() for et in email_tags]  
print(emails)

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2016-08-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-12-15
    • 1970-01-01
    相关资源
    最近更新 更多