【问题标题】:Trouble returning web scraping output as dictionary无法将网络抓取输出作为字典返回
【发布时间】:2020-01-22 19:50:54
【问题描述】:

所以我试图抓取一个网站of its staff roster,我希望最终产品是{staff: position} 格式的字典。我目前坚持将每个员工姓名和职位作为单独的字符串返回。很难清楚地发布输出,但它本质上是在名称列表中,然后是位置。例如,列表中的第一个名字将与第一个位置配对,依此类推。我已经确定每个名字和职位都是class 'bs4.element.Tag。我相信我需要获取名称和位置并列出每个列表,然后使用zip 将元素放入字典中。我已经尝试实现这一点,但到目前为止没有任何效果。通过使用class_ 参数,我可以得到我需要的最低文本是div 包含在其中的个人p。我仍然对python 缺乏经验并且对网络抓取不熟悉,但我精通相对论使用 html 和 css,非常感谢您的帮助。

# Simple script attempting to scrape 
# the staff roster off of the 
# Greenville Drive website

import requests
from bs4 import BeautifulSoup

URL = 'https://www.milb.com/greenville/ballpark/frontoffice'

page = requests.get(URL)

soup = BeautifulSoup(page.content, 'html.parser')

staff = soup.find_all('div', class_='l-grid__col l-grid__col--xs-12 l-grid__col--sm-4 l-grid__col--md-3 l-grid__col--lg-3 l-grid__col--xl-3')

for staff in staff:
    data = staff.find('p')
    if data:
        print(data.text.strip())

position = soup.find_all('div', class_='l-grid__col l-grid__col--xs-12 l-grid__col--sm-4 l-grid__col--md-6 l-grid__col--lg-6 l-grid__col--xl-6')

for position in position:
    data = position.find('p')
    if data:
        print(data.text.strip())  

# This code so far provides the needed data, but need it in a dict()

【问题讨论】:

    标签: python-3.x dictionary web-scraping beautifulsoup python-requests


    【解决方案1】:

    BeautifulSoup 有find_next(),可用于获取指定匹配过滤器的下一个标签。找到“工作人员”div 并使用 find_next() 获取相邻的“位置”div

    import requests
    from bs4 import BeautifulSoup
    
    URL = 'https://www.milb.com/greenville/ballpark/frontoffice'
    page = requests.get(URL)
    soup = BeautifulSoup(page.content, 'html.parser')
    staff_class = 'l-grid__col l-grid__col--xs-12 l-grid__col--sm-4 l-grid__col--md-3 l-grid__col--lg-3 l-grid__col--xl-3'
    position_class = 'l-grid__col l-grid__col--xs-12 l-grid__col--sm-4 l-grid__col--md-6 l-grid__col--lg-6 l-grid__col--xl-6'
    result = {}
    
    for staff in soup.find_all('div', class_=staff_class):
        data = staff.find('p')
        if data:
            staff_name = data.text.strip()
            postion_div = staff.find_next('div', class_=position_class)
            postion_name = postion_div.text.strip()
            result[staff_name] = postion_name
    
    print(result)
    

    输出

    {'Craig Brown': 'Owner/Team President', 'Eric Jarinko': 'General Manager', 'Nate Lipscomb': 'Special Advisor to the President', 'Phil Bargardi': 'Vice President of Sales', 'Jeff Brown': 'Vice President of Marketing', 'Greg Burgess, CSFM': 'Vice President of Operations/Grounds', 'Jordan Smith': 'Vice President of Finance', 'Ned Kennedy': 'Director of Inside Sales', 'Patrick Innes': 'Director of Ticket Operations', 'Micah Gold': 'Senior Account Executive', 'Molly Mains': 'Senior Account Executive', 'Houghton Flanagan': 'Account Executive', 'Jeb Maloney': 'Account Executive', 'Olivia Adams': 'Inside Sales Representative', 'Tyler Melson': 'Inside Sales Representative', 'Toby Sandblom': 'Inside Sales Representative', 'Katie Batista': 'Director of Sponsorships and Community Engagement', 'Matthew Tezza': 'Sponsor Services and Activations Manager', 'Melissa Welch': 'Sponsorship and Community Events Manager', 'Beth Rusch': 'Director of West End Events', 'Kristin Kipper': 'Events Manager', 'Grant Witham': 'Events Manager', 'Alex Guest': 'Director of Game Entertainment & Production', 'Lance Fowler': 'Director of Video Production', 'Davis Simpson': 'Director of Media and Creative Services', 'Cameron White': 'Media Relations Manager', 'Ed Jenson': 'Broadcaster', 'Adam Baird': 'Accountant', 'Mike Agostino': 'Director of Food and Beverage', 'Roger Campana': 'Assistant Director of Food and Beverage', 'Wilbert Sauceda': 'Executive Chef', 'Elise Parish': 'Premium Services Manager', 'Timmy Hinds': 'Director of Facility Operations', 'Zack Pagans': 'Assistant Groundskeeper', 'Amanda Medlin': 'Business and Team Operations Manager', 'Allison Roedell': 'Office Manager'}
    

    【讨论】:

    • 谢谢,这正是我要找的!
    【解决方案2】:

    使用 CSS 选择器和zip() 的解决方案:

    import requests
    from bs4 import BeautifulSoup
    
    url = 'https://www.milb.com/greenville/ballpark/frontoffice'
    
    soup = BeautifulSoup(requests.get(url).content, 'html.parser')
    
    out = {}
    for name, position in zip( soup.select('div:has(+ div p) b'),
                               soup.select('div:has(> div b) + div p')):
        out[name.text] = position.text
    
    from pprint import pprint
    pprint(out)
    

    打印:

    {'Adam Baird': 'Accountant',
     'Alex Guest': 'Director of Game Entertainment & Production',
     'Allison Roedell': 'Office Manager',
     'Amanda Medlin': 'Business and Team Operations Manager',
     'Beth Rusch': 'Director of West End Events',
     'Brady Andrews': 'Assistant Director of Facility Operations',
     'Brooks Henderson': 'Merchandise Manager',
     'Bryan Jones': 'Facilities Cleanliness Manager',
     'Cameron White': 'Media Relations Manager',
     'Craig Brown': 'Owner/Team President',
     'Davis Simpson': 'Director of Media and Creative Services',
     'Ed Jenson': 'Broadcaster',
     'Elise Parish': 'Premium Services Manager',
     'Eric Jarinko': 'General Manager',
     'Grant Witham': 'Events Manager',
     'Greg Burgess, CSFM': 'Vice President of Operations/Grounds',
     'Houghton Flanagan': 'Account Executive',
     'Jeb Maloney': 'Account Executive',
     'Jeff Brown': 'Vice President of Marketing',
     'Jenny Burgdorfer': 'Director of Merchandise',
     'Jordan Smith ': 'Vice President of Finance',
     'Katie Batista': 'Director of Sponsorships and Community Engagement',
     'Kristin Kipper': 'Events Manager',
     'Lance Fowler': 'Director of Video Production',
     'Matthew Tezza': 'Sponsor Services and Activations Manager',
     'Melissa Welch': 'Sponsorship and Community Events Manager',
     'Micah Gold': 'Senior Account Executive',
     'Mike Agostino': 'Director of Food and Beverage',
     'Molly Mains': 'Senior Account Executive',
     'Nate Lipscomb': 'Special Advisor to the President',
     'Ned Kennedy': 'Director of Inside Sales',
     'Olivia Adams': 'Inside Sales Representative',
     'Patrick Innes': 'Director of Ticket Operations',
     'Phil Bargardi': 'Vice President of Sales',
     'Roger Campana': 'Assistant Director of Food and Beverage',
     'Steve Seman': 'Merchandise / Ticketing Advisor',
     'Timmy Hinds': 'Director of Facility Operations',
     'Toby Sandblom': 'Inside Sales Representative',
     'Tyler Melson': 'Inside Sales Representative',
     'Wilbert Sauceda': 'Executive Chef',
     'Zack Pagans': 'Assistant Groundskeeper'}
    

    【讨论】:

      猜你喜欢
      • 2021-12-06
      • 2021-10-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-12-30
      • 1970-01-01
      • 2020-12-15
      • 2020-10-29
      相关资源
      最近更新 更多