【问题标题】:Parsing Link URL with Beautiful Soup用 Beautiful Soup 解析链接 URL
【发布时间】:2021-02-28 05:46:07
【问题描述】:

我正在使用漂亮的汤 (BS4) 和 python 通过 waybackmachine/webarchive 从黄页中抓取数据。我可以轻松返回企业名称和电话号码,但是当我尝试检索企业的网站 url 时,我只返回整个 div 标签。

#Import Dependencies
from splinter import Browser
from bs4 import BeautifulSoup 
import requests
import pandas as pd 

# Path to chromedriver
!which chromedriver 

# Set the executable path and initialize the chrome browser in splinter
executable_path = {'executable_path': '/usr/local/bin/chromedriver'}
browser = Browser('chrome', **executable_path) 

#visit Webpage 
url = 'https://web.archive.org/web/20171004082203/https://www.yellowpages.com/houston-tx/air-conditioning-service-repair'
browser.visit(url) 

# Convert the browser html to a soup object and then quit the browser
html = browser.html
soup = BeautifulSoup(html, "html.parser")  

##Scrapers
#business name
print(soup.find('a', class_='business-name').text)
#Telephone
print(soup.find('li', class_='phone primary').text)
#website
print(soup.find('div', class_='links'))

我怎样才能只返回公司的网站 URL?谢谢。

【问题讨论】:

    标签: python web-scraping beautifulsoup


    【解决方案1】:

    改为返回href:

    print(soup.find('a', class_='business-name')['href'])
    

    【讨论】:

    • print(soup.find('a', class_='business-name')['href']) 在黄页上返回网站。我正在尝试返回实际的商业网站,例如richmondair.com
    【解决方案2】:

    你可以做这样的工作:

    1. 获取所有链接列表,然后获取索引0值
    2. 然后用分隔符分割它:“http://”

    检查下面的更新代码:

    #Import Dependencies
    from splinter import Browser
    from bs4 import BeautifulSoup 
    import requests
    import pandas as pd 
    
    # Path to chromedriver
    !which chromedriver 
    
    # Set the executable path and initialize the chrome browser in splinter
    executable_path = {'executable_path': '/usr/local/bin/chromedriver'}
    browser = Browser('chrome', **executable_path) 
    
    #visit Webpage 
    url = 'https://web.archive.org/web/20171004082203/https://www.yellowpages.com/houston-tx/air-conditioning-service-repair'
    browser.visit(url) 
    
    # Convert the browser html to a soup object and then quit the browser
    html = browser.html
    soup = BeautifulSoup(html, "html.parser")  
    
    ##Scrapers
    #business name
    print(soup.find('a', class_='business-name').text)
    #Telephone
    print(soup.find('li', class_='phone primary').text)
    #website
    links = soup.find('div', class_='links').findAll("a")
    originalLink = links[0].get("href").split("http://")[1]
    

    【讨论】:

    • 太棒了!我现在可以从页面上的第一个列表中返回值。除了第一个列表之外,关于如何检索接下来的 29 个列表的任何想法? “find_all”命令返回了很多无用的信息。
    猜你喜欢
    • 1970-01-01
    • 2023-03-24
    • 1970-01-01
    • 1970-01-01
    • 2017-02-15
    • 1970-01-01
    • 2019-07-03
    • 2015-11-08
    • 2016-05-10
    相关资源
    最近更新 更多