【发布时间】:2021-02-28 05:46:07
【问题描述】:
我正在使用漂亮的汤 (BS4) 和 python 通过 waybackmachine/webarchive 从黄页中抓取数据。我可以轻松返回企业名称和电话号码,但是当我尝试检索企业的网站 url 时,我只返回整个 div 标签。
#Import Dependencies
from splinter import Browser
from bs4 import BeautifulSoup
import requests
import pandas as pd
# Path to chromedriver
!which chromedriver
# Set the executable path and initialize the chrome browser in splinter
executable_path = {'executable_path': '/usr/local/bin/chromedriver'}
browser = Browser('chrome', **executable_path)
#visit Webpage
url = 'https://web.archive.org/web/20171004082203/https://www.yellowpages.com/houston-tx/air-conditioning-service-repair'
browser.visit(url)
# Convert the browser html to a soup object and then quit the browser
html = browser.html
soup = BeautifulSoup(html, "html.parser")
##Scrapers
#business name
print(soup.find('a', class_='business-name').text)
#Telephone
print(soup.find('li', class_='phone primary').text)
#website
print(soup.find('div', class_='links'))
我怎样才能只返回公司的网站 URL?谢谢。
【问题讨论】:
标签: python web-scraping beautifulsoup