用 Beautiful Soup 解析链接 URL答案

【问题标题】：Parsing Link URL with Beautiful Soup用 Beautiful Soup 解析链接 URL
【发布时间】：2021-02-28 05:46:07
【问题描述】：

我正在使用漂亮的汤 (BS4) 和 python 通过 waybackmachine/webarchive 从黄页中抓取数据。我可以轻松返回企业名称和电话号码，但是当我尝试检索企业的网站 url 时，我只返回整个 div 标签。

#Import Dependencies
from splinter import Browser
from bs4 import BeautifulSoup 
import requests
import pandas as pd 

# Path to chromedriver
!which chromedriver 

# Set the executable path and initialize the chrome browser in splinter
executable_path = {'executable_path': '/usr/local/bin/chromedriver'}
browser = Browser('chrome', **executable_path) 

#visit Webpage 
url = 'https://web.archive.org/web/20171004082203/https://www.yellowpages.com/houston-tx/air-conditioning-service-repair'
browser.visit(url) 

# Convert the browser html to a soup object and then quit the browser
html = browser.html
soup = BeautifulSoup(html, "html.parser")  

##Scrapers
#business name
print(soup.find('a', class_='business-name').text)
#Telephone
print(soup.find('li', class_='phone primary').text)
#website
print(soup.find('div', class_='links'))

我怎样才能只返回公司的网站 URL？谢谢。

【问题讨论】：

标签： python web-scraping beautifulsoup

【解决方案1】：

改为返回href：

print(soup.find('a', class_='business-name')['href'])

【讨论】：

print(soup.find('a', class_='business-name')['href']) 在黄页上返回网站。我正在尝试返回实际的商业网站，例如richmondair.com

【解决方案2】：

你可以做这样的工作：

获取所有链接列表，然后获取索引0值
然后用分隔符分割它：“http://”

检查下面的更新代码：

#Import Dependencies
from splinter import Browser
from bs4 import BeautifulSoup 
import requests
import pandas as pd 

# Path to chromedriver
!which chromedriver 

# Set the executable path and initialize the chrome browser in splinter
executable_path = {'executable_path': '/usr/local/bin/chromedriver'}
browser = Browser('chrome', **executable_path) 

#visit Webpage 
url = 'https://web.archive.org/web/20171004082203/https://www.yellowpages.com/houston-tx/air-conditioning-service-repair'
browser.visit(url) 

# Convert the browser html to a soup object and then quit the browser
html = browser.html
soup = BeautifulSoup(html, "html.parser")  

##Scrapers
#business name
print(soup.find('a', class_='business-name').text)
#Telephone
print(soup.find('li', class_='phone primary').text)
#website
links = soup.find('div', class_='links').findAll("a")
originalLink = links[0].get("href").split("http://")[1]

【讨论】：

太棒了！我现在可以从页面上的第一个列表中返回值。除了第一个列表之外，关于如何检索接下来的 29 个列表的任何想法？ “find_all”命令返回了很多无用的信息。