【问题标题】:Hidden phone number can't be scraped隐藏的电话号码无法抓取
【发布时间】:2021-04-13 21:41:31
【问题描述】:

我在单击“llamar”按钮后尝试提取电话号码时遇到了问题。到目前为止,我已经将 xpath 方法与 selenium 一起使用,并尝试使用美丽的汤来提取数字,但不幸的是没有任何效果。我通常会收到一个无效的选择器错误(如果我使用带有 selenium 的 xpath 选择器)并且使用 BS4 我会得到一个 - AttributeError: 'NoneType' object has no attribute 'text' ... 希望你能帮帮我!

这是链接的网址 - https://www.milanuncios.com/venta-de-pisos-en-malaga-malaga/portada-alta-carlos-de-haya-carranque-386352344.htm

这是我尝试过的代码:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import pandas as pd
import time
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import UnexpectedAlertPresentException

url = 'https://www.milanuncios.com/venta-de-pisos-en-malaga-malaga/portada-alta-carlos-de-haya-carranque - 386352344.htm'
path = r'C:\Users\WL-133\anaconda3\Lib\site-packages\selenium\webdriver\chrome\chromedriver.exe'
path1 = r'C:\Users\WL-133\anaconda3\Lib\site-packages\selenium\webdriver\firefox'
# driver = webdriver.Chrome(path)
options = Options()
driver = webdriver.Chrome(path)
driver.get(url)

a = []

mah_div = driver.page_source
soup = BeautifulSoup(mah_div, features='lxml')

cookie_button = '//*[@id="sui-TcfFirstLayerModal"]/div/div/footer/div/button[2]'
btn_press = driver.find_element_by_xpath(cookie_button)
btn_press.click()

llam_button = '//*[@id="ad-detail-contact"]/a[2]'
llam_press = driver.find_element_by_xpath(llam_button)
llam_press.click()
time.sleep(10)

for item in soup.find_all("div", {"class": "contenido"}):
    a.append(item.find("div", {"class": "plaincontenido"}).text)

print(a)

【问题讨论】:

  • 使用这个soup.select_one("script[type='application/ld+json']:contains('Product')").get_text(strip=True)解析相关的脚本标签,然后挖出包含电话号码的description的值。

标签: python selenium web-scraping beautifulsoup


【解决方案1】:

手机存储在 Javascript 中。您可以使用re 模块来提取它:

import re
import requests
from bs4 import BeautifulSoup

url = "https://www.milanuncios.com/venta-de-pisos-en-malaga-malaga/portada-alta-carlos-de-haya-carranque-386352344.htm"
phone_url = "https://www.milanuncios.com/datos-contacto/?usePhoneProxy=0&from=detail&includeEmail=false&id={}"

ad_id = re.search(r"(\d+)\.htm", url).group(1)

html_text = requests.get(phone_url.format(ad_id)).text

soup = BeautifulSoup(html_text, "html.parser")
phone = re.search(r"getTrackingPhone\((.*?)\)", html_text).group(1)

print(soup.select_one(".texto").get_text(strip=True), phone)

打印:

ana (Particular) 639....

【讨论】:

    【解决方案2】:

    使用 Selenium,您需要单击按钮并切换到 iframe。

    from selenium.webdriver.support.wait import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.webdriver.common.by import By
    
    wait.until(EC.element_to_be_clickable(
                (By.CSS_SELECTOR, ".def-btn.phone-btn")))
    tel_button = driver.find_element_by_css_selector(".def-btn.phone-btn")
    tel_button.click()
    wait.until(EC.frame_to_be_available_and_switch_to_it((By.ID, "ifrw")))
    wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,".texto>.telefonos")))
    tel_number = driver.find_element_by_css_selector(".texto>.telefonos").text
    

    请注意,我使用了很多稳定的定位器。

    【讨论】:

    • 难以置信!工作完美谢谢。我还刚刚了解到在这种情况下需要切换帧。这对于未来的网络抓取非常有用。
    猜你喜欢
    • 1970-01-01
    • 2014-11-16
    • 1970-01-01
    • 1970-01-01
    • 2018-06-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多