隐藏的电话号码无法抓取答案

【问题标题】：Hidden phone number can't be scraped隐藏的电话号码无法抓取
【发布时间】：2021-04-13 21:41:31
【问题描述】：

我在单击“llamar”按钮后尝试提取电话号码时遇到了问题。到目前为止，我已经将 xpath 方法与 selenium 一起使用，并尝试使用美丽的汤来提取数字，但不幸的是没有任何效果。我通常会收到一个无效的选择器错误（如果我使用带有 selenium 的 xpath 选择器）并且使用 BS4 我会得到一个 - AttributeError: 'NoneType' object has no attribute 'text' ... 希望你能帮帮我！

这是链接的网址 - https://www.milanuncios.com/venta-de-pisos-en-malaga-malaga/portada-alta-carlos-de-haya-carranque-386352344.htm

这是我尝试过的代码：

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import pandas as pd
import time
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import UnexpectedAlertPresentException

url = 'https://www.milanuncios.com/venta-de-pisos-en-malaga-malaga/portada-alta-carlos-de-haya-carranque - 386352344.htm'
path = r'C:\Users\WL-133\anaconda3\Lib\site-packages\selenium\webdriver\chrome\chromedriver.exe'
path1 = r'C:\Users\WL-133\anaconda3\Lib\site-packages\selenium\webdriver\firefox'
# driver = webdriver.Chrome(path)
options = Options()
driver = webdriver.Chrome(path)
driver.get(url)

a = []

mah_div = driver.page_source
soup = BeautifulSoup(mah_div, features='lxml')

cookie_button = '//*[@id="sui-TcfFirstLayerModal"]/div/div/footer/div/button[2]'
btn_press = driver.find_element_by_xpath(cookie_button)
btn_press.click()

llam_button = '//*[@id="ad-detail-contact"]/a[2]'
llam_press = driver.find_element_by_xpath(llam_button)
llam_press.click()
time.sleep(10)

for item in soup.find_all("div", {"class": "contenido"}):
    a.append(item.find("div", {"class": "plaincontenido"}).text)

print(a)

【问题讨论】：

使用这个soup.select_one("script[type='application/ld+json']:contains('Product')").get_text(strip=True)解析相关的脚本标签，然后挖出包含电话号码的description的值。

标签： python selenium web-scraping beautifulsoup

【解决方案1】：

手机存储在 Javascript 中。您可以使用re 模块来提取它：

import re
import requests
from bs4 import BeautifulSoup

url = "https://www.milanuncios.com/venta-de-pisos-en-malaga-malaga/portada-alta-carlos-de-haya-carranque-386352344.htm"
phone_url = "https://www.milanuncios.com/datos-contacto/?usePhoneProxy=0&from=detail&includeEmail=false&id={}"

ad_id = re.search(r"(\d+)\.htm", url).group(1)

html_text = requests.get(phone_url.format(ad_id)).text

soup = BeautifulSoup(html_text, "html.parser")
phone = re.search(r"getTrackingPhone\((.*?)\)", html_text).group(1)

print(soup.select_one(".texto").get_text(strip=True), phone)

打印：

ana (Particular) 639....

【讨论】：

【解决方案2】：

使用 Selenium，您需要单击按钮并切换到 iframe。

from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

wait.until(EC.element_to_be_clickable(
            (By.CSS_SELECTOR, ".def-btn.phone-btn")))
tel_button = driver.find_element_by_css_selector(".def-btn.phone-btn")
tel_button.click()
wait.until(EC.frame_to_be_available_and_switch_to_it((By.ID, "ifrw")))
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,".texto>.telefonos")))
tel_number = driver.find_element_by_css_selector(".texto>.telefonos").text

请注意，我使用了很多稳定的定位器。

【讨论】：

难以置信！工作完美谢谢。我还刚刚了解到在这种情况下需要切换帧。这对于未来的网络抓取非常有用。