【发布时间】:2019-05-21 09:24:03
【问题描述】:
我正在尝试从website 中抓取,跳转到每个href 文章并抓取位于正文之后的 cmets。但是,我得到空白的结果。我还尝试通过写soup.find_all('li') 来获取所有li 以检查是否存在任何cmets,并发现即使提取所有li 也不包含有关该文章的任何cmets。有人可以建议吗?我怀疑该网站使获取文本变得更加困难。
import requests
from bs4 import BeautifulSoup as bs
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import pandas as pd
urls = [
'https://hypebeast.com/brands/jordan-brand'
]
with requests.Session() as s:
for url in urls:
driver = webdriver.Chrome('/Users/Documents/python/Selenium/bin/chromedriver')
driver.get(url)
products = [element for element in WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='post-box ']")))]
soup = bs(driver.page_source, 'lxml')
element = soup.select('.post-box ')
time.sleep(1)
ahref = [item.find('a')['href'] for item in element]
results = list(zip(ahref))
df = pd.DataFrame(results)
for result in results:
res = driver.get(result[0])
soup = bs(driver.page_source, 'lxml')
time.sleep(6)
comments_href = soup.find_all('ul', {'id': 'post-list'})
print(comments_href)
【问题讨论】:
标签: python-3.x web-scraping selenium-chromedriver