【问题标题】:How to extract hidden li text如何提取隐藏的li文本
【发布时间】:2019-05-21 09:24:03
【问题描述】:

我正在尝试从website 中抓取,跳转到每个href 文章并抓取位于正文之后的 cmets。但是,我得到空白的结果。我还尝试通过写soup.find_all('li') 来获取所有li 以检查是否存在任何cmets,并发现即使提取所有li 也不包含有关该文章的任何cmets。有人可以建议吗?我怀疑该网站使获取文本变得更加困难。

import requests
from bs4 import BeautifulSoup as bs
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import pandas as pd

urls = [
    'https://hypebeast.com/brands/jordan-brand'
]

with requests.Session() as s:
    for url in urls:
        driver = webdriver.Chrome('/Users/Documents/python/Selenium/bin/chromedriver')
        driver.get(url)
        products = [element for element in WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='post-box    ']")))]
        soup = bs(driver.page_source, 'lxml')
        element = soup.select('.post-box    ')
        time.sleep(1)
        ahref = [item.find('a')['href']  for item in element]
        results = list(zip(ahref))
        df = pd.DataFrame(results)
        for result in results:
            res = driver.get(result[0])
            soup = bs(driver.page_source, 'lxml')
            time.sleep(6)
            comments_href = soup.find_all('ul', {'id': 'post-list'})
            print(comments_href)

【问题讨论】:

    标签: python-3.x web-scraping selenium-chromedriver


    【解决方案1】:

    post/cmets 在<iframe> 标签中。该标签还有一个以dsq-app 开头的动态属性。所以你需要做的是找到那个 iframe,切换到它,然后你就可以解析了。我选择使用 BeautifulSoup 来提取script 标签,以 json 格式读取它并在那里导航。这应该可以让您拉动您正在寻找的东西:

    import requests
    from bs4 import BeautifulSoup as bs
    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.common.exceptions import TimeoutException
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    import time
    import pandas as pd
    import json
    
    urls = [
        'https://hypebeast.com/brands/jordan-brand'
    ]
    
    with requests.Session() as s:
        for url in urls:
            driver = webdriver.Chrome('C:/chromedriver_win32/chromedriver.exe')
            driver.get(url)
            products = [element for element in WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='post-box    ']")))]
            soup = bs(driver.page_source, 'lxml')
            element = soup.select('.post-box    ')
            time.sleep(1)
            ahref = [item.find('a')['href']  for item in element]
            results = list(zip(ahref))
            df = pd.DataFrame(results)
            for result in ahref:
    
                driver.get(result)
                time.sleep(6)
    
                iframe = driver.find_element_by_xpath('//iframe[starts-with(@name, "dsq-app")]')
    
                driver.switch_to.frame(iframe)
                soup = bs(driver.page_source, 'html.parser')
    
                scripts = soup.find_all('script')
                for script in scripts:
                    if 'response' in script.text:
                        jsonStr = script.text
                        jsonData = json.loads(jsonStr)
    
                        for each in jsonData['response']['posts']:
                            author = each['author']['username']
                            message = each['raw_message']
                            print('%s: %s' %(author, message))
    

    输出:

    annvee: Lemme get them BDSM jordans fam
    deathb4designer: Lmao
    zenmasterchen: not sure why this model needed to exist in the first place
    Spawnn: Issa flop.
    disqus_lEPADa2ZPn: looks like an AF1
    Lekkerdan: Hoodrat shoes.
    rubnalntapia: Damn this are sweet
    marcellusbarnes: Dope, and I hate Jordan lows
    marcellusbarnes: The little jumpman on the back is dumb
    chickenboihotsauce: copping those CPFM gonna be aids
    lowercasegod: L's inbound
    monalisadiamante: Sold out in 4 minutes. ?
    nickpurita: Those CPFM’s r overhyped AF.
    ...
    

    【讨论】:

    • 嗨 Chitown88 - 您能否展示一个合并每个 ahref 的所有 cmets 的解决方案?我将它们转储到导出到 csv 的 pandas 数据框中,我希望 1 个单元格包含所有 cmets。
    • 是的,这很容易做到。因此,您需要 1 个包含每个 href 的所有 cmets 的单元格。几个问题:1)你想要发表评论的用户名吗?您希望如何分隔 cmets(使用 ;)?你只想要 1 列,n 行?
    • 1) 是的,我想要用户名 2) 用逗号分隔。还有两列,第一列是 href 信息,第二列是用户名 + cmets。非常感谢
    猜你喜欢
    • 1970-01-01
    • 2013-03-16
    • 1970-01-01
    • 2011-07-15
    • 2021-10-03
    • 2017-05-20
    • 2014-06-26
    • 2017-04-16
    • 2011-12-25
    相关资源
    最近更新 更多