【问题标题】:comments extraction from youtube using python selenium使用 python selenium 从 youtube 中提取评论
【发布时间】:2016-04-15 19:15:55
【问题描述】:

我正在使用 Python Selenium 从 youtube 中提取 cmets

from selenium import webdriver
browser = webdriver.Firefox()
browser.get("https://www.youtube.com/watch?v=a6NhKKl-iR0")
for elem in browser.find_elements_by_xpath('//body'):
print elem.text

如何获取cmets?

【问题讨论】:

    标签: python selenium selenium-webdriver web-scraping


    【解决方案1】:

    cmets 位于类 comment-renderer-text-content 的 div 中

    for elem in browser.find_elements_by_xpath('//div[@class="comment-renderer-text-content"]'):
        print elem.text
    

    这给了你:

    great stuff man. question: why use selenium for this site when the data you're looking for is in the source code and could be scraped with requests/beautifulsoup? disclaimer: i'm commenting a year later so the source code may be different :)
    Good question, if the data is in source you're right, selenium is overkill.  I use selenium when I find it quicker to not have to reverse engineer a site looking for sever calls which return json data that only exists inside the browser etc...  So the bottom line is if you're really crafty picking off JSON calls to the server and replicating that without needing to have the DOM built for you than it's a much better be to use BeautifulSoup or Python Requests.   However if you're creating for instance an automated program to automatically pin, like stuff on facebook etc... you will most likely not be able to pull that off very easily just using BeautifulSoup. 
    Answered my questions very well.
    Great job! I do have a questions though. What if the site is built in silverlight? Then I cannot see the Xpath of each element...
    the first test was slow because of a slow loading adserver, you can see it in firefox at the bottom bar.
    This is good stuff.
    Clear and useful although i'm using java. Thx
    YOU ARE BETTER THAN A PROFESSIONAL TEACHER MAN!!!.. 
    thanks man
    

    cmets 是动态加载的,因此您可能需要等待元素的存在:

    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    
    def wait(dr, x):
        element = WebDriverWait(dr, 20).until(
            EC.presence_of_all_elements_located((By.XPATH, x))
        )
        return element
    
    
    from selenium import webdriver
    
    browser = webdriver.Firefox()
    browser.get("https://www.youtube.com/watch?v=a6NhKKl-iR0")
    
    for elem in wait(browser, '//div[@class="comment-renderer-text-content"]'):
        print elem.text
    

    【讨论】:

    • 是不是因为代理问题。请告诉我。
    • @VinayakumarR,你很可能只需要等待,我编辑了答案,页面加载后加载 cmets
    • @Padraic Cunningham,我执行了您的示例程序,但显示错误:TimeoutException:消息:
    • 什么版本的硒?
    • selenium 版本 = 2.53.1 我使用的是 windows 7 和 python 版本 2.7.10
    猜你喜欢
    • 2016-08-05
    • 2020-05-25
    • 1970-01-01
    • 1970-01-01
    • 2020-10-25
    • 2013-09-01
    • 1970-01-01
    • 1970-01-01
    • 2015-03-31
    相关资源
    最近更新 更多