【问题标题】:Use Selenium to click all youtube comment 'reply' buttons and get channel links使用 Selenium 单击所有 youtube 评论“回复”按钮并获取频道链接
【发布时间】:2020-09-10 16:16:04
【问题描述】:

目标是从 youtube 视频评论部分抓取所有 youtube 频道链接。当前代码只获取用户名而不是频道链接,并且不查看用户回复。我不明白如何执行此操作以及为什么我的 xPath 错误。

代码:

from selenium import webdriver
import time

driver=webdriver.Chrome()

driver.get('https://www.youtube.com/watch?v=_p2NvO6KrBs')
time.sleep(5)

#Scrolling
for i in range(4):
    #scroll 1000 px
    driver.execute_script('window.scrollTo(0,(window.pageYOffset+1000))')
    #waiting for the page to load
    time.sleep(1.5) 


#replies
replies = driver.find_element_by_xpath('//*[@id="more-replies"]')
time.sleep(1)
replies.click()


comment_div=driver.find_element_by_xpath('//*[@id="contents"]')
comments=comment_div.find_elements_by_xpath('//*[@id="author-text"]')
for comment in comments:
    print(comment.text)

【问题讨论】:

    标签: python selenium web-scraping youtube selenium-chromedriver


    【解决方案1】:

    如果你想要频道url,你需要获取href属性:

    for comment in comments:
        print(comment.get_attribute('href'))
    

    如果您也想要每个回复(每个评论)的频道,那么您可以尝试以下操作。我在某些行上添加了 cmets 以作为上下文...

    main_comments = driver.find_elements_by_css_selector('#contents #comment') # get all the comments
    
    for mc in main_comments:
        main_comment_channel = mc.find_element_by_id('author-text').get_attribute('href')
        print('The commenters channel is: ' + main_comment_channel) # print the channel of the main comment
    
        replies = mc.find_element_by_xpath('..//*[@id="replies"]') # get the replies section of the above comment
        if replies.text.startswith('View'): # check if there are any replies
            replies.find_element_by_css_selector('a').click() # if so open the replies
            time.sleep(3) # wait for load (better strategy should be used here
    
            for reply in replies.find_elements_by_id('author-text'):
                reply_channel = reply.get_attribute('href')
                print('Reply channel: ' + reply_channel) # print the channel of each reply
    

    完整的解决方案,包括写入 .txt 文件

    file = open("output.txt","w+")
    
    driver.get('https://www.youtube.com/watch?v=_p2NvO6KrBs')
    time.sleep(5)
    
    #new scrolling
    while(len(driver.find_elements_by_css_selector('#sections>#continuations #spinner')) > 0):
        #scroll 1000 px
        driver.execute_script('window.scrollTo(0,(window.pageYOffset+1000))')
        #waiting for the page to load
        time.sleep(1.5) 
    
    
    main_comments = driver.find_elements_by_css_selector('#contents #comment') # get all the comments
    
    for mc in main_comments:
        main_comment_channel = mc.find_element_by_id('author-text').get_attribute('href')
        file.write('The commenters channel is: ' + main_comment_channel + '\n') #write the channel of the main comment to file
    
        replies = mc.find_element_by_xpath('..//*[@id="replies"]') # get the replies section of the above comment
        if replies.text.startswith('View'): # check if there are any replies
            reply = replies.find_element_by_css_selector('a');
            driver.execute_script("arguments[0].scrollIntoView();", reply) # bring view replies into view
            driver.execute_script('window.scrollTo(0,(window.pageYOffset-150))') # cater for the youtube header
            reply.click() # if so open the replies
            time.sleep(3) # wait for load (better strategy should be used here
    
            for reply in replies.find_elements_by_id('author-text'):
                reply_channel = reply.get_attribute('href')
                file.write('Reply channel: ' + reply_channel + '\n') # write the channel of each reply to file
    
    file.close()
    

    【讨论】:

    • 感谢您的回答。你能帮我把结果打印成 .txt 文件吗
    • 我添加了一个完整的解决方案(无需创建驱动程序)。但这会添加您想要的文件,但我也注意到原始滚动可能没有加载所有 cmets,所以这样做了。此外,我们还需要在单击之前将“查看回复”按钮显示在视图中,同时还要满足可能隐藏它的 youtube 标题。我没有满足的一件事是,有时 youtube 会显示一个反馈弹出窗口(和其他一些弹出窗口),但并没有一直为我加载 - 不过你应该能够相当容易地满足这一点。
    • 你能帮我解决另一个似乎出现的问题吗? :)(线程)
    • 最好为此创建一个单独的问题,不过只是一个提示 - 看看你在哪里打开文件并关闭它(for 循环正在接近你)。
    • 啊,是的,我的错。不管它修好了。非常感谢您的帮助。
    猜你喜欢
    • 1970-01-01
    • 2020-10-25
    • 2013-06-09
    • 2013-11-26
    • 2022-12-12
    • 2019-10-16
    • 2017-07-12
    • 1970-01-01
    • 2014-07-14
    相关资源
    最近更新 更多