【问题标题】:Error when scraping with Beautiful Soup and Selenium together用 Beautiful Soup 和 Selenium 一起刮时出错
【发布时间】:2019-07-21 13:38:01
【问题描述】:

我是 Python 和一般网络抓取的初学者。在这段代码中,我同时使用了 Bs4 和 Selenium。我正在使用 Selenium 自动单击“显示更多”按钮,这样我就可以抓取所有结果,而不仅仅是在显示结果的第一页上找到的结果。 我正在尝试抓取以下网站:https://boards.euw.leagueoflegends.com/en/search?query=improve

但是,当将 Bs4 和 Selenium 组合在一起时,我正在抓取的 3 个字段(用户名、服务器和主题)现在给我以下两个错误。

1) 我得到一个 AttributeError: 'NoneType' object has no attribute 'text' for both server and username

Traceback (most recent call last):
  File "failoriginale.py", line 153, in <module>
    main()
  File "failoriginale.py", line 132, in main
    song_data = get_songs(index_page) # Get songs with metadata
  File "failoriginale.py", line 81, in get_songs
    username = row.find(class_='username').text.strip()
AttributeError: 'NoneType' object has no attribute 'text'

2)我收到这个主题错误

Traceback (most recent call last):
  File "failoriginale.py", line 153, in <module>
    main()
  File "failoriginale.py", line 132, in main
    song_data = get_songs(index_page) # Get songs with metadata
  File "failoriginale.py", line 86, in get_songs
    topic = row.find('div', {'class':'discussion-footer byline opaque'}).find_all('a')[1].text.strip()
IndexError: list index out of range

但是,在将 bs4 与 Selenium 结合之前,这 3 个字段的工作方式与其他字段一样,所以我认为问题出在其他地方。我不明白 song_data 的主要功能有什么问题?我已经在 stackoverflow 上查找了其他问题,但我无法解决问题。我是抓取和 bs4、硒库的新手,如果我问了一个愚蠢的问题,我很抱歉。

代码如下:

browser = webdriver.Firefox(executable_path='./geckodriver')
browser.get('https://boards.euw.leagueoflegends.com/en/search?query=improve&content_type=discussion')
html = browser.page_source #page_source is where selenium stores the html source

def get_songs(url):

    html = browser.page_source
    index_page = BeautifulSoup(html,'lxml') # Parse the page

    items = index_page.find(id='search-results') # Get the list on from the webpage
    if not items: # If the webpage does not contain the list, we should exit
        print('Something went wrong!', file=sys.stderr)
        sys.exit()
    data = list()
 # button show more, if the page has the show more button, it will click on that x5secs
    if index_page.find('a', {"class": "box show-more",}):
        button = browser.find_element_by_class_name('box.show-more')
        timeout = time.time() + 5
        while True:
            button.click()
            time.sleep(5.25)
            if time.time() > timeout:
                break

html = browser.page_source
    index_page = BeautifulSoup(html,'lxml')
    items = index_page.find(id='search-results')

    for row in items.find_all(class_='discussion-list-item'):

        username = row.find(class_='username').text.strip()
        question = row.find(class_='title-span').text.strip()
        sentence = row.find('span')['title']
        serverzone = row.find(class_='realm').text.strip()
        #print(serverzone)
        topic = row.find('div', {'class':'discussion-footer byline opaque'}).find_all('a')[1].text.strip()
        #print(topic)
        date=row.find(class_='timeago').get('title')
        #print(date)
        views = row.find(class_='view-counts byline').find('span', {'class' : 'number opaque'}).get('data-short-number')
        comments = row.find(class_='num-comments byline').find('span', {'class' : 'number opaque'}).get('data-short-number')

        # Store the data in a dictionary, and add that to our list
        data.append({
                     'username': username,
                     'topic':topic,
                     'question':question,
                     'sentence':sentence,
                     'server':serverzone,
                     'date':date,
                     'number_of_comments':comments,
                     'number_of_views':views
                    })
    return data
def get_song_info(url):
    browser.get(url)
    html2 = browser.page_source
    song_page = BeautifulSoup(html2, features="lxml")
    interesting_html= song_page.find('div', {'class' : 'list'})
    if not interesting_html: # Check if an article tag was found, not all pages have one
        print('No information availible for song at {}'.format(url), file=sys.stderr)
        return {}
    answer = interesting_html.find('span', {'class' : 'high-quality markdown'}).find('p').text.strip() #.find('span', {"class": "high-quality markdown",}).find('p')
    return {'answer': answer} # Return the data of interest



def main():
    index_page = BeautifulSoup(html,'lxml')
    song_data = get_songs(index_page) # Get songs with metadata
     #for each row in the improve page enter the link and extract the data  
    for row in song_data:
        print('Scraping info on {}.'.format(row['link'])) # Might be useful for debugging
        url = row['link'] #defines that the url is the column link in the csv file 
        song_info = get_song_info(url) # Get lyrics and credits for this song, if available
        for key, value in song_info.items():
            row[key] = value # Add the new data to our dictionary
    with open('results.csv', 'w', encoding='utf-8') as f: # Open a csv file for writing
        fieldnames=['link','username','topic','question','sentence','server','date','number_of_comments','number_of_views','answer'] # These are the values we want to store

感谢您的帮助!

【问题讨论】:

    标签: python selenium selenium-webdriver web-scraping beautifulsoup


    【解决方案1】:

    我很想使用请求来检索总结果计数和每批结果的数量,并在等待条件下循环单击按钮,直到所有结果出现。然后一口气把它们抓起来。下面的大纲可以根据需要重写。您始终可以使用n 端点在n 页面之后停止单击并在循环内增加n。您可以另外添加 WebDriverWait(d,20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '.inline-profile .username'))) 最初在最后一次点击之前收集其他项目以留出时间。

    import requests
    from selenium import webdriver
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.webdriver.common.by import By
    
    data = requests.get('https://boards.euw.leagueoflegends.com/en/search?query=improve&json_wrap=1').json()
    total = data['searchResultsCount']
    batch = data['resultsCount']
    
    d = webdriver.Chrome()
    d.get('https://boards.euw.leagueoflegends.com/en/search?query=improve')
    
    counter = batch
    while counter < total:
        WebDriverWait(d, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, '.show-more-label'))).click()
        counter +=batch
        #print(counter)
    
    userNames = [item.text for item in d.find_elements_by_css_selector('.inline-profile .username')]
    topics = [item.text for item in d.find_elements_by_css_selector('.inline-profile + a')]
    servers = [item.text for item in d.find_elements_by_css_selector('.inline-profile .realm')]
    results = list(zip(userNames, topics, servers))
    

    有趣的是,尽管可以单击按钮,但它似乎确实在给定的结束计数之前停止更新。手动点击时也会发生这种情况。

    【讨论】:

    • 很抱歉这么久才回复,但我已经尝试了多次都没有成功,我已经尝试将代码与您的建议集成。即使计时器工作正常,我也会不断收到几个错误。我得到的最后一个如下: for row in song_data: TypeError: 'NoneType' object is not iterable 因为我是新手,所以我不知道是不是因为我把代码放错了地方。我还有一个问题:“然后一口气抓住它们”是什么意思。谢谢
    • pastebin.com 你的代码。一次抓取 - 我点击直到所有结果都出现,所以我一次抓取所有结果,而不是抓取一批、点击、抓取另一批等。
    猜你喜欢
    • 2023-03-13
    • 2018-04-22
    • 1970-01-01
    • 2021-03-11
    • 2021-10-18
    • 2021-12-08
    • 1970-01-01
    • 2022-07-20
    • 2016-02-27
    相关资源
    最近更新 更多