Python多线程高内存使用问题答案

【问题标题】：Python multithreading high memory usage problemsPython多线程高内存使用问题
【发布时间】：2016-02-27 00:27:19
【问题描述】：

我正在使用多线程和随机代理抓取网页。我的家用电脑可以很好地处理这个问题，但需要很多进程（在当前代码中，我将其设置为 100）。 RAM 使用量似乎达到了 2.5GB 左右。然而，当我在我的 CentOS VPS 上运行它时，我收到一条通用的“Killed”消息并且程序终止。运行 100 个进程时，我非常非常快地收到 Killed 错误。我将它减少到更合理的 8 并且仍然得到相同的错误，但经过更长的时间。基于一些研究，我假设“Killed”错误与内存使用有关。没有多线程，错误不会发生。

那么，我可以做些什么来优化我的代码以仍然快速运行，但不使用这么多内存？我最好的选择是进一步减少进程数量吗？我可以在程序运行时从 Python 中监控我的内存使用情况吗？

编辑：我刚刚意识到我的 VPS 在我的桌面上有 256mb 的 RAM 而不是 24gb，这是我最初编写代码时没有考虑到的。

#Request soup of url, using random proxy / user agent - try different combinations until valid results are returned
def getsoup(url):
    attempts = 0
    while True:
        try:
            proxy = random.choice(working_proxies)
            headers = {'user-agent': random.choice(user_agents)}  
            proxy_dict = {'http': 'http://' + proxy}
            r = requests.get(url, headers, proxies=proxy_dict, timeout=5)
            soup = BeautifulSoup(r.text, "html5lib") #"html.parser"
            totalpages = int(soup.find("div",  class_="pagination").text.split(' of ',1)[1].split('\n', 1)[0])  #Looks for totalpages to verify proper page load 
            currentpage = int(soup.find("div",  class_="pagination").text.split('Page ',1)[1].split(' of', 1)[0])
            if totalpages < 5000: #One particular proxy wasn't returning pagelimit=60 or offset requests properly ..            
                break
        except Exception as e:
            # print 'Error! Proxy: {}, Error msg: {}'.format(proxy,e)
            attempts = attempts + 1        
            if attempts > 30:
                print 'Too many attempts .. something is wrong!'
                sys.exit()
    return (soup, totalpages, currentpage)

#Return soup of page of ads, connecting via random proxy/user agent
def scrape_url(url):
    soup, totalpages, currentpage = getsoup(url)               
    #Extract ads from page soup

    ###[A bunch of code to extract individual ads from the page..]

    # print 'Success! Scraped page #{} of {} pages.'.format(currentpage, totalpages)
    sys.stdout.flush()
    return ads     

def scrapeall():     
    global currentpage, totalpages, offset
    url = "url"

    _, totalpages, _ = getsoup(url + "0")
    url_list = [url + str(60*i) for i in range(totalpages)]

    # Make the pool of workers
    pool = ThreadPool(100)    
    # Open the urls in their own threads and return the results
    results = pool.map(scrape_url, url_list)
    # Close the pool and wait for the work to finish
    pool.close()
    pool.join()

    flatten_results = [item for sublist in results for item in sublist] #Flattens the list of lists returned by multithreading
    return flatten_results

adscrape = scrapeall()

【问题讨论】：

很可能只有 256MB RAM 的进程会因为内存使用过多而被杀死，即使它不是多线程的。您必须记住，甚至不是所有 256MB 都可用。抓取会根据页面使用大量内存。
您想将请求排成一行吗？
彼得，我可以做些什么来减少内存使用？我已经删除了多线程，是的，它仍然会崩溃

标签： python multithreading memory screen-scraping

【解决方案1】：

BeautifulSoup 是纯 Python 库，在中端网站上它会占用大量内存。如果它是一个选项，请尝试将其替换为 lxml，它更快并且用 C 编写。如果您的页面很大，它可能仍然会耗尽内存。

正如 cmets 中已经建议的那样，您可以使用 queue.Queue 来存储响应。更好的版本是检索对磁盘的响应，将文件名存储在队列中，然后在单独的进程中解析它们。为此，您可以使用 multiprocessing 库。如果解析用完内存并被终止，则继续获取。这种模式被称为 fork and die，是 Python 使用过多内存的常见解决方法。

那么你还需要有办法查看哪些响应解析失败。

【讨论】：