【问题标题】:How to correctly use threads if the number of cities is very large?如果城市数量很大,如何正确使用线程?
【发布时间】:2017-11-20 18:35:50
【问题描述】:

如何解决问题

warning: conflicting chdir during another chdir block

我得到了城市中的所有地方,并创建了一个带有附件的文件夹 如何正确优化代码并实现正确的工作? 如何检查包含文件的文件夹是否存在然后向现有文件添加新文本?

require 'open-uri'
require 'JSON'
require 'thread'

def scrape_instagram_city_page(page)
    cityArray = []
    id = 0
    begin
        instagram_source = open(page).read
        content = JSON.parse(instagram_source.split("window._sharedData = ")[1].split(";</script>")[0])
        locationName = content['entry_data']['LocationsDirectoryPage'][0]['city_info']['name']
        nextpage = content['entry_data']['LocationsDirectoryPage'][0]['next_page'] 
        Dir.mkdir("#{locationName}")
        loop do
            id +=1
            instagram_source = open(page+"?page=#{id}").read
            content = JSON.parse(instagram_source.split("window._sharedData = ")[1].split(";</script>")[0])
            locationsList = content['entry_data']['LocationsDirectoryPage'][0]['location_list']
            locationsList.each do |hash|
                cityArray.push(hash['id'].to_i)
            end
            if nextpage == "null"
                break
            end
        Dir.chdir("#{locationName}") do
            fileName = "#{locationName}.txt"
            File.open(fileName, 'w') do |file|
                cityArray.each do |item|
                    file << "https://www.instagram.com/explore/locations/#{item}/\n"
                end
            end
        end
        end
    rescue Exception => e
        return nil
    end
end

threads = []
city = ["https://www.instagram.com/explore/locations/c2269433/dhewng-thailand/","https://www.instagram.com/explore/locations/c2260532/ban-poek-thailand/","https://www.instagram.com/explore/locations/c2267999/ban-wang-takrai-thailand/","https://www.instagram.com/explore/locations/c2255595/ban-nong-kho-thailand/","https://www.instagram.com/explore/locations/c2252832/ban-na-khum-thailand/","https://www.instagram.com/explore/locations/c2267577/ban-wang-khaen-thailand/","https://www.instagram.com/explore/locations/c2248064/ban-khung-mae-luk-on-thailand/","https://www.instagram.com/explore/locations/c2243370/ban-hua-dong-kheng-thailand/","https://www.instagram.com/explore/locations/c2269271/chieng-sean-thailand/","https://www.instagram.com/explore/locations/c2256442/ban-nong-phiman-thailand/","https://www.instagram.com/explore/locations/c2246490/ban-khlong-khwang-thai-thailand/"]
city.each do |page|
    threads << Thread.new do
        scrape_instagram_city_page "#{page}"
    end
end

threads.each(&:join)

【问题讨论】:

    标签: json ruby parsing


    【解决方案1】:

    在回答问题之前,我要指出,抓取网站通常违反该网站的服务条款。您应该对此进行检查并确保您没有做违法的事情。

    chdir 更改的“当前目录”是所有线程共享的进程范围设置。这就是为什么当两个线程尝试同时更改它时会出现异常的原因。它与您创建的线程数无关。

    为避免此问题,请勿更改当前目录。只需在路径中包含目录即可:

    def scrape_instagram_city_page(page)
        cityArray = []
        id = 0
        begin
            instagram_source = open(page).read
            content = JSON.parse(instagram_source.split("window._sharedData = ")[1].split(";</script>")[0])
            locationName = content['entry_data']['LocationsDirectoryPage'][0]['city_info']['name']
            nextpage = content['entry_data']['LocationsDirectoryPage'][0]['next_page'] 
            Dir.mkdir("#{locationName}")
            loop do
                id +=1
                instagram_source = open(page+"?page=#{id}").read
                content = JSON.parse(instagram_source.split("window._sharedData = ")[1].split(";</script>")[0])
                locationsList = content['entry_data']['LocationsDirectoryPage'][0]['location_list']
                locationsList.each do |hash|
                    cityArray.push(hash['id'].to_i)
                end
                if nextpage == "null"
                    break
                end
                fileName = "#{locationName}/#{locationName}.txt"
                File.open(fileName, 'w') do |file|
                    cityArray.each do |item|
                        file << "https://www.instagram.com/explore/locations/#{item}/\n"
                    end
                end
            end
        rescue Exception => e
            return nil
        end
    end
    

    【讨论】:

    • 如何提高信息处理速度?如何有效利用线程?
    • 这是非常主观和广泛的,在这里无法轻易回答。例如,您的主要性能瓶颈很可能是您的 Internet 连接或您要连接的站点。
    • 这种情况下可以使用的最大线程数或有效线程数是多少?
    • 因为它是 I/O-bound,而不是 CPU-bound,所以无法确定。例如,有些人可能会建议切换到异步 I/O 并仅使用一个线程。如果继续使用同步 I/O,这取决于服务器响应需要多长时间。
    • 如何切换到异步 I/O 以及它是如何工作的?
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多