Python：加速反向 DNS 查找答案

【问题标题】：Python : Speed up Reverse DNS lookupPython：加速反向 DNS 查找
【发布时间】：2015-09-27 18:53:52
【问题描述】：

我计划在 4700 万个 ips 上运行反向 DNS。这是我的代码

with open(file,'r') as f:
    with open ('./ip_ptr_new.txt','a') as w:

        for l in f:

            la = l.rstrip('\n')
            ip,countdomain = la.split('|')
            ips.append(ip)

           try:
                ais = socket.gethostbyaddr(ip)
                print ("%s|%s|%s" % (ip,ais[0],countdomain), file = w)    
           except:
                print ("%s|%s|%s" % (ip,"None",countdomain), file = w)

目前它非常慢。有人对加快速度有什么建议吗？

【问题讨论】：

运行多个线程/进程
@matino 我的机器只有两个 cpu。并且使用 python 多处理模块两个不会加快速度。你能详细说明一下吗？
拥有 2 个 CPU 并不意味着一次只能运行 2 个进程。尝试运行更多 - 此代码不太可能受 CPU 限制。
你可以使用 2 个进程，每个进程产生多个线程，每个线程 1) 只读取文件的一部分 2) 因为该部分正在查找主机
@matino 谢谢。如果你能单独写下答案，那对我真的很有帮助。

标签： python dns reverse reverse-dns

【解决方案1】：

尝试使用多处理模块。我已经为大约 8000 ips 的性能计时，我得到了这个：

#dns.py
real    0m2.864s
user    0m0.788s
sys     0m1.216s


#slowdns.py
real    0m17.841s
user    0m0.712s
sys     0m0.772s


# dns.py
from multiprocessing import Pool
import socket
def dns_lookup(ip):
    ip, countdomain = ip
    try:
        ais = socket.gethostbyaddr(ip)
        print ("%s|%s|%s" % (ip,ais[0],countdomain))
    except:
        print ("%s|%s|%s" % (ip,"None",countdomain))

if __name__ == '__main__':
    filename = "input.txt"
    ips = []
    with open(filename,'r') as f:
        with open ('./ip_ptr_new.txt','a') as w:
            for l in f:
                la = l.rstrip('\n')
                ip,countdomain = la.split('|')
                ips.append((ip, countdomain))
    p = Pool(5)
    p.map(dns_lookup, ips)





#slowdns.py
import socket
from multiprocessing import Pool

filename = "input.txt"
if __name__ == '__main__':
    ips = []
    with open(filename,'r') as f:
        with open ('./ip_ptr_new.txt','a') as w:
            for l in f:
               la = l.rstrip('\n')
               ip,countdomain = la.split('|')
               ips.append(ip)
               try:
                    ais = socket.gethostbyaddr(ip)
                    print ("%s|%s|%s" % (ip,ais[0],countdomain), file = w)
               except:
                    print ("%s|%s|%s" % (ip,"None",countdomain), file = w)

【讨论】：

感谢您的帮助。问题是 ips 的数量很多，将它们放入列表中几乎需要 90% 的内存。有没有办法逐行读取并将其传递给多处理模块/.？
对不起，我错过了这个要求。在这种情况下，您不能使用列表，而是必须使用队列。从大文件中读取仍然可以，因为 python 使用生成器而不是列表。请检查此线程。 stackoverflow.com/questions/14677287/…
嗯谢谢。但是我仍然不知道在这种情况下如何使用队列
好的，我会看看今天是否可以使用实际代码发布编辑。

【解决方案2】：

这里的一个解决方案是使用带有选项 timeout 的 nslookup shell 命令。可能是主机命令... 一个不完美但有用的例子！

def sh_dns(ip,dns):
   a=subprocess.Popen(['timeout','0.2','nslookup','-norec',ip,dns],stdout=subprocess.PIPE)
   sortie=a.stdout.read()
   tab=str(sortie).split('=')
   if(len(tab)>1):
     return tab[len(tab)-1].strip(' \\n\'')
   else:
     return ""

【讨论】：

【解决方案3】：

我们最近也不得不处理这个问题。在多个进程上运行并不能提供足够好的解决方案。处理来自强大 AWS 机器的数百万个 IP 可能需要几天时间。运行良好的是使用Amazon EMR，在 10 台机器集群上花了大约半个小时。您不能使用一台机器（通常是一个网络接口）进行太多扩展，因为这是一项网络密集型任务。在多台机器上使用 Map Reduce 确实可以做到这一点。

【讨论】：