【问题标题】:Calculating Incremental Entropy for Data that is not real numbers计算非实数数据的增量熵
【发布时间】:2018-02-03 19:55:02
【问题描述】:

我有一组数据,其中包含 ID、时间戳和标识符。我必须通过它,计算熵并为数据保存一些其他链接。在每一步中,都会将更多标识符添加到标识符字典中,我必须重新计算熵并附加它。我有非常大量的数据,并且由于标识符数量的增加以及每一步之后的熵计算,程序被卡住了。我阅读了以下解决方案,但它是关于由数字组成的数据。 Incremental entropy computation

我从这个页面复制了两个函数,熵的增量计算在每一步都给出了与经典的全熵计算不同的值。 这是我的代码:

from math import log
# ---------------------------------------------------------------------#
# Functions copied from  https://stackoverflow.com/questions/17104673/incremental-entropy-computation
# maps x to -x*log2(x) for x>0, and to 0 otherwise
h = lambda p: -p*log(p, 2) if p > 0 else 0

# entropy of union of two samples with entropies H1 and H2
def update(H1, S1, H2, S2):
    S = S1+S2
    return 1.0*H1*S1/S+h(1.0*S1/S)+1.0*H2*S2/S+h(1.0*S2/S)

# compute entropy using the classic equation
def entropy(L):
    n = 1.0*sum(L)
    return sum([h(x/n) for x in L])
# ---------------------------------------------------------------------#
# Below is the input data (Actually I read it from a csv file)
input_data = [["1","2008-01-06T02:13:38Z","foo,bar"], ["2","2008-01-06T02:12:13Z","bar,blup"], ["3","2008-01-06T02:13:55Z","foo,bar"],
          ["4","2008-01-06T02:12:28Z","foo,xy"], ["5","2008-01-06T02:12:44Z","foo,bar"], ["6","2008-01-06T02:13:00Z","foo,bar"],
          ["7","2008-01-06T02:13:00Z","x,y"]]
total_identifiers = {} # To store the occurrences of identifiers. Values shows the number of occurrences
all_entropies = []  # Classical way of calculating entropy at every step
updated_entropies = []  # Incremental way of calculating entropy at every step
for item in input_data:
    temp = item[2].split(",")
    identifiers_sum = sum(total_identifiers.values())  # Sum of all identifiers
    old_entropy = 0 if all_entropies[-1:] == [] else all_entropies[-1]  # Get previous entropy calculation
    for identifier in temp:
        S_new = len(temp)  # sum of new samples
        temp_dictionaty = {a:1 for a in temp}  # Store current identifiers and their occurrence
        if identifier not in total_identifiers:
            total_identifiers[identifier] = 1
        else:
            total_identifiers[identifier] += 1
    current_entropy = entropy(total_identifiers.values())  # Entropy for current set of identifiers
    updated_entropy = update(old_entropy, identifiers_sum, current_entropy, S_new)
    updated_entropies.append(updated_entropy)

    entropy_value = entropy(total_identifiers.values())  # Classical entropy calculation for comparison. This step becomes too expensive with big data
    all_entropies.append(entropy_value)

print(total_identifiers)
print('Sum of Total Identifiers: ', identifiers_sum)  # Gives 12 while the sum is 14 ???
print("All Classical Entropies:     ", all_entropies)  # print for comparison
print("All Updated Entropies:       ", updated_entropies)

另一个问题是,当我打印“total_identifiers 的总和”时,它给出的是 12 而不是 14! (由于数据量非常大,我逐行读取实际文件,直接将结果写入磁盘,除了标识符字典外,不存储在内存中)。

【问题讨论】:

    标签: python python-3.x math python-3.5 entropy


    【解决方案1】:

    上面的代码使用了定理4;在我看来,您想改用定理 5(来自下一段中的论文)。

    但是请注意,如果标识符的数量确实是个问题,那么下面的增量方法也不起作用——有时字典会变得太大。

    您可以在下面找到符合 Updating Formulas and Algorithms for Computing Entropy and Gini Index from Time-Changing Data Streams 描述的概念验证 Python 实现。

    import collections
    import math
    import random
    
    
    def log2(p):
        return math.log(p, 2) if p > 0 else 0
    
    
    CountChange = collections.namedtuple('CountChange', ('label', 'change'))
    
    
    class EntropyHolder:
        def __init__(self):
            self.counts_ = collections.defaultdict(int)
    
            self.entropy_ = 0
            self.sum_ = 0
    
        def update(self, count_changes):
            r = sum([change for _, change in count_changes])
    
            residual = self._compute_residual(count_changes)
    
            self.entropy_ = self.sum_ * (self.entropy_ - log2(self.sum_ / (self.sum_ + r))) / (self.sum_ + r) - residual
    
            self._update_counts(count_changes)
    
            return self.entropy_
    
        def _compute_residual(self, count_changes):
            r = sum([change for _, change in count_changes])
            residual = 0
    
            for label, change in count_changes:
                p_new = (self.counts_[label] + change) / (self.sum_ + r)
                p_old = self.counts_[label] / (self.sum_ + r)
    
                residual += p_new * log2(p_new) - p_old * log2(p_old)
    
            return residual
    
        def _update_counts(self, count_changes):
            for label, change in count_changes:
                self.sum_ += change
                self.counts_[label] += change
    
        def entropy(self):
            return self.entropy_
    
    
    
    def naive_entropy(counts):
        s = sum(counts)
        return sum([-(r/s) * log2(r/s) for r in counts])
    
    
    if __name__ == '__main__':
        print(naive_entropy([1, 1]))
        print(naive_entropy([1, 1, 1, 1]))
    
        entropy = EntropyHolder()
        freq = collections.defaultdict(int)
        for _ in range(100):
            index = random.randint(0, 5)
            entropy.update([CountChange(index, 1)])
            freq[index] += 1
    
        print(naive_entropy(freq.values()))
        print(entropy.entropy())
    

    【讨论】:

      【解决方案2】:

      感谢@blazs 提供 entropy_holder 类。这样就解决了问题。所以想法是从 (https://gist.github.com/blazs/4fc78807a96976cc455f49fc0fb28738) 导入 entropy_holder.py 并使用它来存储以前的熵,并在新标识符出现时在每一步更新。

      所以最小的工作代码应该是这样的:

      import entropy_holder
      
      input_data = [["1","2008-01-06T02:13:38Z","foo,bar"], ["2","2008-01-06T02:12:13Z","bar,blup"], ["3","2008-01-06T02:13:55Z","foo,bar"],
                ["4","2008-01-06T02:12:28Z","foo,xy"], ["5","2008-01-06T02:12:44Z","foo,bar"], ["6","2008-01-06T02:13:00Z","foo,bar"],
                ["7","2008-01-06T02:13:00Z","x,y"]]
      
      entropy = entropy_holder.EntropyHolder() # This class will hold the current entropy and counts of identifiers
      for item in input_data:
          for identifier in item[2].split(","):
              entropy.update([entropy_holder.CountChange(identifier, 1)])
      
      print(entropy.entropy())
      

      通过使用 Blaz 的增量公式计算的熵非常接近于经典方法计算的熵,并且无需一次又一次地遍历所有数据。

      【讨论】:

        猜你喜欢
        • 2013-06-10
        • 2011-11-15
        • 1970-01-01
        • 1970-01-01
        • 2021-01-29
        • 1970-01-01
        • 1970-01-01
        • 2011-02-19
        • 2016-02-05
        相关资源
        最近更新 更多