【问题标题】:Transform flair language model tensors for viewing in TensorBoard Projector转换 flair 语言模型张量以在 TensorBoard Projector 中查看
【发布时间】:2020-03-22 14:37:15
【问题描述】:

我想转换“向量”,

vectors = [token.embedding for token in sentence]
print(type(vectors))
<class 'list'>

print(vectors)
[tensor([ 0.0077, -0.0227, -0.0004,  ...,  0.1377, -0.0003,  0.0028]),
...
tensor([ 0.0003, -0.0461,  0.0043,  ..., -0.0126, -0.0004,  0.0142])]

0.0077 -0.0227 -0.0004 ... 0.1377 -0.0003 0.0028
...
0.0003 -0.0461 0.0043 ... -0.0126 -0.0004 0.0142

并将其写入 TSV。

除此之外:这些嵌入来自 flair (https://github.com/zalandoresearch/flair):我怎样才能获得完整的输出,而不是 -0.0004 ... 0.1377 的缩写输出?

【问题讨论】:

    标签: python numpy pytorch tensorboard flair


    【解决方案1】:

    要获取令牌,您可以使用 token.text 和 token.embedding.tolist() 来获取嵌入:

    def flair_embeddings(sentences, output_file=None):
        if output_file:
            f = open(output_file, 'w')
        # init embedding
        flair_embedding_forward = FlairEmbeddings('news-forward')
        
        for i, sent in enumerate(sentences):
            print("Encoding the {}th input sentence!".format(i))
            # create a sentence
            sentence = Sentence(sent)
    
            # embed words in sentence
            flair_embedding_forward.embed(sentence)
    
            for token in sentence:
                if output_file:
    
                    f.write(token.text + "\t" + "\t".join([str(num) for num in token.embedding.tolist()]) + '\n')
                else:
                    print(token.text + "\t" + "\t".join([str(num) for num in token.embedding.tolist()]) + '\n')
    

    【讨论】:

      【解决方案2】:

      好吧,我挖了一圈……

      1. 原来那些是 PyTorch 张量(Flair 使用 PyTorch)。对于到 NumPy 数组的简单转换(根据 https://pytorch.org/docs/stable/tensors.html#torch.Tensor.tolistthis StackOverFlow answer 的 PyTorch 文档,请使用 PyTorch 方法 tolist()

        >>> import torch
        >>> a = torch.randn(2, 2)
        >>> print(a)
        tensor([[-2.1693,  0.7698],
                [ 0.0497,  0.8462]])
        
        >>> a.tolist()
        [[-2.1692984104156494, 0.7698001265525818],
         [0.049718063324689865, 0.8462421298027039]]
        

      1. 根据我最初的问题,以下是如何将这些数据转换为纯文本并将它们写入 TSV 文件。

        from flair.embeddings import FlairEmbeddings, Sentence
        from flair.models import SequenceTagger
        from flair.embeddings import StackedEmbeddings
        
        embeddings_f = FlairEmbeddings('pubmed-forward')
        embeddings_b = FlairEmbeddings('pubmed-backward')
        
        sentence = Sentence('The RAS-MAPK signalling cascade serves as a central node in transducing signals from membrane receptors to the nucleus.')
        
        tagger = SequenceTagger.load('ner')
        
        tagger.predict(sentence)
        embeddings_f.embed(sentence)
        
        stacked_embeddings = StackedEmbeddings([
            embeddings_f,
            embeddings_b,
        ])
        
        stacked_embeddings.embed(sentence)
        
        # for token in sentence:
        #     print(token)
        #     print(token.embedding)
        #     print(token.embedding.shape)
        
        tokens = [token for token in sentence]
        print(tokens)
        '''
          [Token: 1 The, Token: 2 RAS-MAPK, Token: 3 signalling, Token: 4 cascade, Token: 5 serves, Token: 6 as, Token: 7 a, Token: 8 central, Token: 9 node, Token: 10 in, Token: 11 transducing, Token: 12 signals, Token: 13 from, Token: 14 membrane, Token: 15 receptors, Token: 16 to, Token: 17 the, Token: 18 nucleus.]
        '''
        
        ## https://www.geeksforgeeks.org/python-string-split/
        
        tokens = [str(token).split()[2] for token in sentence]
        print(tokens)
        '''
          ['The', 'RAS-MAPK', 'signalling', 'cascade', 'serves', 'as', 'a', 'central', 'node', 'in', 'transducing', 'signals', 'from', 'membrane', 'receptors', 'to', 'the', 'nucleus.']
        '''
        
        tensors = [token.embedding for token in sentence]
        print(tensors)
        '''
          [tensor([ 0.0077, -0.0227, -0.0004,  ...,  0.1377, -0.0003,  0.0028]),
          tensor([-0.0007, -0.1601, -0.0274,  ...,  0.1982,  0.0013,  0.0042]),
          tensor([ 4.2534e-03, -3.1018e-01, -3.9660e-01,  ...,  5.9336e-02, -9.4445e-05,  1.0025e-02]),
          tensor([ 0.0026, -0.0087, -0.1398,  ..., -0.0037,  0.0012,  0.0274]),
          tensor([-0.0005, -0.0164, -0.0233,  ..., -0.0013,  0.0039,  0.0004]),
          tensor([ 3.8261e-03, -7.6409e-02, -1.8632e-02,  ..., -2.8906e-03, -4.4556e-04,  5.6909e-05]),
          tensor([ 0.0035, -0.0207,  0.1700,  ..., -0.0193,  0.0017,  0.0006]),
          tensor([ 0.0159, -0.4097, -0.0489,  ...,  0.0743,  0.0005,  0.0012]),
          tensor([ 9.7725e-03, -3.3817e-01, -2.2848e-02,  ..., -6.6284e-02, 2.3646e-04,  1.0505e-02]),
          tensor([ 0.0219, -0.0677, -0.0154,  ...,  0.0102,  0.0066,  0.0016]),
          tensor([ 0.0092, -0.0431, -0.0450,  ...,  0.0060,  0.0002,  0.0005]),
          tensor([ 0.0047, -0.2732, -0.0408,  ...,  0.0136,  0.0005,  0.0072]),
          tensor([ 0.0072, -0.0173, -0.0149,  ..., -0.0013, -0.0004,  0.0056]),
          tensor([ 0.0086, -0.1151, -0.0629,  ...,  0.0043,  0.0050,  0.0016]),
          tensor([ 7.6452e-03, -2.3825e-01, -1.5683e-02,  ..., -5.4974e-04, -1.4646e-04,  6.6120e-03]),
          tensor([ 0.0038, -0.0354, -0.1337,  ...,  0.0060, -0.0004,  0.0102]),
          tensor([ 0.0186, -0.0151, -0.0641,  ...,  0.0188,  0.0391,  0.0069]),
          tensor([ 0.0003, -0.0461,  0.0043,  ..., -0.0126, -0.0004,  0.0142])]
        '''
        
        # ----------------------------------------
        ## Write those data to TSV files.
        
        ## https://stackoverflow.com/a/29896136/1904943
        
        import csv
        
        metadata_f = 'metadata.tsv'
        tensors_f = 'tensors.tsv'
        
        with open(metadata_f, 'w', encoding='utf8', newline='') as tsv_file:
            tsv_writer = csv.writer(tsv_file, delimiter='\t', lineterminator='\n')
            for token in tokens:
                ## Assign to a dummy variable ( _ ) to suppress character counts;
                ## if I use (token), rather than ([token]), I get spaces between all characters:
                _ = tsv_writer.writerow([token])
        
        ## metadata.tsv :
        '''
          The
          RAS-MAPK
          signalling
          cascade
          serves
          as
          a
          central
          node
          in
          transducing
          signals
          from
          membrane
          receptors
          to
          the
          nucleus.
        '''
        
        with open(metadata_f, 'w', encoding='utf8', newline='') as tsv_file:
            tsv_writer = csv.writer(tsv_file, delimiter='\t', lineterminator='\n')
            _ = tsv_writer.writerow(tokens)
        
        ## metadata.tsv :
        '''
          The   RAS-MAPK    signalling  cascade serves  as  a   central node    in  transducing signals from    membrane    receptors   to  the nucleus.
        '''
        
        tensors = [token.embedding for token in sentence]
        print(tensors)
        '''
          [tensor([ 0.0077, -0.0227, -0.0004,  ...,  0.1377, -0.0003,  0.0028]),
          tensor([-0.0007, -0.1601, -0.0274,  ...,  0.1982,  0.0013,  0.0042]),
          tensor([ 4.2534e-03, -3.1018e-01, -3.9660e-01,  ...,  5.9336e-02, -9.4445e-05,  1.0025e-02]),
          tensor([ 0.0026, -0.0087, -0.1398,  ..., -0.0037,  0.0012,  0.0274]),
          tensor([-0.0005, -0.0164, -0.0233,  ..., -0.0013,  0.0039,  0.0004]),
          tensor([ 3.8261e-03, -7.6409e-02, -1.8632e-02,  ..., -2.8906e-03, -4.4556e-04,  5.6909e-05]),
          tensor([ 0.0035, -0.0207,  0.1700,  ..., -0.0193,  0.0017,  0.0006]),
          tensor([ 0.0159, -0.4097, -0.0489,  ...,  0.0743,  0.0005,  0.0012]),
          tensor([ 9.7725e-03, -3.3817e-01, -2.2848e-02,  ..., -6.6284e-02, 2.3646e-04,  1.0505e-02]),
          tensor([ 0.0219, -0.0677, -0.0154,  ...,  0.0102,  0.0066,  0.0016]),
          tensor([ 0.0092, -0.0431, -0.0450,  ...,  0.0060,  0.0002,  0.0005]),
          tensor([ 0.0047, -0.2732, -0.0408,  ...,  0.0136,  0.0005,  0.0072]),
          tensor([ 0.0072, -0.0173, -0.0149,  ..., -0.0013, -0.0004,  0.0056]),
          tensor([ 0.0086, -0.1151, -0.0629,  ...,  0.0043,  0.0050,  0.0016]),
          tensor([ 7.6452e-03, -2.3825e-01, -1.5683e-02,  ..., -5.4974e-04, -1.4646e-04,  6.6120e-03]),
          tensor([ 0.0038, -0.0354, -0.1337,  ...,  0.0060, -0.0004,  0.0102]),
          tensor([ 0.0186, -0.0151, -0.0641,  ...,  0.0188,  0.0391,  0.0069]),
          tensor([ 0.0003, -0.0461,  0.0043,  ..., -0.0126, -0.0004,  0.0142])]
        '''
        
        with open(tensors_f, 'w', encoding='utf8', newline='') as tsv_file:
            tsv_writer = csv.writer(tsv_file, delimiter='\t', lineterminator='\n')
            for token in sentence:
                embedding = token.embedding
                _ = tsv_writer.writerow(embedding.tolist())
        
        ## tensors.tsv (18 lines: one embedding per token in metadata.tsv):
        ## note: enormous output, even for this simple sentence.
        '''
          0.007691788021475077  -0.02268664352595806    -0.0004340760060586035  ...
        '''
        

      1. 最后,我打算将上下文语言嵌入(Flair 等)加载到 TensorFlow 的 Embedding Projector 中。事实证明,我需要做的就是将(这里是 Flair 数据)转换为 NumPy 数组,然后将它们加载到 TensorFlow TensorBoard 实例中(不需要 TSV 文件!)。

        我在我的博文中详细描述了这一点,这里是:Visualizing Language Model Tensors (Embeddings) in TensorFlow's TensorBoard [TensorBoard Projector: PCA; t-SNE; ...]

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2018-11-02
        • 1970-01-01
        • 2017-10-07
        • 1970-01-01
        • 2017-04-12
        • 1970-01-01
        • 2021-11-13
        相关资源
        最近更新 更多