【发布时间】:2019-11-02 02:23:46
【问题描述】:
我已经提到了keras guide on using multiple inputs。然而,由于我是 RNN 和 CNN 的新手,我仍然感到困惑。我正在与 keras 合作训练神经网络分类器。在我的 csv 文件中,我有 3 个功能。
- 句子
- 概率
- 目标
每个句子都是一个恰好有 5 个单词的句子,共有 1860 个这样的句子。概率是[0,1]范围内的浮点值,目标是需要预测的字段(0或1)。
我首先使用嵌入随机启动句子,如下所示。
import string
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import numpy as np
import gensim
import pandas as pd
import os
from tensorflow.python.keras.preprocessing.text import Tokenizer
from tensorflow.python.keras.preprocessing.sequence import pad_sequences
from gensim.models import Word2Vec, KeyedVectors
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, GRU
from keras.layers.embeddings import Embedding
from keras.initializers import Constant
from sklearn.metrics import precision_recall_fscore_support
from sklearn.model_selection import train_test_split
from termcolor import colored
from keras.utils import to_categorical
import tensorflow as tf
import warnings
warnings.filterwarnings("ignore")
nltk.download('stopwords')
# one hot encode
seed = 42
np.random.seed(seed)
tf.set_random_seed(seed)
df = pd.DataFrame()
df = pd.read_csv('../../data/sentence_with_stv.csv')
sentence_lines = list()
lines = df['sentence'].values.tolist()
stv = df['stv'].values.tolist()
for line in lines:
tokens = word_tokenize(line)
tokens = [w.lower() for w in tokens]
table = str.maketrans('','',string.punctuation)
stripped = [w.translate(table) for w in tokens]
words = [word for word in stripped if word.isalpha()]
stop_words = set(stopwords.words('english'))
words = [w for w in words if not w in stop_words]
sentence_lines.append(words)
print('Number of lines', len(sentence_lines)))
EMBEDDING_DIM = 200
#Vectorize the text samples into a S2 integer tensor
tokenizer_obj = Tokenizer()
tokenizer_obj.fit_on_texts(sentence_lines)
sequences = tokenizer_obj.texts_to_sequences(sentence_lines)
print(colored(sequences,'green'))
这给了我一个输出,例如,
Number of lines: 1860
[[2, 77, 20, 17, 81],
[12, 21, 17, 82],
[2, 83, 20, 17, 82],
[2, 20, 17, 43],
[12, 21, 17, 81],
...
现在,我需要将概率附加到这些行中的每一行,以便新序列应类似于以下内容。
[[2, 77, 20, 17, 81, 0.456736827],
[12, 21, 17, 82, 0.765142873],
[2, 83, 20, 17, 82, 0.335627635],
[2, 20, 17, 43, 0.5453652],
[12, 21, 17, 81, 0.446739202],
...
我尝试获取序列的每一行并将概率附加为,
sequence[x] = np.append(sequence[x], probability[x], axis=1)
其中,概率是一个大小相同的数组,1860,仅包含概率值。对所有行执行此操作后,我打印一行以检查是否附加了值。但是,我得到如下所示的输出。
[2. 77. 20. 17. 81. 0.456736827]
在这方面的任何建议将不胜感激。
【问题讨论】:
标签: python tensorflow keras tokenize embedding