在应用 nltk 的句子标记器而不是 Python 3.5.1 中的句子后获取字母答案

【问题标题】：getting alphabets after applying sentence tokenizer of nltk instead of sentences in Python 3.5.1在应用 nltk 的句子标记器而不是 Python 3.5.1 中的句子后获取字母
【发布时间】：2016-05-16 16:56:43
【问题描述】：

import codecs, os
import re
import string
import mysql
import mysql.connector
y_ = ""

'''Searching and reading text files from a folder.'''
for root, dirs, files in os.walk("/Users/ultaman/Documents/PAN dataset/Pan     Plagiarism dataset 2010/pan-plagiarism-corpus-2010/source-documents/test1"):
for file in files:
    if file.endswith(".txt"):
        x_ = codecs.open(os.path.join(root,file),"r", "utf-8-sig")
        for lines in x_.readlines():
            y_ = y_ + lines
'''Tokenizing the senteces of the text file.'''
from nltk.tokenize import sent_tokenize
raw_docs = sent_tokenize(y_)

tokenized_docs = [sent_tokenize(y_) for sent in raw_docs]

'''Removing punctuation marks.'''

regex = re.compile('[%s]' % re.escape(string.punctuation)) 

tokenized_docs_no_punctuation = ''

for review in tokenized_docs:

new_review = ''
for token in review: 
    new_token = regex.sub(u'', token)
    if not new_token == u'':
        new_review+= new_token

tokenized_docs_no_punctuation += (new_review)
print(tokenized_docs_no_punctuation)

'''Connecting and inserting tokenized documents without punctuation in database field.'''
def connect():
    for i in range(len(tokenized_docs_no_punctuation)):
        conn = mysql.connector.connect(user = 'root', password = '', unix_socket = "/tmp/mysql.sock", database = 'test' )
        cursor = conn.cursor()
        cursor.execute("""INSERT INTO splitted_sentences(sentence_id, splitted_sentences) VALUES(%s, %s)""",(cursor.lastrowid,(tokenized_docs_no_punctuation[i])))
        conn.commit()
        conn.close()
if __name__ == '__main__':
    connect()


After writing the above code, The result is like

            2 | S      | N                                                                                                                                                                                                                                                                                                             |
|           3 | S      | o                                                                                                                                                                                                                                                                                                            |

| 4 |小号 | | | 5 |小号 | d | | 6 |小号 | ○ | | 7 |小号 |你| | 8 |小号 |乙 | | 9 |小号 |吨 | | 10 |小号 | | | 11 |小号 |米 | | 12 |小号 |是 | | 13 |小号 |
| 14 |小号 | d
在数据库中。

It should be like:
     1 | S      | No doubt, my dear friend.
     2 | S      | no doubt.

【问题讨论】：

标签： python mysql tokenize punctuation

【解决方案1】：

我建议进行以下编辑（使用您想要的）。但这是我用来让你的代码运行的。您的问题是 for review in tokenized_docs: 中的 review 已经是一个字符串。所以，这使得token 变成for token in review: 字符。因此，为了解决这个问题，我尝试了 -

tokenized_docs = ['"No doubt, my dear friend, no doubt; but in the meanwhile suppose we talk of this annuity.', 'Shall we say one thousand francs a year."', '"What!"', 'asked Bonelle, looking at him very fixedly.', '"My dear friend, I mistook; I meant two thousand francs per annum," hurriedly rejoined Ramin.', 'Monsieur Bonelle closed his eyes, and appeared to fall into a gentle slumber.', 'The mercer coughed;\nthe sick man never moved.', '"Monsieur Bonelle."']

'''Removing punctuation marks.'''

regex = re.compile('[%s]' % re.escape(string.punctuation)) 

tokenized_docs_no_punctuation = []
for review in tokenized_docs:
    new_token = regex.sub(u'', review)
    if not new_token == u'':
        tokenized_docs_no_punctuation.append(new_token)

print(tokenized_docs_no_punctuation)

得到了这个 -

['No doubt my dear friend no doubt but in the meanwhile suppose we talk of this annuity', 'Shall we say one thousand francs a year', 'What', 'asked Bonelle looking at him very fixedly', 'My dear friend I mistook I meant two thousand francs per annum hurriedly rejoined Ramin', 'Monsieur Bonelle closed his eyes and appeared to fall into a gentle slumber', 'The mercer coughed\nthe sick man never moved', 'Monsieur Bonelle']

输出的最终格式由您决定。我更喜欢使用列表。但是你也可以将它连接成一个字符串。

【讨论】：

tokenized_docs 的输出是：[['“毫无疑问，我亲爱的朋友，毫无疑问；但同时假设我们谈论这个年金。','我们可以说一年一千法郎。 “'，'“什么！”'，'博内尔问道，一动不动地看着他。'，'“亲爱的朋友，我弄错了；我的意思是每年两千法郎，”拉明急忙回答道。'，'博内尔先生关门了他的眼睛，似乎陷入了温柔的沉睡。', '佣工咳嗽了；\n病人一动不动。', '“博内尔先生。”', '没有回答。',....
谢谢。但是我们如何将列表的对象传递到数据库中呢？
假设 splitted_sentences 是字符串类型，这一行是准确的 - cursor.execute("""INSERT INTO splitted_sentences(sentence_id, splitted_sentences) VALUES(%s, %s)""",(cursor.lastrowid,(tokenized_docs_no_punctuation[i]))) 。否则，使用" ".join(tokenized_docs_no_punctuation) 连接成一个字符串。

【解决方案2】：

nw = []
for review in tokenized_docs[0]:
    new_review = ''
    for token in review: 
        new_token = regex.sub(u'', token)
        if not new_token == u'':
        new_review += new_token
nw.append(new_review)
'''Inserting into database'''
def connect():
    for j in nw:
        conn = mysql.connector.connect(user = 'root', password = '', unix_socket = "/tmp/mysql.sock", database = 'Thesis' )
        cursor = conn.cursor()
        cursor.execute("""INSERT INTO splitted_sentences(sentence_id,  splitted_sentences) VALUES(%s, %s)""",(cursor.lastrowid,j))
        conn.commit()
        conn.close()
if __name__ == '__main__':
    connect()

【讨论】：