【问题标题】:getting alphabets after applying sentence tokenizer of nltk instead of sentences in Python 3.5.1在应用 nltk 的句子标记器而不是 Python 3.5.1 中的句子后获取字母
【发布时间】:2016-05-16 16:56:43
【问题描述】:
import codecs, os
import re
import string
import mysql
import mysql.connector
y_ = ""

'''Searching and reading text files from a folder.'''
for root, dirs, files in os.walk("/Users/ultaman/Documents/PAN dataset/Pan     Plagiarism dataset 2010/pan-plagiarism-corpus-2010/source-documents/test1"):
for file in files:
    if file.endswith(".txt"):
        x_ = codecs.open(os.path.join(root,file),"r", "utf-8-sig")
        for lines in x_.readlines():
            y_ = y_ + lines
'''Tokenizing the senteces of the text file.'''
from nltk.tokenize import sent_tokenize
raw_docs = sent_tokenize(y_)

tokenized_docs = [sent_tokenize(y_) for sent in raw_docs]

'''Removing punctuation marks.'''

regex = re.compile('[%s]' % re.escape(string.punctuation)) 

tokenized_docs_no_punctuation = ''

for review in tokenized_docs:

new_review = ''
for token in review: 
    new_token = regex.sub(u'', token)
    if not new_token == u'':
        new_review+= new_token

tokenized_docs_no_punctuation += (new_review)
print(tokenized_docs_no_punctuation)

'''Connecting and inserting tokenized documents without punctuation in database field.'''
def connect():
    for i in range(len(tokenized_docs_no_punctuation)):
        conn = mysql.connector.connect(user = 'root', password = '', unix_socket = "/tmp/mysql.sock", database = 'test' )
        cursor = conn.cursor()
        cursor.execute("""INSERT INTO splitted_sentences(sentence_id, splitted_sentences) VALUES(%s, %s)""",(cursor.lastrowid,(tokenized_docs_no_punctuation[i])))
        conn.commit()
        conn.close()
if __name__ == '__main__':
    connect()


After writing the above code, The result is like

            2 | S      | N                                                                                                                                                                                                                                                                                                             |
|           3 | S      | o                                                                                                                                                                                                                                                                                                            |

| 4 |小号 | | | 5 |小号 | d | | 6 |小号 | ○ | | 7 |小号 |你| | 8 |小号 |乙 | | 9 |小号 |吨 | | 10 |小号 | | | 11 |小号 |米 | | 12 |小号 |是 | | 13 |小号 |
| 14 |小号 | d
在数据库中。

It should be like:
     1 | S      | No doubt, my dear friend.
     2 | S      | no doubt.                                                                                                                                                                   

【问题讨论】:

    标签: python mysql tokenize punctuation


    【解决方案1】:

    我建议进行以下编辑(使用您想要的)。但这是我用来让你的代码运行的。您的问题是 for review in tokenized_docs: 中的 review 已经是一个字符串。所以,这使得token 变成for token in review: 字符。因此,为了解决这个问题,我尝试了 -

    tokenized_docs = ['"No doubt, my dear friend, no doubt; but in the meanwhile suppose we talk of this annuity.', 'Shall we say one thousand francs a year."', '"What!"', 'asked Bonelle, looking at him very fixedly.', '"My dear friend, I mistook; I meant two thousand francs per annum," hurriedly rejoined Ramin.', 'Monsieur Bonelle closed his eyes, and appeared to fall into a gentle slumber.', 'The mercer coughed;\nthe sick man never moved.', '"Monsieur Bonelle."']
    
    '''Removing punctuation marks.'''
    
    regex = re.compile('[%s]' % re.escape(string.punctuation)) 
    
    tokenized_docs_no_punctuation = []
    for review in tokenized_docs:
        new_token = regex.sub(u'', review)
        if not new_token == u'':
            tokenized_docs_no_punctuation.append(new_token)
    
    print(tokenized_docs_no_punctuation)
    

    得到了这个 -

    ['No doubt my dear friend no doubt but in the meanwhile suppose we talk of this annuity', 'Shall we say one thousand francs a year', 'What', 'asked Bonelle looking at him very fixedly', 'My dear friend I mistook I meant two thousand francs per annum hurriedly rejoined Ramin', 'Monsieur Bonelle closed his eyes and appeared to fall into a gentle slumber', 'The mercer coughed\nthe sick man never moved', 'Monsieur Bonelle']
    

    输出的最终格式由您决定。我更喜欢使用列表。但是你也可以将它连接成一个字符串。

    【讨论】:

    • tokenized_docs 的输出是:[['“毫无疑问,我亲爱的朋友,毫无疑问;但同时假设我们谈论这个年金。','我们可以说一年一千法郎。 “','“什么!”','博内尔问道,一动不动地看着他。','“亲爱的朋友,我弄错了;我的意思是每年两千法郎,”拉明急忙回答道。','博内尔先生关门了他的眼睛,似乎陷入了温柔的沉睡。', '佣工咳嗽了;\n病人一动不动。', '“博内尔先生。”', '没有回答。',....
    • 谢谢。但是我们如何将列表的对象传递到数据库中呢?
    • 假设 splitted_sentences 是字符串类型,这一行是准确的 - cursor.execute("""INSERT INTO splitted_sentences(sentence_id, splitted_sentences) VALUES(%s, %s)""",(cursor.lastrowid,(tokenized_docs_no_punctuation[i]))) 。否则,使用" ".join(tokenized_docs_no_punctuation) 连接成一个字符串。
    【解决方案2】:
    nw = []
    for review in tokenized_docs[0]:
        new_review = ''
        for token in review: 
            new_token = regex.sub(u'', token)
            if not new_token == u'':
            new_review += new_token
    nw.append(new_review)
    '''Inserting into database'''
    def connect():
        for j in nw:
            conn = mysql.connector.connect(user = 'root', password = '', unix_socket = "/tmp/mysql.sock", database = 'Thesis' )
            cursor = conn.cursor()
            cursor.execute("""INSERT INTO splitted_sentences(sentence_id,  splitted_sentences) VALUES(%s, %s)""",(cursor.lastrowid,j))
            conn.commit()
            conn.close()
    if __name__ == '__main__':
        connect()
    

    【讨论】:

      猜你喜欢
      • 2015-05-16
      • 2013-02-10
      • 2012-12-15
      • 2012-01-12
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-05-19
      • 2020-12-02
      相关资源
      最近更新 更多