正如 Antti 提到的,你应该更喜欢 python3 并让所有这些烦人
python2垃圾在你身后。以下脚本适用于 python2 和 python3。
要读取/写入文件,请使用 io 模块中的 open 函数,这是
python2/python3 兼容。始终使用with 语句打开文件等资源。 with 用于将块的执行包装在Python Context Manager 中。文件描述符有上下文管理器实现,离开with块时会自动关闭。
不依赖python,如果你想读取一个文本文件,你应该知道
此文件的编码以正确读取它(如果您不确定尝试utf-8
第一的)。此外,正确的 UTF-8 签名是 utf-8 和模式 U 是
被贬低了。
#!/usr/bin/env python
# -*- coding: utf-8; mode: python -*-
from nltk.util import ngrams
import collections
import io, sys
def main(inFile, outFile):
with io.open(inFile, encoding="utf-8") as i:
sixgrams = ngrams(i.read().split(), 2)
result = collections.Counter(sixgrams)
templ = "%-10s %s\n"
with io.open(outFile, "w", encoding="utf-8") as o:
o.write(templ % (u"count", u"words"))
o.write(templ % (u"-" * 10, u"-" * 30))
# Sorting might be expensive. Before sort, filter items you don't want
# to handle, btw. place *count* in front of the tuple.
filtered = [ (c, w) for w, c in result.items() if c > 1]
filtered.sort(reverse=True)
for count, item in filtered:
o.write(templ % (count, " ".join(item)))
if __name__ == '__main__':
sys.exit(main("text.txt", "out_text.txt"))
输入text.txt文件:
At eight o'clock on Thursday morning and Arthur didn't feel very good
he missed 100 € on Thursday morning. The Euro symbol of 100 € is here
to test the encoding of non ASCII characters, because encoding errors
do occur only on Thursday morning.
我收到以下output_text:
count words
---------- ------------------------------
3 on Thursday
2 Thursday morning.
2 100 €