【发布时间】:2018-10-23 08:35:47
【问题描述】:
我正在使用polyglot 来标记缅甸语文本。这就是我正在做的事情。
from polyglot.text import Text
blob = u"""
ထိုင္းေရာက္ျမန္မာလုပ္သားမ်ားကို လုံၿခဳံေရး အေၾကာင္းျပၿပီး ထိုင္းရဲဆက္လက္ဖမ္းဆီး၊ ဧည့္စာရင္းအေၾကာင္းျပ၍ ဒဏ္ေငြ႐ိုက္
"""
text = Text(blob)
当我这样做时:
print(text.words)
输出格式如下:
[u'\u1011\u102d\u102f', u'\u1004\u1039\u1038\u1031', u'\u101b\u102c', u'\u1000\u1039\u103b', u'\u1019', u'\u1014\u1039', u'\u1019\u102c', u'\u101c\u102f', u'\u1015\u1039', u'\u101e\u102c\u1038', u'\u1019\u103a\u102c\u1038', u'\u1000\u102d\u102f', u'\u101c\u102f\u1036', u'\u107f', u'\u1001\u1033\u1036\u1031', u'\u101b\u1038', u'\u1021\u1031\u107e', u'\u1000\u102c', u'\u1004\u1039\u1038\u103b', u'\u1015\u107f', u'\u1015\u102e\u1038', u'\u1011\u102d\u102f', u'\u1004\u1039\u1038', u'\u101b\u1032', u'\u1006', u'\u1000\u1039', u'\u101c', u'\u1000\u1039', u'\u1016', u'\u1019\u1039\u1038', u'\u1006\u102e\u1038', u'\u104a', u'\u1027', u'\u100a\u1037\u1039', u'\u1005\u102c', u'\u101b', u'\u1004\u1039\u1038', u'\u1021\u1031\u107e', u'\u1000\u102c', u'\u1004\u1039\u1038\u103b', u'\u1015', u'\u104d', u'\u1012', u'\u100f\u1039\u1031', u'\u1004\u103c\u1090\u102d\u102f', u'\u1000\u1039']
这是什么输出?我不确定为什么输出是这样的。我怎样才能将它转换回我可以从中理解的格式?
我还尝试了以下方法:
text.words[1].decode('unicode-escape')
但它会抛出一个错误:UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
【问题讨论】:
-
@KenY-N 我试过这个。但它会引发错误:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128) -
可能是this will help?升级到 Python 3 可能是最好的选择......
-
当您打印
blob时打印是否正确?如果是这样,当您将text.words列表中的字符串一一打印时会发生什么?
标签: python unicode tokenize python-unicode