如何在 Python 的 spaCy NLP 中定义标记？答案

【问题标题】：How to define tokens in spaCy NLP in Python?如何在 Python 的 spaCy NLP 中定义标记？
【发布时间】：2017-05-25 14:50:08
【问题描述】：

我想在我的 FlaskApp 中使用 NLP 的 spaCy 功能。我一直在官方网站上搜索不同的例子：（对于 spaCy） https://spacy.io/docs/usage/tutorials

和（对于 Flask） https://realpython.com/blog/python/flask-by-example-part-3-text-processing-with-requests-beautifulsoup-nltk/

在 MyWebapp 中，我有代码可以发布来自 parse_news_from 的 NLP 分析结果：

@app.route('/submit', methods=['POST'])
def submit_textarea():
    if(parse_news_from(format(request.form["text"]))):
       print("The news is parsed sucessfully!");
    return talk_title;

目前parse_news_from 与 NLTK 库一起使用，但我将使用 spaCy。这是我从官方来源获得的 spaCy 代码：

from spacy.en import English
import _regex
parser = English()

# Test Data
multiSentence = "There is an art, it says, or rather, a knack to flying." \
                 "The knack lies in learning how to throw yourself at the ground and miss." \
                 "In the beginning the Universe was created. This has made a lot of people "\
                 "very angry and been widely regarded as a bad move."
# all you have to do to parse text is this:
#note: the first time you run spaCy in a file it takes a little while to load up its modules
parsedData = parser(multiSentence)

# Let's look at the tokens
# All you have to do is iterate through the parsedData
# Each token is an object with lots of different properties
# A property with an underscore at the end returns the string representation
# while a property without the underscore returns an index (int) into spaCy's vocabulary
# The probability estimate is based on counts from a 3 billion word
# corpus, smoothed using the Simple Good-Turing method.
for i, token in enumerate(parsedData):
    print("original:", token.orth, token.orth_)
    print("lowercased:", token.lower, token.lower_)
    print("lemma:", token.lemma, token.lemma_)
    print("shape:", token.shape, token.shape_)
    print("prefix:", token.prefix, token.prefix_)
    print("suffix:", token.suffix, token.suffix_)
    print("log probability:", token.prob)
    print("Brown cluster id:", token.cluster)
    print("----------------------------------------")
    if i > 1:
        break

执行后出现错误：

File "/home/xxx/anaconda3/lib/python3.6/site-packages/_regex_core.py", line 21, in <module>
    import _regex
ImportError: /home/xxx/anaconda3/lib/python3.6/site-packages/_regex.cpython-36m-x86_64-linux-gnu.so: undefined symbol: PySlice_AdjustIndices

是否有任何工作示例如何开始？我的错在哪里？谢谢

【问题讨论】：

标签： python ios web-applications nlp

【解决方案1】：

我发现了上述错误的问题，这对我来说太不可预测了。它在这里描述： How to fix a python spaCy error: "undefined symbol: PySlice_AdjustIndices"?

【讨论】：