【发布时间】:2021-12-18 14:16:44
【问题描述】:
我需要sentence 中的word 的索引。但有时会出现重复的单词。 phrase 信息会很有帮助。或word 列中的上一行或下一行。
基本上,我只需要识别话语中的单词,例如如果word 是“海边”,我想知道它在句子中是哪个“海边”。我有来自phrase 的额外信息可以帮助识别。它们在数据框中的出现顺序也有帮助。
我现在拥有的是这样的:
| file_id | phrase | word | sentence | word_indices |
|---|---|---|---|---|
| A | I am | I | I am a happy bird. I sing every day. I eat worms. | [0, 5, 9] |
| B | the seaside is | the | she is by the seaside. The seaside is packed. | [3, 5] |
| B | the seaside is | seaside | she is by the seaside. The seaside is packed. | [4, 6] |
| B | the seaside is | is | she is by the seaside. The seaside is packed. | [1, 7] |
| C | nobody knows | nobody | nobody knows what is going on. She can find nobody | [0, 9] |
| C | find nobody | nobody | nobody knows what is going on. She can find nobody | [0, 9] |
| D | it is such a sunny day | sunny | it is such a sunny day ah I am so happy when it's sunny such a sunny day is the best | [4, 13, 16] |
但我想得到的是target 列中的内容。
| file_id | phrase | word | sentence | word_indices | target |
|---|---|---|---|---|---|
| A | I am | I | I am a happy bird. I sing every day. I eat worms. | [0, 5, 9] | [0] |
| B | the seaside is | the | she is by the seaside. The seaside is packed. | [3, 5] | [5] |
| B | the seaside is | seaside | she is by the seaside. The seaside is packed. | [4, 6] | [6] |
| B | the seaside is | is | she is by the seaside. The seaside is packed. | [1, 7] | [7] |
| C | nobody knows | nobody | nobody knows what is going on. She can find nobody | [0, 9] | [0] |
| C | find nobody | nobody | nobody knows what is going on. She can find nobody | [0, 9] | [9] |
| D | it is such a sunny day | sunny | it is such a sunny day ah I am so happy when it's sunny such a sunny day is the best | [4, 13, 16] | [4] |
我在这里发现了一个类似的问题:Find index of words in matched text 但不幸的是,这是在 java 中,我需要使用 python 来回答。
非常感谢!
【问题讨论】:
-
你能给出更准确的定义吗?我假设,如果
word在句子中不是唯一的,算法将查找phrase术语并返回该短语第一次出现的单词的索引,对吗?如果phrase出现多次会怎样?如果word多次出现但phrase没有出现怎么办? -
感谢您的评论。是的,你问的问题也是我的问题。基本上,我只需要识别话语中的单词,例如如果
word是“海边”,我想知道它在句子中是哪个“海边”。我有来自phrase的额外信息可以帮助识别。它们在数据框中出现的顺序也有帮助。
标签: python pandas indexing nlp nltk