从文本文件中提取数据答案

【问题标题】：Extacting data from text files从文本文件中提取数据
【发布时间】：2020-12-20 03:58:37
【问题描述】：

我有一个包含近 2000 条英文推文的文件。它看起来像这样：

{"data":[{"no.":"1241583652212862978","created":"2020-03-22T04:33:04.000Z","tweet":"@OHAOregon My friend says we should not reuse masks to combat coronavirus, is that correct?"},{"no.":"1241583655538941959","created":"2020-03-22T04:33:05.000Z","tweet":" I know it’s from a few days ago, but these books are in good shape}, .......]}

我只想从文本文件中提取推文。如何从文本文件中仅提取推文部分？任何建议都会有所帮助。提前致谢。

【问题讨论】：

这能回答你的问题吗？ Reading JSON from a file?
嗨@Rakesh，感谢您的回复。但这并不能解决我的问题。我正在尝试仅使用“re”包来解决此问题。所以这对我没有多大帮助。
这里不需要正则表达式....它是一个 json 文件。您可以使用键值访问所需的信息。
@Rakesh，该文件是一个“.txt”文件。不是“.json”文件。我必须根据我正在解决的问题使用正则表达式。

标签： python-3.x twitter text-files text-extraction tweets

【解决方案1】：

您的文件是 json 格式。检查 Python 的 json 库，以便您能够提取推文。 https://docs.python.org/3/library/json.html

【讨论】：

嗨@wildener，有没有可能使用正则表达式解决这个问题？
嗯，JSON 是迄今为止最好的解决方案，但是是的，您可以使用这种模式：\"tweet\":\"(.*?)\"} 在这里查看：regex101.com/r/qfbjgY/1

【解决方案2】：

假设您使用d 来表示对象，它很简单：

tweet = d["data"][0]["tweet"]

另外，如果它有助于我在您的示例中在 shell 中所做的工作示例：

>>> d = {'data': [{'no.': '1241583652212862978', 'created': '2020-03-22T04:33:04.000Z', 'tweet': '@OHAOregon My friend says we should not reuse masks to combat coronavirus, is that correct?'}, {'no.': '1241583655538941959', 'created': '2020-03-22T04:33:05.000Z', 'tweet': ' I know it’s from a few days ago, but these books are in good shape'}]}
>>> print(d["data"])
[{'no.': '1241583652212862978', 'created': '2020-03-22T04:33:04.000Z', 'tweet': '@OHAOregon My friend says we should not reuse masks to combat coronavirus, is that correct?'}, {'no.': '1241583655538941959', 'created': '2020-03-22T04:33:05.000Z', 'tweet': ' I know it’s from a few days ago, but these books are in good shape'}]
>>> print(d["data"][0]["tweet"])
@OHAOregon My friend says we should not reuse masks to combat coronavirus, is that correct?
>>>

【讨论】：