用于文本提取的 python/perl 脚本 [关闭]答案

【问题标题】：python/perl script for text extraction [closed]用于文本提取的 python/perl 脚本 [关闭]
【发布时间】：2014-10-10 13:04:47
【问题描述】：

我目前从事机器学习数学（准确地说是 NLP）。在执行任务时，我遇到了一个问题。我想打印出包含以下任何正则表达式的行：

1)fbchat

2)fb_timeline

3)Facebook 墙贴

成一个单独的文本文件，一个用于上面提到的每个字符串。

然后在每个生成的文本文件中，我想根据 messaged.dmpthread ID 字段对每一行进行排序/em>。我是一个理论人，编程经验很少。

数据库转储的下载链接如下

messages.dmp

更新：

这是我尝试编写的脚本：

import re
from sys import argv

scrip, file_name = argv

dfile = open(file_name, 'r')

for line in dfile:
    if re.match("fbchat", line):
        print line

但是脚本什么也没做。

【问题讨论】：

我知道你是a theoretical person with very less programming experience 但请参考help 你不能问questions you haven't tried to find an answer for 你需要展示你的作品。
@KobiK 我已经更新了我的问题...请通过

标签： python regex perl text

【解决方案1】：

鉴于以下文本 file.txt：

text1
fbchat !
text2
Facebook Wall Post line

您可以使用以下代码：

# open the file
with open('c:\\file.txt') as f:
    # read all lines as list
    lines = f.readlines()
# iterate over the list
for line in lines:
    # if any of the the strings in the list is in the line print it (using list comprehensions)
    if any(s in line for s in ['fbchat', 'fb_timeline', 'Facebook Wall Post']):
        # print but first remove new line character
        print line.strip("\n")

输出：

fbchat !
Facebook Wall Post line

您可以阅读更多关于Python With、Python: List Comprehensions、Strip()的信息

【讨论】：

Thanx 这行得通....但是您能否指出我的脚本中存在哪些错误....这对我打算学习很有帮助......... .....感谢分享额外的资源......新手很难过滤掉谷歌为任何参考查询而收集的大量材料......
很高兴它有帮助，你的问题在于理解re.match()，你可以阅读这个tutorial关于正则表达式和python，它简短易懂，你也可以试试this post。
但我无法为第二部分提出解决方案....也就是说，如果我在聊天文件中获得所有包含 fbchat 的行...如何我能否根据其中的 thread ID 字段对这些行进行排序...请通过...谢谢