【发布时间】:2016-06-13 12:09:30
【问题描述】:
您好,我正在寻找一种方法来从该文本块(原始来自 mbox 文件)中提取标题名称(以粗体显示的内容) 我尝试了这个适用于崇高文本正则表达式搜索但不适用于 python 的正则表达式 ^\w+-?(\w+)?-?(\w+)?:
rgex = re.findall('^\w+-?(\w+)?-?(\w+)?:', mail);
这就是邮件中的内容
X-Apparently-To:test@yahoo.com; 2016 年 6 月 9 日星期四 13:41:21 +0000
返回路径:
Received-SPF:通过(yahoo.com 的域将 72.30.235.45 指定为允许的发件人)
收到:来自 127.0.0.1 (EHLO n3-vm9.bullet.mail.bf1.yahoo.com) (72.30.235.45) 由 mta1287.mail.ne1.yahoo.com 使用 SMTPS;周四,6 月 9 日
2016 13:41:21 +0000
DKIM 签名:v=1;一个=rsa-sha256; c=放松/放松; d=yahoo-inc.com; s=yibm; t=1465479679;
收件人:test@yahoo.com
来自:“雅虎”
回复:“雅虎”
X-YMailISG:PCypxycWLDvGv4Bg8ShrtzVYi3vpFMAjYaqWyWybcVJ_ZQff eyquyqb..Qu6UKhX_Tyz5b3da2iDtRStJpVnNulZHOb8GznJQTCKk9sjvboS KsbzY4E1uScWz0Ieo0jjG0YHrB1dTCzOSeMiPNumCCFS1sR3_SkyMBGG_D2D wWtdRducxLa2YgEMMubVpMtNJMBv.bwk0.E.jQNEy8I3LnJEqcDpmIUM7bZL XgkEFz7yl1Zo6Sj4r0z6pGlVIFOql7uG9Bwq2VJoK1Q1upKJUOBfQqzf64y2 9fXLnQsWENpZloxwncGzLhdzEYGgE3xNuFV8QFxZGXyvtKZFoykH49M03URN jtx8Yg6ypjyRbBIRVJGVFbjAvW6io3yeyIFh042jlgYQtLxbneFA60hn9ifT Mit3bQ5l7Tginw0OgRM2cbqLo0tEZFt9vlN597Z3vPGwsVdBcTp9wnk6orj2 TqjEpAmODy3Yru2HzDP7Dbwq9CGaIozUm91VNWqw5Dy7AMQEsuvnBop7Fflk G21m1WKMBgrS.2bOLQ4797E09LjlyyoWI9FouUNNhDljnPPf2AeKUKzauctw ULOQPveWAm4lDsNLMp5yvXDYNIe5HMor84SVd8_xF3Icna1PAftXGzJUHrXK NZSEN_VO0GprGfaNQg4uSW_0wXFXwC6TYQ4CMjz53o0qNGpILogVfRLwFCFL DtW8nimkLLsNzmDajzJsR_juA86Orw2NE5ED4qdpPxmyxyrXYOQPu3O6zeYf 7mBzU0aX7VHJUxJ4L3HDB9qTjbTaCdnySrnjGtd7u9Cn9yRJirDNeg3UA82P PeA1ZDfc0vKdrn5QI6e6YKa2TTt7Dspy3jObgSapH5epc3LyQVyN7yjpxrq_ MXAbpqedjUfcwq3c7lpt8xxUxy.MXWg0fJO059xijvb_sYTaQTGUWAMeVU.6 IW.hSksejwpn._CgE9Kqabbk5qgYIdYRW1pmz5OBYh0skCX1TrFRuxbGvDit R_wr.wbTpJGiSST.b0ZetmgN72bVvlRtmNPw1Dk.zxaacXxhGSMWupPUDLJZ OMrap2ax8oiQrxT3jIhk8seIkaNJ.tGUhlPx6G4lJJaz0g89LmjBaEjGUG8P W3Phh9db3hjxUIX5UC0jg5ai2XZ7u_wXn2Muk61N1eRCZ0oA2S25YDPK1dh。 3VQ6pH8SSBxVkQHUJXbZUNqLAzi5V5wRS7oeitXERGgA2DiZB268.rJxS7di OMT5eGoITG4LnAo1M3nsVQ6xceHDd4v6KD9KfBgTHX_iLUv_skCv4dVUgVvj edKOFiOMHBTpJ9J9BECjTTzEUpc.fCNUcRwSsiSkqbRhUsAdCbxQZir3Nb1Z 6FzI6J2eNqpj4azjmDeI15R8MyN7VFc6bl6pCZySk2Tx5SQESDm.sVkADSVR pI2nuscEjU3xo_qGUxbh5mbAA17K2zYpcFXaOce8_9Eszos5pURCcdtBYUqI I_DOtvNe.zWY1ShRcr9ZzTj3ibmc7NBmvumhVMjqirb12mfJ6oxHv8d86gze HtAJmJghczUg5otSzdxSgEJJxjMZrzSidJ9FP.gPiPWtuukz82YpZ32MnCVs 6.V2DRxpUmZa31KH93QSEzwMlCn3FFTLBv9izcjoFP81yeAn.3QloF8XIC3K WmtXtloyeGjuygAhlkd_prXmmGGC5JmPlY8xu4k1NavkdDh6pG6zIkt83Wsd p.D.0BgM
X-Originating-IP:[75.30.245.45]
身份验证结果:mta1287.mail.ne1.yahoo.com from=yahoo-inc.com; domainkeys=neutral(无签名);来自=yahoo-inc.com; dkim=pass (ok)
【问题讨论】:
-
嗯...试试
(?![^\s:]*\d+:\d+)([^\s:]+):。 -
是否有理由不只使用
email包来解析标头(从而获得标头名称)? -
donkopotamus- 是的,就我而言,它必须来自某个文件
-
Wiktor Stribiżew- 谢谢你这样做:)
-
@user1731805 你可以只读取文件并将文本传递给
email.Parser(它将读取 rfc822 样式消息)