【发布时间】:2021-06-29 22:02:52
【问题描述】:
谁能告诉我如何将非结构化文件导入熊猫?
我所说的非结构化是指:
- 具有可变长度的行的日志文件如下:
2021-01-26T09:40:01.192Z info hostd[2101947] [Originator@6876 sub=Default opID=823a15d0] Accepted password for user root from 127.0.0.1
2021-01-26T09:40:01.192Z info hostd[2101947] [Originator@6876 sub=Vimsvc opID=823a15d0] [Auth]: User root
2021-01-26T09:40:01.193Z info hostd[2101947] [Originator@6876 sub=Vimsvc.ha-eventmgr opID=823a15d0] Event 24138 : User root@127.0.0.1 logged in as pyvmomi
2021-01-26T09:40:01.268Z info hostd[2101940] [Originator@6876 sub=Vimsvc.ha-eventmgr opID=823a15de user=root] Event 24139 : User root@127.0.0.1 logged out (login time: Tuesday, 26 January, 2021 09:40:01 AM, number of API invocations: 0, user agent: pyvmomi)
我尝试了多种方法并进行了一些谷歌搜索,但每个人似乎都在导入结构良好的 CSV 文件并且找不到任何日志文件导入引用,(我不是程序员,只是想用 pandas 编写这个小程序)
*多个类似的东西:
# giving a range for column names but this is not adequate if I want to search throught the logs for errors later I'd have to use all 54 columns ?!
pd.read_csv("mylog",sep='\s+',header=None,error_bad_lines=False, engine="python",quoting=csv.QUOTE_NONE,names=range(55))
# or putting everything into index :D
pd.read_csv("mylog",sep='\t', lineterminator='\n', index_col=0)
*oh yeah, want to use timeframe as INDEX column*
pd.read_csv("mylog", sep = None, iterator = True)
这个想法是
- 有时间框架作为索引
- 第二(或第二和第三)列中的其他条目,以便于搜索字符串/错误
提前致谢!
【问题讨论】: