过滤掉表格中间的额外标题答案

【问题标题】：Filter out extra headers in middle of table过滤掉表格中间的额外标题
【发布时间】：2017-10-07 04:02:09
【问题描述】：

我正在尝试导入一个非常大的数据文件。它是一个结构类似于

的文本文件

***** Information about Data ***********
Information about data
Information about Data
Information about Data

Information about Data

    Col1     Col2
     1.0      1.0
     1.0      1.0
     1.0      1.0
     1.0      1.0
     ...(10k+ lines)
     1.0      1.0
     1.0      1.0
***** Information about Data ***********
Information about data
Information about Data
Information about Data

Information about Data

    Col1     Col2
     1.0      1.0
     1.0      1.0
     1.0      1.0
     1.0      1.0
     ...(10k+ lines)
     1.0      1.0
     1.0      1.0

并重复任意次数。标题之间的行数各不相同，总文件超过 100 万行。

有没有一种方法可以在不逐行查看的情况下剥离此标题？我写了一个逐行搜索，但是太慢了，不实用。

标题每次显示都会略有不同。

【问题讨论】：

Header info实际上是Header info吗？
不，我会相应地编辑
np.genfromtxt 接受任何可以逐行输入的输入。由于它已经读取了带有readline 的文件，因此在管道中插入逐行搜索不会减慢它的速度。使用pandas' 编译的阅读器可能是另一回事。

标签： python-2.7 pandas numpy

【解决方案1】：

假设您的文件名为test.txt

以字符串形式读取整个文件

split'\n*'

     new line
             \ 
  1.0      1.0
***** Information about Data ***********
 \
  followed by astricks

rsplit 由'\n\n' 获得结果并排在最后

       first new line
                     \
Information about Data

 \
  second new line
    Col1     Col2
     1.0      1.0
     1.0      1.0
     1.0      1.0

read_csv
pd.concat

from io import StringIO
import pandas as pd

def rtxt(txt):
    return pd.read_csv(StringIO(txt), delim_whitespace=True)

fname = 'test.txt'

pd.concat(
    [rtxt(st.rsplit('\n\n', 1)[-1])
     for st in open(fname).read().split('\n*')],
    ignore_index=True
)

    Col1  Col2
0    1.0   1.0
1    1.0   1.0
2    1.0   1.0
3    1.0   1.0
4    1.0   1.0
5    1.0   1.0
6    1.0   1.0
7    1.0   1.0
8    1.0   1.0
9    1.0   1.0
10   1.0   1.0
11   1.0   1.0

【讨论】：