【问题标题】:Convert my data dataframe where each row includes a list of tuples for each sentence转换我的数据数据框,其中每行包含每个句子的元组列表
【发布时间】:2020-11-25 23:34:13
【问题描述】:

我想在 python 中读取一个.dat 文件,我尝试了不同的方法来读取它,最后我得到了这个代码:

datContent = open("..\\data\\train.dat.abs", 'r')
MyList=[]
for line in datContent:
    print(line)

以这种形式打开内容:

1   Should  O
2   students    O
3   be  O
4   taught  O
5   to  O
6   compete O
7   or  O
8   to  O
9   cooperate   O
10  ?   O

------------------> THIS SHOWS, STARTING OF THE NEXT SENTENCES

1   It  O
2   is  O
3   always  O
4   said    O
5   that    O
6   competition O
7   can O
8   effectively O
9   promote O
10  the O
11  development O
12  of  O
13  economy O
14  .   O

但我想将第一列和第二列提取为元组列表:

[(Should, O), (students,O), (be,O), (taught O), (to,O), (compete,O), (or,O), (to,O), (cooperate,O), (?  O)]

每个句子(句子已在原始格式中用空格标记)是数据框的一行。我试过分裂。 我已经完成了使用:

datContent = open("..\\data\\train.dat.abs", 'r', encoding='utf-8' )
MyList=[]
for line in datContent:
    a=line.split()
    print(a)

结果是这样的:

['1', 'Should', 'O']
['2', 'students', 'O']
['3', 'be', 'O']
['4', 'taught', 'O']
['5', 'to', 'O']
['6', 'compete', 'O']
['7', 'or', 'O']
['8', 'to', 'O']
['9', 'cooperate', 'O']
['10', '?', 'O']
[]
['1', 'It', 'O']
['2', 'is', 'O']
['3', 'always', 'O']
['4', 'said', 'O']
['5', 'that', 'O']
['6', 'competition', 'O']
['7', 'can', 'O']
['8', 'effectively', 'O']
['9', 'promote', 'O']
['10', 'the', 'O']
['11', 'development', 'O']
['12', 'of', 'O']
['13', 'economy', 'O']
['14', '.', 'O']

如我所说,我想保存:

[(Should, O), (students,O), (be,O), (taught O), (to,O), (compete,O), (or,O), (to,O), (cooperate,O), (?  O)]

作为一行数据框(基本上是上面每个列表的第 2、3 项),如您所见 [] 将发送的分开

df

row 1= [(Should, O), (students,O), (be,O), (taught  O), (to,O), (compete,O), (or,O), (to,O), (cooperate,O), (?  O)]
row 2= ...

等等。

【问题讨论】:

    标签: python regex dataframe


    【解决方案1】:

    试试这个:

    请参阅regex demo 了解更多信息。

    #form: abc['row1'], abc['row2'] ...
    def getRowContainer(data):
        rowContainer={}
        rowData=[]
        rowCount=1
        dataSet=re.findall(r'(?:^\d{1,14}\s+([a-zA-Z0-9?!.,]{1,20})\s+([^\s]+))|^-{1,20}>',data,flags=re.MULTILINE)
        for item in (dataSet):
            if item[0]=='':
                rowCount+=1
                rowData=[]
                continue
            rowData.append(item)
            rowContainer[f'row{rowCount}']=rowData
        return rowContainer
    
    rows=getRowContainer(data)
    
    for x in range(1,len(rows)+1):
        print (f'row {x}')
        print (rows[f'row{x}'])
    

    我将您的输入数据截图如下:

    data='''
    1   Should  O
    2   students    O
    3   be  O
    4   taught  O
    5   to  O
    6   compete O
    7   or  O
    8   to  O
    9   cooperate   O
    10  ?   O
    
    ------------------> THIS SHOWS, STARTING OF THE NEXT SENTENCES
    
    1   It  O
    2   is  O
    3   always  O
    4   said    O
    5   that    O
    6   competition O
    7   can O
    8   effectively O
    9   promote O
    10  the O
    11  development O
    12  of  O
    13  economy O
    14  .   O'''
    

    我得到的输出:

    row 1
    [('Should', 'O'), ('students', 'O'), ('be', 'O'), ('taught', 'O'), ('to', 'O'), ('compete', 'O'), ('or', 'O'), ('to', 'O'), ('cooperate', 'O'), ('?', 'O')]
    row 2
    [('It', 'O'), ('is', 'O'), ('always', 'O'), ('said', 'O'), ('that', 'O'), ('competition', 'O'), ('can', 'O'), ('effectively', 'O'), ('promote', 'O'), ('the', 'O'), ('development', 'O'), ('of', 'O'), ('economy', 'O'), ('.', 'O')]
    

    【讨论】:

      【解决方案2】:

      简单来说,解决方法是将临时列表中的每一行与所需的数据列表分开,然后将每个临时列表附加到MyList中,最后形成DataFrame,如下所示:

      import pandas as pd
      
      datContent = open("..\\data\\train.dat.abs", 'r', encoding='utf-8' )
      
      MyList = []
      tmp_list = []
      
      for line in datContent:
          a = line.split()
          if len(a) == 0: # space between sentences
              MyList.append(tmp_list)
              tmp_list = []
              continue
          tmp_list.append((a[1], a[2]))
      
      if len(tmp_list) > 0: # to append the last sentence if not space.
          MyList.append(tmp_list)
      
      df = pd.DataFrame({'sentence': MyList})
      
      print(df)
      

      【讨论】:

        猜你喜欢
        • 2019-07-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2019-09-04
        • 2022-11-11
        相关资源
        最近更新 更多