根据 pandas 中的数据框格式化输出文本文件答案

【问题标题】：Format output text file based on data frame in pandas根据 pandas 中的数据框格式化输出文本文件
【发布时间】：2020-01-16 09:06:09
【问题描述】：

我有一个数据框，它与一家商店及其客户的购买情况有关。

我想以某种格式输出数据框中的数据。数据框由以下列组成：

Customer ID、# of products、List of Products、Class of product。

数据框中的一些条目示例如下：

df = [{Customer ID: 00001, 00002, 00003}, 
{# of products: 3, 2, 5},
{List of Products: (Milk, Cheese, Bread), (Butter, Steak), (Bread, Apple, Steak, Pasta, Bananas)}, 
{Class of Product: {[1,2,'D'], [3,3,'G']}, {[1,1,'D'], [2,2,'M']}, {[1,1,'G'], [2,2,'F'],[3,3,'M'], [4,4,'G'], [5,5,'F']}

我希望文本文件输出如下：

00001 # Customer ID
3 # Number of Products
Milk Cheese Bread # List of Products separated using single spacing
D D G # Class corresponding to the products, where D = dairy, G = Gluten, also separated using single spacing
# New line

00002 # Next customer number (Next row of data frame)
2 # number of products
Butter Steak # List of products they purchased separated using single spacing
D M # Class corresponding to the products, where D = Dairy and M = Meat, also separated using single spacing
# New Line

00003 # Next customer number (Next row of data frame)
5 # number of products
Bread Apple Steak Pasta Bananas # List of products separated using single spacing, 
G F M G F # Corresponding to the products where F = Fruit, also separated using single spacing
# New Line

整个数据框以此类推。

我不确定如何指定文本文件的具体格式，以及如何确保每个产品的产品类别正确打印。以客户 00001 为例： [1,2,'D'], [3,3,'G']，确保该类以正确的顺序打印为 D D G，并带有单个间距。

更新：


    Customer_ID Num_Items   List_of_Products          Classes   
    00001        3         Milk Cheese Bread        [[1,2,'D'],[3,3,'G']]   
    00002        2         Butter Steak         [[1,1,'D'],[2,2,'M']]   
    00003        5         Bread Apple Steak Pasta Bananas  [[1,1,'G'], [2,2,'F'], [3,3, 'M'], [4,4,'G'], [5,5,'F']

【问题讨论】：

标签： python pandas dataframe text output

【解决方案1】：

假设您的课程行是list of lists，

df = pd.DataFrame({'Customer ID': ['00001', '00002', '00003'], 
      '# of products': [3, 2, 5],
      'List of Products': ['Milk Cheese Bread', 'Butter Steak','Bread Apple Steak Pasta Bananas'], 
      'Class of Product': [[[1,2,'D'], [3,3,'G']], [[1,1,'D'], [2,2,'M']], [[1,1,'G'], [2,2,'F'],[3,3,'M'], [4,4,'G'], [5,5,'F']]]
     })

>>>df

  Customer ID  # of products                 List of Products                                   Class of Product
0       00001              3                Milk Cheese Bread                             [[1, 2, D], [3, 3, G]]
1       00002              2                     Butter Steak                             [[1, 1, D], [2, 2, M]]
2       00003              5  Bread Apple Steak Pasta Bananas  [[1, 1, G], [2, 2, F], [3, 3, M], [4, 4, G], [...

现在使用pandas.DataFrame.iterrows 遍历每一行并保存到文本文件中

with open('/path_to_file/file_nmae.txt','a')as fp:
    for _, row in df.iterrows():
        for i,value in enumerate(row):
            if i==3:
                extract =''
                for item in value:
                    if item:
                        extract+= ((item[1]-item[0]+1) * item[2])
                value = ' '.join(extract)
            else:
                if not isinstance(value, str):
                    value = str(value)

            fp.write(value+'\n')

输出

00001
3
Milk Cheese Bread
D D G
00002
2
Butter Steak
D M
00003
5
Bread Apple Steak Pasta Bananas
G F M G F

【讨论】：

当我这样做时，我在extract+= ((item[1]-item[0]+1) * item[2]) 行收到错误string index out of range

【解决方案2】：

您能否提供一个有效的df 定义？实际上，它会引发类型错误，并且在不确切知道每列中数据的类型的情况下，它很难为您提供帮助。

假设这个块创建了一个与你的格式相同的数据框：

df = pd.DataFrame({'Customer ID': ['00001', '00002', '00003'], 
      '# of products': [3, 2, 5],
      'List of Products': [['Milk','Cheese','Bread'], ['Butter','Steak'],['Bread','Apple','Steak','Pasta','Bananas']], 
      'Class of Product': [[[1,2,'D'], [3,3,'G']], [[1,1,'D'], [2,2,'M']], [[1,1,'G'], [2,2,'F'],[3,3,'M'], [4,4,'G'], [5,5,'F']]]
     })

那么下面的代码应该可以解决问题：

file=open('outputfile.txt','a')
for idx,row in df.iterrows():
    block  = str(row['Customer ID'])+'\n'
    block += str(row['# of products'])+'\n'
    for product in row['List of Products']:
        block += str(product)+' '
    block += '\n'
    current=1
    for classP in row['Class of Product']:
        if len(classP)==3 and classP[0]==current:
            block += (1+classP[1]-classP[0])*(str(classP[2])+' ')
            current = classP[1]+1
        else:
            print("Class of Product should be a list of 2 numbers and one letter but I got: "+str(classP))
    block += '\n\n'
    print(block)
    file.write(block)

当然，由于您没有提供生成 df 的代码块，我不能确定这是否适合您。

【讨论】：

我仍然不清楚 List_of_products 的数据类型是什么（列表还是元组？）无论如何，您是否尝试过简单地在每一行上创建循环，然后编写一个短函数来格式化每个列数据按你想要的方式？
List_of_Products 是一个列表列表
这段代码中的“classP”是什么？我收到一条错误消息，提示它未定义
抱歉，我在复制代码时出错了。我刚刚编辑了我的答案。
当我尝试这个时，它只是进入 else 语句并为每一行打印字符串的每个字符？例如。 “产品类别应该是 2 个数字和一个字母的列表，但我得到：[” \n “产品类别应该是 2 个数字和一个字母的列表，但我得到：1”，等等...