【问题标题】:Format output text file based on data frame in pandas根据 pandas 中的数据框格式化输出文本文件
【发布时间】:2020-01-16 09:06:09
【问题描述】:

我有一个数据框,它与一家商店及其客户的购买情况有关。

我想以某种格式输出数据框中的数据。 数据框由以下列组成:

Customer ID# of productsList of ProductsClass of product

数据框中的一些条目示例如下:

df = [{Customer ID: 00001, 00002, 00003}, 
{# of products: 3, 2, 5},
{List of Products: (Milk, Cheese, Bread), (Butter, Steak), (Bread, Apple, Steak, Pasta, Bananas)}, 
{Class of Product: {[1,2,'D'], [3,3,'G']}, {[1,1,'D'], [2,2,'M']}, {[1,1,'G'], [2,2,'F'],[3,3,'M'], [4,4,'G'], [5,5,'F']}

我希望文本文件输出如下:

00001 # Customer ID
3 # Number of Products
Milk Cheese Bread # List of Products separated using single spacing
D D G # Class corresponding to the products, where D = dairy, G = Gluten, also separated using single spacing
# New line

00002 # Next customer number (Next row of data frame)
2 # number of products
Butter Steak # List of products they purchased separated using single spacing
D M # Class corresponding to the products, where D = Dairy and M = Meat, also separated using single spacing
# New Line

00003 # Next customer number (Next row of data frame)
5 # number of products
Bread Apple Steak Pasta Bananas # List of products separated using single spacing, 
G F M G F # Corresponding to the products where F = Fruit, also separated using single spacing
# New Line

整个数据框以此类推。

我不确定如何指定文本文件的具体格式,以及如何确保每个产品的产品类别正确打印。 以客户 00001 为例: [1,2,'D'], [3,3,'G'],确保该类以正确的顺序打印为 D D G,并带有单个间距。

更新:


    Customer_ID Num_Items   List_of_Products          Classes   
    00001        3         Milk Cheese Bread        [[1,2,'D'],[3,3,'G']]   
    00002        2         Butter Steak         [[1,1,'D'],[2,2,'M']]   
    00003        5         Bread Apple Steak Pasta Bananas  [[1,1,'G'], [2,2,'F'], [3,3, 'M'], [4,4,'G'], [5,5,'F']


【问题讨论】:

    标签: python pandas dataframe text output


    【解决方案1】:

    假设您的课程行是list of lists

    df = pd.DataFrame({'Customer ID': ['00001', '00002', '00003'], 
          '# of products': [3, 2, 5],
          'List of Products': ['Milk Cheese Bread', 'Butter Steak','Bread Apple Steak Pasta Bananas'], 
          'Class of Product': [[[1,2,'D'], [3,3,'G']], [[1,1,'D'], [2,2,'M']], [[1,1,'G'], [2,2,'F'],[3,3,'M'], [4,4,'G'], [5,5,'F']]]
         })
    
    >>>df
    
      Customer ID  # of products                 List of Products                                   Class of Product
    0       00001              3                Milk Cheese Bread                             [[1, 2, D], [3, 3, G]]
    1       00002              2                     Butter Steak                             [[1, 1, D], [2, 2, M]]
    2       00003              5  Bread Apple Steak Pasta Bananas  [[1, 1, G], [2, 2, F], [3, 3, M], [4, 4, G], [...
    

    现在使用pandas.DataFrame.iterrows 遍历每一行并保存到文本文件中

    with open('/path_to_file/file_nmae.txt','a')as fp:
        for _, row in df.iterrows():
            for i,value in enumerate(row):
                if i==3:
                    extract =''
                    for item in value:
                        if item:
                            extract+= ((item[1]-item[0]+1) * item[2])
                    value = ' '.join(extract)
                else:
                    if not isinstance(value, str):
                        value = str(value)
    
                fp.write(value+'\n')
    

    输出

    00001
    3
    Milk Cheese Bread
    D D G
    00002
    2
    Butter Steak
    D M
    00003
    5
    Bread Apple Steak Pasta Bananas
    G F M G F
    

    【讨论】:

    • 当我这样做时,我在extract+= ((item[1]-item[0]+1) * item[2]) 行收到错误string index out of range
    【解决方案2】:

    您能否提供一个有效的df 定义?实际上,它会引发类型错误,并且在不确切知道每列中数据的类型的情况下,它很难为您提供帮助。

    假设这个块创建了一个与你的格式相同的数据框:

    df = pd.DataFrame({'Customer ID': ['00001', '00002', '00003'], 
          '# of products': [3, 2, 5],
          'List of Products': [['Milk','Cheese','Bread'], ['Butter','Steak'],['Bread','Apple','Steak','Pasta','Bananas']], 
          'Class of Product': [[[1,2,'D'], [3,3,'G']], [[1,1,'D'], [2,2,'M']], [[1,1,'G'], [2,2,'F'],[3,3,'M'], [4,4,'G'], [5,5,'F']]]
         })
    

    那么下面的代码应该可以解决问题:

    file=open('outputfile.txt','a')
    for idx,row in df.iterrows():
        block  = str(row['Customer ID'])+'\n'
        block += str(row['# of products'])+'\n'
        for product in row['List of Products']:
            block += str(product)+' '
        block += '\n'
        current=1
        for classP in row['Class of Product']:
            if len(classP)==3 and classP[0]==current:
                block += (1+classP[1]-classP[0])*(str(classP[2])+' ')
                current = classP[1]+1
            else:
                print("Class of Product should be a list of 2 numbers and one letter but I got: "+str(classP))
        block += '\n\n'
        print(block)
        file.write(block)
    

    当然,由于您没有提供生成 df 的代码块,我不能确定这是否适合您。

    【讨论】:

    • 我仍然不清楚 List_of_products 的数据类型是什么(列表还是元组?)无论如何,您是否尝试过简单地在每一行上创建循环,然后编写一个短函数来格式化每个列数据按你想要的方式?
    • List_of_Products 是一个列表列表
    • 这段代码中的“classP”是什么?我收到一条错误消息,提示它未定义
    • 抱歉,我在复制代码时出错了。我刚刚编辑了我的答案。
    • 当我尝试这个时,它只是进入 else 语句并为每一行打印字符串的每个字符?例如。 “产品类别应该是 2 个数字和一个字母的列表,但我得到:[” \n “产品类别应该是 2 个数字和一个字母的列表,但我得到:1”,等等...
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2016-06-10
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多