如何将数据框的列值拆分为多列答案

【问题标题】：How to split the column values of dataframe into multiple columns如何将数据框的列值拆分为多列
【发布时间】：2021-07-08 05:48:33
【问题描述】：

下面是我的数据框，其中有一列合并在一起，

   PLUGS\nDESIGN\nGEAR
0  700\nDaewoo 8000  Gearless   
1  300\nHyundai 4400  Gearless   
2  600\nSTX 2600  Gearless   
3  200\nB170 \nGeared   
4  362 Wenchong 1700 Mk II \nGeared   
5  252\nRichMax 1550  Gearless   
6  220\nCV 1100 Plus \nGeared   
7  232\nOrskov Mk VII  Gearless   
8  119\nKouan 1000  Gearless   
9  100\nHanjin 700  Gearless

我想将这些列拆分为三个不同的列，即 PLUGS、DESIGN、GEAR。有没有办法做到这一点？

下面是我试过的代码：

new_df[['PLUGS', 'DESIGN', 'GEAR']] = new_df['PLUGS\nDESIGN\nGEAR'].str.split(' ')
                print(new_df)

预期输出：

   PLUGS  DESIGN               GEAR
0  700    Daewoo 8000          Gearless   
1  300    Hyundai 4400         Gearless   
2  600    STX 2600             Gearless   
3  200    B170                 Geared   
4  362    Wenchong 1700 Mk II  Geared   
5  252    RichMax 1550         Gearless   
6  220    CV 1100 Plus         Geared   
7  232    Orskov Mk VII        Gearless   
8  119    Kouan 1000           Gearless   
9  100    Hanjin 700           Gearless

【问题讨论】：

您的原始 CSV 文件是什么？你是如何阅读文件的？ @萨兰
我使用 camelot 从 pdf 中提取了这些信息
是否可以有原始文本而不是数据框？
df["PLUGS\nDESIGN\nGEAR"].str.extract(r"^(\d+)[\\n\s]+([^\\]+)[\\n\s]+(.+)$") 是否在处理真实数据？
@KarnKumar 很抱歉回复晚了......确实正则表达式似乎有效。感谢您的回答，+1。

标签： python pandas dataframe split

【解决方案1】：

按照评论部分的建议，正则表达式在这里应该可以很好地工作，

数据帧示例：

>>> df
                   PLUGS\nDESIGN\nGEAR
0        700\nDaewoo 8000  Gearless
1       300\nHyundai 4400  Gearless
2           600\nSTX 2600  Gearless
3                200\nB170 \nGeared
4  362 Wenchong 1700 Mk II \nGeared
5       252\nRichMax 1550  Gearless
6        220\nCV 1100 Plus \nGeared
7      232\nOrskov Mk VII  Gearless
8         119\nKouan 1000  Gearless
9            100\nHanjin 700  Gearless

只需从列名中删除换行符，以使可读性也易于使用。

>>> df.columns = df.columns.str.replace(r"\\n", " ", regex=True)

现在，列名没有任何特殊的汽车：

>>> df
                     PLUGS DESIGN GEAR
0        700\nDaewoo 8000  Gearless
1       300\nHyundai 4400  Gearless
2           600\nSTX 2600  Gearless
3                200\nB170 \nGeared
4  362 Wenchong 1700 Mk II \nGeared
5       252\nRichMax 1550  Gearless
6        220\nCV 1100 Plus \nGeared
7      232\nOrskov Mk VII  Gearless
8         119\nKouan 1000  Gearless
9            100\nHanjin 700  Gearless

现在，我们可以使用pandas.Series.str.extract。使用regex 方法时，所有命名组() 将成为结果中的列名。

因为，命名组将成为具有预定义名称的列，例如0,1,2，因此我们可以将它们完全重命名为所需的名称以获得所需的结果，如下所示：

>>> df = df['PLUGS DESIGN GEAR'].str.extract(r"^(\d+)[\\n\s]+([^\\]+)[\\n\s]+([\\n|^Gear][a-z]+)").rename(columns={0: 'PLUGS', 1: 'DESIGN', 2: 'GEAR'})

结果：

>>> print(df)
  PLUGS                DESIGN      GEAR
0   700          Daewoo 8000   Gearless
1   300         Hyundai 4400   Gearless
2   600             STX 2600   Gearless
3   200                 B170     Geared
4   362  Wenchong 1700 Mk II     Geared
5   252         RichMax 1550   Gearless
6   220         CV 1100 Plus     Geared
7   232        Orskov Mk VII   Gearless
8   119           Kouan 1000   Gearless
9   100           Hanjin 700   Gearless

正则表达式解释：

您可以在regex101.com查看

(\d+)[\\n\s]+([^\\]+)[\\n\s]+([\|^Gear][a-z]+)

第一捕获组 (\d+)

    \d matches a digit (equivalent to [0-9])
    + matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
    Match a single character present in the list below [\\n\s]
    + matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
    \\ matches the character \ literally (case sensitive)
    n matches the character n literally (case sensitive)
    \s matches any whitespace character (equivalent to [\r\n\t\f\v ])

第二捕获组 ([^\]+)

    Match a single character not present in the list below [^\\]
    + matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
    \\ matches the character \ literally (case sensitive)
    Match a single character present in the list below [\\n\s]
    + matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
    \\ matches the character \ literally (case sensitive)
    n matches the character n literally (case sensitive)
    \s matches any whitespace character (equivalent to [\r\n\t\f\v ])

第三捕获组（[|^Gear][a-z]+）

Match a single character present in the list below [\|^Gear]
\| matches the character | literally (case sensitive)
^Gear matches a single character in the list ^Gear (case sensitive)
Match a single character present in the list below [a-z]
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
a-z matches a single character in the range between a (index 97) and z (index 122) (case sensitive)
Global pattern flags
g modifier: global. All matches (don't return after first match)
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)

【讨论】：

这很好，先生，一个带有正则表达式的班轮。

【解决方案2】：

从您的数据框开始：

>>> import pandas as pd

>>> df = pd.DataFrame({'PLUGS\nDESIGN\nGEAR': ['700\nDaewoo 8000  Gearless', '300\nHyundai 4400  Gearless', '600\nSTX 2600  Gearless', '200\nB170 \nGeared', '362 Wenchong 1700 Mk II \nGeared', '252\nRichMax 1550  Gearless'], }, 
...                   index = [0, 1, 2, 3, 4, 5]) 
>>> df
    PLUGS\nDESIGN\nGEAR
0   700\nDaewoo 8000 Gearless
1   300\nHyundai 4400 Gearless
2   600\nSTX 2600 Gearless
3   200\nB170 \nGeared
4   362 Wenchong 1700 Mk II \nGeared
5   252\nRichMax 1550 Gearless

您确实可以在多个分隔符上使用split 方法，这里是\n 和space：

>>> df = pd.DataFrame(df['PLUGS\nDESIGN\nGEAR'].str.split('\n| '))
    PLUGS\nDESIGN\nGEAR
0   [700, Daewoo, 8000, , Gearless]
1   [300, Hyundai, 4400, , Gearless]
2   [600, STX, 2600, , Gearless]
3   [200, B170, , Geared]
4   [362, Wenchong, 1700, Mk, II, , Geared]
5   [252, RichMax, 1550, , Gearless]

然后，您可以将第一个和最后一个元素分配给正确的列，其余的分配给DESIGN 列：

>>> df['PLUGS'] = df['PLUGS\nDESIGN\nGEAR'].str[0]
>>> df['DESIGN'] = df['PLUGS\nDESIGN\nGEAR'].str[1:-1]
>>> df['GEAR'] = df['PLUGS\nDESIGN\nGEAR'].str[-1]
>>> df
    PLUGS\nDESIGN\nGEAR                         PLUGS   DESIGN                      GEAR
0   [700, Daewoo, 8000, , Gearless]             700     [Daewoo, 8000, ]            Gearless
1   [300, Hyundai, 4400, , Gearless]            300     [Hyundai, 4400, ]           Gearless
2   [600, STX, 2600, , Gearless]                600     [STX, 2600, ]               Gearless
3   [200, B170, , Geared]                       200     [B170, ]                    Geared
4   [362, Wenchong, 1700, Mk, II, , Geared]     362     [Wenchong, 1700, Mk, II, ]  Geared
5   [252, RichMax, 1550, , Gearless]            252     [RichMax, 1550, ]           Gearless

最后要做的是改进DESIGN 列，使用join 方法将其映射为字符串而不是列表，然后像这样删除PLUGS\nDESIGN\nGEAR 列：

>>> df['DESIGN'] = df['DESIGN'].apply(lambda x: ' '.join(map(str, x)))
>>> df.drop(['PLUGS\nDESIGN\nGEAR'], axis=1)
    PLUGS   DESIGN               GEAR
0   700     Daewoo 8000          Gearless
1   300     Hyundai 4400         Gearless
2   600     STX 2600             Gearless
3   200     B170                 Geared
4   362     Wenchong 1700 Mk II  Geared
5   252     RichMax 1550         Gearless

【讨论】：

它正在工作，非常感谢@tlentail