【问题标题】:Split and replace in one dataframe based on a condition with another dataframe in pandas根据条件在一个数据帧中拆分和替换熊猫中的另一个数据帧
【发布时间】:2020-04-28 00:29:35
【问题描述】:

我有两个数据框,都包含 sql 表。

这是我的第一个数据框

Original_Input           Cleansed_Input        Core_Input    Type_input
TECHNOLOGIES S.A         TECHNOLOGIES SA        
A & J INDUSTRIES, LLC    A J INDUSTRIES LLC     
A&S DENTAL SERVICES      AS DENTAL SERVICES     
A.M.G Médicale Inc       AMG Mdicale Inc        
AAREN SCIENTIFIC         AAREN SCIENTIFIC   

我的第二个数据框是:

Name_Extension     Company_Type     Priority
co llc             Company LLC       2
Pvt ltd            Private Limited   8
Corp               Corporation       4
CO Ltd             Company Limited   3
inc                Incorporated      5
CO                 Company           1

我删除了标点符号、ASCII 和数字,并将这些数据放在 df1cleansed_input 列中。

df1 中的 cleansed_input 列需要与 df2Name_Extension 列进行检查。如果来自cleansed_input 的值在末尾有来自Name_Extension 的任何值,那么应该将其拆分并放入df1type_input column 中,而不是像这样,而是缩写。

例如,如果CO 存在于cleansed_column 中,则应将其缩写为Company 并放入type_input column,其余文本应位于df1core_type 列中。还有优先级,不确定是否需要。

预期输出:

Original_Input          Cleansed_Input        Core_Input       Type_input
TECHNOLOGIES S.A        TECHNOLOGIES SA       TECHNOLOGIES      SA
A & J INDUSTRIES, LLC   A J INDUSTRIES LLC    A J INDUSTRIES    LLC
A&S DENTAL SERVICES     AS DENTAL SERVICES      
A.M.G Médicale Inc      AMG Mdicale Inc       AMG Mdicale       Incorporated
AAREN SCIENTIFIC        AAREN SCIENTIFIC        

我尝试了很多方法,例如 isin、mask、contains 等,但不知道该放在哪里。

我收到一条错误消息,提示 "Series are mutable, they cannot be hashed"。当我尝试使用数据框时,我不确定为什么会出现该错误。

我没有该代码,并且正在使用 jupiter notebook 和 sql server,而 isin 似乎在 jupiter 中不起作用。

以同样的方式进行另一个拆分。 original_input 列被拆分为 parent_compnay 名称和别名。

Here is my code:

import pyodbc
import pandas as pd
import string
from string import digits
import sqlalchemy
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from sqlalchemy.types import String
from io import StringIO
from itertools import chain
import re

#Connecting SQL with Python

server = '172.16.15.9'
database = 'Database Demo'
username = '**'
password = '******'


engine = create_engine('mssql+pyodbc://**:******@'+server+'/'+database+'? 
driver=SQL+server')

#Reading SQL table and grouping by columns
data=pd.read_sql('select * from [dbo].[TempCompanyName]',engine)
#df1=pd.read_sql('Select * from company_Extension',engine)
#print(df1)
#gp = df.groupby(["CustomerName", "Quantity"]).size() 
#print(gp)

#1.Removing ASCII characters
data['Cleansed_Input'] = data['Original_Input'].apply(lambda x:''.join(['' 
if ord(i) < 32 or ord(i) > 126 else i for i in x]))

#2.Removing punctuations
data['Cleansed_Input']= data['Cleansed_Input'].apply(lambda 
x:''.join([x.translate(str.maketrans('', '', string.punctuation))]))
#df['Cleansed_Input'] = df['Cleansed_Input'].apply(lambda x:''.join([i for i 
in x if i not in string.punctuation]))

#3.Removing numbers in a table.
data['Cleansed_Input']= data['Cleansed_Input'].apply(lambda 
x:x.translate(str.maketrans('', '', string.digits)))
#df['Cleansed_Input'] = df['Cleansed_Input'].apply(lambda x:''.join([i for i 
in x if i not in string.digits]))

#4.Removing trialing and leading spaces 
data['Cleansed_Input']=df['Cleansed_Input'].apply(lambda x: x.strip())

df=pd.DataFrame(data)
#data1=pd.DataFrame(df1)


df2 = pd.DataFrame({ 
"Name_Extension": ["llc",
                   "Pvt ltd",
                   "Corp",
                   "CO Ltd",
                   "inc", 
                   "CO",
                   "SA"],
"Company_Type": ["Company LLC",
                 "Private Limited",
                 "Corporation",
                 "Company Limited",
                 "Incorporated",
                 "Company",
                 "Anonymous Company"],
"Priority": [2, 8, 4, 3, 5, 1, 9]
})

data.to_sql('TempCompanyName', con=engine, if_exists='replace',index= False)

【问题讨论】:

  • SA 未在 df2 中定义,但已拆分。是预期的吗?
  • 是的。这是一种。在 df2 中有 24 行。 df2 中的缩写也将是 SA

标签: python pandas dataframe replace split


【解决方案1】:

IIUC,我们可以使用一些基本的正则表达式:

首先我们删除所有尾随和前导空格并按空格分割,这将返回一个列表列表,我们可以使用chain.from_iterable 将其拆分

然后我们使用一些带有 pandas 方法 str.findallstr.contains 的正则表达式来匹配您的输入。

from itertools import chain

ext = df2['Name_Extension'].str.strip().str.split('\s+')

ext = list(chain.from_iterable(i for i in ext))

df['Type_Input'] = df['Cleansed_Input'].str.findall('|'.join(ext),flags=re.IGNORECASE).str[0]

s = df['Cleansed_Input'].str.replace('|'.join(ext),'',regex=True,case=False).str.strip()

df.loc[df['Type_Input'].isnull()==False,'Core_Input'] = s

打印(df)

          Original_Input      Cleansed_Input type_input      core_input
0       TECHNOLOGIES S.A     TECHNOLOGIES SA        NaN             NaN
1  A & J INDUSTRIES, LLC  A J INDUSTRIES LLC        LLC  A J INDUSTRIES
2    A&S DENTAL SERVICES  AS DENTAL SERVICES        NaN             NaN
3     A.M.G Médicale Inc     AMG Mdicale Inc        Inc     AMG Mdicale
4       AAREN SCIENTIFIC    AAREN SCIENTIFIC        NaN             NaN

【讨论】:

  • 非常感谢 Datanovice 我正在拆分,但缩写部分尚未完成。我正在尝试这样做
  • 我刚刚在您的原始数据上运行了它,它运行良好,您遇到的错误是什么? @DhanalakshmiV
  • 没有错误,但它出现在不同的列中。现在我尝试再次删除并创建表并运行并输入正确的列名,但出现编程错误。你能帮我在代码中输入相同的列名,而不是重命名列。
  • @DhanalakshmiV 不同的列是什么意思?我重命名以匹配您的输出列名称,我不确定是什么问题。
  • Original_Input Cleansed_Input Core_Input Type_input ext TECHNOLOGIES S.A TECHNOLOGIES SA SA company TECHNOLOGIES
【解决方案2】:

假设您在数据帧中读取为df1df2,第一步是创建2 个列表- 一个用于Name_Extension(键)和一个用于Company_Type(值),如下所示:

keys = list(df2['Name_Extension'])
keys = [key.strip().lower() for key in keys]
print (keys)
>>> ['co llc', 'pvt ltd', 'corp', 'co ltd', 'inc', 'co']
values = list(df2['Company_Type']) 
values = [value.strip().lower() for value in values]
print (values)
>>> ['company llc', 'private limited', 'corporation', 'company limited', 'incorporated', 'company']

下一步是将Cleansed_Input 中的每个值映射到Core_InputType_Input。我们可以在Cleansed_Input 列上使用pandas apply 方法 获取Core_input

def get_core_input(data):
    # preprocess
    data = str(data).strip().lower()
    # check if the data end with any of the keys
    for key in keys:
        if data.endswith(key):
            return data.split(key)[0].strip() # split the data and return the part without the key
    return None

df1['Core_Input'] = df1['Cleansed_Input'].apply(get_core_input)
print (df1)
>>>
 Original_Input      Cleansed_Input   Core_Input  Type_input
0       TECHNOLOGIES S.A     TECHNOLOGIES SA         None         NaN
1  A & J INDUSTRIES, LLC  A J INDUSTRIES LLC         None         NaN
2    A&S DENTAL SERVICES  AS DENTAL SERVICES         None         NaN
3     A.M.G Médicale Inc     AMG Mdicale Inc  amg mdicale         NaN
4       AAREN SCIENTIFIC   AAREN SCIENTIFIC          None         NaN

获取Type_input

def get_type_input(data):
    # preprocess
    data = str(data).strip().lower()
    # check if the data end with any of the keys
    for idx in range(len(keys)):
        if data.endswith(keys[idx]):
            return values[idx].strip() # return the value of the corresponding matched key
    return None

df1['Type_input'] = df1['Cleansed_Input'].apply(get_type_input)
print (df1)
>>>
Original_Input      Cleansed_Input   Core_Input    Type_input
0       TECHNOLOGIES S.A     TECHNOLOGIES SA         None          None
1  A & J INDUSTRIES, LLC  A J INDUSTRIES LLC         None          None
2    A&S DENTAL SERVICES  AS DENTAL SERVICES         None          None
3     A.M.G Médicale Inc     AMG Mdicale Inc  amg mdicale  incorporated
4       AAREN SCIENTIFIC   AAREN SCIENTIFIC          None          None

这是一个非常容易遵循的解决方案,但我敢肯定,这不是解决问题的最有效方法。希望它能解决您的用例。

【讨论】:

    【解决方案3】:

    这是您可以实施的可能解决方案:

    df = pd.DataFrame({
        "Original_Input": ["TECHNOLOGIES S.A", 
                           "A & J INDUSTRIES, LLC", 
                           "A&S DENTAL SERVICES", 
                           "A.M.G Médicale Inc", 
                           "AAREN SCIENTIFIC"],
        "Cleansed_Input": ["TECHNOLOGIES SA", 
                           "A J INDUSTRIES LLC", 
                           "AS DENTAL SERVICES", 
                           "AMG Mdicale Inc", 
                           "AAREN SCIENTIFIC"]
    })
    
    df_2 = pd.DataFrame({ 
        "Name_Extension": ["llc",
                           "Pvt ltd",
                           "Corp",
                           "CO Ltd",
                           "inc", 
                           "CO",
                           "SA"],
        "Company_Type": ["Company LLC",
                         "Private Limited",
                         "Corporation",
                         "Company Limited",
                         "Incorporated",
                         "Company",
                         "Anonymous Company"],
        "Priority": [2, 8, 4, 3, 5, 1, 9]
    })
    
    # Preprocessing text
    df["lower_input"] = df["Cleansed_Input"].str.lower()
    df_2["lower_extension"] = df_2["Name_Extension"].str.lower()
    
    # Getting the lowest priority matching the end of the string
    extensions_list = [ (priority, extension.lower_extension.values[0]) 
                        for priority, extension in df_2.groupby("Priority") ]
    df["extension_priority"] = df["lower_input"] \
        .apply(lambda p: next(( priority 
                                for priority, extension in extensions_list 
                                if p.endswith(extension)), None))
    
    # Merging both dataframes based on priority. This step can be ignored if you only need
    # one column from the df_2. In that case, just give the column you require instead of 
    # `priority` in the previous step.
    df = df.merge(df_2, "left", left_on="extension_priority", right_on="Priority")
    
    # Removing the matched extensions from the `Cleansed_Input` string
    df["aux"] = df["lower_extension"].apply(lambda p: -len(p) if isinstance(p, str) else 0)
    df["Core_Input"] = df.apply(
        lambda p: p["Cleansed_Input"] 
                  if p["aux"] == 0 
                  else p["Cleansed_Input"][:p["aux"]].strip(), 
        axis=1
    )
    
    # Selecting required columns
    df[[ "Original_Input", "Core_Input", "Company_Type", "Name_Extension" ]]
    

    我假设“优先级”列将具有唯一值。但是,如果不是这种情况,只需对优先级进行排序并根据该顺序创建一个索引,如下所示:

    df_2.sort_values("Priority").assign(index = range(df_2.shape[0]))
    

    另外,下次以任何人都可以轻松加载的格式给出数据示例。处理您发送的格式很麻烦。

    编辑:与问题无关,但可能会有所帮助。您可以使用以下方法简化从 1 到 4 的步骤:

    data['Cleansed_Input'] = data["Original_Input"] \
        .str.replace("[^\w ]+", "") \ # removes non-alpha characters
        .str.replace(" +", " ") \ # removes duplicated spaces
        .str.strip() # removes spaces before or after the string
    

    编辑 2:解决方案的 SQL 版本(我使用的是 PostgreSQL,但我使用的是标准 SQL 运算符,因此差异应该不会那么大)。

    SELECT t.Original_Name,
           t.Cleansed_Input,
           t.Name_Extension,
           t.Company_Type,
           t.Priority
    FROM (
        SELECT df.Original_Name,
               df.Cleansed_Input,
               df_2.Name_Extension,
               df_2.Company_Type,
               df_2.Priority,
               ROW_NUMBER() OVER (PARTITION BY df.Original_Name ORDER BY df_2.Priority) AS rn
        FROM (VALUES ('TECHNOLOGIES S.A', 'TECHNOLOGIES SA'), ('A & J INDUSTRIES, LLC', 'A J INDUSTRIES LLC'),
                     ('A&S DENTAL SERVICES', 'AS DENTAL SERVICES'), ('A.M.G Médicale Inc', 'AMG Mdicale Inc'),
                     ('AAREN SCIENTIFIC', 'AAREN SCIENTIFIC')) df(Original_Name, Cleansed_Input)
             LEFT JOIN (VALUES ('llc', 'Company LLC', '2'), ('Pvt ltd', 'Private Limited', '8'), ('Corp', 'Corporation', '4'),
                               ('CO Ltd', 'Company Limited', '3'), ('inc', 'Incorporated', '5'), ('CO', 'Company', '1'),
                               ('SA', 'Anonymous Company', '9')) df_2(Name_Extension, Company_Type, Priority)
                ON  lower(df.Cleansed_Input) like ( '%' || lower(df_2.Name_Extension) )
    ) t
    WHERE rn = 1
    

    【讨论】:

    • 您好,谢谢您的回答。但仍然没有得到所需的输出。它只是将cleansed_input 复制到一个名为lower input 的新列中。它没有分裂。
    • 您是否将“Cleansed_Input”预处理为小写?你能分享你的代码吗?
    • @DhanalakshmiV 我没有看到用于在df 中为每一行查找相应扩展名的代码。你也可以加吗?另外,请检查我的编辑,它可能会有所帮助。
    • 非常感谢,它适用于数字、标点符号和空格。而且我没有保存我为拆分所做的任何代码。我所做的是我尝试用 sql 编写代码并通过 python 执行,因为我无法通过 py 来执行。
    • engine.execute('''update A set A.Type_input = B.Company_Type from [dbo].[TempCompanyName] A (nolock), [dbo].[company_Extension]B where A.Cleansed_Input like '%'+B.Name_Extension''') engine.execute('''update A set A.Core_Input =replace(A.[Cleansed_Input],B.Name_Extension,'') from [TempCompanyName] A (nolock), [company_Extension]B where A.Cleansed_Input like '%'+B.Name_Extension''') engine.execution_options(autocommit=True)
    猜你喜欢
    • 1970-01-01
    • 2022-01-24
    • 1970-01-01
    • 1970-01-01
    • 2015-08-28
    • 1970-01-01
    • 2018-08-18
    • 1970-01-01
    • 2018-07-15
    相关资源
    最近更新 更多