【发布时间】:2020-04-28 00:29:35
【问题描述】:
我有两个数据框,都包含 sql 表。
这是我的第一个数据框
Original_Input Cleansed_Input Core_Input Type_input
TECHNOLOGIES S.A TECHNOLOGIES SA
A & J INDUSTRIES, LLC A J INDUSTRIES LLC
A&S DENTAL SERVICES AS DENTAL SERVICES
A.M.G Médicale Inc AMG Mdicale Inc
AAREN SCIENTIFIC AAREN SCIENTIFIC
我的第二个数据框是:
Name_Extension Company_Type Priority
co llc Company LLC 2
Pvt ltd Private Limited 8
Corp Corporation 4
CO Ltd Company Limited 3
inc Incorporated 5
CO Company 1
我删除了标点符号、ASCII 和数字,并将这些数据放在 df1 的 cleansed_input 列中。
df1 中的 cleansed_input 列需要与 df2 的 Name_Extension 列进行检查。如果来自cleansed_input 的值在末尾有来自Name_Extension 的任何值,那么应该将其拆分并放入df1 的type_input column 中,而不是像这样,而是缩写。
例如,如果CO 存在于cleansed_column 中,则应将其缩写为Company 并放入type_input column,其余文本应位于df1 的core_type 列中。还有优先级,不确定是否需要。
预期输出:
Original_Input Cleansed_Input Core_Input Type_input
TECHNOLOGIES S.A TECHNOLOGIES SA TECHNOLOGIES SA
A & J INDUSTRIES, LLC A J INDUSTRIES LLC A J INDUSTRIES LLC
A&S DENTAL SERVICES AS DENTAL SERVICES
A.M.G Médicale Inc AMG Mdicale Inc AMG Mdicale Incorporated
AAREN SCIENTIFIC AAREN SCIENTIFIC
我尝试了很多方法,例如 isin、mask、contains 等,但不知道该放在哪里。
我收到一条错误消息,提示 "Series are mutable, they cannot be hashed"。当我尝试使用数据框时,我不确定为什么会出现该错误。
我没有该代码,并且正在使用 jupiter notebook 和 sql server,而 isin 似乎在 jupiter 中不起作用。
以同样的方式进行另一个拆分。 original_input 列被拆分为 parent_compnay 名称和别名。
Here is my code:
import pyodbc
import pandas as pd
import string
from string import digits
import sqlalchemy
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from sqlalchemy.types import String
from io import StringIO
from itertools import chain
import re
#Connecting SQL with Python
server = '172.16.15.9'
database = 'Database Demo'
username = '**'
password = '******'
engine = create_engine('mssql+pyodbc://**:******@'+server+'/'+database+'?
driver=SQL+server')
#Reading SQL table and grouping by columns
data=pd.read_sql('select * from [dbo].[TempCompanyName]',engine)
#df1=pd.read_sql('Select * from company_Extension',engine)
#print(df1)
#gp = df.groupby(["CustomerName", "Quantity"]).size()
#print(gp)
#1.Removing ASCII characters
data['Cleansed_Input'] = data['Original_Input'].apply(lambda x:''.join([''
if ord(i) < 32 or ord(i) > 126 else i for i in x]))
#2.Removing punctuations
data['Cleansed_Input']= data['Cleansed_Input'].apply(lambda
x:''.join([x.translate(str.maketrans('', '', string.punctuation))]))
#df['Cleansed_Input'] = df['Cleansed_Input'].apply(lambda x:''.join([i for i
in x if i not in string.punctuation]))
#3.Removing numbers in a table.
data['Cleansed_Input']= data['Cleansed_Input'].apply(lambda
x:x.translate(str.maketrans('', '', string.digits)))
#df['Cleansed_Input'] = df['Cleansed_Input'].apply(lambda x:''.join([i for i
in x if i not in string.digits]))
#4.Removing trialing and leading spaces
data['Cleansed_Input']=df['Cleansed_Input'].apply(lambda x: x.strip())
df=pd.DataFrame(data)
#data1=pd.DataFrame(df1)
df2 = pd.DataFrame({
"Name_Extension": ["llc",
"Pvt ltd",
"Corp",
"CO Ltd",
"inc",
"CO",
"SA"],
"Company_Type": ["Company LLC",
"Private Limited",
"Corporation",
"Company Limited",
"Incorporated",
"Company",
"Anonymous Company"],
"Priority": [2, 8, 4, 3, 5, 1, 9]
})
data.to_sql('TempCompanyName', con=engine, if_exists='replace',index= False)
【问题讨论】:
-
SA未在df2中定义,但已拆分。是预期的吗? -
是的。这是一种。在 df2 中有 24 行。 df2 中的缩写也将是 SA
标签: python pandas dataframe replace split