【问题标题】:mege two csv files using python or pandas使用 python 或 pandas 合并两个 csv 文件
【发布时间】:2019-02-22 12:59:12
【问题描述】:
**csv file 1**

date    yearMonth   deviceCategory  channelGrouping eventCategory   Totalevents
20160719    201607  desktop Direct  _GW_Legal_RM_false  149
20160719    201607  desktop Direct  _GW_Risk_RM_false   298
20160719    201607  desktop Direct  _GW_Risk_RM_true    149
20160719    201607  desktop Direct  _GW__Product-Sign-In__  895
20160719    201607  desktop Organic Search  _GW_Legal_RM_false  149
20160719    201607  desktop Organic Search  _GW_Risk_RM_false   746
20160719    201607  desktop Organic Search  _GW__Product-Sign-In__  1342
20160719    201607  desktop Referral    _GW__Product-Sign-In__  1044
20160719    201607  mobile  Direct  _GW_Legal_RM_false  149
20160719    201607  mobile  Social  _GW_Legal_RM_false  149
20160719    201607  tablet  Direct  _GW_Legal_RM_false  149
20160720    201607  desktop Branded Paid Search _GW_Legal_RM_false  149
20160720    201607  desktop Direct  _GW_Legal_RM_false  149
20160720    201607  desktop Direct  _GW__Product-Sign-In__  746
20160720    201607  desktop Non-Branded Paid Search _GW_Legal_RM_false  149
20160720    201607  desktop Non-Branded Paid Search _GW_Risk_RM_false   149
20160720    201607  desktop Organic Search  _GW_Legal_RM_false  1939
20160720    201607  desktop Organic Search  _GW_Risk_RM_false   298

我有 2 个 CSV 文件,我想基于一个公共列进行合并,但是公共列的长度不同!有没有办法在不重复值的情况下合并/组合它

csv 文件 2

eventCategory   event_type
_GW_Legal_RM_false  Legal
_GW_Legal_RM_true   Legal
_GW_Legal_RM_   Legal
_GW_Risk_RM_false   Risk
_GW_Risk_RM_true    Risk
_GW_Risk_RM_    Risk
_GW__Product-Sign-In__  Sign-in

输出.csv

eventCategory   event_type  date    yearMonth   deviceCategory  channelGrouping Totalevents
 _GW_Legal_RM_false Legal   20160719    201607  desktop Direct  149
 _GW_Legal_RM_false Legal   20160719    201607  desktop Organic Search  149
 _GW_Legal_RM_false Legal   20160719    201607  mobile  Direct  149
 _GW_Legal_RM_false Legal   20160719    201607  mobile  Social  149

【问题讨论】:

  • 你能举一个例子说明两个输入 csv 的样子和你希望输出的例子吗?
  • 会修改问题
  • 这是您期望的完整输出,还是只是完整输出的一个子集?关于为什么只有那 4 行,我似乎找不到任何特定的逻辑。
  • @ALollz ,这只是输出的一个子集,更准确地说,这是我需要的输出格式的一个示例。

标签: python pandas csv


【解决方案1】:

为了延长ALollz的回复,

import pandas as pd
df1 = pd.read_csv("1.csv", sep=" ")
df2 = pd.read_csv("2.csv", sep=" ")

df = pd.merge([df1, df2], on='eventCategory', how='left')

【讨论】:

    【解决方案2】:

    mapset_index 一起使用:

    import pandas as pd
    from io import StringIO
    
    csv1 = StringIO("""date    yearMonth   deviceCategory  channelGrouping  eventCategory   Totalevents
    20160719    201607  desktop  Direct  _GW_Legal_RM_false  149
    20160719    201607  desktop  Direct  _GW_Risk_RM_false   298
    20160719    201607  desktop  Direct  _GW_Risk_RM_true    149
    20160719    201607  desktop  Direct  _GW__Product-Sign-In__  895
    20160719    201607  desktop  Organic Search  _GW_Legal_RM_false  149
    20160719    201607  desktop  Organic Search  _GW_Risk_RM_false   746
    20160719    201607  desktop  Organic Search  _GW__Product-Sign-In__  1342
    20160719    201607  desktop  Referral    _GW__Product-Sign-In__  1044
    20160719    201607  mobile  Direct  _GW_Legal_RM_false  149
    20160719    201607  mobile  Social  _GW_Legal_RM_false  149
    20160719    201607  tablet  Direct  _GW_Legal_RM_false  149
    20160720    201607  desktop  Branded Paid Search  _GW_Legal_RM_false  149
    20160720    201607  desktop  Direct  _GW_Legal_RM_false  149
    20160720    201607  desktop  Direct  _GW__Product-Sign-In__  746
    20160720    201607  desktop  Non-Branded Paid Search  _GW_Legal_RM_false  149
    20160720    201607  desktop  Non-Branded Paid Search  _GW_Risk_RM_false   149
    20160720    201607  desktop  Organic Search  _GW_Legal_RM_false  1939
    20160720    201607  desktop  Organic Search  _GW_Risk_RM_false   298""")
    
    csv2= StringIO("""eventCategory   event_type
    _GW_Legal_RM_false  Legal
    _GW_Legal_RM_true   Legal
    _GW_Legal_RM_   Legal
    _GW_Risk_RM_false   Risk
    _GW_Risk_RM_true    Risk
    _GW_Risk_RM_    Risk
    _GW__Product-Sign-In__  Sign-in""")
    
    df1 = pd.read_csv(csv1,sep='\s\s+')
    df2 = pd.read_csv(csv2, sep='\s\s+')
    
    df1['event_type'] = df1['eventCategory'].map(df2.set_index('eventCategory')['event_type'])
    
    df1
    

    输出:

            date  yearMonth deviceCategory          channelGrouping           eventCategory  Totalevents event_type
    0   20160719     201607        desktop                   Direct      _GW_Legal_RM_false          149      Legal
    1   20160719     201607        desktop                   Direct       _GW_Risk_RM_false          298       Risk
    2   20160719     201607        desktop                   Direct        _GW_Risk_RM_true          149       Risk
    3   20160719     201607        desktop                   Direct  _GW__Product-Sign-In__          895    Sign-in
    4   20160719     201607        desktop           Organic Search      _GW_Legal_RM_false          149      Legal
    5   20160719     201607        desktop           Organic Search       _GW_Risk_RM_false          746       Risk
    6   20160719     201607        desktop           Organic Search  _GW__Product-Sign-In__         1342    Sign-in
    7   20160719     201607        desktop                 Referral  _GW__Product-Sign-In__         1044    Sign-in
    8   20160719     201607         mobile                   Direct      _GW_Legal_RM_false          149      Legal
    9   20160719     201607         mobile                   Social      _GW_Legal_RM_false          149      Legal
    10  20160719     201607         tablet                   Direct      _GW_Legal_RM_false          149      Legal
    11  20160720     201607        desktop      Branded Paid Search      _GW_Legal_RM_false          149      Legal
    12  20160720     201607        desktop                   Direct      _GW_Legal_RM_false          149      Legal
    13  20160720     201607        desktop                   Direct  _GW__Product-Sign-In__          746    Sign-in
    14  20160720     201607        desktop  Non-Branded Paid Search      _GW_Legal_RM_false          149      Legal
    15  20160720     201607        desktop  Non-Branded Paid Search       _GW_Risk_RM_false          149       Risk
    16  20160720     201607        desktop           Organic Search      _GW_Legal_RM_false         1939      Legal
    17  20160720     201607        desktop           Organic Search       _GW_Risk_RM_false          298       Risk
    

    【讨论】:

    • 所以,@ScottBoston,两个输入文件都是 .csv 格式,而不是字符串..你能指导我吗!
    • 哦..这只是模拟您的 csv 输入的测试。您可以执行相同的方法,只需将 csv1 和 csv2 更改为 .csv 文件的路径。
    【解决方案3】:
    df1 = pd.read_csv("csv1.csv")
    
    df2 = pd.read_csv("csv2.csv")
    
    df = pd.merge(df1, df2, on='eventCategory', how='left')
    

    对@FrankZhu 的回答进行了一些修改。

    【讨论】:

    • 谢谢@Keshav Sharma..非常有帮助(我认识猫)
    猜你喜欢
    • 1970-01-01
    • 2021-05-11
    • 2013-04-22
    • 2020-11-22
    • 2018-06-11
    • 2020-08-22
    • 2020-03-18
    • 2012-08-12
    • 2018-02-14
    相关资源
    最近更新 更多