数据集中两个变量的模糊匹配

【问题标题】：Fuzzy match for two variables in a dataset数据集中两个变量的模糊匹配
【发布时间】：2018-10-10 19:01:06
【问题描述】：

如何在 Stata 数据集中的两个变量之间进行模糊匹配（大约 75% 的匹配）？

在我的示例中，如果Brand_1 中的值存在于Brand_2 中，我将生成Match_yes = 1：

**Brand_1    Brand_2    Match_yes**
Samsung     Samsung         1
Microsoft   Apple           0
Apple       Sony            1
Panasonic   Motorola        0
Miumiu                      0
Mottorrola                  1  
LG                          0

如何获取变量Brand_1 下的值Mottorrola 以生成Match_yes = 1，因为它与变量Motorola 中的值Motorola 相似80%？

【问题讨论】：

标签： match stata fuzzy-logic

【解决方案1】：

使用你的玩具示例：

clear

input strL(Brand_1 Brand_2)
Samsung     Samsung     
Microsoft   Apple          
Apple       Sony           
Panasonic   Motorola       
Miumiu                     
Mottorrola                  
LG                          
end

这是使用 community-contributed 命令matchit 生成所需输出的“hack”：

local obs = _N
generate Cont = 0

forvalues i = 1 / `obs' {
    forvalues j = 1 / `obs' {
        replace Cont = 1 if Brand_1[`i'] == Brand_2[`j'] in `i'

        generate b1 = Brand_1[`i'] in 1
        generate b2 = Brand_2[`j'] in 1
        matchit b1 b2, generate(simscore)
        generate score`i'`j' = simscore
        replace Cont = 1 if score`i'`j'[1] > 0.80 in `i'

        drop b1 b2 simscore
    }
}

drop score*

list

     +------------------------------+
     |    Brand_1    Brand_2   Cont |
     |------------------------------|
  1. |    Samsung    Samsung      1 |
  2. |  Microsoft      Apple      0 |
  3. |      Apple       Sony      1 |
  4. |  Panasonic   Motorola      0 |
  5. |     Miumiu                 0 |
     |------------------------------|
  6. | Mottorrola                 1 |
  7. |         LG                 0 |
     +------------------------------+

【讨论】：