【问题标题】:Stata merging data sets based on a range of datesStata基于日期范围合并数据集
【发布时间】:2016-03-30 20:35:48
【问题描述】:

我知道有一个具有这个确切标题的 Stata 论坛,但我没有发现它的语法很有帮助,尤其是因为我的数据集有点不同。我有两个数据集。一个是一个人在设施中的停留时间,包括设施名称。它看起来像这样:

+---+-------------+---------------+-----------------------+
|ID#|Entrance Date|  Exit Date    | Facility Name         |
|1  | 7/22/2009   | 2/24/2010     | Facility 1            |
|1  | 7/10/2010   | 11/21/2010    | Facility 2            |
|2  | 3/31/2010   | 9/23/2010     | Facility 1            |
|3  | 11/24/2010  | 7/5/2011      | Facility 3            |
|4  | 3/7/2007    | 4/19/2010     | Facility 2            |
+---+-------------+---------------+-----------------------+

下一个数据集显示访问它们的日期。里面只有身份证和访问日期:

+---+-------------+
|ID#|Visit Date   |
| 1 | 08/21/2009  |
| 1 | 09/02/2009  |
| 1 | 09/23/2009 |
| 3 | 04/22/2011  |
| 3 | 05/05/2011  |
+---+-------------+

我想将这两个文件合并到ID# 上,其中VisitDate 介于Entrance DateExit Date 之间,这样我就可以看到 1. 谁有访客,2. 他们在什么设施中。

【问题讨论】:

  • 您指的是“Stata 论坛”,大概是指某处的线程。最好通过提供 URL 来说明它是什么。
  • 对于未来的问题,请研究stackoverflow.com/help/mcve另外,请使用dataex(在Stata中,ssc inst dataex)给出可重现的数据生成代码。 (从他们显示的输出中复制日期真的很痛苦。)

标签: merge stata


【解决方案1】:

SSC 上有一个名为rangejoin 的新用户编写程序,专为此类问题量身定制。要安装它,请在 Stata 的命令窗口中输入:

ssc install rangejoin

rangejoin 将根据每次住宿的进出日期(所需间隔的范围)和访问日期配对。所有日期都必须是数字,所以我在下面的示例中将所有日期预先转换为 Stata 日期。

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte id str10 visit int nvisit
1 "08/21/2009" 18130
1 "09/02/2009" 18142
1 "09/23/2009" 18163
3 "04/22/2011" 18739
3 "05/05/2011" 18752
end
format %td nvisit
save "visits.dta", replace

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte id str10(Entrance Exit Name) int(datein dateout)
1 "7/22/2009"  "2/24/2010"  "Facility 1" 18100 18317
1 "7/10/2010"  "11/21/2010" "Facility 2" 18453 18587
2 "3/31/2010"  "9/23/2010"  "Facility 1" 18352 18528
3 "11/24/2010" "7/5/2011"   "Facility 3" 18590 18813
4 "3/7/2007"   "4/19/2010"  "Facility 2" 17232 18371
end
format %td datein
format %td dateout

rangejoin nvisit datein dateout using "visits.dta", by(id)
bysort id datein: egen visit_count = total(!mi(nvisit))
list, sepby(id)

     +-------------------------------------------------------------------------------------------------------+
     | id     Entrance         Exit         Name      datein     dateout        visit      nvisit   visit_~t |
     |-------------------------------------------------------------------------------------------------------|
  1. |  1    7/22/2009    2/24/2010   Facility 1   22jul2009   24feb2010   08/21/2009   21aug2009          3 |
  2. |  1    7/22/2009    2/24/2010   Facility 1   22jul2009   24feb2010   09/02/2009   02sep2009          3 |
  3. |  1    7/22/2009    2/24/2010   Facility 1   22jul2009   24feb2010   09/23/2009   23sep2009          3 |
  4. |  1    7/10/2010   11/21/2010   Facility 2   10jul2010   21nov2010                        .          0 |
     |-------------------------------------------------------------------------------------------------------|
  5. |  2    3/31/2010    9/23/2010   Facility 1   31mar2010   23sep2010                        .          0 |
     |-------------------------------------------------------------------------------------------------------|
  6. |  3   11/24/2010     7/5/2011   Facility 3   24nov2010   05jul2011   04/22/2011   22apr2011          2 |
  7. |  3   11/24/2010     7/5/2011   Facility 3   24nov2010   05jul2011   05/05/2011   05may2011          2 |
     |-------------------------------------------------------------------------------------------------------|
  8. |  4     3/7/2007    4/19/2010   Facility 2   07mar2007   19apr2010                        .          0 |
     +-------------------------------------------------------------------------------------------------------+

然后,如果需要,您可以使用以下方法恢复原始观察:

by id datein: keep if _n == 1
keep id Entrance Exit Name datein dateout visit_count
list
     +------------------------------------------------------------------------------+
     | id     Entrance         Exit         Name      datein     dateout   visit_~t |
     |------------------------------------------------------------------------------|
  1. |  1    7/22/2009    2/24/2010   Facility 1   22jul2009   24feb2010          3 |
  2. |  1    7/10/2010   11/21/2010   Facility 2   10jul2010   21nov2010          0 |
  3. |  2    3/31/2010    9/23/2010   Facility 1   31mar2010   23sep2010          0 |
  4. |  3   11/24/2010     7/5/2011   Facility 3   24nov2010   05jul2011          2 |
  5. |  4     3/7/2007    4/19/2010   Facility 2   07mar2007   19apr2010          0 |
     +------------------------------------------------------------------------------+

【讨论】:

    【解决方案2】:

    任何类型的merge 在这里似乎都无用且不合适,因为您只能匹配标识符。我会使用append

    clear 
    input ID str10 (Entrance Exit) Name  
    1 "7/22/2009"  "2/24/2010"  1  
    1 "7/10/2010"  "11/21/2010" 2  
    2 "3/31/2010"  "9/23/2010"  1  
    3 "11/24/2010" "7/5/2011"   3  
    4 "3/7/2007"   "4/19/2010"  2 
    end 
    gen DateEntrance = daily(Entrance, "MDY") 
    gen DateExit = daily(Exit, "MDY") 
    drop Entrance Exit 
    sort ID, stable 
    by ID : gen T = _n 
    reshape long Date, i(ID T) j(Event) string 
    drop T 
    save Master, replace 
    clear 
    input ID str10 Visit 
    1 "08/21/2009"  
    1 "09/02/2009"  
    1 "09/23/2009" 
    3 "04/22/2011"  
    3 "05/05/2011"  
    end 
    gen Date = daily(Visit, "MDY") 
    drop Visit 
    gen Event = "Visit" 
    append using Master 
    sort ID Date 
    format Date %td 
    list, sepby(ID)  
    
        +----------------------------------+
         | ID        Date      Event   Name |
         |----------------------------------|
      1. |  1   22jul2009   Entrance      1 |
      2. |  1   21aug2009      Visit      . |
      3. |  1   02sep2009      Visit      . |
      4. |  1   23sep2009      Visit      . |
      5. |  1   24feb2010       Exit      1 |
      6. |  1   10jul2010   Entrance      2 |
      7. |  1   21nov2010       Exit      2 |
         |----------------------------------|
      8. |  2   31mar2010   Entrance      1 |
      9. |  2   23sep2010       Exit      1 |
         |----------------------------------|
     10. |  3   24nov2010   Entrance      3 |
     11. |  3   22apr2011      Visit      . |
     12. |  3   05may2011      Visit      . |
     13. |  3   05jul2011       Exit      3 |
         |----------------------------------|
     14. |  4   07mar2007   Entrance      2 |
     15. |  4   19apr2010       Exit      2 |
         +----------------------------------+
    

    立即查看here for how to fill in the missings

    【讨论】:

      【解决方案3】:

      另一种方法使用joinby

      /* Set up Visits Data */
      clear 
      input ID str10 Visit 
      1 "08/21/2009"  
      1 "09/02/2009"  
      1 "09/23/2009" 
      3 "04/22/2011"  
      3 "05/05/2011"  
      end 
      gen DateVisit = daily(Visit, "MDY") 
      drop Visit 
      tempfile Visits
      save `Visits'
      
      /* Set up Facilities Data */
      clear 
      input ID str10 (Entrance Exit Name)  
      1 "7/22/2009"  "2/24/2010"  "Facility 1"  
      1 "7/10/2010"  "11/21/2010" "Facility 2"  
      2 "3/31/2010"  "9/23/2010"  "Facility 1" 
      3 "11/24/2010" "7/5/2011"   "Facility 3"
      4 "3/7/2007"   "4/19/2010"  "Facility 2" 
      end 
      gen DateEntrance = daily(Entrance, "MDY") 
      gen DateExit = daily(Exit, "MDY") 
      drop Entrance Exit 
      
      /* Create pairwise combinations within ID using -joinby- */
      joinby ID using `Visits', unmatched(both)
      drop _merge
      format Date* %td
      
      /* Whatever else you want now... */
      gen Visitor = 0
      replace Visitor = 1 if DateEntrance <= DateVisit & DateVisit <= DateExit
      
      * or... 
      collapse (sum) countVisits = Visitor, by(ID Name DateEntrance DateExit)
      
      * or...
      replace DateVisit = . if !Visitor
      by ID Name (DateVisit), sort : gen VisitNumber = _n * Visitor
      collapse (sum) Visitor, by(ID Name DateEntrance DateExit DateVisit VisitNumber)
      drop VisitNumber
      list, sepby(ID)
      
           +---------------------------------------------------------------+
           | ID         Name   DateEnt~e    DateExit   DateVisit   Visitor |
           |---------------------------------------------------------------|
        1. |  1   Facility 1   22jul2009   24feb2010   21aug2009         1 |
        2. |  1   Facility 1   22jul2009   24feb2010   02sep2009         1 |
        3. |  1   Facility 1   22jul2009   24feb2010   23sep2009         1 |
        4. |  1   Facility 2   10jul2010   21nov2010           .         0 |
           |---------------------------------------------------------------|
        5. |  2   Facility 1   31mar2010   23sep2010           .         0 |
           |---------------------------------------------------------------|
        6. |  3   Facility 3   24nov2010   05jul2011   22apr2011         1 |
        7. |  3   Facility 3   24nov2010   05jul2011   05may2011         1 |
           |---------------------------------------------------------------|
        8. |  4   Facility 2   07mar2007   19apr2010           .         0 |
           +---------------------------------------------------------------+
      

      【讨论】:

        猜你喜欢
        • 2014-02-28
        • 1970-01-01
        • 2021-08-06
        • 2015-09-28
        • 1970-01-01
        • 1970-01-01
        • 2016-12-27
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多