【问题标题】:How to extract substring with Regex in R [duplicate]如何在R中使用正则表达式提取子字符串[重复]
【发布时间】:2019-06-16 13:38:27
【问题描述】:

我有以下字符串:

x <- "\n\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\n\t\t\t\t\t\n\t\t\tGEO Publications\n\t\t\t\t\tHandout\n\t\t\t\t\t\tNAR 2013 (latest)\n\t\t\t\t\t\tNAR 2002 (original)\n\t\t\t\t\t\tAll publications\n\t\t\t\t\t\n\t\t\t\tFAQ\n\t\t\t\tMIAME\n\t\t\t\tEmail GEO\n\t\t\t\n                    \n                \n                    \n                    \n                \n                    \n                           NCBI > GEO > Accession Display\nNot logged in | Login\n\n                    \n                \n                    \n                    \n                \n                    \n                        \n                                    \n\n \n \n\nGEO help: Mouse over screen elements for information.\n\nScope: SelfPlatformSamplesSeriesFamily\n  Format: HTMLSOFTMINiML\n  Amount: BriefQuick\n GEO accession:   \n\n\n\n    Sample GSM935277\n\nQuery DataSets for GSM935277\nStatus\nPublic on May 22, 2012\nTitle\nStanford_ChipSeq_GM12878_TBP_IgG-mus\nSample type\nSRA\n \n\nSource name\nGM12878\nOrganism\nHomo sapiens\nCharacteristics\nlab: Stanfordlab description: Snyder - Stanford Universitydatatype: ChipSeqdatatype description: Chromatin IP Sequencingcell: GM12878cell organism: humancell description: B-lymphocyte, lymphoblastoid, International HapMap Project - CEPH/Utah - European Caucasion, Epstein-Barr Viruscell karyotype: normalcell lineage: mesodermcell sex: Ftreatment: Nonetreatment description: No special treatment or protocol appliesantibody: TBPantibody antibodydescription: Mouse monoclonal. Immunogen is synthetic peptide conjugated to KLH derived from within residues 1 - 100 of HumanTATA binding protein TBP. Antibody Target: TBPantibody targetdescription: General transcription factor that functions at the core of the DNA-binding multiprotein factor TFIID. Binding of TFIID to the TATA box is the initial transcriptional step of the pre-initiation complex (PIC), playing a role in the activation of eukaryotic genes transcribed by RNA polymerase II."

我想要做的是检测这种形式的模式:

Antibody Target: TBPantibody 

并返回子串结果TBPantibody

我试过这个正则表达式,但它不起作用

sub("Antibody Target: ([A-Zaz]+)\\W+", "\\1", x)

正确的做法是什么?

【问题讨论】:

  • 您知道,在给出答案后默默地删除一个问题(另一个问题)可能会导致那些花时间帮助您的人产生不好的感觉,不是吗?

标签: r regex


【解决方案1】:

你可以的

sub(".*Antibody Target: ([A-Za-z]+).*", "\\1", x)
#[1] "TBPantibody"

【讨论】:

    【解决方案2】:

    请您尝试关注一次。

    sub("(.*Antibody Target: )([^ ]*)",\\2,variable)
    

    说明:根据 OP 的样本值存储在名为 variable 的变量中。这里用sub代替Base R的函数。

    sub的语法:

    sub(/regex_to_match/,"get_value_either_from_memory_of_matched_regex OR 将新变量/值放在匹配的位置 部分”,变量名_需要处理)

    "(.*Antibody Target: )([^ ]*)":首先提到正则表达式,它从变量值的开始匹配到字符串Antibody Target:,并将其保存在 R 程序的内存中((....) 表示所提到的正则表达式的匹配保留在那里。在第二个@987654328 @ 提到正则表达式以保留所有内容,直到出现第一个空格。然后\\2 表示用内存中的第二部分替换整个变量值(应该在 Antibody 之后匹配字符串..)。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-11-28
      • 1970-01-01
      • 2021-09-25
      相关资源
      最近更新 更多