【问题标题】:Dataframe summary of unique pages独特页面的数据框摘要
【发布时间】:2016-06-24 03:58:10
【问题描述】:

这是我的数据框:

import pandas as pd
import re

!wget https://s3.amazonaws.com/todel162/elastic.csv

df=pd.read_csv('elastic.csv')

def mysearch(mystring):
    urls = re.findall('elastic.co/guide(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', mystring)
    return urls

df['mysearch']=df.Body.apply(mysearch)

每个名为mysearch 的列中可以有多个 URL。我需要将所有唯一的 html 页面(不是 url)加入到各自的 parentID 中,输出将如下所示:

query-dsl-term-query.html 35564374, 46568374
query-dsl-bool-query.html 35594195, 75694493
plugins-inputs-jdbc.html 34203007

【问题讨论】:

    标签: python pandas dataframe group-by unique


    【解决方案1】:

    你可以使用:

    import pandas as pd
    
    #force column ParentId as string
    df=pd.read_csv('https://s3.amazonaws.com/todel162/elastic.csv', dtype={'ParentId':str})
    #print (df)
    
    #find all patterns, create new dataframe
    pat = 'elastic.co/guide(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    df1 = pd.DataFrame([x for x in df.Body.str.findall(pat)])
    
    #see http://stackoverflow.com/a/37592047/2901002
    df1 = df.drop('Body',axis=1).join(df1.stack().reset_index(drop=True, level=1).rename('Body'))
    
    #filter only rows contains .html
    df1 = df1[df1.Body.str.contains('.html')]
    
    #split by last `/` 
    df1['url'] = df1.Body.str.rsplit('/', 1, expand=False).str[1]
    #print (df1)
    
    #join by unique url
    df2 = df1.groupby('url')['ParentId'].apply(lambda x: ','.join(x.astype(str))).reset_index()
    
    print (df2)
    
                                                       url  \
    0                                   _add_an_index.html   
    1                                   _add_failover.html   
    2                         _aggregation_test_drive.html   
    3                                 _basic_concepts.html   
    4                               _batch_processing.html   
    5                                    _best_fields.html   
    6                         _boosting_query_clauses.html   
    7                            _bucket_aggregations.html   
    8                         _buckets_inside_buckets.html   
    9                                        _cat_api.html   
    10                              _closer_is_better.html   
    11                                _cluster_health.html   
    12                _combining_queries_with_filters.html   
    13                                _community_dsls.html   
    14                        _community_integrations.html   
    15                                 _configuration.html   
    16                          _controlling_analysis.html   
    17                           _coping_with_failure.html   
    18                          _cross_fields_queries.html   
    19   _dealing_with_json_arrays_and_objects_in_php.html   
    20                      _dealing_with_null_values.html   
    21                               _delete_an_index.html   
    22                             _deleting_an_index.html   
    23                            _deleting_documents.html   
    24                _deploying_in_jboss_eap6_module.html   
    25         _developer_guide_adding_a_new_protocol.html   
    26                             _elasticsearch_net.html   
    27                                  _empty_search.html   
    28                            _exact_value_fields.html   
    29                        _executing_aggregations.html   
    ..                                                 ...   
    923                             suggester-context.html   
    924                       synonyms-analysis-chain.html   
    925                   synonyms-expand-or-contract.html   
    926                                         tasks.html   
    927                            term-level-queries.html   
    928                                   term-vector.html   
    929                             term-vs-full-text.html   
    930                        terms-list-query-usage.html   
    931                             testing-framework.html   
    932                                    time-based.html   
    933                                    time-units.html   
    934                                   token-count.html   
    935                                      top-hits.html   
    936                                      translog.html   
    937                              transport-client.html   
    938                         unicode-normalization.html   
    939                                    unit-tests.html   
    940                                    update-doc.html   
    941                                    user-based.html   
    942              using-elasticsearch-test-classes.html   
    943               using-kibana-for-the-first-time.html   
    944                      using-language-analyzers.html   
    945                               using-stopwords.html   
    946                                using-synonyms.html   
    947               verbatim-and-strict-query-usage.html   
    948                                     visualize.html   
    949                              watch-definition.html   
    950                                watch-log-data.html   
    951                          working-with-plugins.html   
    952                               writing-queries.html   
    
                                                  ParentId  
    0                                                  nan  
    1                                                  nan  
    2                                                  nan  
    3     35958492,nan,35374339,31180988,29818589,32869841  
    4                                             34509058  
    5                                             33398143  
    6    33398143,31836937,34069554,31967672,34006197,3...  
    7                                          nan,nan,nan  
    8                                         nan,30063221  
    9                                             29526147  
    10                 31311687,34323428,34255519,30517904  
    11                                            36026339  
    12                  33395412,nan,28989479,36325156,nan  
    13                                            34143066  
    14                                            34143066  
    15                                            30886182  
    16             31591210,35914330,32246656,32463762,nan  
    17                                        35078736,nan  
    18                          33398143,34631940,36569635  
    19                                                 nan  
    20                                 nan,nan,nan,nan,nan  
    21                                            32872677  
    22                                        nan,22924300  
    23                                                 nan  
    24                                             nan,nan  
    25                                            34132278  
    26                                        nan,30956854  
    27                                   31027308,33658619  
    28                               29923047,33757901,nan  
    29                            nan,nan,30280206,nan,nan  
    ..                                                 ...  
    923  37189942,36802797,36802797,35683069,nan,362040...  
    924                                           34358802  
    925                                  33250379,34358802  
    926                                           36508292  
    927                                           34312196  
    928                              32269054,nan,34680820  
    929                36414571,32264571,32075616,32619266  
    930                                  36697563,36565189  
    931                                           30755194  
    932            28984723,33827559,32635456,32718927,nan  
    933                                           36752424  
    934       36025764,34148626,32059804,34882813,34171223  
    935                  nan,nan,nan,29896839,nan,31411664  
    936                         33110371,33110371,35465922  
    937  nan,35064511,35876176,31453270,nan,27170739,25...  
    938                                                nan  
    939                                                nan  
    940                      nan,33218812,31424380,nan,nan  
    941                                                nan  
    942                                                nan  
    943                                           33996619  
    944                                  30195926,37218517  
    945  31625943,33370591,36794324,30132959,32694958,3...  
    946                          29254643,34255519,nan,nan  
    947                                  37697866,37697866  
    948                                           35347332  
    949                                           31831689  
    950                                  33831247,31831689  
    951                                  37007206,31809884  
    952                                                nan  
    
    [953 rows x 2 columns]
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2011-12-11
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多