基于字符串匹配过滤 URLS 的 CSV 文件答案

【问题标题】：Filtering CSV file of URLS based on String Match基于字符串匹配过滤 URLS 的 CSV 文件
【发布时间】：2015-02-26 19:25:55
【问题描述】：

我有一个包含数千行的大型 CSV 文件，没有标题，每行都有一个单独的 URL。

一些示例行：

http://www.whitehouse.gov/the-press-office/2012/01/27/remarks-president-college-affordability-ann-arbor-michigan 
http://www.whitehouse.gov/the-press-office/2012/01/26/remarks-first-lady-dnc-event-palm-beach-fl 
http://www.whitehouse.gov/the-press-office/2012/01/26/remarks-president-american-energy-aurora-colorado
http://www.whitehouse.gov/the-press-office/2012/01/26/remarks-first-lady-dnc-event-sarasota-fl
http://www.whitehouse.gov/the-press-office/2012/01/26/remarks-first-lady-goya-foods-miplato-announcement-tampa-fl 
http://www.whitehouse.gov/the-press-office/2012/01/26/remarks-president-american-made-energy
http://www.whitehouse.gov/the-press-office/2012/01/25/remarks-president-intel-ocotillo-campus-chandler-az 
http://www.whitehouse.gov/the-press-office/2012/01/25/remarks-first-lady-school-lunch-standards-announcement 
http://www.whitehouse.gov/the-press-office/2012/01/25/remarks-president-conveyor-engineering-and-manufacturing-cedar-rapids-io
http://www.whitehouse.gov/the-press-office/2012/01/24/remarks-president-state-union-address 
http://www.whitehouse.gov/the-press-office/2012/01/23/remarks-president-welcoming-2011-stanley-cup-champion-boston-bruins 
http://www.whitehouse.gov/the-press-office/2012/01/21/weekly-address-creating-jobs-boosting-tourism
http://www.whitehouse.gov/the-press-office/2012/01/19/remarks-president-campaign-event-2 
http://www.whitehouse.gov/the-press-office/2012/01/19/remarks-president-campaign-event-1 
http://www.whitehouse.gov/the-press-office/2012/01/19/remarks-president-campaign-event-0
http://www.whitehouse.gov/the-press-office/2012/01/19/remarks-president-campaign-event 
http://www.whitehouse.gov/the-press-office/2012/01/19/remarks-president-unveiling-strategy-help-boost-travel-and-tourism
http://www.whitehouse.gov/the-press-office/2012/01/17/remarks-president-and-first-lady-honoring-2011-world-champion-st-louis-c

我想过滤这些 URL，以便将结果通过管道传输到单独的 CSV 文件中。我尝试了多个 grep 和 awk 选项，但我不断得到太多与我引用的字符串不匹配的结果。

例如，我想

grep "remarks-president" speechurls.csv >> remarks-president_urls.csv

返回所有在 URL 中只有“remarks-president”的 URL。示例：

http://www.whitehouse.gov/the-press-office/2012/01/27/remarks-president-college-affordability-ann-arbor-michigan 
http://www.whitehouse.gov/the-press-office/2012/01/26/remarks-president-american-energy-aurora-colorado
http://www.whitehouse.gov/the-press-office/2012/01/26/remarks-president-american-made-energy
http://www.whitehouse.gov/the-press-office/2012/01/25/remarks-president-intel-ocotillo-campus-chandler-az 
http://www.whitehouse.gov/the-press-office/2012/01/25/remarks-president-conveyor-engineering-and-manufacturing-cedar-rapids-io
http://www.whitehouse.gov/the-press-office/2012/01/24/remarks-president-state-union-address 
http://www.whitehouse.gov/the-press-office/2012/01/23/remarks-president-welcoming-2011-stanley-cup-champion-boston-bruins 
http://www.whitehouse.gov/the-press-office/2012/01/19/remarks-president-campaign-event-2 
http://www.whitehouse.gov/the-press-office/2012/01/19/remarks-president-campaign-event-1 
http://www.whitehouse.gov/the-press-office/2012/01/19/remarks-president-campaign-event-0
http://www.whitehouse.gov/the-press-office/2012/01/19/remarks-president-campaign-event 
http://www.whitehouse.gov/the-press-office/2012/01/19/remarks-president-unveiling-strategy-help-boost-travel-and-tourism
http://www.whitehouse.gov/the-press-office/2012/01/17/remarks-president-and-first-lady-honoring-2011-world-champion-st-louis-c

同样

grep "remarks-first-lady"  speechurls.csv >> remarks-first-lady_urls.csv

应该返回所有在 URL 中带有“remarks-first-lady”的演讲。

我尝试过的其他规范没有帮助。

grep -w -l "remarks-president" speechurls.csv >> remarks-president_urls.csv

我也尝试了以下方法，但运气不佳。

awk -F, '$1 ~ /remarks-president|president-obama/ {print}' speechurls.csv

fgrep -w "remarks-vice-president" speechurls.csv

我不完全确定如何解决这个问题。任何帮助将非常感激。如果在 Python 中有更好的方法可以做到这一点，我也愿意接受该解决方案。

【问题讨论】：

您能否编辑问题以显示您期望来自grep "remarks-president" speechurls.csv 的输出？
当然，已更新以反映预期输出。
我不明白。您正在使用正确的选项运行正确的工具，并且获得了预期的输出，那么问题是什么？请编辑您的问题以显示示例输入、您正在运行的命令以及您不想要的输出。

标签： python unix csv awk grep

【解决方案1】：

我不太明白这个问题。"grep "remarks-first-lady" speechurls.csv" 在这种情况下应该可以正常工作。

您遇到的问题可能来自“>>”，“>>”意味着将新行附加到现有文件中，如果您想要一个仅包含命令输出的文件，则需要使用“>”而不是“>>”。

如果您还可以指出您的代码出了什么问题，我可能会更好地识别您的问题。

【讨论】：

我打算但我没有特权。
谢谢，我相信阈值是 50，我现在已经接近了。

【解决方案2】：

这样的情况很有趣，可以编写一个快速而肮脏的 Python 脚本。我相信以下应该有效。

import csv 
with open('speechurls.csv', 'r') as f:
    for row in csv.reader(f):
        if 'remarks-president' in row[0]:
            with open('remarks-president_urls.csv','a') as f1: f1.write("{}\n".format(row[0]))
        elif 'remarks-first-lady' in row[0]:
            with open('remarks-first-lady_urls.csv', 'a') as f2: f2.write("{}\n".format(row[0]))
        else:
            pass

它不漂亮，没有优雅的设计，但它可以工作并且似乎符合您的要求。

【讨论】：

谢谢羊排！这按预期工作！我得到的唯一奇怪的错误是返回的错误“_csv.Error: new-line character seen in unquoted field”。但是，如果我在脚本中将 'r' 更改为 'rU'，它就完全符合我的要求！
我很高兴它有帮助！如果它回答了您的问题，请考虑将其标记为已接受的答案。
请你们解释一下这个问题是关于什么的？ OP 用他的 grep 命令尝试并未能完成的任务是什么？
为了澄清，我试图用 grep 命令做的是将一个大的 CSV 文件过滤成其他几个过滤的 CSV。问题是原始的 grep 方法返回了一些我想要的 url 和许多我不想要的 url。 @Muttonchop 的原始答案完全符合我的要求。我修改了他的代码并将其插入下面作为答案。再次感谢大家的帮助！

【解决方案3】：

我只是想发布我的问题的更新。感谢@Muttonchop 的帮助，我已经能够解决CSV 过滤问题。

这个 python 解决方案效果很好。修改 @Muttonchop 的初始响应，这是我最终得到的完整代码：

    def filterSpeechURL():
        import csv 
        with open('speechurls.csv', 'rU') as f:
            for row in csv.reader(f):

                #Filter President Obama
                if 'remarks-president' in row[0]:
                    with open('__president_urls.csv','a') as f1: f1.write("{}\n".format(row[0]))
                elif 'weekly-address' in row[0]:
                    with open('__president_urls.csv','a') as f1: f1.write("{}\n".format(row[0]))
                elif 'letter' in row[0]:
                    with open('__president_urls.csv','a') as f1: f1.write("{}\n".format(row[0]))
                elif 'statement-president' in row[0]:
                    with open('__president_urls.csv','a') as f1: f1.write("{}\n".format(row[0]))
                elif 'president-obama' in row[0]:
                    with open('__president_urls.csv','a') as f1: f1.write("{}\n".format(row[0]))
                elif 'excerpts-president' in row[0]:
                    with open('__president_urls.csv','a') as f1: f1.write("{}\n".format(row[0]))


                #Filter First Lady
                elif 'remarks-first-lady' in row[0]:
                    with open('__first-lady_urls.csv', 'a') as f2: f2.write("{}\n".format(row[0]))


                #Filter VP
                elif 'vice-president' in row[0]:
                    with open('__vice_president_urls.csv', 'a') as f2: f2.write("{}\n".format(row[0]))


                #Filter Jill Biden
                elif 'jill' in row[0]:
                    with open('__second-lady_urls.csv', 'a') as f2: f2.write("{}\n".format(row[0]))
                elif 'dr-biden' in row[0]:
                    with open('__second-lady_urls.csv', 'a') as f2: f2.write("{}\n".format(row[0]))


                #Filter Everthing Else
                else:
                    with open('__other_urls.csv', 'a') as f2: f2.write("{}\n".format(row[0]))

    filterSpeechURL()

【讨论】：