【发布时间】:2015-02-26 19:25:55
【问题描述】:
我有一个包含数千行的大型 CSV 文件,没有标题,每行都有一个单独的 URL。
一些示例行:
http://www.whitehouse.gov/the-press-office/2012/01/27/remarks-president-college-affordability-ann-arbor-michigan
http://www.whitehouse.gov/the-press-office/2012/01/26/remarks-first-lady-dnc-event-palm-beach-fl
http://www.whitehouse.gov/the-press-office/2012/01/26/remarks-president-american-energy-aurora-colorado
http://www.whitehouse.gov/the-press-office/2012/01/26/remarks-first-lady-dnc-event-sarasota-fl
http://www.whitehouse.gov/the-press-office/2012/01/26/remarks-first-lady-goya-foods-miplato-announcement-tampa-fl
http://www.whitehouse.gov/the-press-office/2012/01/26/remarks-president-american-made-energy
http://www.whitehouse.gov/the-press-office/2012/01/25/remarks-president-intel-ocotillo-campus-chandler-az
http://www.whitehouse.gov/the-press-office/2012/01/25/remarks-first-lady-school-lunch-standards-announcement
http://www.whitehouse.gov/the-press-office/2012/01/25/remarks-president-conveyor-engineering-and-manufacturing-cedar-rapids-io
http://www.whitehouse.gov/the-press-office/2012/01/24/remarks-president-state-union-address
http://www.whitehouse.gov/the-press-office/2012/01/23/remarks-president-welcoming-2011-stanley-cup-champion-boston-bruins
http://www.whitehouse.gov/the-press-office/2012/01/21/weekly-address-creating-jobs-boosting-tourism
http://www.whitehouse.gov/the-press-office/2012/01/19/remarks-president-campaign-event-2
http://www.whitehouse.gov/the-press-office/2012/01/19/remarks-president-campaign-event-1
http://www.whitehouse.gov/the-press-office/2012/01/19/remarks-president-campaign-event-0
http://www.whitehouse.gov/the-press-office/2012/01/19/remarks-president-campaign-event
http://www.whitehouse.gov/the-press-office/2012/01/19/remarks-president-unveiling-strategy-help-boost-travel-and-tourism
http://www.whitehouse.gov/the-press-office/2012/01/17/remarks-president-and-first-lady-honoring-2011-world-champion-st-louis-c
我想过滤这些 URL,以便将结果通过管道传输到单独的 CSV 文件中。我尝试了多个 grep 和 awk 选项,但我不断得到太多与我引用的字符串不匹配的结果。
例如,我想
grep "remarks-president" speechurls.csv >> remarks-president_urls.csv
返回所有在 URL 中只有“remarks-president”的 URL。示例:
http://www.whitehouse.gov/the-press-office/2012/01/27/remarks-president-college-affordability-ann-arbor-michigan
http://www.whitehouse.gov/the-press-office/2012/01/26/remarks-president-american-energy-aurora-colorado
http://www.whitehouse.gov/the-press-office/2012/01/26/remarks-president-american-made-energy
http://www.whitehouse.gov/the-press-office/2012/01/25/remarks-president-intel-ocotillo-campus-chandler-az
http://www.whitehouse.gov/the-press-office/2012/01/25/remarks-president-conveyor-engineering-and-manufacturing-cedar-rapids-io
http://www.whitehouse.gov/the-press-office/2012/01/24/remarks-president-state-union-address
http://www.whitehouse.gov/the-press-office/2012/01/23/remarks-president-welcoming-2011-stanley-cup-champion-boston-bruins
http://www.whitehouse.gov/the-press-office/2012/01/19/remarks-president-campaign-event-2
http://www.whitehouse.gov/the-press-office/2012/01/19/remarks-president-campaign-event-1
http://www.whitehouse.gov/the-press-office/2012/01/19/remarks-president-campaign-event-0
http://www.whitehouse.gov/the-press-office/2012/01/19/remarks-president-campaign-event
http://www.whitehouse.gov/the-press-office/2012/01/19/remarks-president-unveiling-strategy-help-boost-travel-and-tourism
http://www.whitehouse.gov/the-press-office/2012/01/17/remarks-president-and-first-lady-honoring-2011-world-champion-st-louis-c
同样
grep "remarks-first-lady" speechurls.csv >> remarks-first-lady_urls.csv
应该返回所有在 URL 中带有“remarks-first-lady”的演讲。
我尝试过的其他规范没有帮助。
grep -w -l "remarks-president" speechurls.csv >> remarks-president_urls.csv
我也尝试了以下方法,但运气不佳。
awk -F, '$1 ~ /remarks-president|president-obama/ {print}' speechurls.csv
fgrep -w "remarks-vice-president" speechurls.csv
我不完全确定如何解决这个问题。任何帮助将非常感激。如果在 Python 中有更好的方法可以做到这一点,我也愿意接受该解决方案。
【问题讨论】:
-
您能否编辑问题以显示您期望来自
grep "remarks-president" speechurls.csv的输出? -
当然,已更新以反映预期输出。
-
我不明白。您正在使用正确的选项运行正确的工具,并且获得了预期的输出,那么问题是什么?请编辑您的问题以显示示例输入、您正在运行的命令以及您不想要的输出。