【问题标题】:Pig capture matching string with regex猪用正则表达式捕获匹配字符串
【发布时间】:2014-11-04 22:40:09
【问题描述】:

我正在尝试从推文中捕获图片网址。

REGISTER 'hdfs:///user/cloudera/elephant-bird-pig-4.1.jar';
REGISTER 'hdfs:///user/cloudera/elephant-bird-core-4.1.jar';
REGISTER 'hdfs:///user/cloudera/elephant-bird-hadoop-compat-4.1.jar';

--Load Json

loadJson = LOAD '/user/cloudera/tweetwall' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS (json:map []);
B = FOREACH loadJson GENERATE flatten(json#'tweets') as (m:map[]);
tweetText = FOREACH B GENERATE FLATTEN(m#'text') as (str:chararray);

中间日期如下所示:

(@somenameontwitter your nan makes me laugh with some of the things she comes out with like http://somepics.com/my.jpg)

然后我尝试执行以下操作以仅获取图像 url:

 x = foreach tweetText generate REGEX_EXTRACT_ALL(str, '((http)(.*)(.jpg|.bmp|.png))');

dump x;

但这似乎不起作用。我也一直在尝试过滤无济于事。

即使使用 .* 尝试上述操作,它也会返回空结果 () 或 (())

我不擅长正则表达式,而且对 Pig 还很陌生,所以我可能在这里遗漏了一些我没有看到的简单内容。

更新

示例输入数据

 {"tweets":[{"created_at":"Sat Nov 01 23:15:45 +0000 2014","id":5286804225,"id_str":"5286864225","text":"@Beace_ your nan makes me laugh with some of the things she comes out with blabla http://t.co/b7hjMWNg is an url, but not a valid one http://www.something.com/this.jpg should be a valid url","source":"\u003ca href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for iPhone\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":52812992878592,"in_reply_to_status_id_str":"522","in_reply_to_user_id":398098,"in_reply_to_user_id_str":"3","in_reply_to_screen_name":"Be_","user":{"id":425,"id_str":"42433395","name":"SAINS","screen_name":"sa3","location":"Lincoln","profile_location":null,"description":"","url":null,"entities":{"description":{"urls":[]}},"protected":false,"followers_count":92,"friends_count":526,"listed_count":0,"created_at":"Mon May 25 16:18:05 +0000 2009","favourites_count":6,"utc_offset":0,"time_zone":"London","geo_enabled":true,"verified":false,"statuses_count":19,"lang":"en","contributors_enabled":false,"is_translator":false,"is_translation_enabled":false,"profile_background_color":"EDECE9","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme3\/bg.gif","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme3\/bg.gif","profile_background_tile":false,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/52016\/DGDCj67z_normal.jpeg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/526\/DGDCj67z_normal.jpeg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/424395\/13743515","profile_link_color":"088253","profile_sidebar_border_color":"D3D2CF","profile_sidebar_fill_color":"E3E2DE","profile_text_color":"634047","profile_use_background_image":true,"default_profile":false,"default_profile_image":false,"following":false,"follow_request_sent":false,"notifications":false},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweet_count":0,"favorite_count":1,"entities":{"hashtags":[],"symbols":[],"user_mentions":[{"screen_name":"e_","name":"\u2601\ufe0f effy","id":3998,"id_str":"398","indices":[0,15]}],"urls":[]},"favorited":false,"retweeted":false,"lang":"en"}]}

【问题讨论】:

  • 您想从每条推文中提取图片网址,对吧?即,最终输出应该是“somepics.com/my.jpg”?
  • 正确,这就是我想要的

标签: regex apache-pig


【解决方案1】:

试试这个,如果可行,请告诉我

x = foreach tweetText generate REGEX_EXTRACT(str,'.*(http://.*.[jpg|bmp|png])',1);
DUMP x;

【讨论】:

  • 这给了我更好的结果,但还没有。这是我返回的一些示例:() (http://t.co/Vbiq6ZuvzB) RT to enter Entran) () () (http://t.co/TQu8XGg) (http://t.co/EYp) () (http://t.co/g0p) () (http://t.co/efo13URHBg http) () (http://t.co/DJU5KlsiCr http) () () (http://t.co/HUVPF9j) (http://t.co/iipidujn) () () (http://t.co/Xd6NqApcnC http) () () () () () (http://t.co/tXQT891XA5 http) () () () (http://t.co/b7hjMWNg)
  • 你能粘贴你的json文件吗?我想检查图片 url 的所有输入?
  • 添加了一个问题示例
  • json输入中的图片url在哪里?你只是在打印关键“text”的值:“@Beace_你的 nan 让我对她提出的一些东西发笑”。此“文本”键没有任何图像 URL。你能给我更多的信息吗?
  • 对不起,我会在一分钟内调整它。但基本上没关系,任何带有包含图像 url 的文本值的推文文本都应该被过滤,在这种情况下,在文本中的任何位置添加匹配的 url 都应该被捕获。
【解决方案2】:

我设法让它工作(尽管我怀疑它是否完全最佳)

x = foreach tweetText generate REGEX_EXTRACT(str,'(http://.*(.jpg|.bmp|.png))',1) as image;


filtered = FILTER x BY $0 is not null;


dump filtered;

所以最初的问题只是正则表达式(而且我对此主题缺乏了解)。

感谢sivasakthi jayaraman的帮助!

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2023-01-30
    • 2012-06-05
    • 2013-12-25
    相关资源
    最近更新 更多