【问题标题】:Not able to scrape a website for HTML?无法为 HTML 抓取网站?
【发布时间】:2012-09-29 07:39:58
【问题描述】:

所以我试图让我的应用程序访问网站,从该网站获取 HTML,从 HTML 中删除不必要的元素,然后在我的 make shift 应用程序中加载“内容”,因为我没有有一个 API 或一个提要。我正在使用 Jsoup,如果我不是在 android 中进行网络抓取,它就可以工作,但 android 不喜欢它。

public class SimpleDiggActivity extends Activity {

private WebView browser;
final Activity activity = this;

@Override
public void onCreate(Bundle savedInstanceState) {
    super.onCreate(savedInstanceState);
    this.getWindow().requestFeature(Window.FEATURE_PROGRESS);

    setContentView(R.layout.main);

    getWindow().setFeatureInt(Window.FEATURE_PROGRESS, Window.PROGRESS_VISIBILITY_ON);

    String url = "http://www.digg.com";
    Document digg;
    browser = (WebView) findViewById(R.id.mybrowser);
    final Button homeDigg = (Button) findViewById(R.id.button1);

    browser.setWebViewClient(new SimpleWebViewClient());

    browser.getSettings().setJavaScriptEnabled(true);
    browser.getSettings().setUseWideViewPort(true);
    browser.getSettings().setLoadWithOverviewMode(true);
    browser.getSettings().setDisplayZoomControls(false);
    browser.getSettings().setEnableSmoothTransition(true);
    browser.getSettings().setBuiltInZoomControls(true);
    browser.getSettings().setUserAgentString("Android");

    // progressCircle = ProgressDialog.show(SimpleDiggActivity.this, "", "Loading...");
    final ProgressDialog progressCircle = new ProgressDialog(activity);
    progressCircle.setProgressStyle(ProgressDialog.STYLE_SPINNER);
    progressCircle.setMessage("Loading...");
    progressCircle.setCancelable(false);

    try{
        Toast.makeText(getApplicationContext(), "No Steps down", Toast.LENGTH_SHORT).show();
        Document diggTest = Jsoup.connect("http://digg.com/enable/mobile").get();
        Toast.makeText(getApplicationContext(), "1 Steps down", Toast.LENGTH_SHORT).show();
        String diggTitle = diggTest.title();
        Toast.makeText(getApplicationContext(), "2 Steps down"    , Toast.LENGTH_SHORT).show();
        Document compressed = Jsoup.parseBodyFragment(diggTitle);
        Toast.makeText(getApplicationContext(), "3 Steps down", Toast.LENGTH_SHORT).show();
        org.jsoup.select.Elements div = diggTest.select("div");
        Toast.makeText(getApplicationContext(), "4 Steps down", Toast.LENGTH_SHORT).show();
        String divBrow = div.toString();
        Toast.makeText(getApplicationContext(), "5 Steps down", Toast.LENGTH_SHORT).show();
        browser.loadUrl(divBrow);
    }catch (Exception e){
        e.printStackTrace();

        Toast.makeText(getApplicationContext(), "Gave up", Toast.LENGTH_SHORT).show();
        String diggBrow = url;
        browser.loadUrl("http://www.google.com");
    }

抱歉,如果弄乱了,我只是在胡闹,这是我的第一次。 Toasts 是让我知道代码何时尝试失败并求助于捕获。当我运行它时,它不会过去

 Document diggTest = Jsoup.connect("http://digg.com/enable/mobile").get();

【问题讨论】:

  • 没有错误,只是无法连接或从 digg.com 获取 HTML,我在常规 Java 工作区中尝试过,但它运行良好。
  • 我假设您是在模拟器上运行它,您是否必须设置一个权限才能允许外部连接? (我在问,不幸的是我不知道)
  • 你的清单中有这个吗?
  • 我实际上是在模拟器上运行这个的,我可以试试我的手机,看看我是否得到不同的结果。是的,这是我唯一的许可。无论有无应用,它都可以完全访问互联网。
  • 在我的手机上我收到了完全相同的东西,它立即放弃并移动到了捕获位置

标签: java android eclipse parsing jsoup


【解决方案1】:

我使用 JSOUP 1.7.1 版尝试了您的代码,它在我的最后运行良好。以下是工作代码:

public class SimpleDiggActivity extends Activity {

    final Activity activity = this;

    @Override
    public void onCreate(Bundle savedInstanceState) {
        super.onCreate(savedInstanceState);
        this.getWindow().requestFeature(Window.FEATURE_PROGRESS);

        setContentView(R.layout.activity_simple_digg);

        getWindow().setFeatureInt(Window.FEATURE_PROGRESS,
                Window.PROGRESS_VISIBILITY_ON);

        String url = "http://www.digg.com";
        Document digg;

        // progressCircle = ProgressDialog.show(SimpleDiggActivity.this, "",
        // "Loading...");
        final ProgressDialog progressCircle = new ProgressDialog(activity);
        progressCircle.setProgressStyle(ProgressDialog.STYLE_SPINNER);
        progressCircle.setMessage("Loading...");
        progressCircle.setCancelable(false);

        try {
            Toast.makeText(getApplicationContext(), "No Steps down",
                    Toast.LENGTH_SHORT).show();
            Document diggTest = Jsoup.connect("http://digg.com/enable/mobile")
                    .get();
            Toast.makeText(getApplicationContext(), "1 Steps down",
                    Toast.LENGTH_SHORT).show();
            String diggTitle = diggTest.title();
            Toast.makeText(getApplicationContext(), "2 Steps down",
                    Toast.LENGTH_SHORT).show();
            Document compressed = Jsoup.parseBodyFragment(diggTitle);
            Toast.makeText(getApplicationContext(), "3 Steps down",
                    Toast.LENGTH_SHORT).show();
            org.jsoup.select.Elements div = diggTest.select("div");
            Toast.makeText(getApplicationContext(), "4 Steps down",
                    Toast.LENGTH_SHORT).show();
            String divBrow = div.toString();
            Toast.makeText(getApplicationContext(), "5 Steps down",
                    Toast.LENGTH_SHORT).show();
            Log.d(this.getClass().getSimpleName(), "data is " + divBrow);
        } catch (Exception e) {
            e.printStackTrace();

            Toast.makeText(getApplicationContext(), "Gave up",
                    Toast.LENGTH_SHORT).show();
            String diggBrow = url;
        }
    }
}

以下是 divBrow 的值:

10-10 11:58:45.631: D/SimpleDiggActivity(350): data is <div class="site-header-container page-container"> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):  <header class="site-header"> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):   <h1 class="site-header-logo-container"><a href="/" id="site-header-logo" class="image-replace">Digg</a></h1> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):  </header> 
10-10 11:58:45.631: D/SimpleDiggActivity(350): </div>
10-10 11:58:45.631: D/SimpleDiggActivity(350): <div id="container" class="page-container"> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):  <ul id="top-stories"> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):   <li class="story-container story-1" data-content-id="Racz8K" id="story-Racz8K"> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):    <div class="story-details"> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):     <div class="story-kicker">
10-10 11:58:45.631: D/SimpleDiggActivity(350):       NO FILTER 
10-10 11:58:45.631: D/SimpleDiggActivity(350):     </div> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):     <div class="story-headline"> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):      <a data-position="0" class="story-link" href="http://www.fastcompany.com/3001994/no-filter-inside-hipstamatics-lost-year-searching-next-killer-social-app" data-content-id="Racz8K"> Inside Hipstamatic’s Lost Year Searching For The Next Killer Social&nbsp;App </a> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):     </div> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):     <div class="story-domain"> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):      <div class="story-link-wrapper"> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):       <a data-position="0" class="story-link" href="http://www.fastcompany.com/3001994/no-filter-inside-hipstamatics-lost-year-searching-next-killer-social-app" data-content-id="Racz8K">fastcompany.com</a> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):      </div> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):      <div class="story-actions"> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):       <span class="story-action-item story-score"> <span class="story-score-details"> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):         <ul class="story-score-details-list"> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):          <li class="story-score-thumb-Racz8K story-score-thumb">20</li> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):          <li class="story-score-tweets-Racz8K story-score-twitter">402</li> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):          <li class="story-score-fb_shares-Racz8K story-score-facebook">72</li> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):         </ul> </span> <span class="story-score-Racz8K">494</span> </span> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):      </div> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):     </div> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):     <div class="story-image"> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):      <a data-position="0" class="story-link" href="http://www.fastcompany.com/3001994/no-filter-inside-hipstamatics-lost-year-searching-next-killer-social-app" data-content-id="Racz8K"><img src="http://static.digg.com/images/Racz8K_1_www_large_thumb.jpeg" alt="" width="312" height="170" /></a> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):     </div> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):     <div class="story-preview">
10-10 11:58:45.631: D/SimpleDiggActivity(350):      From rooftop bashes and acquisition talks to staff clashes and layoffs, Hipstamatic’s founders and ex-employees describe the startup’s losing struggle to keep pace with Instagram, Facebook, and others in the white-hot photo-sharing space.
10-10 11:58:45.631: D/SimpleDiggActivity(350):     </div> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):    </div> </li> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):   <li class="story-container story-1" data-content-id="Qa2sP3" id="story-Qa2sP3"> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):    <div class="story-details"> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):     <div class="story-kicker">
10-10 11:58:45.631: D/SimpleDiggActivity(350):       PHOTOGRAPHY 
10-10 11:58:45.631: D/SimpleDiggActivity(350):     </div> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):     <div class="story-headline"> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):      <a data-position="1" class="story-link" href="http://lens.blogs.nytimes.com/2012/10/09/looking-into-the-eyes-of-made-in-china/" data-content-id="Qa2sP3"> Looking Into The Eyes Of 'Made In&nbsp;China' </a> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):     </div> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):     <div class="story-domain"> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):      <div class="story-link-wrapper"> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):       <a data-position="1" class="story-link" href="http://lens.blogs.nytimes.com/2012/10/09/looking-into-the-eyes-of-made-in-china/" data-content-id="Qa2sP3">lens.blogs.nytimes.com</a> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):      </div> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):      <div class="story-actions"> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):       <span class="story-action-item story-score"> <span class="story-score-details"> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):         <ul class="story-score-details-list"> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):          <li class="story-score-thumb-Qa2sP3 story-score-thumb">0</li> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):          <li class="story-score-tweets-Qa2sP3 story-score-twitter">252</li> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):          <li class="story-score-fb_shares-Qa2sP3 story-score-facebook">411</li> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):         </ul> </span> <span class="story-score-Qa2sP3">663</span> </span> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):      </div> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):     </div> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):     <div class="story-image"> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):      <a data-position="1" class="story-link" href="http://lens.blogs.nytimes.com/2012/10/09/looking-into-the-eyes-of-made-in-china/" data-content-id="Qa2sP3"><img src="http://static.digg.com/images/Qa2sP3_1_www_large_thumb.jpeg" alt="" width="312" height="170" /></a> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):     </div> 
10-10 11:58:45.631: D/SimpleDiggActivity(350):     <div class="story-preview">
10-10 11:58:45.631: D/SimpleDiggActivity(350):      In “Faces of Made in China,” a series of typological portraits looking at workers inside six Chinese factories, the photographer Lucas Schifres seeks to consider the otherwise anonymous people who produce our essential possessions by looking directly into their eyes.
10-10 11:58:45.631: D/SimpleDiggActivity(350):     </div> 

请在最后尝试一下,然后告诉我结果如何。

【讨论】:

  • 好的,谢谢。我会继续努力的。基本上,我需要能够抓取容器的那个 div,像在移动 digg 上通常看起来一样显示它,但只是去掉页眉和页脚。耶。谢谢。
猜你喜欢
  • 1970-01-01
  • 2013-03-29
  • 1970-01-01
  • 2017-04-05
  • 1970-01-01
  • 1970-01-01
  • 2013-06-07
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多