【发布时间】:2012-04-02 21:14:59
【问题描述】:
我正在 Google App Engine 上构建网络爬虫。要将爬取的信息存储在 Data Store 中,我使用 JDO 使用以下字段。代码如下:
public class LinkInfo
{
@PrimaryKey
@Persistent private String id;
@Extension(vendorName="datanucleus", key="gae.unindexed", value="true")
@Persistent private int linkNo;
@Extension(vendorName="datanucleus", key="gae.unindexed", value="true")
@Persistent private String link;
@Persistent private int version;
@Persistent private String fetchDate;
@Extension(vendorName="datanucleus", key="gae.unindexed", value="true")
@Persistent private long fetchTime;
@Persistent private String nextFetch;
@Extension(vendorName="datanucleus", key="gae.unindexed", value="true")
@Persistent private String pageCreationDate;
@Persistent private int retries;
@Extension(vendorName="datanucleus", key="gae.unindexed", value="true")
@Persistent private int retryInterval;
@Extension(vendorName="datanucleus", key="gae.unindexed", value="true")
@Persistent private int outLinks;
@Persistent private float score;
@Extension(vendorName="datanucleus", key="gae.unindexed", value="true")
@Persistent private String abstractContent;
@Persistent private String contentType;
@Persistent private String parent;
@Extension(vendorName="datanucleus", key="gae.unindexed", value="true")
@Persistent private String title;
...
在 16 个字段中,我已将 8 个未编入索引,因为我不需要对它们进行过滤或排序。即使是现在,我也超过了数据存储区写入操作限制。
通过“数据存储写入操作”减少的任何建议?
【问题讨论】:
标签: google-app-engine google-cloud-datastore web-crawler