【发布时间】:2015-01-11 13:41:52
【问题描述】:
我有这个文件是我从互联网上解析出来的。其中包含一个 json 格式的文件。
我正在尝试将此文件拆分为更小的部分。
例如:
原始文件:
{
"kind": "customsearch#search",
"url": {
"type": "application/json",
"template": "https://www.googleapis.com/customsearch/v1?q={searchTerms}&num={count?}&start={startIndex?}&lr={language?}&safe={safe?}&cx={cx?}&cref={cref?}&sort={sort?}&filter={filter?}&gl={gl?}&cr={cr?}&googlehost={googleHost?}&c2coff={disableCnTwTranslation?}&hq={hq?}&hl={hl?}&siteSearch={siteSearch?}&siteSearchFilter={siteSearchFilter?}&exactTerms={exactTerms?}&excludeTerms={excludeTerms?}&linkSite={linkSite?}&orTerms={orTerms?}&relatedSite={relatedSite?}&dateRestrict={dateRestrict?}&lowRange={lowRange?}&highRange={highRange?}&searchType={searchType}&fileType={fileType?}&rights={rights?}&imgSize={imgSize?}&imgType={imgType?}&imgColorType={imgColorType?}&imgDominantColor={imgDominantColor?}&alt=json"
},
"queries": {
"nextPage": [
{
"title": "Google Custom Search - pagerank",
"totalResults": "14700000",
"searchTerms": "pagerank",
"count": 10,
"startIndex": 11,
"inputEncoding": "utf8",
"outputEncoding": "utf8",
"safe": "off",
"cx": "017576662512468239146:omuauf_lfve"
}
],
"request": [
{
"title": "Google Custom Search - pagerank",
"totalResults": "14700000",
"searchTerms": "pagerank",
"count": 10,
"startIndex": 1,
"inputEncoding": "utf8",
"outputEncoding": "utf8",
"safe": "off",
"cx": "017576662512468239146:omuauf_lfve"
}
]
},
"context": {
"title": "CS Curriculum",
"facets": [
[
{
"label": "lectures",
"anchor": "Lectures",
"label_with_op": "more:lectures"
}
],
[
{
"label": "assignments",
"anchor": "Assignments",
"label_with_op": "more:assignments"
}
],
[
{
"label": "reference",
"anchor": "Reference",
"label_with_op": "more:reference"
}
]
]
},
"searchInformation": {
"searchTime": 0.239874,
"formattedSearchTime": "0.24",
"totalResults": "14700000",
"formattedTotalResults": "14,700,000"
},
"items": [
{
"kind": "customsearch#result",
"title": "Lecture slides on PageRank",
"htmlTitle": "Lecture slides on \u003cb\u003ePageRank\u003c/b\u003e",
"link": "https://www.cs.utexas.edu/users/novak/lec5-pagerank.ppt",
"displayLink": "www.cs.utexas.edu",
"snippet": "Distributed Computing Seminar. Lecture 5: Graph Algorithms & PageRank. \nChristophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet. Summer 2007.",
"htmlSnippet": "Distributed Computing Seminar. Lecture 5: Graph Algorithms & \u003cb\u003ePageRank\u003c/b\u003e. \u003cbr\u003e\nChristophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet. Summer 2007.",
"cacheId": "CwgPK6hTEZQJ",
"mime": "application/vnd.ms-powerpoint",
"fileFormat": "Microsoft Powerpoint",
"formattedUrl": "https://www.cs.utexas.edu/users/novak/lec5-pagerank.ppt",
"htmlFormattedUrl": "https://www.cs.utexas.edu/users/novak/lec5-\u003cb\u003epagerank\u003c/b\u003e.ppt",
"pagemap": {
"metatags": [
{
"author": "jhebert",
"last saved by": "Google"
}
]
}
},
{
"kind": "customsearch#result",
"title": "The PageRank Citation Ranking: Bringing Order to the Web January ...",
"htmlTitle": "The \u003cb\u003ePageRank\u003c/b\u003e Citation Ranking: Bringing Order to the Web January \u003cb\u003e...\u003c/b\u003e",
"link": "http://www.cis.upenn.edu/~mkearns/teaching/NetworkedLife/pagerank.pdf",
"displayLink": "www.cis.upenn.edu",
"snippet": "Jan 29, 1998 ... We compare PageRank to an idealized random Web surfer. We show how to ... \nThis ranking, called PageRank, helps search engines and.",
"htmlSnippet": "Jan 29, 1998 \u003cb\u003e...\u003c/b\u003e We compare \u003cb\u003ePageRank\u003c/b\u003e to an idealized random Web surfer. We show how to ... \u003cbr\u003e\nThis ranking, called \u003cb\u003ePageRank\u003c/b\u003e, helps search engines and.",
"cacheId": "akmuPYNhiKMJ",
"mime": "application/pdf",
"fileFormat": "PDF/Adobe Acrobat",
"formattedUrl": "www.cis.upenn.edu/~mkearns/teaching/.../pagerank.pdf",
"htmlFormattedUrl": "www.cis.upenn.edu/~mkearns/teaching/.../\u003cb\u003epagerank\u003c/b\u003e.pdf",
"pagemap": {
"cse_image": [
{
"src": "x-raw-image:///9a2d934c7c41f83c4c97c3fb9a4cb4cc8fbcb453aaf1002ed6f970005773aa0e"
}
],
"cse_thumbnail": [
{
"width": "262",
"height": "193",
"src": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQcCouA-BJlMWA0HZNMSxsXzbqIZzgu6tXXRqiuse2sttpJaNK2b0cNbm4"
}
],
"metatags": [
{
"producer": "AFPL Ghostscript 7.0",
"creator": "dvipsk 5.58f Copyright 1986, 1994 Radical Eye Software",
"title": "prpaperdraft.dvi"
}
]
}
},
{
"kind": "customsearch#result",
"title": "MATH 51 LECTURE NOTES: HOW GOOGLE RANKS WEB PAGES ...",
"htmlTitle": "MATH 51 LECTURE NOTES: HOW GOOGLE RANKS WEB PAGES \u003cb\u003e...\u003c/b\u003e",
"link": "http://stanford.edu/class/math51/PageRank.pdf",
"displayLink": "stanford.edu",
"snippet": "Google's method1 is called the PageRank algorithm and was developed by \nGoogle founders Sergey Brin and Larry Page while they were graduate students.",
"htmlSnippet": "Google's method1 is called the \u003cb\u003ePageRank\u003c/b\u003e algorithm and was developed by \u003cbr\u003e\nGoogle founders Sergey Brin and Larry Page while they were graduate students.",
"cacheId": "RKV6ZEmHrjUJ",
"mime": "application/pdf",
"fileFormat": "PDF/Adobe Acrobat",
"formattedUrl": "stanford.edu/class/math51/PageRank.pdf",
"htmlFormattedUrl": "stanford.edu/class/math51/\u003cb\u003ePageRank\u003c/b\u003e.pdf",
"pagemap": {
"metatags": [
{
"producer": "pdfTeX-1.40.13",
"creator": "TeX",
"creationdate": "D:20130604152429-07'00'",
"moddate": "D:20130604152429-07'00'",
"fullbanner": "This is pdfTeX, Version 3.1415926-2.4-1.40.13 (TeX Live 2012) kpathsea version 6.1.0"
}
]
}
},
处理后的文件
{u'sn-p': u'分布式计算研讨会。第 5 讲:图算法和 PageRank。 \nChristophe Bisciglia、Aaron Kimball 和 Sierra Michels-Slettvet。 2007 年夏天。', u'title': u'PageRank 上的讲座幻灯片'} {u'sn-p': u'Jan 29, 1998 ... 我们将 PageRank 与理想化的随机网络冲浪者进行比较。我们将展示如何... \n这种称为 PageRank 的排名有助于搜索引擎和。', u'title': u'PageRank Citation Ranking: 为 Web 带来秩序 一月 ...'} {u'sn-p': u"Google 的方法 1 被称为 PageRank 算法,是由\nGoogle 创始人 Sergey Brin 和 Larry Page 在研究生期间开发的。", u'title': u'MATH 51 LECTURE NOTES: GOOGLE HOW RANKS WEB PAGES ...'}
分成三个不同的文本文件/.txt 文件/.json 文件
每个都以 {u'sn-p' ... '} 开头
试图这样做以运行文本比较过程
P.S.:我已经编辑了我需要的唯一部分,即标题和 sn-p 部分。 因此我可能在这些过程中丢失了 json 格式。
【问题讨论】:
-
你能分享你的尝试吗?
-
分割成更小的部分根据什么标准?
-
@utdemir 我已经尝试阅读 docs.python.org/2/library/… 无法从那里找到解决方案
-
@Jasper 以上是我数据的一小部分,我想将它们拆分为开始重复的部分。就像在这个例子中一样,从 {u'sn-p': ... 开始直到它结束,在一个文本文件和下一个 {u'sn-p': ... 再次作为另一个文件等等向前。我期待大约 70 个文件。
-
请仔细检查您输入的准确格式。
{...} {...} {...}之类的东西不是有效的 JSON;您应该会看到类似[{...}, {...}, {...}]或{"first": {...}, "second": {...}, "third": {...}}的内容 - 即顶层的列表或对象。
标签: python json python-2.7