【发布时间】:2020-08-28 20:12:19
【问题描述】:
我刚开始学习使用 Python 3 进行网络抓取,并尝试将其应用到一个小型项目中,该项目包括从工作列表中提取数据。我确实在寻找答案,并发现了一些涉及类似主题的问题,但它们似乎都没有完全相同的用例——至少这是我的理解。
我从网站的搜索结果中提取了公司 URL,并将公司 URL 附加到名为 sitelis 的列表中。然后,我遍历 sitelis 以从每个公司 URL 中提取 json 数据。但是,我在从一些公司 URL 中检索 json 数据时遇到了问题(请参阅回溯:json.decoder.JSONDecodeError: Invalid \escape)——而大多数 URL 都可以正常工作。知道是什么原因造成的吗?我有点迷茫,因为 90% 的 URL 都可以正常工作,而对于那些不能正确解析的少数 URL,我找不到任何可以解释它的差异。
非常感谢您的帮助!
以下是此类错误的示例:
这是回溯:
Traceback (most recent call last):
File "glassdoor_json.py", line 117, in <module>
company_js = json.loads(company_jdata.text, strict=False) # to get a Python list
File "/Users/spw/anaconda3/lib/python3.7/json/__init__.py", line 361, in loads
return cls(**kw).decode(s)
File "/Users/spw/anaconda3/lib/python3.7/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/Users/spw/anaconda3/lib/python3.7/json/decoder.py", line 353, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Invalid \escape: line 54 column 92 (char 2146)
这是for循环的代码:
for site in sitelis:
count_site = count_site + 1
try:
companyrstarturl = Request(site, headers={'User-Agent': 'Mozilla/5.0'})
fhand_company = urllib.request.urlopen(companyrstarturl, context=ctx)
companydata = fhand_company.read()
company_soup = BeautifulSoup(companydata, 'lxml')
company_jdata = company_soup.select("[type='application/ld+json']")[0]
company_js = json.loads(company_jdata.text, strict=False) # to get a Python list
print('')
print('>>>>>>>>>>>>>>> (json) Company', count_site, '<<<<<<<<<<<<<<<' )
print('')
print(json.dumps(company_js, indent=4))
print('')
except KeyboardInterrupt:
print('')
print('(2) Program interrupted by user...')
break
这是来自公司网站的相关 json 数据(有回溯的那个):
{
"@context": "http://schema.org",
"@type": "JobPosting",
"title": "Back-End Developer (m/w/d)",
"url": "https://www.glassdoor.de/job-listing/back-end-developer-mwd-storming-creative-studios-JV_IC2561561_KO0,22_KE23,48.htm?jl=3655406413",
"datePosted": "2020-08-25",
"employmentType": "FULL_TIME",
"salaryCurrency": "EUR",
"validThrough": "2020-09-26",
"hiringOrganization": {
"@type": "Organization",
"name": "STORMING GmbH Creative Studios"
},
"jobLocation": {
"@type": "Place",
"address": {
"@type": "PostalAddress",
"addressLocality": "Leonberg",
"addressRegion": "01",
"addressCountry": {
"@type" : "Country",
"name" : "DE"
}
}
,
"geo": {
"@type": "GeoCoordinates",
"latitude": "48.8005",
"longitude": "9.0168"
}
}
,"description": "In den STORMING Creative Studios wird Kommunikation neu gedacht: In f&uuml;nf einzigartigen Studios b&uuml;ndeln wir unsere Kompetenzen und erschaffen innovative Kommunikationsdienstleistungen. Wir blicken auf beeindruckendes Wachstum zur&uuml;ck und schauen in eine ambitionierte Zukunft. Werde jetzt Teil des Teams.
<br/><br/>
In unserem Development Studio entstehen innovative Webseiten, durchdachte Apps und hilfreiche Software. Dabei bieten wir unseren Kunden immer die neuesten Technologien und zielf&uuml;hrendsten L&ouml;sungen. F&uuml;r unser Team suchen wir daher Back-End Developer f&uuml;r folgende Aufgaben:
<ul>
<li>Arbeit an Unternehmenssoftware zur Digitalisierung von Prozessen</li>
<li>Anpassungen an CMS-Backends</li>
<li>Erstellung von Konfiguratoren</li>
<li>Enge Zusammenarbeit mit Front-End Developern und Projektleitungen</li>
</ul>
Uns ist wichtig, dass wir uns aufeinander verlassen und uns vertrauen k&ouml;nnen. Jede\*r bei STORMING ist ein wichtiger Teil des Unternehmens, tr&auml;gt Verantwortung und unterst&uuml;tzt aktiv unser Wachstum. Aus diesem Grund suchen wir nach loyalen Mitarbeitern\*innen mit hoher Motivation. Dar&uuml;ber hinaus ist uns folgendes wichtig:
<ul>
<li>Hervorragende Kenntnisse in PHP, Javascript &amp; SQL</li>
<li>Kenntnisse in Pythan, Objective-C/Swift &amp; Java von Vorteil</li>
<li>Berufserfahrung</li>
<li>Zuverl&auml;ssigkeit</li>
</ul>
F&uuml;r uns sind faire Bezahlung und geldwerte Vorteile eine Selbstverst&auml;ndlichkeit. Doch auch dar&uuml;ber hinaus ist unser Ziel, einen Ort zu schaffen, an dem Menschen sich gern aufhalten und sie selbst sein k&ouml;nnen.Zusammengefasst bieten wir dir:
<ul>
<li>gute Work-Life-Balance</li>
<li>faire Bezahlung &amp; geldwerte Vorteile</li>
<li>moderne Ausstattung &amp; firmeneigene Parkpl&auml;tze</li>
<li>flache Hierarchien &amp; kurze Entscheidungswege</li>
</ul>
Interesse? Dann bewirb dich jetzt per Mail mit deinem Lebenslauf und Portfolio. Wir freuen uns auf dich!
<br/><br/>
Art der Stelle: Vollzeit"
}
这是一个运行良好的 json 示例(来自公司网站):
{
"@context": "http://schema.org",
"@type": "JobPosting",
"title": "Back-End Node Developer",
"url": "https://www.glassdoor.de/job-listing/back-end-node-developer-ust-global-JV_IC2622109_KO0,23_KE24,34.htm?jl=3615685703",
"datePosted": "2020-08-28",
"employmentType": "FULL_TIME",
"salaryCurrency": "EUR",
"validThrough": "2020-09-27",
"industry": "Information Technology",
"hiringOrganization": {
"@type": "Organization",
"name": "UST Global",
"logo": "https://media.glassdoor.com/sqll/155577/ust-global-squarelogo-1579115891630.png",
"sameAs": "www.ust-global.com"
},
"jobLocation": {
"@type": "Place",
"address": {
"@type": "PostalAddress",
"addressLocality": "Berlin",
"addressRegion": "16",
"addressCountry": {
"@type" : "Country",
"name" : "DE"
}
}
,
"geo": {
"@type": "GeoCoordinates",
"latitude": "52.5177",
"longitude": "13.4055"
}
}
,"occupationalCategory" : ["15-1132.00", "Software Developers, Applications"]
,"description": "<p>UST Global is increasing its International Digital &amp; Innovation Hub in Berlin in partnership model with one of our Fortune 500 clients, to deliver new digital solutions in more than 60 countries as part of their business transformation model.</p>
<p>The Hub team leads the end-to-end process of creating new capabilities, products and platforms, applying best practices, top technology trends and agile techniques.</p>
<p>As part of our Digital Hub based in Berlin, you will have the opportunity to work in a multicultural and highly dynamic environment. You will have the chance to live the first steps of this international, highly skilled, success-oriented team.</p>
<p><strong>Key responsibilities</strong></p>
<p>As part of our digital squads you will work on state-of-the-art technologies to design and create products for creating new business models, that will transform the way of interacting between people and enterprises:</p>
<ul>
<li>E2E responsibility for building digital products and solutions.</li>
<li>Build AI solutions.</li>
<li>Apply architecture principles and development standards.</li>
<li>Work closely with other technical teams undertaking product development coordination and delivery.</li>
</ul>
<p><strong>Basic qualifications:</strong></p>
<ul>
<li>You are experienced in JavaScript/TypeScript</li>
<li>You are familiar with building scalable applications and services with Node.js</li>
<li>You have knowledge in relational databases like Postgres</li>
<li>Monitoring with ELK (Elastic Search, LogStash, Kibana)</li>
<li>Use of best practices in clean code, testing and code review.</li>
<li>Understanding on quality documentation and diagrams.</li>
<li>Experience working with Agile principles and best practices.</li>
<li>Good time management skills.</li>
<li>Real passion of coding and technology</li>
<li>A degree in computer science, or similar professional certifications.</li>
<li>Fluent English and excellent communication skills.</li>
<li>Used to work in multinational projects.</li>
</ul>
<p><strong>Desirable qualifications:</strong></p>
<ul>
<li>Experience implementing and managing CI/CD solutions.</li>
<li>Experience under some of the following frameworks: Angular or React.</li>
<li>Knowledge of Kafka is a plus</li>
</ul>
<p><strong>Suitable candidates:</strong></p>
<ul>
<li>German passport holders</li>
<li>German valid working visa</li>
<li>European Union passport holders</li>
</ul>
<p><strong> </strong><strong>Who we are:</strong></p>
<p>We are a multinational digital company with over 20.000 employees all over the world and presence in more than 25 countries.</p>
<p>We transform lives with our human centered innovative solutions, touching 3 billion &ldquo;personas&rdquo; through digital solutions and technologies.</p>
<p>UST Global is a Great Place to Work&reg; and Top Employer&reg; certified company.</p>
<p>For further details please go to www.ust-global.com</p>
<p><strong>What we offer:</strong></p>
<ul>
<li>Competitive compensation package and benefits.</li>
<li>Flexible Payment Plan so you can adapt your salary according to your preferences (child care checks, transport card, online German and English lessons with native teachers, health insurance&hellip;).</li>
<li>25 working days of holidays.</li>
<li>Free breakfast, food and drinks.</li>
<li>Team activities like barbecues, game nights, team events and much more.</li>
<li>Professional career in our Center of Excellence where you could participate on several projects inside the company.</li>
<li>International environment and close contact with colleagues specialized in the core technologies of the company, with whom you will share your knowledge.</li>
<li>We have an (internal) program to compensate referrals from which you can benefit when you refer professionals that get on the company.</li>
</ul>
<p>If you want to know more, don&rsquo;t hesitate to apply and we&rsquo;ll get in touch with you to give more details about the offer. If you are a digital native, this is an amazing opportunity to join one of the leading initiatives in Central Europe, Berlin.</p>"
}
【问题讨论】:
标签: json python-3.x web-scraping