Scrapy分布式爬虫打造搜索引擎- (四)通过CrawlSpider对拉勾网进行整站爬取

四、通过CrawlSpider对招聘网站拉钩网进行整站爬取

使用CrawlSpider对于拉勾网进行整站爬取。
附带源码解读,数据库建表与爬取后将数据存入数据库等内容。

推荐工具cmder
http://cmder.net/
下载full版本,使我们在windows环境下也可以使用linux部分命令。
配置path环境变量

1. 设计拉勾网的数据表结构

拉勾网数据库表设计

2. 初始化拉钩网项目并解读crawl源码

scrapy genspider --list
查看可使用的初始化模板
ailable templates:

  • basic
  • crawl
  • csvfeed
  • xmlfeed
1
scrapy genspider -t crawl lagou www.lagou.com

cmd与pycharm不同,mark root
setting.py 设置目录
crawl模板

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
class LagouSpider(CrawlSpider):
name = 'lagou'
allowed_domains = ['www.lagou.com']
start_urls = ['http://www.lagou.com/']

rules = (
Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
)

def parse_item(self, response):
i = {}
#i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
#i['name'] = response.xpath('//div[@id="name"]').extract()
#i['description'] = response.xpath('//div[@id="description"]').extract()
return i

源码阅读剖析
https://doc.scrapy.org/en/1.3/topics/spiders.html#crawlspider

提供了一些可以让我们进行简单的follow的规则,link,迭代爬取

rules:

规则,crawel spider读取并执行

parse_start_url(response):

example:

rules是一个可迭代对象,里面有Rule实例->LinkExtractor的分析
allow=('category\.php', ), callback='parse_item',
allow允许的url模式。callback,要回调的函数名。
因为rules里面没有self,无法获取到方法。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']

rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),

# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),
)

def parse_item(self, response):
self.logger.info('Hi, this is an item page! %s', response.url)
item = scrapy.Item()
item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()
item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()
return item

分析拉勾网模板代码

  1. 将http加上s
  2. 重命名parse_item为我们自定义的parse_job
  3. 点击class LagouSpider(CrawlSpider):的CrawlSpider,进入crawl源码
  4. class CrawlSpider(Spider):可以看出它继承于spider
  5. 入口:def start_requests(self):
  6. alt+左右方向键,不同代码跳转
  7. 5->之后默认parse CrawlSpider里面有parse函数。但是这次我们不能向以前一样覆盖

Crawl.py核心函数parse。

parse函数调用_parse_response

1
2
def parse(self, response):
return self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True)

_parse_response

  1. 判断是否有callback即有没有self.parse_start_url
  2. 我们可以重载parse_start_url加入自己的处理
  3. 把参数传递给函数,并调用process_results函数

_parse_response函数

1
2
3
4
5
6
7
8
9
10
def _parse_response(self, response, callback, cb_kwargs, follow=True):
if callback:
cb_res = callback(response, **cb_kwargs) or ()
cb_res = self.process_results(response, cb_res)
for requests_or_item in iterate_spider_output(cb_res):
yield requests_or_item

if follow and self._follow_links:
for request_or_item in self._requests_to_follow(response):
yield request_or_item

parse_start_url的return值将会被process_results方法接收处理
如果不重写,因为返回为空,然后就相当于什么都没做

1
2
def process_results(self, response, results):
return results

点击followlink

1
2
3
def set_crawler(self, crawler):
super(CrawlSpider, self).set_crawler(crawler)
self._follow_links = crawler.settings.getbool('CRAWLSPIDER_FOLLOW_LINKS', True)

如果setting中有这个参数,则可以进一步执行到parse

_requests_to_follow

  1. 判断传入的是不是response,如果不是直接returns
  2. 针对当前response设置一个空set,去重
  3. 把self的rules通过enumerate变成一个可迭代对象
  4. 跳转rules详情
  5. 拿到link通过link_extractor.extract_links抽取出具体的link
  6. 执行我们的process_links
  7. link制作完成发起Request,回调_response_downloaded函数
  8. 然后执行parse_respose
1
2
3
4
5
6
7
8
9
10
11
12
13
14
def _requests_to_follow(self, response):
if not isinstance(response, HtmlResponse):
return
seen = set()
for n, rule in enumerate(self._rules):
links = [lnk for lnk in rule.link_extractor.extract_links(response)
if lnk not in seen]
if links and rule.process_links:
links = rule.process_links(links)
for link in links:
seen.add(link)
r = Request(url=link.url, callback=self._response_downloaded)
r.meta.update(rule=n, link_text=link.text)
yield rule.process_request(r)

_compile_rules

  1. 在我们初始化时会调用_compile_rules
  2. copy.copy(r) for r in self.rules]将我们的rules进行一个copy
  3. 调用回调函数get_method。
  4. 调用rules里面我们定义的process_links
  5. 调用rules里面我们定义的process_request
1
2
3
4
5
6
7
8
9
10
11
12
def _compile_rules(self):
def get_method(method):
if callable(method):
return method
elif isinstance(method, six.string_types):
return getattr(self, method, None)

self._rules = [copy.copy(r) for r in self.rules]
for rule in self._rules:
rule.callback = get_method(rule.callback)
rule.process_links = get_method(rule.process_links)
rule.process_request = get_method(rule.process_request)
self.process_links = process_links
self.process_request = process_request

可以通过在rules里面传入我们自己的处理函数,实现对url的自定义。
达到负载均衡,多地不同ip访问。

_response_downloaded
通过rule取到具体的rule
调用我们自己的回调函数

1
2
3
def _response_downloaded(self, response):
rule = self._rules[response.meta['rule']]
return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)
  • allow :符合这个url我就爬取
  • deny : 符合这个url规则我就放弃
  • allow_domin : 这个域名下的我才处理
  • allow_domin : 这个域名下的我不处理
  • restrict_xpaths:进一步限定xpath
1
2
3
self, allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths=(),
tags=('a', 'area'), attrs=('href',), canonicalize=True,
unique=True, process_value=None, deny_extensions=None, restrict_css=()

extract_links
如果有restrict_xpaths,他会进行读取执行

1
2
3
4
5
6
7
8
9
10
11
12
13
def extract_links(self, response):
base_url = get_base_url(response)
if self.restrict_xpaths:
docs = [subdoc
for x in self.restrict_xpaths
for subdoc in response.xpath(x)]
else:
docs = [response.selector]
all_links = []
for doc in docs:
links = self._extract_links(doc, response.url, response.encoding, base_url)
all_links.extend(self._process_links(links))
return unique_list(all_links)

get_base_url:

urllib.parse.urljoin替我们拼接好url

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def get_base_url(text, baseurl='', encoding='utf-8'):
"""Return the base url if declared in the given HTML `text`,
relative to the given base url.

If no base url is found, the given `baseurl` is returned.

"""

text = to_unicode(text, encoding)
m = _baseurl_re.search(text)
if m:
return moves.urllib.parse.urljoin(
safe_url_string(baseurl),
safe_url_string(m.group(1), encoding=encoding)
)
else:
return safe_url_string(baseurl)

编写rule规则

1
2
3
4
5
rules = (
Rule(LinkExtractor(allow=("zhaopin/.*",)), follow=True),
Rule(LinkExtractor(allow=("gongsi/j\d+.html",)), follow=True),
Rule(LinkExtractor(allow=r'jobs/\d+.html'), callback='parse_job', follow=True),
)

3. 设计lagou的items

需要用到的方法

1
2
3
4
5
6
7
8
9
from w3lib.html import remove_tags
def remove_splash(value):
#去掉工作城市的斜线
return value.replace("/","")

def handle_jobaddr(value):
addr_list = value.split("\n")
addr_list = [item.strip() for item in addr_list if item.strip()!="查看地图"]
return "".join(addr_list)

定义好的item

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
class LagouJobItem(scrapy.Item):
#拉勾网职位信息
title = scrapy.Field()
url = scrapy.Field()
url_object_id = scrapy.Field()
salary = scrapy.Field()
job_city = scrapy.Field(
input_processor=MapCompose(remove_splash),
)
work_years = scrapy.Field(
input_processor = MapCompose(remove_splash),
)
degree_need = scrapy.Field(
input_processor = MapCompose(remove_splash),
)
job_type = scrapy.Field()
publish_time = scrapy.Field()
job_advantage = scrapy.Field()
job_desc = scrapy.Field()
job_addr = scrapy.Field(
input_processor=MapCompose(remove_tags, handle_jobaddr),
)
company_name = scrapy.Field()
company_url = scrapy.Field()
tags = scrapy.Field(
input_processor = Join(",")
)
crawl_time = scrapy.Field()

重写的itemloader
设置默认只提取第一个

1
2
3
class LagouJobItemLoader(ItemLoader):
#自定义itemloader
default_output_processor = TakeFirst()

4. 提取字段值并存入数据库

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def parse_job(self, response):

#解析拉勾网的职位
item_loader = LagouJobItemLoader(item=LagouJobItem(), response=response)
item_loader.add_css("title", ".job-name::attr(title)")
item_loader.add_value("url", response.url)
item_loader.add_value("url_object_id", get_md5(response.url))
item_loader.add_css("salary", ".job_request .salary::text")
item_loader.add_xpath("job_city", "//*[@class='job_request']/p/span[2]/text()")
item_loader.add_xpath("work_years", "//*[@class='job_request']/p/span[3]/text()")
item_loader.add_xpath("degree_need", "//*[@class='job_request']/p/span[4]/text()")
item_loader.add_xpath("job_type", "//*[@class='job_request']/p/span[5]/text()")

item_loader.add_css("tags", '.position-label li::text')
item_loader.add_css("publish_time", ".publish_time::text")
item_loader.add_css("job_advantage", ".job-advantage p::text")
item_loader.add_css("job_desc", ".job_bt div")
item_loader.add_css("job_addr", ".work_addr")
item_loader.add_css("company_name", "#job_company dt a img::attr(alt)")
item_loader.add_css("company_url", "#job_company dt a::attr(href)")
item_loader.add_value("crawl_time", datetime.now())

job_item = item_loader.load_item()

return job_item

获得的拉勾网item数据
拉勾网item数据

5. items中添加get_insert_sql实现存入数据库

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def get_insert_sql(self):
insert_sql = """
insert into lagou_job(title, url, url_object_id, salary, job_city, work_years, degree_need,
job_type, publish_time, job_advantage, job_desc, job_addr, company_name, company_url,
tags, crawl_time) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
ON DUPLICATE KEY UPDATE salary=VALUES(salary), job_desc=VALUES(job_desc)
"""
params = (
self["title"], self["url"], self["url_object_id"], self["salary"], self["job_city"],
self["work_years"], self["degree_need"], self["job_type"],
self["publish_time"], self["job_advantage"], self["job_desc"],
self["job_addr"], self["company_name"], self["company_url"],
self["job_addr"], self["crawl_time"].strftime(SQL_DATETIME_FORMAT),
)

return insert_sql, params
-------------本文结束感谢您的阅读-------------

本文标题:Scrapy分布式爬虫打造搜索引擎- (四)通过CrawlSpider对拉勾网进行整站爬取

文章作者:mtianyan

发布时间:2017年07月01日 - 00:07

最后更新:2018年02月02日 - 20:02

原始链接:http://blog.mtianyan.cn/post/d083b798.html

许可协议: 署名-非商业性使用-禁止演绎 4.0 国际 转载请保留原文链接及作者。

请博主吃包辣条
0%