scrapy
安装
sudo apt-get install gcc python-virutalenv python-dev libxml2-dev
libxslt-dev
pip install Scrapy
抓取
scrapy crawl douban_book
参数
scrapy crawl myspider -a category=electronics -a domain=system
Spiders receive arguments in their constructors:
class MySpider(BaseSpider):
name = 'myspider'
def __init__(self, category='', domain=None):
self.start_urls = ['http://www.example.com/categories/%s' % category]
self.domain = domain
...
selector
样例
<html>
<head>
<base href='http://example.com/' />
<title>Example website</title>
</head>
<body>
<div id='images'>
<a href='image1.html'>Name: My image 1 <br />
<img src='image1_thumb.jpg' /></a>
<a href='image2.html'>Name: My image 2 <br />
<img src='image2_thumb.jpg' /></a>
<a href='image3.html'>Name: My image 3 <br />
<img src='image3_thumb.jpg' /></a>
<a href='image4.html'>Name: My image 4 <br />
<img src='image4_thumb.jpg' /></a>
<a href='image5.html'>Name: My image 5 <br />
<img src='image5_thumb.jpg' /></a>
</div>
</body>
</html>
xpath
选择属性使用@
选择节点文字text()
选择某节点href属性@href
css
选择属性使用::
选择节点文字::text
选择某节点href属性::attr(href)
xpath
sel.xpath(
sel.xpath(
sel.xpath(
sel.xpath(
[u'image1.html', u'image2.html', u'image3.html',
u'image4.html', u'image5.html']
sel.xpath(
[u'image1_thumb.jpg', u'image2_thumb.jpg', u'image3_thumb.jpg',
u'image4_thumb.jpg', u'image5_thumb.jpg']
css
sel.css('title::text').extract()
sel.css('base::attr(href)').extract()
sel.css('a[href*=image]::attr(href)').extract()
[u'image1.html', u'image2.html', u'image3.html', u'image4.html', u'image5.html']
sel.css('a[href*=image] img::attr(src)').extract()
[u'image1_thumb.jpg', u'image2_thumb.jpg', u'image3_thumb.jpg',
u'image4_thumb.jpg', u'image5_thumb.jpg']