Scrapy框架爬取我爱我家二手房信息存储CSV、mysql（IP代理和User-Agent用户代理）

有一段时间没出来活动了闲来无事弄个玩玩O(∩_∩)O哈哈~
想必学过Scrapy框架的人都知道如何创建Scrapy和运行，那么我这里呢现将我创的框架展示一下

scrapy startproject pachong2(创一个pachong2的文件)
cd pachong2	#进入创的文件
scrapy genspider aijia wx.5i5j.com	#生成名叫aijia的爬虫

然后就生成了这些

打开我爱我家的首页进入到二手房页面https://wx.5i5j.com/ershoufang下面就开始编程爬虫程序了
找到创建的aijia项目已经自动生成了一些网址

import scrapyclass AijiaSpider(scrapy.Spider):name = 'aijia'allowed_domains = ['www.5i5j.com']start_urls = ['http://www.5i5j.com/']def parse(self, response):pass

因为爬的是二手房所以start_urls后面的网址就需要变动，然后再看网页的翻页格式n2、n3等等。所以只需要改n后面的数字实现翻页再列一个循环后的网址形式如下
在这里插入图片描述

start_urls = ['https://wx.5i5j.com/ershoufang/n' + str(x) for x in range(1,6)]

我们需要在items里面定义爬取的名称数据所以在items的定义为：

# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass Pachong2Item(scrapy.Item):# define the fields for your item here like:title = scrapy.Field()  #户型标题danjia = scrapy.Field() #单价zongjia = scrapy.Field()    #总价house = scrapy.Field()  #户型floor = scrapy.Field()  # 楼层area = scrapy.Field()   #面积renovation = scrapy.Field() #装修chaoxiang = scrapy.Field()  #朝向xiaoqu = scrapy.Field() #所属小区jingjiren = scrapy.Field()  #经纪人date = scrapy.Field()   #发布时间

下面就开始编写我们的爬虫吧F12找到全部的list右键Copy Xpath

定义一个总的list列表titles将Copy的内容放入做一个循环使得可以得到该页面下的所有list

titles = response.xpath('/html/body/div[6]/div[1]/div[2]/ul/li')for bt in titles:item = Pachong2Item()

之后可以将该页面的重要信息爬取出来也可以点击进入到详细页面

所以这里就需要提取出详细页面中的URL

利用scrpy.Request请求url后生成一个Request对象，通过meta参数，把item这个字典赋值给meta字典的’item’键，即meta={‘item’:item}，这个meta参数会被放在Request对象里一起发送给detail()函数即

    def parse(self, response):titles = response.xpath('/html/body/div[6]/div[1]/div[2]/ul/li')for bt in titles:item = Pachong2Item()url = 'https://wx.5i5j.com' + bt.xpath('div[2]/h3/a/@href').extract()[0]    #详细页面URLyield scrapy.Request(url,meta = {'item':item},callback = self.detail)

进入到详细页面后只需要将需要的东西一个一个拉出来就好

    def detail(self,response):item = response.meta['item']#这个response已含有上述meta字典，此句将这个字典赋值给item，完成信息传递。这个item已经和parse中的item一样了item['house'] = response.xpath('//*[@class="house-infor clear"]/div[1]/p[1]/text()').extract_first()     #户型item['floor'] = response.xpath('//*[@class="house-infor clear"]/div[1]/p[2]/text()').extract_first()    # 楼层item['area'] = response.xpath('//*[@class="house-infor clear"]/div[2]/p[1]/text()').extract_first()     #面积item['area'] = item['area'] + response.xpath('//*[@class="house-infor clear"]/div[2]/p[1]/span/text()').extract_first()item['renovation']= response.xpath('//*[@class="house-infor clear"]/div[2]/p[2]/text()').extract_first()#装修item['chaoxiang'] = response.xpath('//*[@class="house-infor clear"]/div[3]/p/text()').extract_first()   #朝向item['xiaoqu'] = response.xpath('//*[@class = "zushous"]/ul/li[1]/a/text()').extract_first()    #所属小区item['jingjiren'] = response.xpath('/html/body/div[5]/div[2]/div[2]/div[3]/ul/li[2]/h3/a/text()').extract_first()   #经纪人item['date'] = response.xpath('/html/body/div[5]/div[3]/div[3]/div[1]/div/div/ul/li[3]/span/text()').extract_first()    #发布时间yield item

当然我们不能这样盲目的爬取还要有一些应对反爬的措施这时我们就需要给它添加IP代理和User-Agent用户代理，分别在settings和middlewares中设置
那么设置IP代理需要修改的文件如下：
在settings中手动添加IP池代理IP可在https://www.xicidaili.com/中挑选

IPPOOL=[{"ipaddr":"https://113.12.202.50:40498"},{"ipaddr":"https://42.59.85.83:1133"},{"ipaddr":"https://60.5.254.169:8081"},{"ipaddr":"https://124.237.83.14:53281"},{"ipaddr":"https://120.26.208.102:88"}
]

然后配置settings中的DOWNLOADER_MIDDLEWARES

DOWNLOADER_MIDDLEWARES = {'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': None,'pachong2.middlewares.MyproxiesSpiderMiddleware': 125,
}

在middlewares中需要创一个随机选取IP的类


import random
from pachong2.settings import IPPOOL
# 	随机选择ip
class MyproxiesSpiderMiddleware(object):def __init__(self,ip=''):self.ip=ipdef process_request(self,request,spider):thisip = random.choice(IPPOOL)print(">>>>" + thisip["ipaddr"])request.meta["proxy"] = thisip["ipaddr"]

随后在settings中建立User-Agent用户代理池（网上可以搜到）

USER_AGENTS = ["Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)","Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)","Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)","Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)","Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)","Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6","Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1","Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0","Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5","Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11","Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20","Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER","Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER","Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1","Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5","Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0b13pre) Gecko/20110307 Firefox/4.0b13pre","Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11","Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
]

配置settings的DOWNLOADER_MIDDLEWARES

DOWNLOADER_MIDDLEWARES = {'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware':543,'pachong2.middlewares.RandomUserAgent':100,
}

同样在middlewares中创一个随机选取 User-Agent的类

import random
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
from pachong2.settings import USER_AGENTS
# # 随机选择 User-Agent 的下载器中间件
class RandomUserAgent(UserAgentMiddleware):def process_request(self, request, spider):# 从 settings 的 USER_AGENTS 列表中随机选择一个作为 User-Agentuser_agent = random.choice(USER_AGENTS)print(">>>" + user_agent)request.headers.setdefault('User-Agent',user_agent)

配置完成后需要将数据存储起来这里就先存储在CSV中，所以在pipelines添加如下内容

import csv
import os
from pachong2.items import Pachong2Item
from pachong2 import settingsclass Pachong2Pipeline(object):def process_item(self, item, spider):store_file = os.path.dirname(__file__) + 'woaiwojia.csv'self.file = open(store_file, 'a', encoding='utf-8', newline='')self.writer = csv.writer(self.file)print('正在写入')self.writer.writerow([item['title'],item['danjia'], item['zongjia'],item['house'],item['floor'],item['area'],item['renovation'],item['chaoxiang'],item['xiaoqu'],item['jingjiren'],item['date']])return itemdef close_spider(self,spider):self.file.close()

数据存储到mysql中同样在pipelines中修改前提条件需要已经安装mysql并已创建相关项目名如下

create database aijia charset utf8;		#创建aijia这个项目
use aijia;	#使用创建好的
create table woaiwojia(title char(50),danjia char(50),zongjia char(20),house char(20),floor char(20),area char(20),renovation char(20),chaoxiang char(20),xiaoqu char(50),jingjiren char(20),date char(50));#（建立表）
select * from woaiwojia;#查看表内容#

随后在pipelines中修改mysql的相关内容如下：

import pymysql.cursors
from pachong2.items import Pachong2Item
from pachong2 import settings
class Pachong2Pipeline(object):conn = Nonecursor = Nonedef open_spider(self,spider):print('开始爬虫')self.conn = pymysql.Connect(host='127.0.0.1',port=3306,user='root',password='123456',db='aijia')def process_item(self, item, spider):self.cursor = self.conn.cursor()self.cursor.execute( """insert into woaiwojia(title,danjia,zongjia,house,floor,area,renovation,chaoxiang,xiaoqu,jingjiren,date) values (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)""",(item['title'],item['danjia'],item['zongjia'],item['house'],item['floor'],item['area'],item['renovation'],item['chaoxiang'],item['xiaoqu'],item['jingjiren'],item['date']))self.conn.commit()self.conn.rollback()return itemdef close_spider(self,spider):self.cursor.close()self.conn.close()

到这里就基本上结束了只需在PowerShell中scrapy crawl aijia就可以爬取了

CSV存储内容

mysql存储内容：

如果使用IP代理发现出现由于目标计算机积极拒绝，无法连接，需要打开代理IP设置


若是报错代码10060 由于连接方在一段时间后没有正确答复或连接的主机没有反应多半是网速不太好换一个好一点的网络
实在不行就将IP代理这块注释掉只利用用户代理运行也是可以的
附上完整内容
aijia:

# -*- coding: utf-8 -*-
import scrapy
from pachong2.items import Pachong2Item
from scrapy.http import Requestclass AijiaSpider(scrapy.Spider):name = 'aijia'allowed_domains = ['wx.5i5j.com']start_urls = ['https://wx.5i5j.com/ershoufang/n' + str(x) for x in range(1,6)]def parse(self, response):titles = response.xpath('/html/body/div[6]/div[1]/div[2]/ul/li')for bt in titles:item = Pachong2Item()item['title'] = bt.xpath('div[2]/h3/a/text()').extract_first()    # 标题名称try:item['danjia'] = bt.xpath('div[2]/div/div/p[2]/text()').extract_first()[2:] #   单价取单价后内容except:item['danjia'] = ''item['zongjia'] = bt.xpath('div[2]/div/div/p[1]/strong/text()').extract_first()item['zongjia'] = item['zongjia'] + bt.xpath('div[2]/div/div/p[1]/text()').extract_first()#房源总价url = 'https://wx.5i5j.com' + bt.xpath('div[2]/h3/a/@href').extract()[0]    #详细页面URLyield scrapy.Request(url,meta = {'item':item},callback = self.detail)def detail(self,response):item = response.meta['item']item['house'] = response.xpath('//*[@class="house-infor clear"]/div[1]/p[1]/text()').extract_first()     #户型item['floor'] = response.xpath('//*[@class="house-infor clear"]/div[1]/p[2]/text()').extract_first()    # 楼层item['area'] = response.xpath('//*[@class="house-infor clear"]/div[2]/p[1]/text()').extract_first()     #面积item['area'] = item['area'] + response.xpath('//*[@class="house-infor clear"]/div[2]/p[1]/span/text()').extract_first()item['renovation']= response.xpath('//*[@class="house-infor clear"]/div[2]/p[2]/text()').extract_first()#装修item['chaoxiang'] = response.xpath('//*[@class="house-infor clear"]/div[3]/p/text()').extract_first()   #朝向item['xiaoqu'] = response.xpath('//*[@class = "zushous"]/ul/li[1]/a/text()').extract_first()    #所属小区item['jingjiren'] = response.xpath('/html/body/div[5]/div[2]/div[2]/div[3]/ul/li[2]/h3/a/text()').extract_first()   #经纪人item['date'] = response.xpath('/html/body/div[5]/div[3]/div[3]/div[1]/div/div/ul/li[3]/span/text()').extract_first()    #发布时间yield item

items：

# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass Pachong2Item(scrapy.Item):# define the fields for your item here like:title = scrapy.Field()  #户型标题danjia = scrapy.Field() #单价zongjia = scrapy.Field()    #总价house = scrapy.Field()  #户型floor = scrapy.Field()  # 楼层area = scrapy.Field()   #面积renovation = scrapy.Field() #装修chaoxiang = scrapy.Field()  #朝向xiaoqu = scrapy.Field() #所属小区jingjiren = scrapy.Field()  #经纪人date = scrapy.Field()   #发布时间

settings：

# -*- coding: utf-8 -*-# Scrapy settings for pachong2 project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'pachong2'SPIDER_MODULES = ['pachong2.spiders']
NEWSPIDER_MODULE = 'pachong2.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'pachong2 (+http://www.yourdomain.com)'# Obey robots.txt rules
ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# IPPOOL=[
#     {"ipaddr":"https://113.12.202.50:40498"},
#     {"ipaddr":"https://42.59.85.83:1133"},
#     {"ipaddr":"https://60.5.254.169:8081"},
#     {"ipaddr":"https://124.237.83.14:53281"},
#     {"ipaddr":"https://120.26.208.102:88"}
# ]
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'en','Cookie': 'PHPSESSID=bfep4ds3alrdh57ht157713hif; domain=wx; yfx_c_g_u_id_10000001=_ck20060819481414154416101364846; yfx_f_l_v_t_10000001=f_t_1591616894410__r_t_1591616894410__v_t_1591616894410__r_c_0; _ga=GA1.2.1765837557.1591616894; _gid=GA1.2.547071340.1591616894; _gat=1; __TD_deviceId=TT9BDIH91LKRA5K4; _dx_uzZo5y=6a38293bdf00802fe0a018001bb0e29e6ebd01757b26f5375e86414db3515987f98cfb75; Hm_lvt_94ed3d23572054a86ed341d64b267ec6=1591616895; Hm_lpvt_94ed3d23572054a86ed341d64b267ec6=1591616895; _Jo0OQK=1550512396DD7A63810EEF8F23C18E02CD7F785BDE115DC664C2A006951F4858E966DDD1C64EE8E77769E76E91E5CB12D92BF4AA004796C4C13237C33CD7E7B74159A46BDFB12DE2C98989D845E2D7305CF989D845E2D7305CF217CB44B7A47FF7BGJ1Z1RA==; smidV2=20200608194815b617fe97736465acaaa1a246c829a2560039239acab230290; gr_user_id=3c316764-40a4-461d-b689-2bfa5a26f0cc; 8fcfcf2bd7c58141_gr_session_id=db1cb732-1dc6-46cf-a1e7-9d4b18566c2e; grwng_uid=53009e0b-ca49-444b-902e-c1eb9d4b8c04; 8fcfcf2bd7c58141_gr_session_id_db1cb732-1dc6-46cf-a1e7-9d4b18566c2e=true','Referer': 'https://wx.5i5j.com/ershoufang/',
}USER_AGENTS = ["Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)","Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)","Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)","Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)","Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)","Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6","Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1","Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0","Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5","Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11","Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20","Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER","Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER","Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1","Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5","Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0b13pre) Gecko/20110307 Firefox/4.0b13pre","Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11","Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
]# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'pachong2.middlewares.Pachong2SpiderMiddleware': 543,
#}# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {# 'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': None,# 'pachong2.middlewares.MyproxiesSpiderMiddleware': 125,'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware':543,'pachong2.middlewares.RandomUserAgent':100,
}# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {'pachong2.pipelines.Pachong2Pipeline': 300,
}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

middlewares:

import random
from scrapy import signals
# from scrapy.downloadermiddlewares.httpproxy import HttpProxyMiddleware
# from pachong2.settings import IPPOOL
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
from pachong2.settings import USER_AGENTS
# 随机选择ip
# class MyproxiesSpiderMiddleware(object):
#     def __init__(self,ip=''):
#         self.ip=ip
#     def process_request(self,request,spider):
#         thisip = random.choice(IPPOOL)
#         print(">>>>" + thisip["ipaddr"])
#         request.meta["proxy"] = thisip["ipaddr"]
# # 随机选择 User-Agent 的下载器中间件
class RandomUserAgent(UserAgentMiddleware):def process_request(self, request, spider):# 从 settings 的 USER_AGENTS 列表中随机选择一个作为 User-Agentuser_agent = random.choice(USER_AGENTS)print(">>>" + user_agent)request.headers.setdefault('User-Agent',user_agent)

pipelines:

# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import csv
import os
from pachong2.items import Pachong2Item
from pachong2 import settingsclass Pachong2Pipeline(object):def process_item(self, item, spider):store_file = os.path.dirname(__file__) + 'woaiwojia.csv'self.file = open(store_file, 'a', encoding='utf-8', newline='')self.writer = csv.writer(self.file)print('正在写入')self.writer.writerow([item['title'],item['danjia'], item['zongjia'],item['house'],item['floor'],item['area'],item['renovation'],item['chaoxiang'],item['xiaoqu'],item['jingjiren'],item['date']])return itemdef close_spider(self,spider):self.file.close()

Scrapy框架爬取我爱我家二手房信息存储CSV、mysql（IP代理和User-Agent用户代理）

Published by

风君子