亚洲国产一区二区三区四区,亚洲人午夜精品免费,亚洲桃花岛网站

拉勾網(wǎng)爬蟲

解析拉勾網(wǎng)網(wǎng)站：

在拉勾網(wǎng)上輸入關(guān)鍵詞后我們可以得到相應(yīng)的崗位信息（這里以Python為例），我們先獲取到網(wǎng)站中所有的城市信息，再通過城市信息遍歷爬取全國的Python職位信息。

在數(shù)據(jù)包的Headers中我們可以得到網(wǎng)頁頭的相關(guān)信息，如網(wǎng)頁URL、請求方法、Cookies信息、用戶代理等相關(guān)信息。

獲取所有城市：

            
              class CrawlLaGou(object):
    def __init__(self):
        # 使用session保存cookies信息
        self.lagou_session = requests.session()
        self.header = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'
        }
        self.city_list = ""

    #獲取城市
    def crawl_city(self):
        #使用正則表達(dá)式獲取HTML代碼中的城市名稱
        city_search = re.compile(r'www\.lagou\.com\/.*\/">(.*?)')
        #網(wǎng)頁URL
        city_url = "https://www.lagou.com/jobs/allCity.html"
        city_result = self.crawl_request(method="GET", url=city_url)
        self.city_list = city_search.findall(city_result)
        self.lagou_session.cookies.clear()

    #返回結(jié)果
    def crawl_request(self,method,url,data=None,info=None):
        while True:
            if method == "GET":
                response = self.lagou_session.get(url=url,headers=self.header)
            elif method == "POST":
                response = self.lagou_session.post(url=url, headers=self.header, data=data)
            response.encoding = "utf8"
            return response.text

if __name__ == '__main__':
    lagou = CrawlLaGou()
    lagou.crawl_city()
    print(lagou.city_list)

其中self.header中的User-Agent信息也在上圖中Headers中可以找到。上述代碼先將url所對應(yīng)的網(wǎng)頁源碼爬取下來，再通過正則表達(dá)式獲取到網(wǎng)頁中的所有城市名稱。

運(yùn)行結(jié)果：

在我們獲取完所有的城市名稱信息后，我們開始獲取城市對應(yīng)的職位信息，我們回到職位列表（https://www.lagou.com/jobs/list_python），找到存放有職位信息的數(shù)據(jù)包，以及其對應(yīng)的請求頭部信息。

存放職位信息的數(shù)據(jù)包：

在得到網(wǎng)頁的職位信息后，我們可以使用https://www.json.cn/進(jìn)行解析，并找出我們需要的信息內(nèi)容。

從json解析中，我們可以得到職位信息的列表為’content’→’positionResult’→’result’

獲取職位信息：

            
              #獲取職位信息
def crawl_city_job(self,city):
    #職位列表數(shù)據(jù)包的url
    first_request_url = "https://www.lagou.com/jobs/list_python?city=%s&cl=false&fromSearch=true&labelWords=&suginput="%city
    first_response = self.crawl_request(method="GET", url=first_request_url)
    #使用正則表達(dá)式獲取職位列表的頁數(shù)
    total_page_search = re.compile(r'class="span\stotalNum">(\d+)')
    try:
        total_page = total_page_search.search(first_response).group(1)
    except:
        # 如果沒有職位信息，直接return
        return
    else:
        for i in range(1, int(total_page) + 1):
            #data信息中的字段
            data = {
                "pn":i,
                "kd":"python"
            }
            #存放職位信息的url
            page_url = "https://www.lagou.com/jobs/positionAjax.json?city=%s&needAddtionalResult=false" % city
            #添加對應(yīng)的Referer
            referer_url = "https://www.lagou.com/jobs/list_python?city=%s&cl=false&fromSearch=true&labelWords=&suginput="% city
            self.header['Referer'] = referer_url.encode()
            response = self.crawl_request(method="POST",url=page_url,data=data,info=city)
            lagou_data = json.loads(response)
            #通過json解析得到的職位信息存放的列表
            job_list = lagou_data['content']['positionResult']['result']
            for job in job_list:
                print(job）

在上述代碼中，先通過存放職位列表的數(shù)據(jù)包url（first_request_url）中獲取網(wǎng)頁代碼中的頁碼信息，并通過頁碼來判斷是否存在崗位信息，若沒有則返回。若有，則通過存放職位信息的數(shù)據(jù)包url（page_url），并添加對應(yīng)的data數(shù)據(jù)和Refer信息，來獲取該數(shù)據(jù)包中的所有信息，最后通過’content’→’positionResult’→’result’的列表順序來獲得到我們所需要的職位信息。
運(yùn)行結(jié)果：

解決“操作太頻繁，請稍后再試”的問題：

如在爬蟲運(yùn)行過程中出現(xiàn)“操作太頻繁”則說明該爬蟲已經(jīng)被網(wǎng)站發(fā)現(xiàn)，此時我們需要清除cookies信息并重新獲取該url，并讓程序停止10s后再繼續(xù)運(yùn)行。

            
              #返回結(jié)果
def crawl_request(self,method,url,data=None,info=None):
    while True:
        if method == "GET":
            response = self.lagou_session.get(url=url,headers=self.header)
        elif method == "POST":
            response = self.lagou_session.post(url=url, headers=self.header, data=data)
        response.encoding = "utf8"
        #解決操作太頻繁問題
        if '頻繁' in response.text:
            print(response.text)
            self.lagou_session.cookies.clear()
            first_request_url = "https://www.lagou.com/jobs/list_python?city=%s&cl=false&fromSearch=true&labelWords=&suginput=" % info
            self.crawl_request(method="GET", url=first_request_url)
            time.sleep(10)
            continue 
        return response.text

將爬取到的數(shù)據(jù)保存到數(shù)據(jù)庫：

在以上我們爬取到的結(jié)果中，我們只是爬取了在result列表中的所有數(shù)據(jù)，可讀性還比較差。我們需要創(chuàng)建一個數(shù)據(jù)庫，并篩選出我們需要的數(shù)據(jù)插入進(jìn)去。

創(chuàng)建數(shù)據(jù)庫：

創(chuàng)建數(shù)據(jù)庫：

            
              #創(chuàng)建數(shù)據(jù)庫連接
engine = create_engine("mysql+pymysql://root:root@127.0.0.1:3306/lagou?charset=utf8")
#操作數(shù)據(jù)庫
Session = sessionmaker(bind=engine)
#聲明一個基類
Base = declarative_base()

class Lagoutables(Base):
    #表名稱
    __tablename__ = 'lagou_java'
    #id,設(shè)置為主鍵和自動增長
    id = Column(Integer,primary_key=True,autoincrement=True)
    #職位id
    positionID = Column(Integer,nullable=True)
    # 經(jīng)度
    longitude = Column(Float, nullable=False)
    # 緯度
    latitude = Column(Float, nullable=False)
    # 職位名稱
    positionName = Column(String(length=50), nullable=False)
    # 工作年限
    workYear = Column(String(length=20), nullable=False)
    # 學(xué)歷
    education = Column(String(length=20), nullable=False)
    # 職位性質(zhì)
    jobNature = Column(String(length=20), nullable=True)
    # 公司類型
    financeStage = Column(String(length=30), nullable=True)
    # 公司規(guī)模
    companySize = Column(String(length=30), nullable=True)
    # 業(yè)務(wù)方向
    industryField = Column(String(length=30), nullable=True)
    # 所在城市
    city = Column(String(length=10), nullable=False)
    # 崗位標(biāo)簽
    positionAdvantage = Column(String(length=200), nullable=True)
    # 公司簡稱
    companyShortName = Column(String(length=50), nullable=True)
    # 公司全稱
    companyFullName = Column(String(length=200), nullable=True)
    # 工資
    salary = Column(String(length=20), nullable=False)
    # 抓取日期
    crawl_date = Column(String(length=20), nullable=False)

插入數(shù)據(jù)：

            
              def __init__(self):
    self.mysql_session = Session()
    self.date = time.strftime("%Y-%m-%d",time.localtime())

#數(shù)據(jù)存儲方法
def insert_item(self,item):
    #今天
    date = time.strftime("%Y-%m-%d",time.localtime())
    #數(shù)據(jù)結(jié)構(gòu)
    data = Lagoutables(
        #職位ID
        positionID = item['positionId'],
        # 經(jīng)度
        longitude=item['longitude'],
        # 緯度
        latitude=item['latitude'],
        # 職位名稱
        positionName=item['positionName'],
        # 工作年限
        workYear=item['workYear'],
        # 學(xué)歷
        education=item['education'],
        # 職位性質(zhì)
        jobNature=item['jobNature'],
        # 公司類型
        financeStage=item['financeStage'],
        # 公司規(guī)模
        companySize=item['companySize'],
        # 業(yè)務(wù)方向
        industryField=item['industryField'],
        # 所在城市
        city=item['city'],
        # 職位標(biāo)簽
        positionAdvantage=item['positionAdvantage'],
        # 公司簡稱
        companyShortName=item['companyShortName'],
        # 公司全稱
        companyFullName=item['companyFullName'],
         # 工資
        salary=item['salary'],
        # 抓取日期
        crawl_date=date
    )

    #在存儲數(shù)據(jù)之前查詢表里是否有這條職位信息
    query_result = self.mysql_session.query(Lagoutables).filter(Lagoutables.crawl_date==date,
                                                                Lagoutables.positionID == item['positionId']).first()

    if query_result:
        print('該職位信息已存在%s:%s:%s' % (item['positionId'], item['city'], item['positionName']))
    else:
        #插入數(shù)據(jù)
        self.mysql_session.add(data)
        #提交數(shù)據(jù)
        self.mysql_session.commit()
        print('新增職位信息%s' % item['positionId'])

運(yùn)行結(jié)果：

此時職位信息已保存到數(shù)據(jù)庫中：

完整代碼：
github：https://github.com/KeerZhou/crawllagou
csdn：https://download.csdn.net/download/keerzhou/11584694

更多文章、技術(shù)交流、商務(wù)合作、聯(lián)系博主

微信掃碼或搜索：z360901061

微信掃一掃加我為好友

QQ號聯(lián)系： 360901061

您的支持是博主寫作最大的動力，如果您喜歡我的文章，感覺我的文章對您有幫助，請用微信掃描下面二維碼支持博主2元、5元、10元、20元等您想捐的金額吧，狠狠點(diǎn)擊下面給點(diǎn)支持吧，站長非常感激您！手機(jī)微信長按不能支付解決辦法：請將微信支付二維碼保存到相冊，切換到微信，然后點(diǎn)擊微信右上角掃一掃功能，選擇支付二維碼完成支付。

【本文對您有幫助就好】元

2元

5元

10元

20元

自定義

日韩久久久精品,亚洲精品久久久久久久久久久,亚洲欧美一区二区三区国产精品 ,一区二区福利

Python拉勾網(wǎng)爬蟲實(shí)現(xiàn)

拉勾網(wǎng)爬蟲

解決“操作太頻繁，請稍后再試”的問題：

將爬取到的數(shù)據(jù)保存到數(shù)據(jù)庫：