当前位置:网站首页>My crawler learning notes

My crawler learning notes

2022-06-13 01:18:00 NI3E

Introductory enlightenment

 

  Introductory articles : How to get started Python Reptiles ? - You know Now my slogan : Write code ! Search problem ! transfer BUG!

Main reference books 《Python Web crawler from entry to practice 》. Use pycharm+python3. Browsers are mainly Chrome.

Supplementary knowledge

 Http The request method of the protocol , Request header , Request data  

 HTTP request / Response steps :
The client connects to Web The server -> send out Http request -> The server accepts the request and returns HTTP Respond to -> Release the connection TCP Connect -> Client browser parsing HTML Content
1、 The client connects to Web The server
2、 send out HTTP request
adopt TCP Socket , Client to Web The server sends a text request message , A request message is sent by the request line 、 Request header 、 Blank lines and request data 4 Part of it is made up of .
3、 The server accepts the request and returns HTTP Respond to
4、 Release the connection TCP Connect
5、 Client browser parsing HTML Content

HTTP Respond to It's also made up of four parts , Namely : Status line 、 The message header 、 Blank lines and response text .

HTTP In the way of request 8 Request method ( Brief introduction ) - Wei banggang - Blog Garden

HTTP In the agreement 14 Request method | Wonderful every day

  My first reptile

  

import requests
from bs4 import BeautifulSoup
link="http://www.santostang.com/"
# Access to the page 
headers={'User-Agent':'Mozilla/5.0(Windows;U;Windows '
                      'NT 6,1;en-US;rv:1.9.1.6)Geck0/20091201 Firefow/3,5.6'}
r=requests.get(link,headers=headers)
# Get the data you need 
soup=BeautifulSoup(r.text,"lxml")
title=soup.find("h1",class_="post-title").a.text.strip()
print(title)
with open('title.txt',"a+") as f:
    f.write(title)
    f.close()

explain :

1.requests.get(link,headers=headers) Access to web pages .headers Masquerading as a browser access .r yes respond Returns the object ,r.text Get web content .

2. hold html Convert to soup object .soup.find("h1",class_="post-title").a.text.strip() Get the title of the first article .

3.chrome Check function .crtl+shift+I Enter developer mode , You can locate the code in the corresponding location .

4.python File operation storage txt file .

 

 

  Problems encountered :bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library? 

bs4 When parsing a web page bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml_LitterTwo The blog of -CSDN Blog Installed on lxml There will be no problem after the package . 

Static web page capture

  Learn to use requests Get response content :

import requests
r=requests.get('http://www.santostang.com/')
print(" Text encoding :",r.encoding)
print(" Response status code :",r.status_code)
print(" String style response body :",r.text)

  explain :

1r.text Server response content . Automatically encode and decode according to the response header characters .

2r.encoding The text encoding used by the server content .

3r.status_code Detect the response status code .

4r.content Byte response body . Automatic decoding gzip and deflate code .

5r.json() Built in JSON decoder .

customized requests: Some networks need to be set up requests Parameters .URL, Custom request header , send out POST request , Set timeout, etc .

URL Parameters :

stay WWW On , Every information resource has a unified and unique address on the Internet , The address is URL(Uniform Resource Locator, Uniform resource locator ), It is WWW The unified resource positioning mark , It means network address .URL It's made up of three parts : The resource type 、 Host domain name where resources are stored 、 Resource file name . It can also be said that 4 Part of it is made up of : agreement 、 host 、 port 、 route .URL Parameters Yes, append to URL A name on / It's worth it . Parameters are marked with question marks (?) Start and apply name=value The format of . If there are multiple URL Parameters , Between the parameters (&) Rune separation .

import requests
key_dict={'key1':'value1','key2':'value2'}
r=requests.get('http://httpbin.org/get',params=key_dict)
print("URL It's coded correctly :",r.url)
print(" String style response body \n",r.text)
#URL result 
#URL It's coded correctly : http://httpbin.org/get?key1=value1&key2=value2

explain :

Parameters are saved in the dictionary , use params structure URL in . 

Request header :

《Python—— Reptiles 【Requests Set request header Headers】

POST request :

When you need to send data encoded in form format , Request for POST. Need to pass the dictionary to Requests Of data Parameters , The data dictionary is automatically encoded as a form .

import requests
key_dict={'key1':'value1','key2':'value2'}
r=requests.post('http://httpbin.org/post',data=key_dict)
print(r.text)

  explain :

from Variable value is key_dict The input value of .

another : Find out It is suggested that Use postman Increase of efficiency .

practice & problem

Crawling for Douban movie TOP250. Web site address :http://movie.douban.com/top250.

Request header : Actual browser request header .

step :

· Website analysis : Open a web page and use ” Check “ Function to view the request header of the web page .

·requests Get the movie web page code ,for Turn pages .

·BeautifulSoup Parse web pages and get movie data .( Parsing the network does not involve )

 

  problem :1Failed to load resource: net::ERR_TIMED_OUT Overtime

2Failed to load resource: net::ERR_CONNECTION_CLOSED Failed to load resources : Network error connection

3 Not recommended JS Module file .

4. And every time I refresh the display, the error is different .

have no idea of .

  I understand that the filter should be filtered out .chorme Developer tools network Under the name Why is it always empty 【chrome Well 】_ Baidu post bar

 

Yes, just change it to blue  

# obtain HTML Code 
import requests
def get_movies():
    headers={
    'user-agent':'Mozilla/5.0  (Windows NT 6.1; Win64; x64)'
                 'AppleWebkit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36',
     'Host': 'movie.douban.com'
    }
    for i in range(1,10):
        link='https://movie.douban.com/top250?satart'+str(i*25)
        r=requests.get(link,headers=headers,timeout=10)
        print(str(i+1)," Page response status code :",r.status_code)
        print(r.text)

get_movies()

· How to get the request header in the browser Headers Information _yqning123 The blog of -CSDN Blog  

· Appears when accessing the web site “net::ERR_CONNECTION_CLOSED”? | Wechat open community

Dynamic Web Capture

 

The content of a static web page is just HTML in , But use JavaScript It's a dynamic network , The content is not all about HTML In . Dynamic web pages need to learn two techniques :① Resolve the real web address by reviewing the elements ② Use selenium Simulation browser .

Resolve the real web address by reviewing the elements  

 1. Asynchronous update technology AJAX

AJAX course | Novice tutorial

 2. Parsing real address grabbing

See the original book for details : Chapter four – 4.2 Parsing real address grabbing

① Developer model Network.

② Find the real data address ,“ Grab the bag ” file . It's usually used json File format acquisition . choice XHP Options find . 

③ Request the address to get json data .

④ from json Extract comment data from . Use json Library parsing data . 

If you want to climb a comment like Taobao AJAX Load the web page , You can't find the data you want from the web page source code . You need to use the browser's review elements , Find the real data address . Then climb the real website .

import requests
link = """https://api-zero.livere.com/v1/comments/list?callback=jQuery112403473268296510956_1531502963311&limit=10&repSeq=4272904&requestPath=%2Fv1%2Fcomments%2Flist&consumerSeq=1020&livereSeq=28583&smartloginSeq=5154&_=1531502963313"""
headers = {'User-Agent' : 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
r = requests.get(link, headers= headers)
print (r.text)

import json
#  obtain  json  Of  string
json_string = r.text
json_string = json_string[json_string.find('{'):-2]
json_data = json.loads(json_string)
comment_list = json_data['results']['parents']
for eachone in comment_list:
    message = eachone['content']
    print (message)

explain : Use json_string[json_string.find(‘{‘}:-2)], Just extract the string accord with json Format Part of . then , Use json.loads The response body data in string format can be converted to json data . then , utilize json Structure of data , We can extract comments from list comment_list. And finally through a for loop , Extract the comment text , And print .

3.URL Address regularity and for Loop crawling

URL Two important variables in the address :offset and limit.limit: Maximum comments per page ,offset: The first comment on this page is the general comment . stay URL Change in China offset You can change pages .

import requests
import json
def single_page_comment(link):
    headers = {'User-Agent' : 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
    r = requests.get(link, headers= headers)
    #  obtain  json  Of  string
    json_string = r.text
    json_string = json_string[json_string.find('{'):-2]
    json_data = json.loads(json_string)
    comment_list = json_data['results']['parents']
    for eachone in comment_list:
        message = eachone['content']
        print (message)
for page in range(1,4):
        link1 = "https://api-zero.livere.com/v1/comments/list?callback=jQuery112403473268296510956_1531502963311&limit=10&offset="
        link2 = "&repSeq=4272904&requestPath=%2Fv1%2Fcomments%2Flist&consumerSeq=1020&livereSeq=28583&smartloginSeq=5154&_=1531502963316"
        page_str = str(page)
        link = link1 + page_str + link2
        print (link)
        single_page_comment(link)

Add

Quick start crawler ( Two ): Dynamic web crawling - You know

“selenium The library is cumbersome to use , The crawl speed is relatively slow , Therefore, the first method is commonly used .”

I didn't run successfully selenium Method code .  

① find JS Source code after rendering , It is also the source code of the content displayed on the current website .

In developer mode .Headers The following contains the request header of the current package 、 Response header ;Preview Is a preview of the response message ,Response Is the server response code . In turn, click XHR、JS All the packages under , adopt Preview To preview the response information of the current package , Find the results we want .

② find Request URL、 The path that should really be requested .

③ Write code according to the path found .

  Use selenium Simulation browser

Usually ,chrome Getting the source address may not be easy , At this time, the method of using the browser rendering engine . Use the browser to parse the web page directly HTML, application CSS Style and execute JavaScript The sentence of . use Python Of selenium library Simulate the browser to complete crawling .

selenium The script can control the browser to operate . The following code uses Firefox browser . After installing the package , need ​​​​​​ download geckodriver Releases · mozilla/geckodriver · GitHub . And add it to PATH among .

from selenium import webdriver
driver = webdriver.Firefox(executable_path = r'C:\Users\santostang\Desktop\geckodriver.exe')
# Change the above address to your computer geckodriver.exe The address of the program 
driver.get("http://www.santostang.com/2018/07/04/hello-world/")

because Selenium Use a browser Rendering , therefore , The comment data has been rendered to HTML Code in . We can use Chrome“ Check ” Method , Navigate to the element location .

1 Find comments HTML Code tags . Use Chrome Open the article page , Navigate to comment data .
2 Try to get a comment data .
3 type driver.page_source, Yes iframe analysis .
from selenium import webdriver
import time
driver = webdriver.Firefox(executable_path = r'C:\Users\santostang\Desktop\geckodriver.exe')
driver.implicitly_wait(20) #  The recessive waiting , Longest etc 20 second 
# Change the above address to your computer geckodriver.exe The address of the program 
driver.get("http://www.santostang.com/2018/07/04/hello-world/")
time.sleep(5)
for i in range(0,3):
    #  Slide to the bottom of the page 
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    #  transformation iframe, Find more , Click on 
 driver.switch_to.frame(driver.find_element_by_css_selector("iframe[title='livere']"))
    load_more = driver.find_element_by_css_selector('button.more-btn')
    load_more.click()
    #  hold iframe Turn back 
    driver.switch_to.default_content()
    time.sleep(2)
driver.switch_to.frame(driver.find_element_by_css_selector("iframe[title='livere']"))
comments = driver.find_elements_by_css_selector('div.reply-content')
for eachcomment in comments:
    content = eachcomment.find_element_by_tag_name('p')
    print (content.text)

And then there is bug I can't open it .

 

 selenium How to select elements -:

  • find_element_by_id: By element id choice , for example :driver.find_element_by_id(‘loginForm’)
  • find_element_by_name: By element name choice ,driver.find_element_by_name(‘password’)
  • find_element_by_xpath: adopt xpath choice ,driver.find_element_by_xpath(“//form[1]”)
  • find_element_by_link_text: Select by link address
  • find_element_by_partial_link_text: Select through the partial address of the link
  • find_element_by_tag_name: Select by element name
  • find_element_by_class_name: By element id choice
  • find_element_by_css_selector: adopt css Selector selection

Parse web pages

Using requests After the library crawls down the source code , Need to parse web pages to extract data .

Three common methods : Regular expressions 、BeautifulSoup and lxml Method .Due to my laziness, I'm not going to learn the method of regular expressions .

Install in advance bs4 library .

Beautiful Soup documentationhttps://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.htmlbeautifulsoup Novice tutorial - python - web Tutorial network  

python Standard library parser :BeautifulSoup(markuo,"html.parser")

lxml HTML Parser :BeautifulSoup(markuo,"lxml") Need to install C Language library

Parser :BeautifulSoup(markuo,["lxml","xml"]) Need to install C Language library

html5lib:BeautifulSoup(markuo,"html.parser") Slow speed , Don't rely on external extensions , Fault tolerance is best .

Use Beautiful Soup Parsing web content _ After is a guest -CSDN Blog _ Use beautifulsoup analysis

example :

import requests
from bs4 import BeautifulSoup

link="http://www.santostang.com/"
headers={'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT6.1; en-us; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
r = requests.get(link, headers=headers)
soup= BeautifulSoup(r.text,"html.parser")
first_title = soup.find("h1", class_ ="post-title").a.text.strip()
print(" The title of the first article is :",first_title)

BS Object is a complex tree structure . The process of getting Web content is to extract objects ( Node of tree ) The process of . The extraction methods are : Traverse the document tree , Search document tree ,CSS Selectors .

A little learning reflection

· Reading books and reading online notes . The knowledge in the book is not guaranteed to be the latest technology .

· It's really cool to support Chinese in the developer mode .

原网站

版权声明
本文为[NI3E]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202280553489337.html