当前位置:网站首页>Reptile exercises (I)

Reptile exercises (I)

2022-07-04 13:00:00 InfoQ

This press release is of great commemorative significance , Published on birthday , Started my career of online note taking , And deepened the infinite love for reptiles , I hope you can give me support !!! Please support me for the first time !!! It will be wonderful in the future .

10.( Choose a question 1) The target site https://www.sogou.com/
  requirement :1. The user enters what to search , Start page and end page
 2. Crawl the source code of relevant pages according to the content entered by the user 3. Save the acquired data to the local
import requests
word = input(" Please enter the search content ")
start = int(input(" Please enter the start page "))
end = int(input(" Please enter the end page "))
headers = {
 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36 Edg/100.0.1185.44'
}
for n in range(start, end + 1):
 url = f'https://www.sogou.com/web?query={word}&page={n}'
 # print(url)
 response = requests.get(url, headers=headers)
 with open(f'{word} Of the {n} page .html', "w", encoding="utf-8")as file:
 file.write(response.content.decode("utf-8"))
One 、 Analyze the web
 1. Enter the website first
python -  Sogou search  (sogou.com)
https://www.sogou.com/web?query=python&_ast=1650447467&_asf=www.sogou.com&w=01029901&p=40040100&dp=1&cid=&s_from=result_up&sut=7606&sst0=1650447682406&lkt=0%2C0%2C0&sugsuv=1650427656976942&sugtime=1650447682406
 
 2. Search separately  “Python”,“ China ” And compare the websites .
China  -  Sogou search  (sogou.com)
https://www.sogou.com/web?query=%E4%B8%AD%E5%9B%BD&_ast=1650446881&_asf=www.sogou.com&w=01029901&p=40040100&dp=1&cid=&s_from=result_up&sut=9319&sst0=1650447465594&lkt=0%2C0%2C0&sugsuv=1650427656976942&sugtime=1650447465594


null
null
 
 3. Put each parameter in "&" Separate the symbols
  Conclusion :(1) Parameters query Is the search object , Chinese characters will be escaped ( Code after escape without memory )
 (2) Parameters ie Is the escape encoding type , With "utf-8" For example , This parameter can be used in Network- Payload-ie=utf- 8, Some bags can also be in Network-Response-charset='utf-8', Pictured :


null
 
4. However, these two parameters can not meet the page turning and crawling requirements of the test questions , So we have to turn the page manually and then check
 
null
https://www.sogou.com/web?query=python&page=2&ie=utf8
    
 5. Finally, it's the critical moment ! We found the law , And the meaning of important parameters , You can build a common  URL 了 , Pictured :
url = f'https://www.sogou.com/web?query={word}&page={n}'
 
#  Assign variable parameters with variables , In this way, a new URL
Two 、 Look for parameters
https://www.sogou.com/web?query=Python&_asf=www.sogou.com&_ast=&w=01019900&p=40040100&ie=utf8&from=index-nologin&s_from=index&sut=12736&sst0=1650428312860&lkt=0%2C0%2C0&sugsuv=1650427656976942&sugtime=1650428312860
https://www.sogou.com/web?query=java&_ast=1650428313&_asf=www.sogou.com&w=01029901&p=40040100&dp=1&cid=&s_from=result_up&sut=10734&sst0=1650428363389&lkt=0%2C0%2C0&sugsuv=1650427656976942&sugtime=1650428363389
https://www.sogou.com/web?query=C%E8%AF%AD%E8%A8%80&_ast=1650428364&_asf=www.sogou.com&w=01029901&p=40040100&dp=1&cid=&s_from=result_up&sut=11662&sst0=1650428406805&lkt=0%2C0%2C0&sugsuv=1650427656976942&sugtime=1650428406805
https://www.sogou.com/web?
https://www.sogou.com/web?query=Python&
https://www.sogou.com/web?query=Python&page=2&ie=utf8
 
 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36 Edg/100.0.1185.44'
'cookie' = "IPLOC=CN3600; SUID=191166B6364A910A00000000625F8708; SUV=1650427656976942; browerV=3; osV=1; ABTEST=0|1650428297|v17; SNUID=636A1DCD7B7EA775332A80CB7B347D43; sst0=663; [email protected]@@@@@@@@@; LSTMV=229,37; LCLKINT=1424"
'URl' = "https://www.sogou.com/web?query=Python&_ast=1650429998&_asf=www.sogou.com&w=01029901&cid=&s_from=result_up&sut=5547&sst0=1650430005573&lkt=0,0,0&sugsuv=1650427656976942&sugtime=1650430005573"
url="https://www.sogou.com/web?query={}&page={}:


# UA To be in dictionary form headers receive
      1.headers Error of :
" ":" ", 
#  Build the format of the dictionary ,',' Never forget
# headers It's a keyword. You can't write it wrong , If you make a mistake, you will have the following error reports
 
 
import requests
url = "https://www.bxwxorg.com/"
hearders = {
 'cookie':'Hm_lvt_46329db612a10d9ae3a668a40c152e0e=1650361322; mc_user={"id":"20812","name":"20220415","avatar":"0","pass":"2a5552bf13f8fa04f5ea26d15699233e","time":1650363349}; Hm_lpvt_46329db612a10d9ae3a668a40c152e0e=1650363378',
 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36 Edg/100.0.1185.44'
}
response = requests.get(url, hearders=hearders)
print(response.content.decode("UTF-8"))
 
 
Traceback (most recent call last):
 File &quot;D:/pythonproject/ The second assignment .py&quot;, line 141, in <module>
 response = requests.get(url, hearders=hearders)
 File &quot;D:\python37\lib\site-packages\requests\api.py&quot;, line 75, in get
 return request('get', url, params=params, **kwargs)
 File &quot;D:\python37\lib\site-packages\requests\api.py&quot;, line 61, in request
 return session.request(method=method, url=url, **kwargs)
TypeError: request() got an unexpected keyword argument 'hearders'
 
#  reason : Three hearders Write consistently , however headers Is the key word , So the report type is wrong
 
 
 
 
#  But it's written heades There will be another form of error reporting
 
import requests
word = input(&quot; Please enter the search content &quot;)
start = int(input(&quot; Please enter the start page &quot;))
end = int(input(&quot; Please enter the end page &quot;))
heades = {
 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36 Edg/100.0.1185.44'
}
for n in range(start, end + 1):
 url = f'https://www.sogou.com/web?query={word}&page={n}'
 # print(url)
 response = requests.get(url, headers=headers)
 with open(f'{word} Of the {n} page .html', &quot;w&quot;, encoding=&quot;utf-8&quot;)as file:
 file.write(response.content.decode(&quot;utf-8&quot;))
 
 
Traceback (most recent call last):
 File &quot;D:/pythonproject/ The second assignment .py&quot;, line 117, in <module>
 response = requests.get(url, headers=headers)
NameError: name 'headers' is not defined
 
#  reason : Three hearders Inconsistent writing , So the registration is wrong
 
 
 
 
#  The correct way of writing is , You'd better not make a mistake !
 
import requests
url = &quot;https://www.bxwxorg.com/&quot;
headers = {
 'cookie':'Hm_lvt_46329db612a10d9ae3a668a40c152e0e=1650361322; mc_user={&quot;id&quot;:&quot;20812&quot;,&quot;name&quot;:&quot;20220415&quot;,&quot;avatar&quot;:&quot;0&quot;,&quot;pass&quot;:&quot;2a5552bf13f8fa04f5ea26d15699233e&quot;,&quot;time&quot;:1650363349}; Hm_lpvt_46329db612a10d9ae3a668a40c152e0e=1650363378',
 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36 Edg/100.0.1185.44'
}
response = requests.get(url, headers=headers)
print(response.content.decode(&quot;UTF-8&quot;))
 
 
3、 ... and 、 loop
for n in range(start, end + 1):
#  Why 'end+1' Well : because range The range of a function is left closed and right open, so it's true end It's not worth it , Add at the back 1 You can get the previous real value
Because it's crawling through the pages , Need to use every page URL, So we should build URL Call into the loop
Four 、requests Basic introduction of :( Jianghu people  &quot;urllib3&quot; )
 1. install :Win + R --> cmd --> Input pip install requests
 #  If it cannot be called after downloading, it is because : The module is installed in Python In its own environment , The virtual environment you use doesn't have this   Databases , To specify the environment
 #  Specify the environment  :  The input terminal where python Find the installation path  --> File --> setting --> project --> project interpreter -->  Find the settings Icon  --> Add --> system interpreter --> ... --> (where python In the path ) --> OK --> Apply application  --> OK
 2. The request module of the crawler urllib.request modular  (urllib3)
 (1) Common methods :
 urllib.request.urlopen(&quot; website &quot;)
  effect : Make a request to the website , And get the corresponding
  Byte stream  = response.read().decode('utf-8')
 urllib.request request(&quot; website &quot;, headers=&quot; Dictionaries &quot;)
 #  because urlopen()  Refactoring is not supported User-Agent
 (2)requests modular
 (2-1)request Common usage of :
 requsts.get( website )
 (2-2) The object of the response response Methods :
 response.text return Unicode Formatted data (str)
 response.content Return byte stream data ( Binary system )
 response.conten.decode('utf-8') Decode manually
 response.url return url
 response.encode() = &quot; code &quot;
 (3). adopt requsts Module to send POST request :
 cookie:
 cookie The user identity is determined by the information recorded on the client
 HTTP Is a kind of connection protocol. The interaction between client and server is limited to “ request / Respond to ” Disconnect when finished , Next time, please   Time finding , The server will think of it as a new client , In order to maintain the connection between them , Let the server know that this is a request initiated by the previous user , Client information must be saved in one place
 session;
 session It is to determine the user identity through the information recorded on the server , there session It means a conversation
null
null

In this way, you can get the page source code !!!
 
  Don't worry , The best things come when you least expect them to . So what we're going to do is : Try with hope , Wait for the good to come . May your world always be full of sunshine , There will be no unnecessary sadness . May you be on the road of life , Reach the place you want , With you , It's your favorite look .
  Today is a happy day ,4 month 21 Number , It's not just my birthday , It's also my first time CSDN Draft , As a cute new , Please give me your support , In the afternoon, I will post a link , Please join us for three times !!!

原网站

版权声明
本文为[InfoQ]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/185/202207041232514088.html