当前位置:网站首页>Programmers don't talk about morality, and use multithreading for Heisi's girlfriend
Programmers don't talk about morality, and use multithreading for Heisi's girlfriend
2022-07-04 05:29:00 【Program ape Li Xun Tian】
1. For love
Just yesterday , A friend suddenly sent me a private message :
Sky Survey , Want to see a joke ?
I said, : Look at your joke ! What's funny ?
He said : No , I found a joke website , There are many jokes on it !
I said, : so what ?
He said : That website programmer seems to be stupid , I want to write a program to catch all the jokes , Send it to your girlfriend every day 1 A joke !
I said, : Oh , Go and catch it . I don't have a girlfriend , Unwanted .
2. Stupid programmer
At his insincere entreaty , I went to see the page of the joke website :
http://xiaohua.zol.com.cn/detail1/1.html
This website is indeed flawed :
First of all, the website is not used https, There is no basic standard configuration of this modern website .
And then it's URL It's easy to guess , You see 1.html, Does that have 2.html Well ? I really have a try .
That's easy , If you want to catch it, just follow the vine ,1,2,3…100000 Just catch it .
It can be said that the crawler defense of this website has hardly been done .
3. Simple reptile
A few minutes later , He came to me again with a sad face .
Yes , you 're right ! He was happy because he soon wrote the reptile , Also grabbed some jokes . Crying hurt his face because he hung up after catching the program a few times .
Take a look at his program :
import requests
import bs4
url = 'http://xiaohua.zol.com.cn/detail1'
with open('joke.txt') as f:
for joke_id in range(1, 100000):
response = requests.get(f'{url}{joke_id}.html')
soup = bs4.BeautifulSoup(response.text, 'lxml')
joke_text = soup.select('div.article-text')[0].getText().strip()
f.write(f'{joke_id}, {joke_text}\n')
His code is simple :
- Use requests.get Grab web content , Dynamic splicing of web pages URL, Thanks to the lack of defense of website programmers .
- Use BeautifulSoup Parse the text of the joke .
- Save to joke.txt in . good heavens , This breath should be grasped 10 Ten thousand jokes , How many girlfriends do you have ??
On the surface, the program is ok , But I took a glance with my not too myopic myopia , I knew this program would not last long , Think about the problem .
4. We have to optimize
The above program can only live one episode in a TV series , Because if any network request reports an error , This program will hang up ! It is normal for network requests to report errors , Many reasons may cause the network request to fail !
This has to be changed , You have to change !
import requests
import bs4
url = 'http://xiaohua.zol.com.cn/detail1'
with open('joke.txt') as f:
for joke_id in range(1, 100000):
try:
response = requests.get(f'{url}{joke_id}.html')
soup = bs4.BeautifulSoup(response.text, 'lxml')
joke_text = soup.select('div.article-text')[0].getText().strip()
f.write(f'{joke_id}, {joke_text}\n')
except Exception as e:
print(' I didn't catch the joke , Continue to catch the next ')
By putting network requests into try except in , If the request goes wrong , Only one sentence will be printed “ I didn't catch the joke , Continue to catch the next ", At least the program will not stop !
This product must be alive and stable !
however , You can't live too long . this 10 Ten thousand data , How long do you have to catch ! Girlfriend should say : You can't !
This has to be changed , You have to change !
5. Multithreading
This is not easy to change , With multithreading :
import requests
import bs4
import threading
url = 'http://xiaohua.zol.com.cn/detail1'
def get_joke(joke_id, file):
response = requests.get(f'{url}{joke_id}.html')
soup = bs4.BeautifulSoup(response.text, 'lxml')
joke_text = soup.select('div.article-text')[0].getText().strip()
file.write(f'{joke_id}, {joke_text}\n')
with open('joke.txt') as f:
for joke_id in range(1, 100000):
try:
threading.Thread(target=get_joke, args=(joke_id,))
except Exception as e:
print(' I didn't catch the joke , Continue to catch the next ')
Code instructions :
- Introduced threading modular
- Put the code to grab the joke into a function
- Create separate threads for each joke to grab
Run it , There should be no problem. . But , His computer exploded !!!
Because too many threads are started in a short time .
This needs to be controlled , This must be controlled .
6. Thread pool
Simple , Use thread pool , Control the number of threads :
import requests
import bs4
import concurrent.futures
url = 'http://xiaohua.zol.com.cn/detail1'
def get_joke(joke_id, file):
response = requests.get(f'{url}{joke_id}.html')
soup = bs4.BeautifulSoup(response.text, 'lxml')
joke_text = soup.select('div.article-text')[0].getText().strip()
file.write(f'{joke_id}, {joke_text}\n')
with open('joke.txt') as f:
with concurrent.futures.ThreadPoolExecutor(max_workers=100) as executor:
for joke_id in range(1, 100000):
try:
executor.submit(get_joke, joke_id)
except Exception as e:
print(' I didn't catch the joke , Continue to catch the next ')
Code instructions :
- Introduced concurrent modular
- Created a thread pool , most 100 Threads
- Submit each crawl task to the thread pool , most 100 Threads work , I have to wait in line
… After a few hours … It's finally done !
Feeling , It's really good to learn programming !
7. result
In a few days , He came to me again , Like eggplant beaten with frost , It turned out that his girlfriend broke up with him !
He said : little does one think , In the end , I'm just a joke .
I said, : little does one think , I really saw your joke . What's going on? ?
original , He didn't audit jokes manually , As a result, the program randomly sent it to his girlfriend 3 A joke ended their friendship :
2020-11-17:
ask : What if your girlfriend is ugly ?
answer : Ugliness is not her fault , It's all from Mom and dad , If you want to make her beautiful, go to Korea for cosmetic surgery , If you don't have money , It's your fault that you still think she's ugly , Either accept or change .
2020-11-18:
If we break up , It's fine during the day , But at night, I can't restrain my inner feelings any more , A man in the quilt secretly laughed .
2020-11-19 Japan :
Today, I broke up with my girlfriend , There are always some things that make people feel distressed .
Through inner struggle , I finally summoned up the courage to call :“ feed , Is it mobile ? Um. , That's true , I broke up with my girlfriend , I helped her pay 200 phone bills the day before yesterday , Can you get it back for me ?”
8. Postscript :
There is still room for improvement in the crawler program :
- So many threads come from the same IP Address ,8 Cheng will be sealed , Consider using dynamic proxy , prevent IP Be sealed up
- Synergy can be used to further improve efficiency .
If you are the programmer who developed the website , How do you defend :
- Don't use numbers that are easy to guess as the number of jokes in the website , Use randomly generated long strings , Look at the website of Taobao and you will know .
- Prevent one IP Visit your website too much in a short time .
- Continuous access to 10 Time , Output verification code , You can continue to access after passing the verification .
Last , I advise young people to drink rat tail juice , Whether it's developing a website , Start developing crawlers , We should all talk about martial virtue , Don't be careless !
About Python Technology reserve
Learn from good examples Python Whether it's employment or sideline, it's good to make money , But learn to Python Still have a learning plan . Finally, let's share a complete set of Python Learning materials , For those who want to learn Python Let's have a little help !
One 、Python Learning routes in all directions
Python All directions are Python Sort out the common technical points , Form a summary of knowledge points in various fields , The use of it is , You can find the corresponding learning resources according to the above knowledge points , Make sure you learn more comprehensively .
Two 、 Learning software
If a worker wants to do a good job, he must sharpen his tools first . Study Python Common development software is here , It saves you a lot of time .
3、 ... and 、 Getting started video
When we were watching videos to learn , You can't just move your eyes and brain without hands , A more scientific way to learn is to use them after understanding , At this time, the hand training program is very suitable .
Four 、 Practical cases
Optical theory is useless , Learn to knock together , Do it , Can you apply what you have learned to practice , At this time, we can make some practical cases to learn .
5、 ... and 、 Interview information
We learn Python Must be to find a well paid job , The following interview questions are from Ali 、 tencent 、 The latest interview materials of big Internet companies such as byte , And the leader Ali gave an authoritative answer , After brushing this set of interview materials, I believe everyone can find a satisfactory job .
This full version of Python A full set of learning materials has been uploaded CSDN, Friends can scan the bottom of wechat if necessary CSDN The official two-dimensional code is free 【 Guarantee 100% free
】
边栏推荐
- VB. Net calls ffmpeg to simply process video (class Library-6)
- TCP state transition diagram
- 2022g2 power station boiler stoker special operation certificate examination question bank and answers
- [matlab] matlab simulation - narrow band Gaussian white noise
- VB. Net simple processing pictures, black and white (class library - 7)
- Two sides of the evening: tell me about the bloom filter and cuckoo filter? Application scenario? I'm confused..
- [matlab] matlab simulation modulation system - DSB system
- JS string splicing
- 【兴趣阅读】Adversarial Filtering Modeling on Long-term User Behavior Sequences for Click-Through Rate Pre
- 2022 R2 mobile pressure vessel filling retraining question bank and answers
猜你喜欢
【雕爷学编程】Arduino动手做(105)---压电陶瓷振动模块
拓扑排序和关键路径的图形化显示
[interested reading] advantageous filtering modeling on long term user behavior sequences for click through rate pre
The data mark is a piece of fat meat, and it is not only China Manfu technology that focuses on this meat
ETCD数据库源码分析——初始化总览
[paper summary] zero shot semantic segmentation
【QT】制作MyComboBox点击事件
2022g2 power station boiler stoker special operation certificate examination question bank and answers
Zhongke Panyun - module a infrastructure setting and safety reinforcement scoring standard
PostgreSQL has officially surpassed mysql. Is this guy too strong!
随机推荐
Principle and practice of common defects in RSA encryption application
How to configure static IP for Kali virtual machine
1480. Dynamic sum of one-dimensional array
BUU-Real-[PHP]XXE
光模塊字母含義及參數簡稱大全
[untitled]
BUU-Crypto-[HDCTF2019]basic rsa
Enterprise level log analysis system elk (if things backfire, there must be other arrangements)
Automated testing selenium foundation -- webdriverapi
Daily question brushing record (12)
Integer type of C language
Ping port artifact psping
【兴趣阅读】Adversarial Filtering Modeling on Long-term User Behavior Sequences for Click-Through Rate Pre
Appearance of LabVIEW error dialog box
NTFS security permissions
What is MQ?
JS string splicing enhancement
Leetcode 184 Employees with the highest wages in the Department (July 3, 2022)
FreeRTOS 中 RISC-V-Qemu-virt_GCC 的 锁机制 分析
Remote desktop client RDP