当前位置：网站首页>Crawler (17) - Interview (2) | crawler interview question bank

Crawler (17) - Interview (2) | crawler interview question bank

2022-07-07 21:58:00 【Old Ge】

1. What is a reptile

Crawler is to crawl web data , As long as there is something on the web , Can be crawled down by reptiles , Such as the picture 、 Written comments 、 Product details, etc .

Generally speaking, two words ,Python Crawlers need the following steps ：

Find the page URL, Initiate request , Wait for the server to respond
Get server response content
Parsing content （ Regular expressions 、xpath、bs4 etc. ）
Save the data （ Local files 、 Database etc. ）

2. Basic process of reptile

Find the page URL, Initiate request , Wait for the server to respond
Get server response content
Parsing content （ Regular expressions 、xpath、bs4 etc. ）
Save the data （ Local files 、 Database etc. ）

3. The difference between greedy and non greedy patterns of regular expressions

1） What is greedy matching and non greedy matching

Greedy matching always tries to match as many characters as possible when matching strings
Contrary to greedy matching , Non greedy matching always tries to match as few characters as possible when matching strings

2） difference

In form , There is a non greedy pattern “?” As the end sign of this part
Functionally , The greedy pattern is to match as many current regular expressions as possible , It may contain several strings that satisfy regular expressions ; Non greedy model , Under the condition that all regular expressions are satisfied, the current regular form expression should be matched as little as possible

3） expand

*? Repeat any number of times , But repeat as little as possible
+? repeat 1 Times or more , But repeat as little as possible
?? repeat 0 Time or 1 Time , But repeat as little as possible
{n,m}? repeat n To m Time , But repeat as little as possible
{n,}? repeat n More than once , But repeat as little as possible

4.re Module match And search The difference between

1） The same thing ： Are looking for substrings in a string , If you can find , Just go back to one Match object , If you can't find it , Just go back to None.

2） Difference ：mtach() The method is to match from the beginning , and search() Method , You can find it anywhere in the string .

5. How to find and replace strings with regular expressions

1） lookup ：findall() function

re.findall(r" Target string ",“ Original string ”)
re.findall(r" Zhang San ",“I love  Zhang San ”)[0]

2） Replace ：sub() function

re.sub(r" To replace the original character ",“ To replace a new character ”,“ Original string ”)
re.sub(r" Li Si ",“python”,“I love  Li Si ”)

6. Write a match ip Regular expression of

1）ip Address format ：(1-255).(0-255).(0-255).(0-255)

2） Corresponding regular expression

"^(1\d{2}|2[0-4]\d|25[0-5]|[1-9]\d|[1-9])\."
+"(1\d{2}|2[0-4]\d|25[0-5]|[1-9]\d|\d)\."
+"(1\d{2}|2[0-4]\d|25[0-5]|[1-9]\d|\d)\."
+"(1\d{2}|2[0-4]\d|25[0-5]|[1-9]\d|\d)$"

3） analysis

\d Express 0~9 Any number of
{2} It happened twice
[0-4] Express 0~4 Any number of
| perhaps
parentheses ( ) It can't be less , To extract the matching string , There are several in the expression () It means that there are several corresponding matching strings
1\d{2}：100~199 Any number between
2[0-4]\d：200~249 Any number between
25[0-5]：250~255 Any number between
[1-9]\d：10~99 Any number between
[1-9]：1~9 Any number between
\.： Escape point number .

7. Write a regular expression for the email address

[A-Za-z0-9_-][email protected][a-zA-Z0-9_-]+(.[a-zA-Z0-9_-]+)+$

8.group and groups The difference between

1）m.group(N) Back to page N Group matching characters
2）m.group() == m.group(0) == All matching characters
3）m.groups() Returns all matching characters , With tuple（ Tuples ） Format
m.groups() == (m.group(0), m.group(1),……)

9.requests Requested returned content and text The difference between

1） Different return types

response.text The return is Unicode Type data ,response.content The return is bytes type , That's binary data

2） Different scenarios

Get text using ,response.text, Get pictures and files , Use response.content

3） Modify the coding method differently

response.text type ：str, according to HTTP The header makes a reasoned guess about the encoding of the response
- Change the encoding ：response.encoding=”gbk”
response.content type ：bytes, Decoding type is not specified
- Change the encoding ：response.content.decode(“utf8”)

10.urllib and requests Module differences

1）urllib yes python Built in bag , No need to install separately

2）requests It's a third party library , Separate installation required

3）requests Ku is in urllib On the basis of , Than urllib Modules are easier to use

4）requests It can be initiated directly get、post request ,urllib You need to build the request first , Then make a request

11. Why? requests The request needs to be brought with you header？

1） reason ： Simulation browser , Spoofing servers , Get content consistent with the browser

2）header Format ： Dictionaries

headers = {“User-Agent”: “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36”}

3） usage ：requests.get(url, headers=headers)

12.requests What are the tips for using the module ？

1）reqeusts.util.dict_from_cookiejar hold cookie Object to dictionary

2） Set request not used SSL Certificate validation

response = requests.get("https://www.12306.cn/mormhweb/ ", verify=False)

3） Set timeout

response = requests.get(url, timeout=10)

4） Determine whether the request is successful with the status code

response.status_code == 200

13.json modular dumps、loads And dump、load Differences in methods

json.dumps(), take python Of dict The data type code is json character string
json.loads(), take json The string is decoded to dict Data type of
json.dump(x, y),x yes json object ,y It's the file object , In the end, it will be json Object is written to a file
json.load(y), From file object y Read from json object

14. common HTTP What are the request methods ？

GET： Request the specified page information , Return the entity body
HEAD： Be similar to get request , But there is no specific content in the response returned , Used to capture headers
POST： Submit data to the specified resource for processing request （ Such as submitting forms or uploading files ）, Data is contained in the request body
PUT： Send data from the client to the server to replace the content of the specified document
DELETE： Request to delete the specified page
CONNNECT：HTTP1.1 The protocol is reserved for proxy servers that can change the connection mode to pipeline mode
OPTIONS： Allow clients to view the performance of the server ; TRACE： Echo the server's request , Mainly used for testing or diagnosis

15.HTTPS What are the advantages and disadvantages of the agreement ？

1） advantage

Use HTTPS The protocol authenticates users and servers , Make sure the data is sent to the correct client and server
HTTPS Agreement is made SSL+HTTP The protocol is built for encrypted transmission 、 Network protocol for identity authentication , than http Security agreement , It can prevent data from being stolen during transmission 、 change , Ensure data integrity
HTTPS Is the most secure solution under the current architecture , Although not absolutely safe , But it significantly increases the cost of man in the middle attacks

2） shortcoming

HTTPS The encryption range of the protocol is also limited , In a hacker attack 、 Denial of service attacks 、 Server hijacking and other aspects have little effect
HTTPS The protocol also affects caching , Increase data overhead and power consumption , Even existing security measures will be affected
SSL Certificates need money , The more powerful the certificate, the higher the cost , Personal website 、 Small websites don't need to be used
HTTPS Connecting to the server takes up a lot of resources , The handshake phase is more time-consuming and has a negative impact on the corresponding speed of the website
HTTPS Connection caching is not as good as HTTP Efficient

16.HTTP Communication composition

HTTP Communication consists of two parts ： Client request message and server response message .

17.robots.txt What is the role of agreement documents ？

When a search engine visits a website , The first file to access is robots.txt, Website through Robots The protocol tells search engines which pages to grab , Which pages can't be crawled .

18.robots.txt Where are the documents ？

This file needs to be placed in the root directory of the website , And there are restrictions on the case of letters , The file name must be lowercase . The first letter of all commands should be capitalized , The rest are in lowercase , And there should be an English character space after the command .

19. How to write Robots agreement ？

User-agent： Indicates which search engine is defined , Such as User-agent：Baiduspider, Define Baidu
Disallow： It means no access
Allow： Indicates running access

20. Talk about your multi process , Multithreading , And understanding of the process , Whether to use... In the project ？

process ： A running program （ Code ） It's a process , No running code is called a program , Process is the smallest unit of system resource allocation , Processes have their own independent memory space , So data is not shared between processes , Spending big .
Threads ： The smallest unit of scheduling execution , It's also called the execution path , Can't exist independently , A dependent process has at least one thread , It's called the main thread , Multiple threads share memory （ Data sharing , Share global variables ）, Thus greatly improving the efficiency of the program .
coroutines ： It's a lightweight thread in user mode , The scheduling of the process is completely controlled by the user . It has its own register context and stack . When the coordination scheduling is switched , Save register context and stack elsewhere , At the time of cutting back , Restore the previously saved register context and stack , The direct operation stack basically has no kernel switching overhead , You can access global variables without locking , So context switching is very fast .
The project uses multithreading to crawl data , Improve efficiency .

21. The execution order of threads

newly build ： Thread creation （t=threading.Thread(target= Method name ) Or thread class ）
be ready ： After starting the thread , The thread is ready , The thread in ready state will be put into a CPU In the scheduling queue ,cpu Will be responsible for making the threads run , Change to running state
Running state ：CPU Schedule a ready thread , The thread becomes running
Blocked state ： When the running thread is blocked, it becomes blocked , The thread in the blocking state will change to the ready state again before it can continue to run
Death state ： Thread execution completed

22. Advantages and disadvantages of multithreading and multiprocessing

1） Advantages of multithreading

The program logic and control mode are complex
All threads can share memory and variables directly
The total resource consumed by thread mode is better than that by process mode

2） Disadvantages of multithreading

Each thread shares the address space with the main program , Limited by 2GB address space
Synchronization and lock control between threads are troublesome
A thread crash may affect the stability of the whole program

3） Multi process advantages

Each process is independent of each other , Does not affect the stability of the main program , It doesn't matter if the child process crashes
By increasing the CPU, It's easy to scale performance
Every subprocess has 2GB Address space and related resources , The overall performance limit that can be achieved is very large

4） Multi process disadvantages

Logic control is complex , Need to interact with the main program
Need to cross process boundaries , If there is a large amount of data transmission , It's not so good , Suitable for small amount of data transmission 、 Intensive Computing
The cost of multi process scheduling is relatively large

In actual development , Choosing multithreading and multiprocessing should be based on the actual development , It's better to combine multi process and multi thread .

23. Writing a crawler is good with multiple processes ？ It's better to multithread ？ Why? ？

IO Intensive code （ File read / write processing 、 Web crawlers, etc ）, Multithreading can effectively improve efficiency . Under single thread IO The operation will go on IO wait for , Cause unnecessary waste of time , And multithreading can be enabled in threads A When waiting for , Automatically switch to thread B, Can not waste CPU And so on , It can improve the efficiency of program execution .

In the actual data acquisition process , Consider both network speed and response , You also need to consider the hardware of your own machine , To set up multiprocessing or multithreading .

Multi process is suitable for CPU Intensive operation （cpu There are many operation instructions , Such as multi digit floating-point operation ）. Multithreading is suitable for IO Intensive operation （ More data reading and writing operations , Like crawlers. ）.

24. What is multithreading competition ？

Threads are not independent , Threads in the same process share data , When each thread accesses the data resource, there will be a competitive state , That is, data is almost synchronized and will be occupied by multiple threads , Data confusion , That is, the so-called thread is not safe .

So how to solve the problem of multithreading competition ？ —— lock .

25. What is a lock , And the advantages and disadvantages of locks

lock (Lock) yes Python Provided object for thread control .

Lock benefits ： Ensure that a piece of key code （ Sharing data resources ） Only one thread can execute completely from beginning to end , It can solve the problem of multi-threaded resource competition .
The harm of locks ： Multithreaded concurrent execution is blocked , A piece of code that contains a lock can actually only be executed in a single thread mode , The efficiency is greatly reduced .
The fatal problem of lock ： Deadlock .

26. What is a deadlock ？

When several sub threads compete for system resources , Waiting for the other party to release the occupancy status of some resources , As a result, no one wants to unlock it first , Work with each other and wait , The program can't go on , This is a deadlock .

27.XPath What library is used to parse data

XPath Parsing data depends on lxml library .

28. For web crawlers XPath The main grammar of

. Select the current node
… Select the parent of the current node
@ Select Properties
* Any match ,//div/*,div Any element under
// Select the node in the document from the current node that matches the selection , Regardless of their location
//div Pick all div
//div[@class=”demo”] selection class by demo Of div node

29.Mysql,Mongodb,redis Understanding of three databases

1）MySQL database ： Open source relational database , Need to implement the creation of database 、 Data tables and table fields , Tables can be associated with each other （ One to many 、 Many to many ）, It's persistent storage

2）Mongodb database ： It's a non relational database , The three elements of a database are , database 、 aggregate 、 file , Persistent storage can be done , It can also be used as an in memory database , There is no need to format the data in advance , Data is stored in the form of key value pairs .

3）redis database ： Non relational database , You don't need to set the format before use , Save as a key value pair , The file format is relatively free , Cache and database are mainly used , You can also do persistent storage

30.MongoDB Common command

use yourDB; Switch / Create database
show dbs; Query all databases
db.dropDatabase(); Delete the currently used database
db.getName(); View currently used databases
db.version(); At present db edition
db.addUser(“name”); Add users ,db.addUser(“userName”, “pwd123”, true);
show users; Show all current users
db.removeUser(“userName”); Delete user
db.collectionName.count(); Query the number of data in the current set
db.collectionName.find({key:value}); Query data

31. Why does the crawler project use MongoDB, Instead of using MySQL database

1）MySQL Belongs to relational database , It has the following characteristics ：

There are different storage methods on different engines
The query statements are traditional sql sentence , Have a more mature system , Maturity is high
The share of open source databases is growing ,MySQL The share of continues to grow
The efficiency of processing massive amounts of data will be significantly slower

2）Mongodb It is a non-relational database , It has the following characteristics ：

The data structure consists of key value pairs
storage ： Virtual memory + Persistence
Query statements are unique Mongodb Query mode of
High availability
The data is stored on the hard disk , Just the data that needs to be read frequently will be loaded into memory , Store data in physical memory , So as to achieve high-speed reading and writing

32.MongoDB The advantages of

File oriented 、 High performance 、 High availability 、 Easy to expand 、 It's sharable 、 Data storage friendly .

33.MongoDB What data types are supported ？

String、Integer、Double、Boolean
Object、Object ID
Arrays
Min/Max Keys
Code、Regular Expression etc.

34.MongoDB "Object ID" What are the components of ？

"Object ID" Data types are used to store documents id

It's made up of four parts ： Time stamp 、 client ID、 Customer process ID、 A three byte incremental counter

35.Redis What data types does the database support ？

1）String：String yes Redis The most commonly used data type ,String The data structure of is key/value type ,String Can contain any data . Common commands are set、get、decr、incr、mget etc. .

2）Hash：Hash Type can be regarded as a key/value All are String Of Map Containers . Common commands are hget、hset、hgetall etc. .

3）List：List Used to store an ordered list of strings , Common operations are to add elements to both ends of the queue or get a fragment of the list . Common commands are lpush、rpush、lpop、rpop、lrange etc. .

4）Set：Set It can be understood as a set of disordered characters ,Set The same element in will not repeat , Only one of the same elements is reserved . Common commands are sadd、spop、smembers、sunion etc. .

5）Sorted Set（ Ordered set ）： An ordered set is a set based on which each element is associated with a score ,Redis Sort the members of the set by scores . Common commands are zadd、zrange、zrem、zcard etc. .

36.Redis How many libraries are there ？

Redis There is 16 Databases , By default 0 library ,select 1 You can switch to 1 library .

37. Talk about Selenium frame

Selenium It's a Web Automated testing tools for , According to our instructions , Let the browser load the page automatically , Get the data you need , Even a screenshot of the page , Or judge whether some actions on the website happen .Selenium I don't have a browser , Browser functionality is not supported , It needs to work with Firefox、chrome It can only be used when third-party browsers are combined , You need to download the browser first Driver.

38.selenium There are several ways to locate elements , What kind of ？

1）selenium There are eight positioning methods

and name Relevant ：ByName、ByClassName、ByTagName
and link Relevant ：ByLinkText、ByPartialLinkText
and id Relevant ：ById
Omnipotent ：ByXpath and ByCssSelector

2） The most common is ByXpath, Because in many cases ,html The properties of the tag are not standardized , Cannot locate by a single attribute , At this time xpath You can reposition the unique element ; In fact, the fastest positioning should belong to ById, because id Is the only one. , However, most web page elements are not set id.

39.selenium What pits have you encountered in the process of use , It's all about how to solve it ？

The program is unstable , Sometimes the operation fails and the data cannot be retrieved

resolvent ：

Add element intelligence wait time driver.implicitly_wait(30)
Add mandatory wait time （ such as python Write in time.sleep()）
multi-purpose try capture , Handling exceptions , Multiple ways to locate , If the first fails, you can automatically try the second

40.driver What are the common properties and methods of objects ？

driver.page_source： The web page source code rendered by the current tab browser
driver.current_url： Of the current tab url
driver.close() ： Close the current tab , If there is only one tab, close the entire browser
driver.quit()： Close the browser
driver.forward()： Page forward
driver.back()： Page back
driver.screen_shot(img_name)： Screenshot of the page

41. Talk to you about Scrapy The understanding of the

Scrapy frame , Only a small amount of code needs to be implemented , Can quickly grab the data content .Scrapy Used Twisted Asynchronous network framework to handle network communication , Can speed up the download , No need to implement asynchronous framework by yourself , It also includes various middleware interfaces , Flexible to meet various needs .

scrapy The workflow of the framework ：

1） First Spiders（ Reptiles ） Will need to send the requested url(requests) the ScrapyEngine（ engine ） hand Scheduler（ Scheduler ）.
2）Scheduler（ Sort , The team ） After processing , the ScrapyEngine, hand Downloader.
3）Downloader Send a request to the Internet , And receive the download response （response）, Will respond （response） the ScrapyEngine hand Spiders.
4）Spiders Handle response, Extract data and put data , the ScrapyEngine hand ItemPipeline preservation （ It can be local , It could be a database ）, extract url Re menstruation ScrapyEngine hand Scheduler Go on to the next cycle , Until nothing Url The end of the request procedure .

42.Scrapy The basic structure of the framework

1） engine （Scrapy）： Data flow processing for the whole system , Trigger transaction ( Framework core ).

2） Scheduler （Scheduler）： To accept requests from the engine , Push into queue , And return when the engine requests it again . Think of it as a URL（ Grab the web address or link ） Priority queue for , It's up to it to decide what's next , Remove duplicate URLs at the same time .

3） Downloader （Downloader）： For downloading web content , And return the web content to the spider （Scrapy The Downloader is built on twisted On this efficient asynchronous model ）.

4） Reptiles （Spiders）： Reptiles are the main workers , Used to extract the information you need from a specific web page , The so-called entity (Item). Users can also extract links from it , Give Way Scrapy Continue to grab next page

5） Project pipeline （Pipeline）： Responsible for dealing with entities extracted from web pages by Crawlers , The main function is to persist entities 、 Verify the validity of the entity 、 Clear unwanted information . When the page is parsed by the crawler , Will be sent to project pipeline , And process the data in several specific order .

6） Downloader middleware （Downloader Middlewares）： be located Scrapy Framework between engine and Downloader , Mainly dealing with Scrapy Request and response between engine and Downloader .

7） Crawler middleware （Spider Middlewares）： Be situated between Scrapy The frame between the engine and the reptile , The main work is to deal with the spider's response input and request output .

8） Scheduler middlewares （Scheduler Middewares）： Be situated between Scrapy Middleware between engine and scheduling , from Scrapy Requests and responses sent by the engine to the schedule .

43.Scrapy Framework de duplication principle

Need to put dont_filter Parameter set to False, Open to remove heavy .

For each of these url Request , The dispatcher will encrypt a fingerprint information according to the relevant information requested , And combine fingerprint information with set() Compare the fingerprint information in the collection , If set() This data already exists in the collection , I'm not putting this Request Put it in the queue . If set() There is no , Will this Request Object into the queue , Waiting to be scheduled .

44.Scrapy Advantages and disadvantages of framework

1） advantage ：

Scrapy It's asynchronous , His asynchronous mechanism is based on twisted The asynchronous network framework deals with , stay settings.py The specific concurrency value can be set in the file （ Default concurrency 16）
Take something more readable xpath Instead of regular
Powerful statistics and log System
At the same time in different url Crawling up
Support shell The way , Easy to debug independently
Write middleware, Easy to write some unified filters
It's piped into the database

2） shortcoming ：

Distributed crawling cannot be realized
The asynchronous framework will not stop other tasks after errors , It's hard to detect when the data goes wrong

45.Scrapy and requests The difference between

1）scrapy It's a packaged framework , Contains downloaders 、 Parser 、 Log and exception handling , Based on Multithreading , twisted How to deal with , Can speed up our Downloads , No need to implement asynchronous framework by yourself . Poor scalability , inflexible .

2）requests It's a HTTP library , It is only used to make requests , about HTTP request , He's a powerful library , download 、 Parse all by yourself , More flexibility , High concurrency and distributed deployment are also very flexible .

46.Scrapy and Scrapy-Redis The difference between

Scrapy It's a Python The crawler frame , Climbing efficiency is very high , Highly customized , But it doesn't support distributed .

and Scrapy-Redis A set of based on redis database 、 Running on the Scrapy Components above the frame , It can make Scrapy Support distributed strategy ,Slaver End share Master End Redis In the database item queue 、 Request queue and request fingerprint set .

47. Why do distributed crawlers choose Redis database ？

because Redis Support master-slave synchronization , And the data is cached in memory , So based on the Redis Distributed crawler of , The high frequency reading efficiency of requests and data is very high .

Redis The advantages of ：

Data read fast , Because the data is stored in memory
Support transaction mechanism
Data persistence , Support snapshots and logs , Easy to recover data
With a wealth of data types ：list,string,set,qset,hash
Support master-slave replication , Data can be backed up
Rich features ： Can be used as a cache , Message queue , Set expiration time , Automatically delete when due

48. What problems do distributed crawlers mainly solve ？

IP、 bandwidth 、CPU、io Other questions .

49.Scrapy How does the framework implement distributed crawling ？

Can use scrapy_redis Class library to implement .

In distributed crawling , There will be master The machine and slave machine , among ,master As the core server ,slave For the specific crawler server .

stay master Installation on server Redis database , And we're going to grab url Store in redis In the database , be-all slave Crawler server in the crawl time from redis Database to link , because scrapy-redis Its own queue mechanism ,slave Acquired url They don't conflict with each other , The results of the fetch are then stored in the database .master Of redis The database will also fetch url Are stored , To heavy .

原网站

版权声明
本文为[Old Ge]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/188/202207071421261420.html