当前位置：网站首页>Large scale web crawling of e-commerce websites (Ultimate Guide)

Large scale web crawling of e-commerce websites (Ultimate Guide)

2022-07-29 01:57:00 【Oxylabs Chinese station】

Large scale web page crawling of e-commerce websites

Compared to small projects , Large scale web crawling brings a series of different challenges , For example, infrastructure construction 、 Manage resource costs 、 Bypass crawler detection measures, etc .

This article will guide you through Large scale data collection , And Electricity field As the key point .

Web crawl infrastructure

Build and manage web crawl infrastructure Is one of the first tasks . Of course , Let's assume that you have established a data collection method （ Also known as reptile ）.

The general network capture process is as follows ：

In short , First of all, you should grasp some goals . For large-scale operations , Crawling without agents cannot last long , Because it will be blocked by the website soon . Agents are an important element of large-scale data collection .

The best practice for large-scale data collection is to adopt multiple agent solutions , Even multiple suppliers . Let's start with the agent supplier .

Part 1 Choose a proxy supplier

It is very important to choose the right agent supplier , Because this will directly affect the crawl program .

If the agency you choose to cooperate with is unreliable , Your internal data retrieval tool is not effective .

Part 2 Select a proxy type

If you only want to find an agent type suitable for e-commerce data collection , Please consider it A residential agent . This kind of proxy is unlikely to be blocked because of its nature , It also provides a large agent pool covering a wide range .

Part 3 Bypass safety measures

E-commerce websites will deploy certain security measures , Used to stop unwanted reptiles . Some common methods are as follows ：

●IP distinguish . The server can distinguish IP Is it from data center or residence .

●CAPTCHA Verification Code . This is a question and answer test , Users are usually required to fill in the correct password or identify the objects in the picture .

●Cookie. Ordinary users rarely enter a specific product page directly .

● Browser fingerprint recognition . This refers to information about computer equipment collected for identification purposes .

● header . The website can know the geographical location of users 、 The time zone 、 Language, etc. .

● Inconsistent with natural user behavior .

Part 4 The subtle art of storage

All the data you collect needs to be stored somewhere , Therefore, large-scale crawling naturally requires a lot of storage resources .

that , When there is a difference between the data receiving speed and the processing speed , You usually use Buffer .

# Create a buffer for data transmission

In order to explain buffering in popular language , Let's take the office as an example . Suppose you are working in an office , From time to time, someone comes to add new tasks to your pile of documents . After you have completed the task in progress , Will go to the next assigned task . Then this pile of files is a buffer . If the pile is too high , Will overturn , So you must limit the number of pages in the document . This is the capacity of the buffer , As long as this limit is exceeded, it will overflow .

If you are waiting for another service to receive information , You need a buffer to see how much information is being transmitted . In this way, overflow can be avoided , Just like avoiding that pile of documents overturning . If the buffer overflows , You have to give up some work . under these circumstances , You have three choices ：

1. Discard the data stored in the buffer at the earliest

2. Discard the newly added data

3. Stop the data collection process to prevent overflow

However , If you choose to stop the crawl process , Then some work will be postponed , Wait until it returns to normal , You need to do more crawling .

# Database storage service

If you want to process the incoming data and convert it to a readable format （ Such as JSON）, What to do ？ From this point of view , In fact, there is no need for raw data . in other words , You can save information in short-term memory . If you need HTML Documents and processed data , And what to do ？ Long term memory Will be the best choice .

however , We are talking about large-scale data collection , Therefore, it is recommended to use both methods . Our recommended practices are as follows ：

under these circumstances , Because the short-term memory runs very fast , Can handle a large number of requests , Therefore, it will be used to absorb a large amount of data captured . Through this solution , You can send data into the parser , It can also dispose the untreated HTML Put the file into long-term memory .

You can also use only long-term memory as a buffer . But this way , You need to invest more resources , To ensure that all processes can be completed on time .

Here are some services for short-term and long-term storage ：

1. long-term .MySQL、BigQuery、Cassandra、Hadoop etc. .

These solutions usually take the form of Permanent storage The way （ Hard disk, not memory /RAM） Save the data . Because the information is expected to remain for a long time , So these solutions are equipped with some tools , You can filter out the data you need from the entire data set .

2. short-term .Apache Kafka、RabbitMQ、Redis etc.

These memories have limited functions in data filtering , Therefore, it is generally not suitable for long-term data storage . But on the other hand , These memories run extremely fast , Although a lot of functions have been sacrificed , But always available , Thus, the performance required for large-scale operation can be achieved .

Of course , You can also completely avoid stored procedures . What we offer Real-Time Crawler yes Advanced gripper customized for high load data retrieval operations , It is especially suitable for grabbing e-commerce product pages . One of its advantages is , It can save you the trouble of data storage , Because you only need to provide a website for it .Real-Time Crawler Complete the whole crawl by yourself 、 Storage and processing work , All the data returned to you are useful （HTML or JSON）.

Part 5 Process the captured data

After determining the storage requirements , We must consider how to deal with , That is, parsing . Data parsing refers to the analysis of incoming information , And extract the relevant fragments into a format suitable for subsequent processing . Data parsing is a key step in web page crawling .

However , Like everything we have discussed in this blog so far , Parsing is not that simple . On a small scale , Building and maintaining parsers is very simple . But for large-scale web crawling , The situation is much more complicated .

# The problem of large-scale data analysis

● The target website may change its page layout

● When using a third-party parser , The process may be forced to stop

● If you use third-party services , You need multiple services

● The data sets provided by different services are different in structure

● If you use your own parser , You need a lot of parsers

● When the parser process terminates , Your buffer may overflow

To make a long story short , You either Build and maintain your own parser , or Get the parser through a third-party solution . For large-scale operations , We recommend that you try either of the above two methods . Diversify resources into several excellent third-party solutions （ Diversify services ）, It can ensure the smooth operation of web page crawling .

You can also choose one Both parsing and crawling Solutions for . such as Oxylabs Exclusive The next generation of residential agents . Like any regular agent , The solution can also be customized , But at the same time, it ensures a higher success rate , And provide adaptive parsing . This clever little function can adapt to any e-commerce product page , And parse all HTML Code .

summary

Large scale web crawling takes a long time , It's very difficult . Whether your project is in planning or ongoing , We all hope this guide can help you .

If you are interested in the agent 、 The best solution 、 There are any questions about the best capture instance , You can check out our article Learn more about , You can also contact at any time Our website Customer service , We will do everything we can to help .

原网站

版权声明
本文为[Oxylabs Chinese station]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/196/202207130554309465.html

当前位置：网站首页>Large scale web crawling of e-commerce websites (Ultimate Guide)

Large scale web crawling of e-commerce websites (Ultimate Guide)

边栏推荐

猜你喜欢

随机推荐