当前位置:网站首页>Large scale web crawling of e-commerce websites (Ultimate Guide)
Large scale web crawling of e-commerce websites (Ultimate Guide)
2022-07-29 01:57:00 【Oxylabs Chinese station】

Large scale web page crawling of e-commerce websites
Compared to small projects , Large scale web crawling brings a series of different challenges , For example, infrastructure construction 、 Manage resource costs 、 Bypass crawler detection measures, etc .
This article will guide you through Large scale data collection , And Electricity field As the key point .
Web crawl infrastructure
Build and manage web crawl infrastructure Is one of the first tasks . Of course , Let's assume that you have established a data collection method ( Also known as reptile ).
The general network capture process is as follows :

In short , First of all, you should grasp some goals . For large-scale operations , Crawling without agents cannot last long , Because it will be blocked by the website soon . Agents are an important element of large-scale data collection .
The best practice for large-scale data collection is to adopt multiple agent solutions , Even multiple suppliers . Let's start with the agent supplier .
Part 1 Choose a proxy supplier
It is very important to choose the right agent supplier , Because this will directly affect the crawl program .
If the agency you choose to cooperate with is unreliable , Your internal data retrieval tool is not effective .
Part 2 Select a proxy type
If you only want to find an agent type suitable for e-commerce data collection , Please consider it A residential agent . This kind of proxy is unlikely to be blocked because of its nature , It also provides a large agent pool covering a wide range .
Part 3 Bypass safety measures
E-commerce websites will deploy certain security measures , Used to stop unwanted reptiles . Some common methods are as follows :
●IP distinguish . The server can distinguish IP Is it from data center or residence .
●CAPTCHA Verification Code . This is a question and answer test , Users are usually required to fill in the correct password or identify the objects in the picture .
●Cookie. Ordinary users rarely enter a specific product page directly .
● Browser fingerprint recognition . This refers to information about computer equipment collected for identification purposes .
● header . The website can know the geographical location of users 、 The time zone 、 Language, etc. .
● Inconsistent with natural user behavior .
Part 4 The subtle art of storage
All the data you collect needs to be stored somewhere , Therefore, large-scale crawling naturally requires a lot of storage resources .
that , When there is a difference between the data receiving speed and the processing speed , You usually use Buffer .
# Create a buffer for data transmission
In order to explain buffering in popular language , Let's take the office as an example . Suppose you are working in an office , From time to time, someone comes to add new tasks to your pile of documents . After you have completed the task in progress , Will go to the next assigned task . Then this pile of files is a buffer . If the pile is too high , Will overturn , So you must limit the number of pages in the document . This is the capacity of the buffer , As long as this limit is exceeded, it will overflow .
If you are waiting for another service to receive information , You need a buffer to see how much information is being transmitted . In this way, overflow can be avoided , Just like avoiding that pile of documents overturning . If the buffer overflows , You have to give up some work . under these circumstances , You have three choices :
1. Discard the data stored in the buffer at the earliest
2. Discard the newly added data
3. Stop the data collection process to prevent overflow
However , If you choose to stop the crawl process , Then some work will be postponed , Wait until it returns to normal , You need to do more crawling .
# Database storage service
If you want to process the incoming data and convert it to a readable format ( Such as JSON), What to do ? From this point of view , In fact, there is no need for raw data . in other words , You can save information in short-term memory . If you need HTML Documents and processed data , And what to do ? Long term memory Will be the best choice .
however , We are talking about large-scale data collection , Therefore, it is recommended to use both methods . Our recommended practices are as follows :
under these circumstances , Because the short-term memory runs very fast , Can handle a large number of requests , Therefore, it will be used to absorb a large amount of data captured . Through this solution , You can send data into the parser , It can also dispose the untreated HTML Put the file into long-term memory .
You can also use only long-term memory as a buffer . But this way , You need to invest more resources , To ensure that all processes can be completed on time .
Here are some services for short-term and long-term storage :
1. long-term .MySQL、BigQuery、Cassandra、Hadoop etc. .
These solutions usually take the form of Permanent storage The way ( Hard disk, not memory /RAM) Save the data . Because the information is expected to remain for a long time , So these solutions are equipped with some tools , You can filter out the data you need from the entire data set .
2. short-term .Apache Kafka、RabbitMQ、Redis etc.
These memories have limited functions in data filtering , Therefore, it is generally not suitable for long-term data storage . But on the other hand , These memories run extremely fast , Although a lot of functions have been sacrificed , But always available , Thus, the performance required for large-scale operation can be achieved .
Of course , You can also completely avoid stored procedures . What we offer Real-Time Crawler yes Advanced gripper customized for high load data retrieval operations , It is especially suitable for grabbing e-commerce product pages . One of its advantages is , It can save you the trouble of data storage , Because you only need to provide a website for it .Real-Time Crawler Complete the whole crawl by yourself 、 Storage and processing work , All the data returned to you are useful (HTML or JSON).
Part 5 Process the captured data
After determining the storage requirements , We must consider how to deal with , That is, parsing . Data parsing refers to the analysis of incoming information , And extract the relevant fragments into a format suitable for subsequent processing . Data parsing is a key step in web page crawling .
However , Like everything we have discussed in this blog so far , Parsing is not that simple . On a small scale , Building and maintaining parsers is very simple . But for large-scale web crawling , The situation is much more complicated .
# The problem of large-scale data analysis
● The target website may change its page layout
● When using a third-party parser , The process may be forced to stop
● If you use third-party services , You need multiple services
● The data sets provided by different services are different in structure
● If you use your own parser , You need a lot of parsers
● When the parser process terminates , Your buffer may overflow
To make a long story short , You either Build and maintain your own parser , or Get the parser through a third-party solution . For large-scale operations , We recommend that you try either of the above two methods . Diversify resources into several excellent third-party solutions ( Diversify services ), It can ensure the smooth operation of web page crawling .
You can also choose one Both parsing and crawling Solutions for . such as Oxylabs Exclusive The next generation of residential agents . Like any regular agent , The solution can also be customized , But at the same time, it ensures a higher success rate , And provide adaptive parsing . This clever little function can adapt to any e-commerce product page , And parse all HTML Code .
summary
Large scale web crawling takes a long time , It's very difficult . Whether your project is in planning or ongoing , We all hope this guide can help you .
If you are interested in the agent 、 The best solution 、 There are any questions about the best capture instance , You can check out our article Learn more about , You can also contact at any time Our website Customer service , We will do everything we can to help .
边栏推荐
- Anaconda environment installation problem
- 把逻辑做在Sigma-DSP中的优化实例-数据分配器
- 活动速递| Apache Doris 性能优化实战系列直播课程初公开,诚邀您来参加!
- The scientific research environment has a great impact on people
- 【7.27】代码源 - 【删数】【括号序列】【数字替换】【游戏】【画画】
- [golang] use select {}
- [the road of Exile - Chapter 5]
- Talk about possible problems when using transactions (@transactional)
- 正则过滤数据学习笔记(①)
- The basic concept of transaction and the implementation principle of MySQL transaction
猜你喜欢
随机推荐
正则过滤数据学习笔记(①)
Event express | Apache Doris Performance Optimization Practice Series live broadcast course is open at the beginning. You are cordially invited to participate!
Super technology network security risk assessment service, comprehensively understand the security risks faced by the network system
About df['a column name'] [serial number]
Tda75610-i2c-determination of I2C address of analog power amplifier
Code reading - ten C open source projects
MySQL execution order
数据平台数据接入实践
DSP vibration seat
活动速递| Apache Doris 性能优化实战系列直播课程初公开,诚邀您来参加!
Six noteworthy cloud security trends in 2022
Make logic an optimization example in sigma DSP - data distributor
[7.21-26] code source - [good sequence] [social circle] [namonamo]
Overview of Qualcomm 5g intelligent platform
科研环境对人的影响是很大的
[WesternCTF2018]shrine
Where will Jinan win in hosting the first computing power conference?
Analysis of Multi Chain use cases on moonbeam -- review of Derek's speech in Polkadot decoded 2022
抓包工具Charles使用
Day01作业




![[hcip] two mGRE networks are interconnected through OSPF (ENSP)](/img/fe/8bb51ac48f52d61e8d31af490300bb.png)




