当前位置:网站首页>Web crawler knowledge day04
Web crawler knowledge day04
2022-06-29 03:47:00 【Young Chen Gong】
One 、 encapsulation HttpClient
We need to use it often HttpClient, So it needs to be packaged , Easy to use
@Component
public class HttpUtils {
private PoolingHttpClientConnectionManager cm;
public HttpUtils() {
this.cm = new PoolingHttpClientConnectionManager();
// Set the maximum number of connections
cm.setMaxTotal(200);
// Set the number of concurrencies per host
cm.setDefaultMaxPerRoute(20);
}
// Get content
public String getHtml(String url) {
// obtain HttpClient object
CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(cm).build();
// Statement httpGet Request object
HttpGet httpGet = new HttpGet(url);
// Set request parameters RequestConfig
httpGet.setConfig(this.getConfig());
CloseableHttpResponse response = null;
try {
// Use HttpClient Initiate request , return response
response = httpClient.execute(httpGet);
// analysis response Return the data
if (response.getStatusLine().getStatusCode() == 200) {
String html = "";
// If response.getEntity The result is empty , In execution EntityUtils.toString Will report a mistake
// Need to be right Entity Make non empty judgments
if (response.getEntity() != null) {
html = EntityUtils.toString(response.getEntity(), "UTF-8");
}
return html;
}
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
if (response != null) {
// Close the connection
response.close();
}
// Can not close , Now we're using the connection manager
// httpClient.close();
} catch (Exception e) {
e.printStackTrace();
}
}
return null;
}
// Get photo
public String getImage(String url) {
// obtain HttpClient object
CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(cm).build();
// Statement httpGet Request object
HttpGet httpGet = new HttpGet(url);
// Set request parameters RequestConfig
httpGet.setConfig(this.getConfig());
CloseableHttpResponse response = null;
try {
// Use HttpClient Initiate request , return response
response = httpClient.execute(httpGet);
// analysis response Download the pictures
if (response.getStatusLine().getStatusCode() == 200) {
// Get file type
String extName = url.substring(url.lastIndexOf("."));
// Use uuid Generate image name
String imageName = UUID.randomUUID().toString() + extName;
// Declare the output file
OutputStream outstream = new FileOutputStream(new File("D:/images/" + imageName));
// Use the response body output file
response.getEntity().writeTo(outstream);
// Returns the generated image name
return imageName;
}
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
if (response != null) {
// Close the connection
response.close();
}
// Can not close , Now we're using the connection manager
// httpClient.close();
} catch (Exception e) {
e.printStackTrace();
}
}
return null;
}
// Get the request parameter object
private RequestConfig getConfig() {
RequestConfig config = RequestConfig.custom().setConnectTimeout(1000)// Set the timeout for creating a connection
.setConnectionRequestTimeout(500) // Set the timeout for getting the connection
.setSocketTimeout(10000) // Set the timeout for the connection
.build();
return config;
}
}
Two 、 Achieve data capture
Using scheduled tasks , It can capture the latest data regularly
@Component
public class ItemTask {
@Autowired
private HttpUtils httpUtils;
@Autowired
private ItemService itemService;
public static final ObjectMapper MAPPER = new ObjectMapper();
// Set the time when the task is finished , Re interval 100 Once per second
@Scheduled(fixedDelay = 1000 * 100)
public void process() throws Exception {
// Analyze the page and find the address to visit , Page number page from 1 Start , The next page oage Add 2
String url = "https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&cid2=653&cid3=655&s=5760&click=0&page=";
// Traverse the execution , Get all the data
for (int i = 1; i < 10; i = i + 2) {
// Initiate a request for access , Get page data , First visit the first page
String html = this.httpUtils.getHtml(url + i);
// Parsing page data , Save data to database
this.parseHtml(html);
}
System.out.println(" Execution completed ");
}
// Parsing the page , And save the data to the database
private void parseHtml(String html) throws Exception {
// Use jsoup Parsing the page
Document document = Jsoup.parse(html);
// Get product data
Elements spus = document.select("div#J_goodsList > ul > li");
// Traverse the merchandise spu data
for (Element spuEle : spus) {
// Access to goods spu
Long spuId = Long.parseLong(spuEle.attr("data-spu"));
// Access to goods sku data
Elements skus = spuEle.select("li.ps-item img");
for (Element skuEle : skus) {
// Access to goods sku
Long skuId = Long.parseLong(skuEle.attr("data-sku"));
// Judge whether the goods have been seized , According to sku Judge
Item param = new Item();
param.setSku(skuId);
List<Item> list = this.itemService.findAll(param);
// Judge whether the result is found
if (list.size() > 0) {
// If there is a result , Indicates that the product has been downloaded , Do the next traversal
continue;
}
// Save product data , Declare the commodity object
Item item = new Item();
// goods spu
item.setSpu(spuId);
// goods sku
item.setSku(skuId);
// goods url Address
item.setUrl("https://item.jd.com/" + skuId + ".html");
// Creation time
item.setCreated(new Date());
// Modification time
item.setUpdated(item.getCreated());
// Get the product title
String itemHtml = this.httpUtils.getHtml(item.getUrl());
String title = Jsoup.parse(itemHtml).select("div.sku-name").text();
item.setTitle(title);
// Get commodity prices
String priceUrl = "https://p.3.cn/prices/mgets?skuIds=J_"+skuId;
String priceJson = this.httpUtils.getHtml(priceUrl);
// analysis json Data access to commodity prices
double price = MAPPER.readTree(priceJson).get(0).get("p").asDouble();
item.setPrice(price);
// Get image address
String pic = "https:" + skuEle.attr("data-lazy-img").replace("/n9/","/n1/");
System.out.println(pic);
// Download the pictures
String picName = this.httpUtils.getImage(pic);
item.setPic(picName);
// Save product data
this.itemService.save(item);
}
}
}
}
边栏推荐
- Supplement to the scheme of gateway+nacos+knife4j (swagger)
- [tcapulusdb knowledge base] Introduction to tcapulusdb tcapsvrmgr tool (I)
- Basic concepts of graph theory
- [ruoyi] ztree initialization
- 高性能限流器 Guava RateLimiter
- 【面试指南】AI算法面试
- Data collection and management [3]
- Preliminary construction of SSM project environment
- 凌晨三点学习的你,感到迷茫了吗?
- 点云地图导入gazebo思路
猜你喜欢

88. (cesium chapter) cesium aggregation diagram

【TcaplusDB知识库】批量复制游戏区

【TcaplusDB知识库】TcaplusDB-tcapsvrmgr工具介绍(一)

ssm项目环境初步搭建

Whose encryption key is written in the code? (that's you)
![Sequence traversal of binary tree ii[one of sequence traversal methods - > recursive traversal + level]](/img/f9/efb73dd6047e6d5833581376904788.png)
Sequence traversal of binary tree ii[one of sequence traversal methods - > recursive traversal + level]

2D人体姿态估计 - DeepPose

leetcode:560. Subarray with and K
![[tcapulusdb knowledge base] Introduction to tcapulusdb table data caching](/img/7b/8c4f1549054ee8c0184495d9e8e378.png)
[tcapulusdb knowledge base] Introduction to tcapulusdb table data caching

一个注解优雅的实现 接口数据脱敏
随机推荐
Microsecond TCP timestamp
Input input box click with border
[tcapulusdb knowledge base] Introduction to tcapulusdb data import
[data update] NPU development data based on 3568 development board is fully upgraded
二叉树的层序遍历 II[层序遍历方式之一 ->递归遍历 + level]
87.(cesium篇)cesium热力图(贴地形)
Potential learning C language - pointer explanation (Advanced)
【TcaplusDB知识库】TcaplusDB-tcapsvrmgr工具介绍(一)
做 SQL 性能优化真是让人干瞪眼
迅为龙芯开发板pmon下Ejtag-设置硬件断点指令
Data statistical analysis (SPSS) [5]
[World Ocean Day] tcapulusdb calls on you to protect marine biodiversity together
SSH login without password
Data collection and management [1]
【TcaplusDB知识库】TcaplusDB表数据缓写介绍
需求分析说明书和需求规格说明书
分享 60 个神级 VS Code 插件
How to understand MySQL indexes?
88.(cesium篇)cesium聚合图
leetcode - 295. 数据流的中位数