当前位置:网站首页>Web crawler knowledge day04
Web crawler knowledge day04
2022-06-29 03:47:00 【Young Chen Gong】
One 、 encapsulation HttpClient
We need to use it often HttpClient, So it needs to be packaged , Easy to use
@Component
public class HttpUtils {
private PoolingHttpClientConnectionManager cm;
public HttpUtils() {
this.cm = new PoolingHttpClientConnectionManager();
// Set the maximum number of connections
cm.setMaxTotal(200);
// Set the number of concurrencies per host
cm.setDefaultMaxPerRoute(20);
}
// Get content
public String getHtml(String url) {
// obtain HttpClient object
CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(cm).build();
// Statement httpGet Request object
HttpGet httpGet = new HttpGet(url);
// Set request parameters RequestConfig
httpGet.setConfig(this.getConfig());
CloseableHttpResponse response = null;
try {
// Use HttpClient Initiate request , return response
response = httpClient.execute(httpGet);
// analysis response Return the data
if (response.getStatusLine().getStatusCode() == 200) {
String html = "";
// If response.getEntity The result is empty , In execution EntityUtils.toString Will report a mistake
// Need to be right Entity Make non empty judgments
if (response.getEntity() != null) {
html = EntityUtils.toString(response.getEntity(), "UTF-8");
}
return html;
}
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
if (response != null) {
// Close the connection
response.close();
}
// Can not close , Now we're using the connection manager
// httpClient.close();
} catch (Exception e) {
e.printStackTrace();
}
}
return null;
}
// Get photo
public String getImage(String url) {
// obtain HttpClient object
CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(cm).build();
// Statement httpGet Request object
HttpGet httpGet = new HttpGet(url);
// Set request parameters RequestConfig
httpGet.setConfig(this.getConfig());
CloseableHttpResponse response = null;
try {
// Use HttpClient Initiate request , return response
response = httpClient.execute(httpGet);
// analysis response Download the pictures
if (response.getStatusLine().getStatusCode() == 200) {
// Get file type
String extName = url.substring(url.lastIndexOf("."));
// Use uuid Generate image name
String imageName = UUID.randomUUID().toString() + extName;
// Declare the output file
OutputStream outstream = new FileOutputStream(new File("D:/images/" + imageName));
// Use the response body output file
response.getEntity().writeTo(outstream);
// Returns the generated image name
return imageName;
}
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
if (response != null) {
// Close the connection
response.close();
}
// Can not close , Now we're using the connection manager
// httpClient.close();
} catch (Exception e) {
e.printStackTrace();
}
}
return null;
}
// Get the request parameter object
private RequestConfig getConfig() {
RequestConfig config = RequestConfig.custom().setConnectTimeout(1000)// Set the timeout for creating a connection
.setConnectionRequestTimeout(500) // Set the timeout for getting the connection
.setSocketTimeout(10000) // Set the timeout for the connection
.build();
return config;
}
}
Two 、 Achieve data capture
Using scheduled tasks , It can capture the latest data regularly
@Component
public class ItemTask {
@Autowired
private HttpUtils httpUtils;
@Autowired
private ItemService itemService;
public static final ObjectMapper MAPPER = new ObjectMapper();
// Set the time when the task is finished , Re interval 100 Once per second
@Scheduled(fixedDelay = 1000 * 100)
public void process() throws Exception {
// Analyze the page and find the address to visit , Page number page from 1 Start , The next page oage Add 2
String url = "https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&cid2=653&cid3=655&s=5760&click=0&page=";
// Traverse the execution , Get all the data
for (int i = 1; i < 10; i = i + 2) {
// Initiate a request for access , Get page data , First visit the first page
String html = this.httpUtils.getHtml(url + i);
// Parsing page data , Save data to database
this.parseHtml(html);
}
System.out.println(" Execution completed ");
}
// Parsing the page , And save the data to the database
private void parseHtml(String html) throws Exception {
// Use jsoup Parsing the page
Document document = Jsoup.parse(html);
// Get product data
Elements spus = document.select("div#J_goodsList > ul > li");
// Traverse the merchandise spu data
for (Element spuEle : spus) {
// Access to goods spu
Long spuId = Long.parseLong(spuEle.attr("data-spu"));
// Access to goods sku data
Elements skus = spuEle.select("li.ps-item img");
for (Element skuEle : skus) {
// Access to goods sku
Long skuId = Long.parseLong(skuEle.attr("data-sku"));
// Judge whether the goods have been seized , According to sku Judge
Item param = new Item();
param.setSku(skuId);
List<Item> list = this.itemService.findAll(param);
// Judge whether the result is found
if (list.size() > 0) {
// If there is a result , Indicates that the product has been downloaded , Do the next traversal
continue;
}
// Save product data , Declare the commodity object
Item item = new Item();
// goods spu
item.setSpu(spuId);
// goods sku
item.setSku(skuId);
// goods url Address
item.setUrl("https://item.jd.com/" + skuId + ".html");
// Creation time
item.setCreated(new Date());
// Modification time
item.setUpdated(item.getCreated());
// Get the product title
String itemHtml = this.httpUtils.getHtml(item.getUrl());
String title = Jsoup.parse(itemHtml).select("div.sku-name").text();
item.setTitle(title);
// Get commodity prices
String priceUrl = "https://p.3.cn/prices/mgets?skuIds=J_"+skuId;
String priceJson = this.httpUtils.getHtml(priceUrl);
// analysis json Data access to commodity prices
double price = MAPPER.readTree(priceJson).get(0).get("p").asDouble();
item.setPrice(price);
// Get image address
String pic = "https:" + skuEle.attr("data-lazy-img").replace("/n9/","/n1/");
System.out.println(pic);
// Download the pictures
String picName = this.httpUtils.getImage(pic);
item.setPic(picName);
// Save product data
this.itemService.save(item);
}
}
}
}
边栏推荐
猜你喜欢

87. (cesium chapter) cesium thermal map (pasted with terrain)

【资料上新】基于3568开发板的NPU开发资料全面升级
![[World Ocean Day] tcapulusdb calls on you to protect marine biodiversity together](/img/87/373af42f3a2ffa6b9f7fb0c0c3735b.png)
[World Ocean Day] tcapulusdb calls on you to protect marine biodiversity together

【TcaplusDB】祝大家端午安康!

凌晨三点学习的你,感到迷茫了吗?

87.(cesium篇)cesium热力图(贴地形)

Open source demo| you draw and I guess -- make your life more interesting

leetcode - 295. 数据流的中位数

88.(cesium篇)cesium聚合图

【面试指南】AI算法面试
随机推荐
二叉树序列化与反序列化(leetcode(困难))
泠静的想一想自己的路
Same tree [from part to whole]
The efficiency of 20 idea divine plug-ins has been increased by 30 times, and it is necessary to write code
Implementing mqtt communication with PHP
django model生成docx数据库设计文档
使用roslaunch为Gazebo加载自定义模型时黑屏、报错问题
[tcaplusdb knowledge base] Introduction to tcaplusdb tcaplusadmin tool
5-minute NLP: summary of time chronology from bag of words to transformer
Requirements analysis specification and requirements specification
高性能限流器 Guava RateLimiter
Data statistical analysis (SPSS) [8]
迅为i.MX8M开发板yocto系统使用Gstarwmr视频转换
Data statistical analysis (SPSS) [4]
Data collection and management [5]
Inventory deduction based on redis
Différents arbres de recherche binaires [arbre de génération rétrospectif ascendant + recherche de mémoire - - espace - temps]
Whose encryption key is written in the code? (that's you)
[tcapulusdb knowledge base] tcapulusdb technical support introduction
Devops note-05: what are the roles of Ba, SM, Po, PM, PD, dev, OPS and QA in the IT industry