当前位置:网站首页>Crawler framework crawler
Crawler framework crawler
2022-07-25 17:11:00 【wangmcn】
crawler
Catalog
- 1、 brief introduction
- 2、 Installation and deployment
- 3、 Frame description
- 4、 Using frames
1、 brief introduction
crawler use requests+lxml The way to crawl , Crawl content and url use XPath In the same way ( About XPath May refer to XPath Reference manual chapter ).
GitHub website :https://github.com/shuizhubocai/crawler
requests yes Python An excellent third-party library , Suitable for human use HTTP library , Encapsulates a lot of cumbersome HTTP function , Greatly simplified HTTP The amount of code required for the request .
lxml yes Python A parsing library of , Support HTML and XML Parsing , Support XPath Analytical way , And the parsing efficiency is very high .
2、 Installation and deployment
stay Windows Environmental Science (64 position ) Next Python Version is 3.6.5.
1、 Open the official website to download , Download completed as crawler-master.zip file .
2、 Unzip the file to the specified directory ( for example D:\crawler).
3、 Installation directory , Command line run pip install -r requrements.txt The library files that the installation framework depends on .
requrements.txt The contents of the document :
certifi==2018.4.16
chardet==3.0.4
idna==2.7
requests==2.19.1
urllib3==1.23
4、 install lxml, The version number is 4.2.5.
Download address :https://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml
Download the specified version ,cp36 representative Python 3.6 Version of ,win_amd64 representative 64 A system of , So you need to choose the right , Otherwise, the installation process will report an error that the platform does not match .
Start the installation when the download is complete lxml, Enter the path where the installation file is located in the command line and enter the command .
pip install lxml-4.2.5-cp36-cp36m-win_amd64.whl
3、 Frame description
1、crawler.py file :
Urls class : Address Manager
Download class : Page downloader
Parser class : Page parser
Output class : Export data to HTML
Scheduler class : Crawler scheduler
2、modules\useragent In the catalog chrome.py、firefox.py Wait for browser proxy .
3、data.html Import the crawled data into this file .
4、 Using frames
demand : visit 51testing Forum , Get the specified number of pages (1-10) Post title and URL Address .
As shown in the figure : Post title to get .
As shown in the figure : obtain 1-10 page .
1、 Modify the script (crawler.py file ).
(1) modify Parser class ,getDatas Methodical html.xpath value .
//tbody[contains(@id,'normalthread')]/tr/th/a[3]
As shown in the figure : Use Firefox+FirePath Debugging and positioning .
(2) modify Parser class ,getUrls Methodical html.xpath value .
//span[@id='fd_page_bottom']/div//a[not(@class)]//@href
As shown in the figure : Use Firefox+FirePath Debugging and positioning .
(3) Instantiation
Add access address :http://bbs.51testing.com/forum-279-1.html
2、 Execute the script (crawler.py file ).
Installation directory , Command line run python crawler.py
3、 View the crawl results .
After script execution , It will be automatically generated in the installation directory data.html file .
open data.html file , Display the data after crawling , Clicking the title will pop up a new window and jump to the specified address .
边栏推荐
- 我们被一个 kong 的性能 bug 折腾了一个通宵
- GTX1080Ti 光纤HDMI干扰出现闪屏1080Ti 闪屏解决方法
- Lvgl 7.11 tileview interface cycle switching
- 从数字化到智能运维:有哪些价值,又有哪些挑战?
- 理财有保本产品吗?
- Go语言系列:Go从哪里来,Go将去哪里?
- 接口自动化测试Postman+Newman+Jenkins
- Bo Yun container cloud and Devops platform won the trusted cloud "technology best practice Award"
- 枚举类和魔术值
- win10如何删除微软拼音输入法
猜你喜欢

POWERBOARD coco! Dino: let target detection embrace transformer

Customize MVC project login registration and tree menu

ACL 2022 | 基于最优传输的对比学习实现可解释的语义文本相似性

Dynamic planning topic record

3D semantic segmentation - scribed supervised lidar semantic segmentation

Rebudget汇报PPT

Don't believe these "rumors" in the process of preparing for the exam!

【redis】redis安装
![[target detection] yolov5 Runtong voc2007 dataset (repair version)](/img/b6/b74e93ca5e1986e0265c58f750dce3.png)
[target detection] yolov5 Runtong voc2007 dataset (repair version)

How to prevent the unburned gas when the city gas safety is alarmed again?
随机推荐
理财有保本产品吗?
Jenkins' file parameters can be used to upload files
stm32F407------SPI
[Nanjing University of Aeronautics and Astronautics] information sharing for the first and second examinations of postgraduate entrance examination
用秩讨论线性方程组的解/三个平面的位置关系
Unity is better to use the hot scheme Wolong
Chapter 4: operators
Exception handling mechanism topic 1
Data analysis and privacy security become the key factors for the success or failure of Web3.0. How do enterprises layout?
EasyUI modification and DataGrid dialog form control use
简述redis集群的实现原理
更新|3DCAT实时云渲染 v2.1.2版本全新发布
使用Huggingface在矩池云快速加载预训练模型和数据集
第四章:操作符
Fudan University EMBA peer topic: always put the value of consumers in the most important position
C#入门基础教程
China's chip self-sufficiency rate has increased significantly, resulting in high foreign chip inventories and heavy losses. American chips can be said to have thrown themselves in the foot
Briefly describe the implementation principle of redis cluster
What is the monthly salary of 10000 in China? The answer reveals the cruel truth of income
2022年最新北京建筑施工焊工(建筑特种作业)模拟题库及答案解析