当前位置:网站首页>Crawler framework crawler
Crawler framework crawler
2022-07-25 17:11:00 【wangmcn】
crawler
Catalog
- 1、 brief introduction
- 2、 Installation and deployment
- 3、 Frame description
- 4、 Using frames
1、 brief introduction
crawler use requests+lxml The way to crawl , Crawl content and url use XPath In the same way ( About XPath May refer to XPath Reference manual chapter ).
GitHub website :https://github.com/shuizhubocai/crawler
requests yes Python An excellent third-party library , Suitable for human use HTTP library , Encapsulates a lot of cumbersome HTTP function , Greatly simplified HTTP The amount of code required for the request .
lxml yes Python A parsing library of , Support HTML and XML Parsing , Support XPath Analytical way , And the parsing efficiency is very high .
2、 Installation and deployment
stay Windows Environmental Science (64 position ) Next Python Version is 3.6.5.
1、 Open the official website to download , Download completed as crawler-master.zip file .
2、 Unzip the file to the specified directory ( for example D:\crawler).
3、 Installation directory , Command line run pip install -r requrements.txt The library files that the installation framework depends on .
requrements.txt The contents of the document :
certifi==2018.4.16
chardet==3.0.4
idna==2.7
requests==2.19.1
urllib3==1.23
4、 install lxml, The version number is 4.2.5.
Download address :https://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml
Download the specified version ,cp36 representative Python 3.6 Version of ,win_amd64 representative 64 A system of , So you need to choose the right , Otherwise, the installation process will report an error that the platform does not match .
Start the installation when the download is complete lxml, Enter the path where the installation file is located in the command line and enter the command .
pip install lxml-4.2.5-cp36-cp36m-win_amd64.whl
3、 Frame description
1、crawler.py file :
Urls class : Address Manager
Download class : Page downloader
Parser class : Page parser
Output class : Export data to HTML
Scheduler class : Crawler scheduler
2、modules\useragent In the catalog chrome.py、firefox.py Wait for browser proxy .
3、data.html Import the crawled data into this file .
4、 Using frames
demand : visit 51testing Forum , Get the specified number of pages (1-10) Post title and URL Address .
As shown in the figure : Post title to get .
As shown in the figure : obtain 1-10 page .
1、 Modify the script (crawler.py file ).
(1) modify Parser class ,getDatas Methodical html.xpath value .
//tbody[contains(@id,'normalthread')]/tr/th/a[3]
As shown in the figure : Use Firefox+FirePath Debugging and positioning .
(2) modify Parser class ,getUrls Methodical html.xpath value .
//span[@id='fd_page_bottom']/div//a[not(@class)]//@href
As shown in the figure : Use Firefox+FirePath Debugging and positioning .
(3) Instantiation
Add access address :http://bbs.51testing.com/forum-279-1.html
2、 Execute the script (crawler.py file ).
Installation directory , Command line run python crawler.py
3、 View the crawl results .
After script execution , It will be automatically generated in the installation directory data.html file .
open data.html file , Display the data after crawling , Clicking the title will pop up a new window and jump to the specified address .
边栏推荐
- Enterprise live broadcast: witness focused products, praise and embrace ecology
- Wu Enda logistic regression 2
- Postdoctoral recruitment | West Lake University Machine Intelligence Laboratory recruitment postdoctoral / Assistant Researcher / scientific research assistant
- Rosen's QT journey 100 QML four standard dialog boxes (color, font, file, promotion)
- Sogou batch push software - Sogou batch push tool [2022 latest]
- [target detection] yolov5 Runtong voc2007 dataset (repair version)
- 【PHP伪协议】源码读取、文件读写、任意php命令执行
- Headless mode of new selenium4.3 in egde browser
- Add batch delete
- GTX1080Ti 光纤HDMI干扰出现闪屏1080Ti 闪屏解决方法
猜你喜欢

Enterprise live broadcast: witness focused products, praise and embrace ecology

Replicate swin on Huawei ascend910_ transformer

Outlook tutorial, how to search for calendar items in outlook?

What are the free low code development platforms?

Starting from business needs, open the road of efficient IDC operation and maintenance

搜狗批量推送软件-搜狗批量推送工具【2022最新】

免费的低代码开发平台有哪些?
![[mathematical modeling and drawing series tutorial] II. Drawing and optimization of line chart](/img/73/2b6fe0cf69fa013894abce331e1386.png)
[mathematical modeling and drawing series tutorial] II. Drawing and optimization of line chart

Budget report ppt

Step by step introduction of sqlsugar based development framework (13) -- package the upload component based on elementplus, which is convenient for the project
随机推荐
在华为昇腾Ascend910上复现swin_transformer
Update 3dcat real time cloud rendering V2.1.2 release
MySQL view
【知识图谱】实践篇——基于医疗知识图谱的问答系统实践(Part3):基于规则的问题分类
Chapter 4: operators
[mathematical modeling and drawing series tutorial] II. Drawing and optimization of line chart
备考过程中,这些“谣言”千万不要信!
搜狗批量推送软件-搜狗批量推送工具【2022最新】
爬虫框架-crawler
用秩讨论线性方程组的解/三个平面的位置关系
2022 latest Beijing Construction welder (construction special operation) simulation question bank and answer analysis
动态规划题目记录
Go语言系列:Go从哪里来,Go将去哪里?
jenkins的文件参数,可以用来上传文件
Don't believe these "rumors" in the process of preparing for the exam!
[knowledge atlas] practice -- Practice of question answering system based on medical knowledge atlas (Part3): rule-based problem classification
How to prevent the unburned gas when the city gas safety is alarmed again?
How to delete Microsoft Pinyin input method in win10
GTX1080Ti 光纤HDMI干扰出现闪屏1080Ti 闪屏解决方法
ReBudget:通过运行时重新分配预算的方法,在基于市场的多核资源分配中权衡效率与公平性