当前位置:网站首页>Use the script to crawl the beautiful sentences of the sentence fan website and store them locally (blessed are those who like to excerpt!)
Use the script to crawl the beautiful sentences of the sentence fan website and store them locally (blessed are those who like to excerpt!)
2022-06-26 13:06:00 【Rowing in the waves】
utilize scrapy Crawl the sentence fan website and store the beautiful sentences locally ( Blessed are those who like excerpts !)
- 1. Preface
- 2. newly build scapy The project
- 3. For the convenience of operation , Or create a new one under the first level directory main.py file , The code is as follows :
- 4. To write item.py The code is as follows :
- 5. To write sentence.py Inside the main function code :
- 6. utilize CsvItemExporter Saved locally
- 7. Set up the proxy middleware .
- 8. The screenshot of the startup and operation project interface is as follows :
1. Preface
I have always collected sentences or lines that I accidentally think are very beautiful , When I first played Sina Weibo, I saw that some weibos had always shared very good lines or sentences . I almost copied them down , Although the writing is not very good , Now when you see a good sentence, you will copy, paste and save it to your mobile phone memo , I'll watch it when I have time , Then I will post my collection (ps: The following sentences are all about me after dinner , Browse the major websites , Hold the heart of love , Collected , The source is unknown , Please let us know if there is any infringement )
- It's not what we hate that destroys us , And it's exactly what we love .
These people are not heroes , It may not be the culprit . Their social status may be extremely low , It can also be extremely high . It's hard to say whether they are good or bad , But because of their existence , Many simple things become chaotic 、 ambiguous 、 dirty , Many peaceful relationships become tense 、 embarrassed … They never want to be responsible for anything , And they really can't be held accountable …
… You are finally angry , Gather all powerful thunders and prepare to bombard , I didn't expect them to be angry with you , You suddenly lost your target …
… What is a villain ? If the definition is clear , They are not so hateful … - “ Where the heart is , I've been in the past , Life is like a journey , One reed to sail .”
- The first floor will eventually miss the youth , Freedom sooner or later confuses the rest of life .
- The bloody wind made a great noise , There is no white plum fragrance in the world . The beloved drifts away with the snow , The living keep the cross wound alone .----《 Rurouni Kenshin 》 (ps: This is what I saw when I watched animation , It still reads like chicken jelly ?)
- Three winters are warm , Harsh words hurt in June .
- . All the people in the world are for profit , The world is bustling for profit .
- The honey of this , That is arsenic. .
- I love you , It's with Qing Qing , I don't care , Who should be Qing Qing .
- I'm not sociable by nature . In most cases , I don't think it's boring , It's just that I'm afraid that they think I'm boring . But I don't want to put up with boredom , I don't want to make myself interesting , It was too tired . I'm the most relaxed when I'm alone , Because I don't feel bored , Even if it's boring , I'll bear it myself , Don't involve others .
- The green light shines on the wall and people first sleep , The cold rain knocks on the window and is not warm .
- Memories can make a person neurotic. One second ago, the corners of his mouth were slightly raised , The next second it moistened the eyes , This may be a sudden moment related to your memory , Or a similar plot is enough to make you suddenly burst into tears ……
- On the first night of last year , Flower market lights like day . Willow head on the moon , After dusk .
On the first night of this year , Flowers and lights are still . No one last year , Tears filled the sleeves of the spring shirt . - I didn't know the meaning of the song at the beginning , Listen to the music again . Now that you have become a person in the song , Why do you need to know the music again .
- Since ancient times, people have expressed themselves in white , Never love letter, never love letter .
- Talk about how many years you are young , Often with life, human life .
2. newly build scapy The project
The reason for using the following crawl Templates , It is because the template is more suitable for extracting links from article pages , Avoid too many parsing pages ,scrapy It will automatically access the url style , Greatly save access time .
Create a juzimi Of project, And called sentence The reptiles of py file .
scrapy startproject juzimi
cd juzimi
scrapy genspider -t crawl sentence www.juzimi.com# It's using scrapy Another kind of crawl Templates
3. For the convenience of operation , Or create a new one under the first level directory main.py file , The code is as follows :
from scrapy import cmdline
cmdline.execute('scrapy crawl sentence'.split())
4. To write item.py The code is as follows :
class JuzimiItem(scrapy.Item):
title=scrapy.Field()# The source of the sentence
sentence=scrapy.Field()# The sentence
writer=scrapy.Field()# author
love=scrapy.Field()# Love to count
url=scrapy.Field()# article url, This link is stored so that some of the above fields are incorrect , Convenient for troubleshooting .
5. To write sentence.py Inside the main function code :
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from w3lib.html import remove_tags#python Self-contained class , Remove the label inside the content
from ..items import JuzimiItem#.. Represents the relative path under the current file
class SentenceSpider(CrawlSpider):
name = 'sentence'
allowed_domains = ['www.juzimi.com']
start_urls = [ 'https://www.juzimi.com/allarticle/jingdiantaici',
'https://www.juzimi.com/allarticle/sanwen' ]
rules = (
Rule(LinkExtractor(allow=r'article/\d+'), callback='parse_item', follow=True),
# List of sentences url Such as :'https://www.juzimi.com/article/28410'
)
def parse_item(self, response):
sentences = [remove_tags(item) for item in response.css('.xlistju').extract()]
loves=[ item.lstrip(' like ') for item in response.css('div.view-content:nth-child(2) .flag-action::text').extract()]
writers=[item.css('.views-field-field-oriwriter-value::text').extract() if item.css('.views-field-field-oriwriter-value::text') else ''
for item in response.css('div.view-content:nth-child(2) .xqjulistwafo')]
titles=[item for item in response.css('div.view-content:nth-child(2) .xqjulistwafo .active::text').extract()]
for sentence,love,writer,title in zip(sentences,loves,writers,titles):
item = JuzimiItem()
item['sentence']=sentence
item['love']=love
item['writer']=writer
item['title']=title
item['url']=response.url
return item
Find yourself really in love with the list derivation , Each sentence list page has 10 Sentences , The field value is not obtained on the sentence details page , Considering that the details page doesn't have the amount of love I want , In addition, field values can be obtained on the list page , All said there was no need to go to the details page to get . Some sentences have no author , All the judgment statements are added to the list derivation , If there is a corresponding author tag, get , Nothing is empty ’’. Back to item The fields are as follows :
{
'love': '(2139)',
'sentence': ' Strangers are like jade , There is no one like you .',
'title': ' Mu Yucheng is about ',
'url': 'https://www.juzimi.com/article/45916',
'writer': [' Ye Fan ']}
{
'love': '(110)',
'sentence': ' Life always takes more than it gives .',
'title': ' Galaxy escort ',
'url': 'https://www.juzimi.com/article/76689',
'writer': ''}
{
'love': '(405)',
'sentence': "Secrets have a cost. They're not free. Not now, not ever. \r"
' Secrets come at a price . There are no free secrets . No, not at the moment , Not at any time .',
'title': ' Super spider man ',
'url': 'https://www.juzimi.com/article/27324',
'writer': ''}
{
'love': '(140)',
'sentence': ' Hatred is like poison , Slowly it will make your soul ugly .',
'title': ' spider-man 3',
'url': 'https://www.juzimi.com/article/37095',
'writer': ''}
{
'love': '(30)',
'sentence': '“ From now on , I will no longer be greedy , But just love to eat .”',
'title': ' Garfield's happy life ',
'url': 'https://www.juzimi.com/article/362645',
'writer': [' Garfield ']}
6. utilize CsvItemExporter Saved locally
in consideration of scraoy Provides exporter, You can see expoter Source code ,scrapy Provides multiple export methods , As shown below :
__all__ = ['BaseItemExporter', 'PprintItemExporter', 'PickleItemExporter',
'CsvItemExporter', 'XmlItemExporter', 'JsonLinesItemExporter',
'JsonItemExporter', 'MarshalItemExporter']
Here we use CsvItemExporter The exporter exports the... Returned above item, because csv The format file can be accessed through excel open . So next write pipelines.py The documents inside , The code is as follows :
class CsvExporterPipeline(object):
# utilize scrapy Built in CsvExporter export csv file
def __init__(self):
self.file=open('new_sentence.csv','wb')# Open storage file , It doesn't matter whether the file is created or not
self.exporter=CsvItemExporter(self.file,encoding='utf-8')# Create an export class object , Specify the encoding format
self.exporter.start_exporting()# Start import
def spider_closed(self,spider):
self.finish_exporting()# Close import
def process_item(self,item,spider):
self.exporter.export_item(item)
return item # Pass on item
Remember to write the code in settings.py Open inside ITEM_PIPELINES, The code is as follows :
ITEM_PIPELINES = {
'juzimi.pipelines.CsvExporterPipeline':300,
}
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 2
USER_AGENT ='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
7. Set up the proxy middleware .
Waiting for the result with joy , It turned out to be a 403 Blocking access , The back directly sealed it for me ip, Browsers can't access , I have no choice but to act as an agent , Acting Dafa is good . Here we recommend Abu cloud agent , There is an hour system , At the same time, various interface documents are written in detail , So it is more convenient . The following code provides documentation for Abu cloud
import base64
# proxy server
proxyServer = "http://http-dyn.abuyun.com:9020"
# Proxy tunnel validation information
proxyUser = "user"# There will be... After purchase
proxyPass = "password"# There will be... After purchase
proxyAuth = "Basic " + base64.urlsafe_b64encode(bytes((proxyUser + ":" + proxyPass), "ascii")).decode("utf8")
class ProxyMiddleware(object):
def process_request(self, request, spider):
request.meta["proxy"] = proxyServer
request.headers["Proxy-Authorization"] = proxyAuth
Next, just configure the middleware :
DOWNLOADER_MIDDLEWARES = {
'juzimi.middlewares.ProxyMiddleware': 543,
}
8. The screenshot of the startup and operation project interface is as follows :
Here's a little question , Generated csv The file in pycharm It's not garbled when you open it , Notepad open is not garbled , But with excel When you open it, the code is garbled , Later, I tried to specify the encoding format as... Through Notepad ascii Save as a new file , It is normal to open it again ,csv The screenshot of the file is as follows :
Here I sort according to my favorite amount , Post it , The plan was to post only the top ten , There are my lines in the back of the page ,HASAKI, That must be posted , It seems that a lot of people like Yasso , The lines sound domineering .
边栏推荐
猜你喜欢
![Vivado 错误代码 [DRC PDCN-2721] 解决](/img/de/ce1a72f072254ae227fdcb307641a2.png)
Vivado 错误代码 [DRC PDCN-2721] 解决

倍福PLC实现绝对值编码器原点断电保持---bias的使用

Group counting practice experiment 9 -- using cmstudio to design microprogram instructions based on segment model machine (2)

Learning Processing Zoog

Electron official docs series: Get Started
What should the software test report include? Interview must ask

System tasks (display / print class) in Verilog - $display, $write, $strobe, $monitor

opencv高速下载

P2393 yyy loves Maths II

processing 随机生成线动画
随机推荐
C# const详解:C#常量的定义和使用
MySQL 自定义函数时:This function has none of DETERMINISTIC, NO SQL 解决方案
zoopeeper设置acl权限控制(只允许特定ip访问,加强安全)
C# 结构体:定义、示例
Record a phpcms9.6.3 vulnerability to use the getshell to the intranet domain control
四类线性相位 FIR滤波器设计 —— MATLAB源码全集
Electron official docs series: Testing And Debugging
倍福PLC通过MC_ReadParameter读取NC轴的配置参数
KVM video card transparent transmission -- the road of building a dream
System tasks (display / print class) in Verilog - $display, $write, $strobe, $monitor
First knowledge - Software Testing
Deeply analyze the differences between dangbei box B3, Tencent Aurora 5S and Xiaomi box 4S
倍福EtherCAT Xml描述文件更新和下载
Lightflow completed the compatibility certification with "daocloud Enterprise Cloud native application cloud platform"
Adobe Acrobat prevents 30 security software from viewing PDF files or there are security risks
C structure: definition and example
POJ 3070 Fibonacci
倍福将EtherCAT模块分到多个同步单元运行--Sync Units的使用
postgis計算角度
Openlayers drawing dynamic migration lines and curves