当前位置:网站首页>How to use robots Txt and its detailed explanation
How to use robots Txt and its detailed explanation
2022-06-30 23:11:00 【Full stack programmer webmaster】
Hello everyone , I meet you again , I'm your friend, Quan Jun .
At home , Website managers seem to be right about robots.txt It didn't get much attention , At the request of some friends , Today, I want to talk about robots.txt Writing .
robots.txt Basic introduction
robots.txt It's a plain text file , In this file, the site manager can declare that the site does not want to be robots Part of the interview , Or the specified search engine only includes the specified content .
When a search robot ( Some are called search spiders ) When visiting a site , It will first check if the root directory of the site exists robots.txt, If there is , The search robot will follow the contents of the file to determine the scope of access ; If the file does not exist , So the search robot grabs along the link .
in addition ,robots.txt Must be placed in the root of a site , And the file name must be all lowercase .
robots.txt Writing grammar
First , Let's take a look at one robots.txt Example :http://www.seovip.cn/robots.txt
Visit the above address , We can see robots.txt The details are as follows :
# Robots.txt file from http://www.seovip.cn # All robots will spider the domain
User-agent: * Disallow:
The above text means to allow all search robots to access www.seovip.cn All files under the site .
Specific grammar analysis : among # The following text is explanatory information ;User-agent: After that is the name of the search robot , If it is *, All search robots ;Disallow: The following is the directory of files that are not allowed to be accessed .
below , I'm going to list some robots.txt Specific usage of :
Allow all robot visit
User-agent: * Disallow:
Or you can create an empty file “/robots.txt” file
All search engines are forbidden to visit any part of the website
User-agent: * Disallow: /
Forbid all search engines to visit several parts of the website ( In the following example 01、02、03 Catalog )
User-agent: * Disallow: /01/ Disallow: /02/ Disallow: /03/
Prohibit access to a search engine ( In the following example BadBot)
User-agent: BadBot Disallow: /
Only one search engine is allowed to access ( In the following example Crawler)
User-agent: Crawler Disallow:
User-agent: * Disallow: /
in addition , I think it is necessary to expand the description , Yes robots meta Make some introductions :
Robots META Tags are mainly for specific pages . And others META label ( Such as the language used 、 Description of the page 、 Key words ) equally ,Robots META Tags are also on the page <head></head> in , Specifically used to tell search engines ROBOTS How to grab the content of the page .
Robots META How to write a label :
Robots META There is no case in the tag ,name=”Robots” All search engines , It can be written for a specific search engine as name=”BaiduSpider”. content Section has four command options :index、noindex、follow、nofollow, The instructions are separated by “,” Separate .
INDEX The command tells the search robot to grab the page ;
FOLLOW The command indicates that the search robot can continue to crawl along the links on the page ;
Robots Meta The default value of the tag is INDEX and FOLLOW, Only inktomi With the exception of , For it , The default value is INDEX,NOFOLLOW.
such , There are four combinations :
<META NAME=”ROBOTS” CONTENT=”INDEX,FOLLOW”> <META NAME=”ROBOTS” CONTENT=”NOINDEX,FOLLOW”> <META NAME=”ROBOTS” CONTENT=”INDEX,NOFOLLOW”> <META NAME=”ROBOTS” CONTENT=”NOINDEX,NOFOLLOW”>
among
<META NAME=”ROBOTS” CONTENT=”INDEX,FOLLOW”> It can be written. <META NAME=”ROBOTS” CONTENT=”ALL”>;
<META NAME=”ROBOTS” CONTENT=”NOINDEX,NOFOLLOW”> It can be written. <META NAME=”ROBOTS” CONTENT=”NONE”>
So far , Most search engine robots follow robots.txt The rules of , And for Robots META label , At present, there is not much support , But it's growing , Such as the famous search engine GOOGLE I'm totally for , and GOOGLE And an instruction was added “archive”, Can restrict GOOGLE Do you want to keep the page snapshot . for example :
<META NAME=”googlebot” CONTENT=”index,follow,noarchive”>
It means to grab the page in the site and grab along the link in the page , But not in GOOLGE Keep a snapshot of the page on .
How to use robots.txt
robots.txt File pair crawl web search engine roaming ( Called a rover ) Limit . These rovers are automatic , Before they visit a web page, they will check whether there are any restrictions on their access to a specific web page robots.txt file . If you want to protect some content on the website from search engine revenue ,robots.txt Is a simple and effective tool . Here is a brief introduction to how to use it .
How to place Robots.txt file
robots.txt Itself is a text file . It must be located in the root directory of the domain name and Was named ”robots.txt”. Located in the subdirectory robots.txt The file is invalid , Because the rover only looks for this file in the root directory of the domain name . for example ,http://www.example.com/robots.txt Is a valid location ,http://www.example.com/mysite/robots.txt It is not .
Here is an example robots.txt Example :
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /~name/ Use robots.txt File interception or deletion of the entire website
To delete your site from the search engine , And prevent all roaming devices from crawling your website later , Please send the following robots.txt Put the file in the root directory of your server :
User-agent: *
Disallow: /
Only from Google Delete your website , And just prevent Googlebot Crawl your website in the future , Please send the following robots.txt Put the file in the root directory of your server : User-agent: Googlebot
Disallow: /
Each port should have its own robots.txt file . Especially if you pass http and https When hosting content , These agreements need to have their own robots.txt file . for example , Must let Googlebot Just for all http Web pages instead of https Web indexing , The following should be used robots.txt file .
about http agreement (http://yourserver.com/robots.txt):
User-agent: *
Allow: /
about https agreement (https://yourserver.com/robots.txt):
User-agent: *
Disallow: /
Allow all Rovers to access your web pages User-agent: *
Disallow:
( Another way : Build an empty “/robots.txt” file , Or not robot.txt.)
Use robots.txt File block or delete web pages
You can use robots.txt File to block Googlebot Grab pages from your website . for example , If you are manually creating robots.txt File to block Googlebot Grab a specific directory ( for example ,private) All the pages of , The following can be used: robots.txt entry : User-agent: Googlebot
Disallow: /private Stop Googlebot Grab specific file types ( for example ,.gif) All files for , The following can be used: robots.txt entry : User-agent: Googlebot
Disallow: /*.gif$ Stop Googlebot Grab all that contain ? The website of ( To be specific , This URL begins with your domain name , Followed by any string , Then there's the question mark , And then any string ), The following entries can be used : User-agent: Googlebot
Disallow: /*?
Although we don't grab by robots.txt Block or index web content , But if we find this in other pages on the web , We'll still crawl and index their URLs . therefore , Web site and other public information , For example, refers to The positioning text in the link to the website , It's possible to be in Google In search results . however , The content on your web page will not be captured 、 Index and display . As part of webmaster tools ,Google Provides robots.txt Analysis tools . It can follow Googlebot Read robots.txt Read the file in the same way as the file , And can be Google user-agents( Such as Googlebot) Provide results . We strongly recommend that you use it . Creating a robots.txt Documents before , It is necessary to consider what content can be found by users , And what should not be found . In this case , Through rational use robots.txt, Search engines bring users to your site at the same time , It can also ensure that private information is not included .
Mistake 1 : All the files on my website need to be crawled by spiders , Then I don't need to add robots.txt The file . Anyway, if the file doesn't exist , All search spiders will be able to access all pages on the website that are not protected by password by default .
Whenever a user tries to access something that doesn't exist URL when , The server will record in the log 404 error ( Can't find file ). Whenever you search spiders for something that doesn't exist robots.txt When you file , The server will also log a 404 error , So you should add one to your website robots.txt. Mistake 2 : stay robots.txt It is set in the file that all files can be grabbed by the search spider , This can increase the collection rate of the website . Program scripts in Web sites 、 Style sheets and other files are even included by spiders , It will not increase the collection rate of the website , It will only waste server resources . So it has to be robots.txt The file settings do not let the search spider index these files . Which files need to be excluded , stay robots.txt There is a detailed introduction in the article "using skills" . Mistake 3 : Searching spiders to grab web pages wastes server resources , stay robots.txt File settings all search spiders cannot crawl all web pages . If so , Will lead to the entire site can not be included by the search engine .
robots.txt Using skills
1. Whenever a user tries to access something that doesn't exist URL when , The server will record in the log 404 error ( Can't find file ). Whenever you search spiders for something that doesn't exist robots.txt When you file , The server will also log a 404 error , So you should add a robots.txt.
2. Webmasters must keep spider programs away from directories on some servers —— Ensure server performance . such as : Most web servers have programs stored in “cgi-bin” Under the table of contents , So in robots.txt Add... To the document “Disallow: /cgi-bin” It's a good idea , This can avoid all program files being indexed by spiders , Can save server resources . The files that do not need to be captured by spiders in general web sites are : Background management files 、 Program script 、 The attachment 、 Database files 、 Code file 、 Stylesheet file 、 Template file 、 Navigation picture, background picture, etc . Here is VeryCMS Inside robots.txt file : User-agent: * Disallow: /admin/ Background management files Disallow: /require/ Program files Disallow: /attachment/ The attachment Disallow: /images/ picture Disallow: /data/ Database files Disallow: /template/ Template file Disallow: /css/ Stylesheet file Disallow: /lang/ Code file Disallow: /script/ Script files 3. If your website is a dynamic web page , And you create static copies of these dynamic web pages , To make it easier for search spiders to grab . Then you need to robots.txt File settings to prevent dynamic web pages from being indexed by spiders , To ensure that these pages are not considered to contain duplicate content . 4. robots.txt The document can also be directly included in sitemap Link to file . Just like this. : Sitemap: sitemap.xml At present, the search engine companies that support this are Google, Yahoo, Ask and MSN. And Chinese search engine companies , Obviously not in this circle . The advantage of doing so is , Webmasters don't have to go to the webmaster tools of each search engine or similar webmaster parts , Submit your own sitemap file , Search engine spiders will crawl by themselves robots.txt file , Read the sitemap route , Then grab the linked pages . 5. The rational use of robots.txt Files can also avoid errors in accessing . such as , You can't let the searcher go directly to the shopping cart page . Because there is no reason for shopping carts to be included , So you can robots.txt File settings to prevent searchers from directly entering the shopping cart page .
Publisher : Full stack programmer stack length , Reprint please indicate the source :https://javaforall.cn/132200.html Link to the original text :https://javaforall.cn
边栏推荐
- Ctfshow permission maintenance
- HP notebook disable touchpad after mouse is inserted
- 云游戏| 云计算推动游戏行业进入“新纪元”
- 基于kubernetes平台微服务的部署
- JMeter cross thread parameter association requires no script
- In depth analysis of Apache bookkeeper series: Part 4 - back pressure
- Ideal interface automation project
- 后疫情时代,云计算如何为在线教育保驾护航
- Youfu network hybrid cloud accelerates enterprise digital transformation and upgrading
- New trends of China's national tide development in 2022
猜你喜欢
![[无线通信基础-13]:图解移动通信技术与应用发展-1-概述](/img/1d/62e55f1b5445d7349ec383879f4275.png)
[无线通信基础-13]:图解移动通信技术与应用发展-1-概述

多线程经典案例

Solution to the conflict between unique index and logical deletion

MIT doctoral dissertation optimization theory and machine learning practice

Redis' transaction and locking mechanism

Swift5.0 ----Swift FrameWork的创建及使用

Esp8266 becomes client and server

Braces on the left of latex braces in latex multiline formula

零样本和少样本学习

The sandbox is being deployed on the polygon network
随机推荐
理想中的接口自动化项目
"More Ford, more China" saw through the clouds, and the orders of Changan Ford's flagship products exceeded 10000
In depth analysis of Apache bookkeeper series: Part 4 - back pressure
Dell r720 server installation network card Broadcom 5720 driver
How cloud computing can protect online education in the post epidemic Era
MaxPool2d详解--在数组和图像中的应用
微信小程序通过点击事件传参(data-)
Asynchronous transition scenario - generator
Doker's container data volume
[Android, kotlin, tflite] mobile device integration deep learning light model tflite (object detection)
基金客户服务
Smart streetlights | cloud computing lights up the "spark" of smart cities
Discuz forum speed up to delete XXX under data/log PHP file
Fastjson V2 simple user manual
AtCoder Beginner Contest 257
请指教同花顺软件究竟是什么?另外想问,现在在线开户安全么?
Swift5.0 ----Swift FrameWork的创建及使用
How to open a stock account? Is it safe to open a mobile account
唯一性索引与逻辑删除冲突问题解决思路
Redis - 01 缓存:如何利用读缓存提高系统性能?