当前位置:网站首页>How to use robots Txt and its detailed explanation
How to use robots Txt and its detailed explanation
2022-06-30 23:11:00 【Full stack programmer webmaster】
Hello everyone , I meet you again , I'm your friend, Quan Jun .
At home , Website managers seem to be right about robots.txt It didn't get much attention , At the request of some friends , Today, I want to talk about robots.txt Writing .
robots.txt Basic introduction
robots.txt It's a plain text file , In this file, the site manager can declare that the site does not want to be robots Part of the interview , Or the specified search engine only includes the specified content .
When a search robot ( Some are called search spiders ) When visiting a site , It will first check if the root directory of the site exists robots.txt, If there is , The search robot will follow the contents of the file to determine the scope of access ; If the file does not exist , So the search robot grabs along the link .
in addition ,robots.txt Must be placed in the root of a site , And the file name must be all lowercase .
robots.txt Writing grammar
First , Let's take a look at one robots.txt Example :http://www.seovip.cn/robots.txt
Visit the above address , We can see robots.txt The details are as follows :
# Robots.txt file from http://www.seovip.cn # All robots will spider the domain
User-agent: * Disallow:
The above text means to allow all search robots to access www.seovip.cn All files under the site .
Specific grammar analysis : among # The following text is explanatory information ;User-agent: After that is the name of the search robot , If it is *, All search robots ;Disallow: The following is the directory of files that are not allowed to be accessed .
below , I'm going to list some robots.txt Specific usage of :
Allow all robot visit
User-agent: * Disallow:
Or you can create an empty file “/robots.txt” file
All search engines are forbidden to visit any part of the website
User-agent: * Disallow: /
Forbid all search engines to visit several parts of the website ( In the following example 01、02、03 Catalog )
User-agent: * Disallow: /01/ Disallow: /02/ Disallow: /03/
Prohibit access to a search engine ( In the following example BadBot)
User-agent: BadBot Disallow: /
Only one search engine is allowed to access ( In the following example Crawler)
User-agent: Crawler Disallow:
User-agent: * Disallow: /
in addition , I think it is necessary to expand the description , Yes robots meta Make some introductions :
Robots META Tags are mainly for specific pages . And others META label ( Such as the language used 、 Description of the page 、 Key words ) equally ,Robots META Tags are also on the page <head></head> in , Specifically used to tell search engines ROBOTS How to grab the content of the page .
Robots META How to write a label :
Robots META There is no case in the tag ,name=”Robots” All search engines , It can be written for a specific search engine as name=”BaiduSpider”. content Section has four command options :index、noindex、follow、nofollow, The instructions are separated by “,” Separate .
INDEX The command tells the search robot to grab the page ;
FOLLOW The command indicates that the search robot can continue to crawl along the links on the page ;
Robots Meta The default value of the tag is INDEX and FOLLOW, Only inktomi With the exception of , For it , The default value is INDEX,NOFOLLOW.
such , There are four combinations :
<META NAME=”ROBOTS” CONTENT=”INDEX,FOLLOW”> <META NAME=”ROBOTS” CONTENT=”NOINDEX,FOLLOW”> <META NAME=”ROBOTS” CONTENT=”INDEX,NOFOLLOW”> <META NAME=”ROBOTS” CONTENT=”NOINDEX,NOFOLLOW”>
among
<META NAME=”ROBOTS” CONTENT=”INDEX,FOLLOW”> It can be written. <META NAME=”ROBOTS” CONTENT=”ALL”>;
<META NAME=”ROBOTS” CONTENT=”NOINDEX,NOFOLLOW”> It can be written. <META NAME=”ROBOTS” CONTENT=”NONE”>
So far , Most search engine robots follow robots.txt The rules of , And for Robots META label , At present, there is not much support , But it's growing , Such as the famous search engine GOOGLE I'm totally for , and GOOGLE And an instruction was added “archive”, Can restrict GOOGLE Do you want to keep the page snapshot . for example :
<META NAME=”googlebot” CONTENT=”index,follow,noarchive”>
It means to grab the page in the site and grab along the link in the page , But not in GOOLGE Keep a snapshot of the page on .
How to use robots.txt
robots.txt File pair crawl web search engine roaming ( Called a rover ) Limit . These rovers are automatic , Before they visit a web page, they will check whether there are any restrictions on their access to a specific web page robots.txt file . If you want to protect some content on the website from search engine revenue ,robots.txt Is a simple and effective tool . Here is a brief introduction to how to use it .
How to place Robots.txt file
robots.txt Itself is a text file . It must be located in the root directory of the domain name and Was named ”robots.txt”. Located in the subdirectory robots.txt The file is invalid , Because the rover only looks for this file in the root directory of the domain name . for example ,http://www.example.com/robots.txt Is a valid location ,http://www.example.com/mysite/robots.txt It is not .
Here is an example robots.txt Example :
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /~name/ Use robots.txt File interception or deletion of the entire website
To delete your site from the search engine , And prevent all roaming devices from crawling your website later , Please send the following robots.txt Put the file in the root directory of your server :
User-agent: *
Disallow: /
Only from Google Delete your website , And just prevent Googlebot Crawl your website in the future , Please send the following robots.txt Put the file in the root directory of your server : User-agent: Googlebot
Disallow: /
Each port should have its own robots.txt file . Especially if you pass http and https When hosting content , These agreements need to have their own robots.txt file . for example , Must let Googlebot Just for all http Web pages instead of https Web indexing , The following should be used robots.txt file .
about http agreement (http://yourserver.com/robots.txt):
User-agent: *
Allow: /
about https agreement (https://yourserver.com/robots.txt):
User-agent: *
Disallow: /
Allow all Rovers to access your web pages User-agent: *
Disallow:
( Another way : Build an empty “/robots.txt” file , Or not robot.txt.)
Use robots.txt File block or delete web pages
You can use robots.txt File to block Googlebot Grab pages from your website . for example , If you are manually creating robots.txt File to block Googlebot Grab a specific directory ( for example ,private) All the pages of , The following can be used: robots.txt entry : User-agent: Googlebot
Disallow: /private Stop Googlebot Grab specific file types ( for example ,.gif) All files for , The following can be used: robots.txt entry : User-agent: Googlebot
Disallow: /*.gif$ Stop Googlebot Grab all that contain ? The website of ( To be specific , This URL begins with your domain name , Followed by any string , Then there's the question mark , And then any string ), The following entries can be used : User-agent: Googlebot
Disallow: /*?
Although we don't grab by robots.txt Block or index web content , But if we find this in other pages on the web , We'll still crawl and index their URLs . therefore , Web site and other public information , For example, refers to The positioning text in the link to the website , It's possible to be in Google In search results . however , The content on your web page will not be captured 、 Index and display . As part of webmaster tools ,Google Provides robots.txt Analysis tools . It can follow Googlebot Read robots.txt Read the file in the same way as the file , And can be Google user-agents( Such as Googlebot) Provide results . We strongly recommend that you use it . Creating a robots.txt Documents before , It is necessary to consider what content can be found by users , And what should not be found . In this case , Through rational use robots.txt, Search engines bring users to your site at the same time , It can also ensure that private information is not included .
Mistake 1 : All the files on my website need to be crawled by spiders , Then I don't need to add robots.txt The file . Anyway, if the file doesn't exist , All search spiders will be able to access all pages on the website that are not protected by password by default .
Whenever a user tries to access something that doesn't exist URL when , The server will record in the log 404 error ( Can't find file ). Whenever you search spiders for something that doesn't exist robots.txt When you file , The server will also log a 404 error , So you should add one to your website robots.txt. Mistake 2 : stay robots.txt It is set in the file that all files can be grabbed by the search spider , This can increase the collection rate of the website . Program scripts in Web sites 、 Style sheets and other files are even included by spiders , It will not increase the collection rate of the website , It will only waste server resources . So it has to be robots.txt The file settings do not let the search spider index these files . Which files need to be excluded , stay robots.txt There is a detailed introduction in the article "using skills" . Mistake 3 : Searching spiders to grab web pages wastes server resources , stay robots.txt File settings all search spiders cannot crawl all web pages . If so , Will lead to the entire site can not be included by the search engine .
robots.txt Using skills
1. Whenever a user tries to access something that doesn't exist URL when , The server will record in the log 404 error ( Can't find file ). Whenever you search spiders for something that doesn't exist robots.txt When you file , The server will also log a 404 error , So you should add a robots.txt.
2. Webmasters must keep spider programs away from directories on some servers —— Ensure server performance . such as : Most web servers have programs stored in “cgi-bin” Under the table of contents , So in robots.txt Add... To the document “Disallow: /cgi-bin” It's a good idea , This can avoid all program files being indexed by spiders , Can save server resources . The files that do not need to be captured by spiders in general web sites are : Background management files 、 Program script 、 The attachment 、 Database files 、 Code file 、 Stylesheet file 、 Template file 、 Navigation picture, background picture, etc . Here is VeryCMS Inside robots.txt file : User-agent: * Disallow: /admin/ Background management files Disallow: /require/ Program files Disallow: /attachment/ The attachment Disallow: /images/ picture Disallow: /data/ Database files Disallow: /template/ Template file Disallow: /css/ Stylesheet file Disallow: /lang/ Code file Disallow: /script/ Script files 3. If your website is a dynamic web page , And you create static copies of these dynamic web pages , To make it easier for search spiders to grab . Then you need to robots.txt File settings to prevent dynamic web pages from being indexed by spiders , To ensure that these pages are not considered to contain duplicate content . 4. robots.txt The document can also be directly included in sitemap Link to file . Just like this. : Sitemap: sitemap.xml At present, the search engine companies that support this are Google, Yahoo, Ask and MSN. And Chinese search engine companies , Obviously not in this circle . The advantage of doing so is , Webmasters don't have to go to the webmaster tools of each search engine or similar webmaster parts , Submit your own sitemap file , Search engine spiders will crawl by themselves robots.txt file , Read the sitemap route , Then grab the linked pages . 5. The rational use of robots.txt Files can also avoid errors in accessing . such as , You can't let the searcher go directly to the shopping cart page . Because there is no reason for shopping carts to be included , So you can robots.txt File settings to prevent searchers from directly entering the shopping cart page .
Publisher : Full stack programmer stack length , Reprint please indicate the source :https://javaforall.cn/132200.html Link to the original text :https://javaforall.cn
边栏推荐
- pytorch 的Conv2d的详细解释
- DNS server setup, forwarding, master-slave configuration
- Solution to the conflict between unique index and logical deletion
- 深入解析 Apache BookKeeper 系列:第四篇—背压
- Esp8266 becomes client and server
- 软件测试报告包含哪些内容?如何获取高质量软件测试报告?
- latex字母头顶两个点
- “飞桨+辨影相机”成为AI界的“预制菜”,工业AI质检落地更简单
- Solve arm_ release_ ver of this libmali is ‘g2p0-01eac0‘,rk_ so_ Ver is' 4 ', libgl1 mesa dev will not be installed, and there are unsatisfied dependencies
- "More Ford, more China" saw through the clouds, and the orders of Changan Ford's flagship products exceeded 10000
猜你喜欢
Esp8266 becomes client and server
Zero sample and small sample learning
Swift5.0 ----Swift FrameWork的创建及使用
多线程经典案例
机器学习编译入门课程学习笔记第二讲 张量程序抽象
The sandbox is being deployed on the polygon network
深入解析 Apache BookKeeper 系列:第四篇—背压
多線程經典案例
2022-06-30: what does the following golang code output? A:0; B:2; C: Running error. package main import “fmt“ func main() { ints := make
Solution to the conflict between unique index and logical deletion
随机推荐
如何使用 DataAnt 监控 Apache APISIX
[golang] golang实现截取字符串函数SubStr
Redis' transaction and locking mechanism
Youfu network hybrid cloud accelerates enterprise digital transformation and upgrading
C language array interception, C string by array interception method (c/s)
What are the contents and processes of software validation testing? How much does it cost to confirm the test report?
"Paddle + camera" has become a "prefabricated dish" in the AI world, and it is easier to implement industrial AI quality inspection
Cas classique multithreadé
35 giant technology companies jointly form the meta universe standard Forum Organization
未来十年世界数字化与机器智能展望
5g smart building solution 2021
hot-fix、cherry-pick怎么提
Ctfshow permission maintenance
手机上怎么开股票账户?另外,手机开户安全么?
Fund managers' corporate governance and risk management
How to open a stock account? Is it safe to open a mobile account
Meet the streamnational | yangzike: what made me give up Dachang offer
图纸加密如何保障我们的核心图纸安全
Cesiumjs 2022 ^ source code interpretation [6] - new architecture of modelempirical
Spark - understand partitioner in one article