当前位置：网站首页>How to use robots Txt and its detailed explanation

How to use robots Txt and its detailed explanation

2022-06-30 23:11:00 【Full stack programmer webmaster】

Hello everyone , I meet you again , I'm your friend, Quan Jun .

At home , Website managers seem to be right about robots.txt It didn't get much attention , At the request of some friends , Today, I want to talk about robots.txt Writing .

robots.txt Basic introduction

robots.txt It's a plain text file , In this file, the site manager can declare that the site does not want to be robots Part of the interview , Or the specified search engine only includes the specified content .

When a search robot （ Some are called search spiders ） When visiting a site , It will first check if the root directory of the site exists robots.txt, If there is , The search robot will follow the contents of the file to determine the scope of access ; If the file does not exist , So the search robot grabs along the link .

in addition ,robots.txt Must be placed in the root of a site , And the file name must be all lowercase .

robots.txt Writing grammar

First , Let's take a look at one robots.txt Example ：http://www.seovip.cn/robots.txt

Visit the above address , We can see robots.txt The details are as follows ：

# Robots.txt file from http://www.seovip.cn # All robots will spider the domain

User-agent: * Disallow:

The above text means to allow all search robots to access www.seovip.cn All files under the site .

Specific grammar analysis ： among # The following text is explanatory information ;User-agent: After that is the name of the search robot , If it is *, All search robots ;Disallow: The following is the directory of files that are not allowed to be accessed .

below , I'm going to list some robots.txt Specific usage of ：

Allow all robot visit

User-agent: * Disallow:

Or you can create an empty file “/robots.txt” file

All search engines are forbidden to visit any part of the website

User-agent: * Disallow: /

Forbid all search engines to visit several parts of the website （ In the following example 01、02、03 Catalog ）

User-agent: * Disallow: /01/ Disallow: /02/ Disallow: /03/

Prohibit access to a search engine （ In the following example BadBot）

User-agent: BadBot Disallow: /

Only one search engine is allowed to access （ In the following example Crawler）

User-agent: Crawler Disallow:

User-agent: * Disallow: /

in addition , I think it is necessary to expand the description , Yes robots meta Make some introductions ：

Robots META Tags are mainly for specific pages . And others META label （ Such as the language used 、 Description of the page 、 Key words ） equally ,Robots META Tags are also on the page ＜head＞＜/head＞ in , Specifically used to tell search engines ROBOTS How to grab the content of the page .

Robots META How to write a label ：

Robots META There is no case in the tag ,name=”Robots” All search engines , It can be written for a specific search engine as name=”BaiduSpider”. content Section has four command options ：index、noindex、follow、nofollow, The instructions are separated by “,” Separate .

INDEX The command tells the search robot to grab the page ;

FOLLOW The command indicates that the search robot can continue to crawl along the links on the page ;

Robots Meta The default value of the tag is INDEX and FOLLOW, Only inktomi With the exception of , For it , The default value is INDEX,NOFOLLOW.

such , There are four combinations ：

＜META NAME=”ROBOTS” CONTENT=”INDEX,FOLLOW”＞＜META NAME=”ROBOTS” CONTENT=”NOINDEX,FOLLOW”＞＜META NAME=”ROBOTS” CONTENT=”INDEX,NOFOLLOW”＞＜META NAME=”ROBOTS” CONTENT=”NOINDEX,NOFOLLOW”＞

among

＜META NAME=”ROBOTS” CONTENT=”INDEX,FOLLOW”＞ It can be written. ＜META NAME=”ROBOTS” CONTENT=”ALL”＞;

＜META NAME=”ROBOTS” CONTENT=”NOINDEX,NOFOLLOW”＞ It can be written. ＜META NAME=”ROBOTS” CONTENT=”NONE”＞

So far , Most search engine robots follow robots.txt The rules of , And for Robots META label , At present, there is not much support , But it's growing , Such as the famous search engine GOOGLE I'm totally for , and GOOGLE And an instruction was added “archive”, Can restrict GOOGLE Do you want to keep the page snapshot . for example ：

＜META NAME=”googlebot” CONTENT=”index,follow,noarchive”＞

It means to grab the page in the site and grab along the link in the page , But not in GOOLGE Keep a snapshot of the page on .

How to use robots.txt

robots.txt File pair crawl web search engine roaming （ Called a rover ） Limit . These rovers are automatic , Before they visit a web page, they will check whether there are any restrictions on their access to a specific web page robots.txt file . If you want to protect some content on the website from search engine revenue ,robots.txt Is a simple and effective tool . Here is a brief introduction to how to use it .

How to place Robots.txt file

robots.txt Itself is a text file . It must be located in the root directory of the domain name and Was named ”robots.txt”. Located in the subdirectory robots.txt The file is invalid , Because the rover only looks for this file in the root directory of the domain name . for example ,http://www.example.com/robots.txt Is a valid location ,http://www.example.com/mysite/robots.txt It is not .

Here is an example robots.txt Example :

User-agent: *

Disallow: /cgi-bin/

Disallow: /tmp/

Disallow: /~name/ Use robots.txt File interception or deletion of the entire website

To delete your site from the search engine , And prevent all roaming devices from crawling your website later , Please send the following robots.txt Put the file in the root directory of your server ：

User-agent: *

Disallow: /

Only from Google Delete your website , And just prevent Googlebot Crawl your website in the future , Please send the following robots.txt Put the file in the root directory of your server ： User-agent: Googlebot

Disallow: /

Each port should have its own robots.txt file . Especially if you pass http and https When hosting content , These agreements need to have their own robots.txt file . for example , Must let Googlebot Just for all http Web pages instead of https Web indexing , The following should be used robots.txt file .

about http agreement (http://yourserver.com/robots.txt):

User-agent: *

Allow: /

about https agreement (https://yourserver.com/robots.txt):

User-agent: *

Disallow: /

Allow all Rovers to access your web pages User-agent: *

Disallow:

( Another way : Build an empty “/robots.txt” file , Or not robot.txt.)

Use robots.txt File block or delete web pages

You can use robots.txt File to block Googlebot Grab pages from your website . for example , If you are manually creating robots.txt File to block Googlebot Grab a specific directory （ for example ,private） All the pages of , The following can be used: robots.txt entry ： User-agent: Googlebot

Disallow: /private Stop Googlebot Grab specific file types （ for example ,.gif） All files for , The following can be used: robots.txt entry ： User-agent: Googlebot

Disallow: /*.gif$ Stop Googlebot Grab all that contain ? The website of （ To be specific , This URL begins with your domain name , Followed by any string , Then there's the question mark , And then any string ）, The following entries can be used ： User-agent: Googlebot

Disallow: /*?

Although we don't grab by robots.txt Block or index web content , But if we find this in other pages on the web , We'll still crawl and index their URLs . therefore , Web site and other public information , For example, refers to The positioning text in the link to the website , It's possible to be in Google In search results . however , The content on your web page will not be captured 、 Index and display . As part of webmaster tools ,Google Provides robots.txt Analysis tools . It can follow Googlebot Read robots.txt Read the file in the same way as the file , And can be Google user-agents（ Such as Googlebot） Provide results . We strongly recommend that you use it . Creating a robots.txt Documents before , It is necessary to consider what content can be found by users , And what should not be found . In this case , Through rational use robots.txt, Search engines bring users to your site at the same time , It can also ensure that private information is not included .

Mistake 1 ： All the files on my website need to be crawled by spiders , Then I don't need to add robots.txt The file . Anyway, if the file doesn't exist , All search spiders will be able to access all pages on the website that are not protected by password by default .

　　 Whenever a user tries to access something that doesn't exist URL when , The server will record in the log 404 error （ Can't find file ）. Whenever you search spiders for something that doesn't exist robots.txt When you file , The server will also log a 404 error , So you should add one to your website robots.txt. 　　 Mistake 2 ： stay robots.txt It is set in the file that all files can be grabbed by the search spider , This can increase the collection rate of the website . 　　 Program scripts in Web sites 、 Style sheets and other files are even included by spiders , It will not increase the collection rate of the website , It will only waste server resources . So it has to be robots.txt The file settings do not let the search spider index these files . 　　 Which files need to be excluded , stay robots.txt There is a detailed introduction in the article "using skills" . 　　 Mistake 3 ： Searching spiders to grab web pages wastes server resources , stay robots.txt File settings all search spiders cannot crawl all web pages . 　　 If so , Will lead to the entire site can not be included by the search engine .

robots.txt Using skills

1. Whenever a user tries to access something that doesn't exist URL when , The server will record in the log 404 error （ Can't find file ）. Whenever you search spiders for something that doesn't exist robots.txt When you file , The server will also log a 404 error , So you should add a robots.txt.

　　2. Webmasters must keep spider programs away from directories on some servers —— Ensure server performance . such as ： Most web servers have programs stored in “cgi-bin” Under the table of contents , So in robots.txt Add... To the document “Disallow: /cgi-bin” It's a good idea , This can avoid all program files being indexed by spiders , Can save server resources . The files that do not need to be captured by spiders in general web sites are ： Background management files 、 Program script 、 The attachment 、 Database files 、 Code file 、 Stylesheet file 、 Template file 、 Navigation picture, background picture, etc . 　　 Here is VeryCMS Inside robots.txt file ：　　User-agent: * 　　Disallow: /admin/ Background management files 　　Disallow: /require/ Program files 　　Disallow: /attachment/ The attachment 　　Disallow: /images/ picture 　　Disallow: /data/ Database files 　　Disallow: /template/ Template file 　　Disallow: /css/ Stylesheet file 　　Disallow: /lang/ Code file 　　Disallow: /script/ Script files 　　3. If your website is a dynamic web page , And you create static copies of these dynamic web pages , To make it easier for search spiders to grab . Then you need to robots.txt File settings to prevent dynamic web pages from being indexed by spiders , To ensure that these pages are not considered to contain duplicate content . 　　4. robots.txt The document can also be directly included in sitemap Link to file . Just like this. ：　　Sitemap: sitemap.xml 　　 At present, the search engine companies that support this are Google, Yahoo, Ask and MSN. And Chinese search engine companies , Obviously not in this circle . The advantage of doing so is , Webmasters don't have to go to the webmaster tools of each search engine or similar webmaster parts , Submit your own sitemap file , Search engine spiders will crawl by themselves robots.txt file , Read the sitemap route , Then grab the linked pages . 　　5. The rational use of robots.txt Files can also avoid errors in accessing . such as , You can't let the searcher go directly to the shopping cart page . Because there is no reason for shopping carts to be included , So you can robots.txt File settings to prevent searchers from directly entering the shopping cart page .

Publisher ： Full stack programmer stack length , Reprint please indicate the source ：https://javaforall.cn/132200.html Link to the original text ：https://javaforall.cn

原网站

版权声明
本文为[Full stack programmer webmaster]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/181/202206301858240262.html

当前位置：网站首页>How to use robots Txt and its detailed explanation

How to use robots Txt and its detailed explanation

边栏推荐

猜你喜欢

随机推荐