ROBOTS: Difference between revisions

no edit summary
No edit summary
No edit summary
Line 1: Line 1:
One of the first files you should add to your website is "<code>/robots.txt</code>."&ensp; This is a plaintext file for the Robots Exclusion Protocol (ROBOTS language).&ensp; What the <code>robots.txt</code> file does is instruct which webdirectories should be accessed or avoided by web bots.
[[image:Exciting Comics 3.jpg|thumb|[Image.]&ensp; The Robots Exclusion Protocol will not prevent bad bots from accessing your website.]]
 
One of the first files you should add to your website is "<code>/robots.txt</code>".&ensp; This is a plaintext file for the [https://www.robotstxt.org/ Robots Exclusion Protocol] (ROBOTS language).&ensp; What the <code>robots.txt</code> file does is instruct which webdirectories should be accessed or avoided by web bots.


An important thing to remember is that no bot is <em>required</em> to follow the Robots Exclusion Protocol.&ensp; The protocol only affects the behavior of compliant or well-behaved bots and anyone can program a bot to ignore "<code>robots.txt</code>".&ensp; As such, you should <em>not</em> use the Robots Exclusion Protocol to try to hide sensitive directories, especially since publicly listing the directories in "<code>robots.txt</code>" simply gives malicious bots an easy way to find the very directories you don't want them to visit.&ensp; To hide directories from public access (on Apache <abbr title="Hypertext Transfer Protocol">HTTP</abbr> Server) you should use "<code>/.htaccess</code>" (hypertext access) instead.
An important thing to remember is that no bot is <em>required</em> to follow the Robots Exclusion Protocol.&ensp; The protocol only affects the behavior of compliant or well-behaved bots and anyone can program a bot to ignore "<code>robots.txt</code>".&ensp; As such, you should <em>not</em> use the Robots Exclusion Protocol to try to hide sensitive directories, especially since publicly listing the directories in "<code>robots.txt</code>" simply gives malicious bots an easy way to find the very directories you don't want them to visit.&ensp; To hide directories from public access (on Apache <abbr title="Hypertext Transfer Protocol">HTTP</abbr> Server) you should use "<code>/.htaccess</code>" (hypertext access) instead.
Line 31: Line 33:
"<code>robots.txt</code>" should always be placed in the root webdirectory ("<code>/</code>") if possible.
"<code>robots.txt</code>" should always be placed in the root webdirectory ("<code>/</code>") if possible.


SITEMAP is an extension of the Robots Exclusion Protocol to allow listing a sitemap in the "<code>robots.txt</code>" file.&ensp; Having a precompiled list of links for the website available makes the bot's job to crawl and index the site a lot easier and more efficient.&ensp; Since the first thing a good bot does when accessing a website is to check for a "<code>/robots.txt</code>" file, it is best to have the link to the sitemap listed directly in the "<code>/robots.txt</code>" file so the bot doesn't have to guess whether or not the website has a sitemap available (which could be either "<code>/sitemap.txt</code>" or "<code>/sitemap.xml</code>").  The following example shows a "<code>robots.txt</code>" for a public website with a sitemap.&ensp; Note that unlike the other ROBOTS instructions, the sitemap should be provided with a full URL (uniform resource locator) and not with a relative link.
[https://www.sitemaps.org/ SITEMAP] is an extension of the Robots Exclusion Protocol to allow listing a sitemap in the "<code>robots.txt</code>" file.&ensp; Having a precompiled list of links for the website available makes the bot's job to crawl and index the site a lot easier and more efficient.&ensp; Since the first thing a good bot does when accessing a website is to check for a "<code>/robots.txt</code>" file, it is best to have the link to the sitemap listed directly in the "<code>/robots.txt</code>" file so the bot doesn't have to guess whether or not the website has a sitemap available (which could be either "<code>/sitemap.txt</code>" or "<code>/sitemap.xml</code>").  The following example shows a "<code>robots.txt</code>" for a public website with a sitemap.&ensp; Note that unlike the other ROBOTS instructions, the sitemap should be provided with a full URL (uniform resource locator) and not with a relative link.


<syntaxhighlight lang="text">
<syntaxhighlight lang="text">
Line 39: Line 41:
</syntaxhighlight>
</syntaxhighlight>


Additional protocols such as SECURITY and HUMANS can also be added to the Robots Exclusion Protocol but these are not officially supported so should be commented out to avoid confusing bots while still allowing nonbot users to find the relevant files.  Example below shows a "<code>robots.txt</code>" file for a public website that includes additional protocols for SITEMAP, SECURITY, and HUMANS.
Additional protocols such as [https://www.securitytxt.org/ SECURITY] and [https://humanstxt.org/ HUMANS] can also be added to the Robots Exclusion Protocol but these are not officially supported so should be commented out to avoid confusing bots while still allowing nonbot users to find the relevant files.  Example below shows a "<code>robots.txt</code>" file for a public website that includes additional protocols for SITEMAP, SECURITY, and HUMANS.


<syntaxhighlight lang="text">
<syntaxhighlight lang="text">
Line 78: Line 80:


[[category:webdevelopment]]
[[category:webdevelopment]]
[[category:pages with images]]