ROBOTS: Difference between revisions

no edit summary
No edit summary
No edit summary
Line 9: Line 9:
For a public website, you want to allow access to all bots in order to get the site crawled and indexed by as many search engines as possible.&ensp; The following "<code>/robots.txt</code>" allows all bots access to all directories (minus anything excluded by "<code>/.htaccess</code>").
For a public website, you want to allow access to all bots in order to get the site crawled and indexed by as many search engines as possible.&ensp; The following "<code>/robots.txt</code>" allows all bots access to all directories (minus anything excluded by "<code>/.htaccess</code>").


<highlight lang="robots">
<code><highlight lang="robots">
User-agent: *
User-agent: *
Disallow:
Disallow:
# All bots can crawl/index all files and directories.
# All bots can crawl/index all files and directories.
</highlight>
</highlight></code>


Websites and webdirectories that are not publicly indexed on search engines are referred to as the "deep web" or "darknet".&ensp; For example, you may want to create a mirror of your website for testing purposes, but don't want the development website publicly indexed since it will create duplicate or misleading results on search engines.&ensp; The following "<code>/robots.txt</code>" creates a darknet website that instructs all bots compliant with the Robots Exclusion Protocol to not crawl or index any part of the site.
Websites and webdirectories that are not publicly indexed on search engines are referred to as the "deep web" or "darknet".&ensp; For example, you may want to create a mirror of your website for testing purposes, but don't want the development website publicly indexed since it will create duplicate or misleading results on search engines.&ensp; The following "<code>/robots.txt</code>" creates a darknet website that instructs all bots compliant with the Robots Exclusion Protocol to not crawl or index any part of the site.


<highlight lang="robots">
<code><highlight lang="robots">
User-agent: *
User-agent: *
Disallow: /
Disallow: /
# No bots can crawl/index any files or directories.
# No bots can crawl/index any files or directories.
</highlight>
</highlight></code>


The following "<code>/robots.txt</code>" excludes two webdirectories ("<code>/sandbox/</code>" and "<code>/testbox/</code>") from crawling/indexing but permits access to all other webdirectories on the site.
The following "<code>/robots.txt</code>" excludes two webdirectories ("<code>/sandbox/</code>" and "<code>/testbox/</code>") from crawling/indexing but permits access to all other webdirectories on the site.


<highlight lang="robots">
<code><highlight lang="robots">
User-agent: *
User-agent: *
Disallow: /sandbox/
Disallow: /sandbox/
Disallow: /testbox/
Disallow: /testbox/
# Bots can crawl/index all files and directories except for "/sandbox/" and "/testbox/" (exclusion is applied recursively to all subdirectories of "sandbox" and "testbox").
# Bots can crawl/index all files and directories except for "/sandbox/" and "/testbox/" (exclusion is applied recursively to all subdirectories of "sandbox" and "testbox").
</highlight>
</highlight></code>


"<code>robots.txt</code>" should always be placed in the root webdirectory ("<code>/</code>") if possible.
"<code>robots.txt</code>" should always be placed in the root webdirectory ("<code>/</code>") if possible.
Line 36: Line 36:
[https://www.sitemaps.org/ SITEMAP] is an extension of the Robots Exclusion Protocol to allow listing a sitemap in the "<code>robots.txt</code>" file.&ensp; Having a precompiled list of links for the website available makes the bot's job to crawl and index the site a lot easier and more efficient.&ensp; Since the first thing a good bot does when accessing a website is to check for a "<code>/robots.txt</code>" file, it is best to have the link to the sitemap listed directly in the "<code>/robots.txt</code>" file so the bot doesn't have to guess whether or not the website has a sitemap available (which could be either "<code>/sitemap.txt</code>" or "<code>/sitemap.xml</code>").  The following example shows a "<code>robots.txt</code>" for a public website with a sitemap.&ensp; Note that unlike the other ROBOTS instructions, the sitemap should be provided with a full URL (uniform resource locator) and not with a relative link.
[https://www.sitemaps.org/ SITEMAP] is an extension of the Robots Exclusion Protocol to allow listing a sitemap in the "<code>robots.txt</code>" file.&ensp; Having a precompiled list of links for the website available makes the bot's job to crawl and index the site a lot easier and more efficient.&ensp; Since the first thing a good bot does when accessing a website is to check for a "<code>/robots.txt</code>" file, it is best to have the link to the sitemap listed directly in the "<code>/robots.txt</code>" file so the bot doesn't have to guess whether or not the website has a sitemap available (which could be either "<code>/sitemap.txt</code>" or "<code>/sitemap.xml</code>").  The following example shows a "<code>robots.txt</code>" for a public website with a sitemap.&ensp; Note that unlike the other ROBOTS instructions, the sitemap should be provided with a full URL (uniform resource locator) and not with a relative link.


<highlight lang="robots">
<code><highlight lang="robots">
User-agent: *
User-agent: *
Disallow:
Disallow:
Sitemap: https://www.example.net/sitemap.xml
Sitemap: https://www.example.net/sitemap.xml
</highlight>
</highlight></code>


Additional protocols such as [https://www.securitytxt.org/ SECURITY] and [https://humanstxt.org/ HUMANS] can also be added to the Robots Exclusion Protocol but these are not officially supported so should be commented out to avoid confusing bots while still allowing nonbot users to find the relevant files.  Example below shows a "<code>robots.txt</code>" file for a public website that includes additional protocols for SITEMAP, SECURITY, and HUMANS.
Additional protocols such as [https://www.securitytxt.org/ SECURITY] and [https://humanstxt.org/ HUMANS] can also be added to the Robots Exclusion Protocol but these are not officially supported so should be commented out to avoid confusing bots while still allowing nonbot users to find the relevant files.  Example below shows a "<code>robots.txt</code>" file for a public website that includes additional protocols for SITEMAP, SECURITY, and HUMANS.


<highlight lang="robots">
<code><highlight lang="robots">
User-agent: *
User-agent: *
Disallow:
Disallow:
Line 50: Line 50:
# Security: https://www.example.net/.well-known/security.txt
# Security: https://www.example.net/.well-known/security.txt
# Humans: https://www.example.net/humans.txt
# Humans: https://www.example.net/humans.txt
</highlight>
</highlight></code>


Below is the Robots Exclusion Protocol of "<u>[https://www.nicolesharp.net/robots.txt <code>nicolesharp.net/robots.txt</code>]</u>" showing a hidden directory ("<code>/testbox/</code>") for webdevelopment plus additional protocols and comments, including a comment line to provide <u>[[attribution]]</u> to the author of the file.
Below is the Robots Exclusion Protocol of "<u>[https://www.nicolesharp.net/robots.txt <code>nicolesharp.net/robots.txt</code>]</u>" showing hidden directories ("<code>/sandbox/</code>" and "<code>/testbox/</code>") for webdevelopment plus additional protocols and comments, including a comment line to provide <u>[[attribution]]</u> to the author of the file.


<highlight lang="robots">
<code><highlight lang="robots">
User-agent: *
User-agent: *
Disallow: /sandbox/
Disallow: /testbox/
Disallow: /testbox/
Sitemap: https://www.nicolesharp.net/sitemap.txt
Sitemap: https://www.nicolesharp.net/sitemap.txt
Line 64: Line 65:
# 2023-09-04 Nicole Sharp
# 2023-09-04 Nicole Sharp
# https://www.nicolesharp.net/
# https://www.nicolesharp.net/
</highlight>
</highlight></code>


== see also ==
== see also ==