ROBOTS and SECURITY: Difference between pages

From NikkiWiki
(Difference between pages)
Jump to navigation Jump to search
 
No edit summary
 
Line 1: Line 1:
[[image:Exciting Comics 3.jpg|thumb|[Image.]&ensp; The Robots Exclusion Protocol will not prevent bad bots from accessing your website. <ref><code>[[commons:category:robots in art]]</code></ref>]]
[[image:Rome (Italy), Padlock at Ponte Palatino -- 2013 -- 3.jpg|thumb|[Image.]&ensp; A "<code>/security.txt</code>" file helps to make your website more secure by providing a standardized means for security researchers to contact you about any security vulnerabilities discovered on your website.&ensp; Photo depicts a secured red padlock with a heart-shaped pattern of glittery sequins. <ref><code>[[commons:category:padlocks]]</code></ref> <ref><code>[[commons:category:padlocks by color]]</code></ref> <ref><code>[[commons:category:red padlocks]]</code></ref>]]


One of the first files you should add to your website is "<code>/robots.txt</code>". <ref><code>https://www.rfc-editor.org/rfc/rfc9309</code></ref> <ref><code>https://www.robotstxt.org/</code></ref>&ensp; This is a plaintext file for the Robots Exclusion Protocol (ROBOTS language, Internet Society Request for Comments [RFC] 9309). <ref><code>https://www.robotstxt.org/robotstxt.html</code></ref>&ensp; What the "<code>/robots.txt</code>" file does is instruct which webdirectories should be accessed or avoided by web bots.
The SECURITY website protocol involves adding a plaintext file of "<code>/security.txt</code>" and/or "<code>/.well-known/security.txt</code>" that provides information about how to contact the website administrator in the case that any security vulnerabilities are discovered on the website. <ref><code>https://www.rfc-editor.org/rfc/rfc9116</code></ref> <ref><code>https://www.securitytxt.org/</code></ref>
 
An important thing to remember is that no bot is <em>required</em> to follow the Robots Exclusion Protocol. <ref><code>https://www.robotstxt.org/faq/prevent.html</code></ref> <ref><code>https://www.robotstxt.org/faq/blockjustbad.html</code></ref> <ref><code>https://www.robotstxt.org/faq/legal.html</code></ref>&ensp; The protocol only affects the behavior of compliant or well-behaved bots and anyone can program a bot to ignore the Robots Exclusion Protocol.&ensp; As such, you should <em>not</em> use the Robots Exclusion Protocol to try to hide sensitive directories, especially since publicly listing the directories in "<code>/robots.txt</code>" simply gives malicious bots an easy way to find the very directories you don't want them to visit. <ref><code>https://www.robotstxt.org/faq/nosecurity.html</code></ref>&ensp; On Apache HTTP (Hypertext Transfer Protocol) Server, you should use "<code>/.htaccess</code>" (hypertext access) instead to hide directories from public access.


== documentation ==
== documentation ==


* [https://www.rfc-editor.org/rfc/rfc9309 Internet Society RFC 9309: Robots Exclusion Protocol]
* [https://www.rfc-editor.org/rfc/rfc9116 Internet Society Request for Comments (RFC) 9116: A File Format to Aid in Security Vulnerability Disclosure]
* [https://www.robotstxt.org/ <code>robotstxt.org</code>: The Web Robots Pages]
* [https://www.securitytxt.org/ <code>security.txt</code>: A Proposed Standard Which Allows Websites to Define Security Policies]
* [https://developers.google.com/search/docs/crawling-indexing/robots/ Google Developers: Introduction to <code>robots.txt</code>]
* [[wikipedia:security.txt|<code>security.txt</code> (Wikipedia)]]
* [[wikipedia:robots.txt|<code>robots.txt</code> (Wikipedia)]]


== editor ==
== editor ==


{{webtext editor}}
{{webtext editor}}
== HUMANS ==
SECURITY is somewhat redundant with <u>[[HUMANS]]</u> and more technical to set up and use.&ensp; If you already have "<code>/humans.txt</code>" then you don't really need "<code>/security.txt</code>" but it can be helpful as it provides a standardized way for security researchers (as opposed to any human) to reach you in case a security vulnerability is discovered on your website.
* <u><code>https://www.nicolesharp.net/humans.txt</code></u>


== directory ==
== directory ==


"<code>robots.txt</code>" will only work from the root webdirectory ("<code>/</code>"). <ref><code>https://www.robotstxt.org/faq/shared.html</code></ref>
I recommend to put "<code>security.txt</code>" in the root webdirectory ("<code>/</code>") together with "<code>[[ROBOTS|/robots.txt]]</code>", "<code>[[SITEMAP|/sitemap.txt]]</code>", and "<code>/humans.txt</code>", but a copy should also be placed in "<code>/.well-known/</code>" since this is the recommended location from the protocol.&ensp; When you make an update to "<code>security.txt</code>", remember to save it to both locations.


== comments ==
* <u><code>https://www.nicolesharp.net/security.txt</code></u>
* <u><code>https://www.nicolesharp.net/.well-known/security.txt</code></u>


Comments are added to the Robots Exclusion Protocol with a hash ("<code>#</code>") at the beginning of a new line.
== security ==


<highlight lang="robots">
A canonical "<code>security.txt</code>" should only be accessible by HTTPS (Hypertext Transfer Protocol Secure).
# A comment.
</highlight>


== examples ==
=== HTTP ===


=== public ===
If your site does not have a security certificate, then you should use a comment in the Robots Exclusion Protocol ("<code>/robots.txt</code>") instead of using "<code>security.txt</code>" to specify any security contact info. <ref><u><code>[[ROBOTS#SECURITY]]</code></u></ref> <ref><code>https://www.rfc-editor.org/rfc/rfc9309</code></ref> <ref><code>https://www.robotstxt.org/</code></ref>&ensp; In the example Robots Exclusion Protocol below, "<code>security.txt</code>" has been replaced by "<code>security.htm</code>" as a nonsecure HTTP link to the security policy webpage without using the SECURITY protocol.


For a public website, you want to allow access to all bots in order to get the site crawled and indexed by as many search engines as possible.&ensp; The following "<code>/robots.txt</code>" allows all bots access to all files and directories (minus anything excluded by "<code>/.htaccess</code>").
<highlight lang="robots">
 
<code><highlight lang="robots">
User-agent: *
User-agent: *
Disallow:
Disallow:
# All bots can crawl/index all files and directories.
Sitemap: http://www.example.net/sitemap.txt
</highlight></code>
# Security: http://www.example.net/security.htm
# Humans: http://www.example.net/humans.txt
</highlight>
 
== comments ==
 
Comments are added to SECURITY with a hash ("<code>#</code>") at the beginning of a new line.


=== private ===
== example ==


Websites and webdirectories that are not publicly indexed on search engines are referred to as the "[[wikipedia:deep web|deep web]]" or "deepnet" (not to be confused with the "[[wikipedia:dark web|dark web]]" or "darknet"). <ref><code>[[wikipedia:deep web]]</code></ref> <ref><code>[[wikipedia:dark web]]</code></ref>&ensp; For example, you may want to create a mirror of your website for testing purposes but don't want the development website publicly indexed since it will create duplicate or misleading results on search engines.&ensp; The following "<code>/robots.txt</code>" creates a "deepnet" website that instructs all bots compliant with the Robots Exclusion Protocol to not crawl or index any part of the site.
"<code>/security.txt</code>" for <u><cite class="u">[[Nicole Sharp's Website]]</cite></u> is given below.


<code><highlight lang="robots">
<code><highlight lang="robots">
User-agent: *
Contact: https://www.nicolesharp.net/wiki/Nicole_Sharp
Disallow: /
Expires: 2024-01-18
# No compliant bots will crawl/index any files or directories.
Acknowledgments: https://www.securitytxt.org/
Preferred-Languages: en
Canonical: https://www.nicolesharp.net/security.txt
Policy: https://www.nicolesharp.net/wiki/security_for_Nicole_Sharp's_Website
# Security for Nicole Sharp's Website.
# 2023-09-06 Nicole Sharp
# https://www.nicolesharp.net/
</highlight></code>
</highlight></code>


=== hybrid ===
== EXPIRES ==


The following "<code>/robots.txt</code>" excludes two webdirectories ("<code>/sandbox/</code>" and "<code>/testbox/</code>") from crawling/indexing but permits access to all other files and directories on the site.
The "<code>Expires</code>" field should be for either a) the day before your next domain name registration renewal date or b) the day before your next webhosting service renewal date, whichever is soonest.&ensp; If you don't renew your domain name registration or your webhosting service, bad things can happen and your website security policy should be considered voided (since you don't have a website any more).&ensp; This also means that you should update "<code>security.txt</code>" each time you renew your domain name registration and/or webhosting service.


<code><highlight lang="robots">
"<code>Expires</code>" takes the form of an [[wikipedia:ISO 8601|ISO 8601]] date.&ensp; The actual date of expiration depends on timezone so you should set the expiration time to zero hundred hours zulu (UTC) the day before the date of expiration.&ensp; This will put the time of the expiration for the website security policy as somewhere between zero and twenty-four hours before the time of expiration for the website.
User-agent: *
 
Disallow: /sandbox/
== PREFERRED-LANGUAGES ==
Disallow: /testbox/
# Compliant bots will crawl/index all files and directories except for "/sandbox/" and "/testbox/" (exclusion is applied recursively to all subdirectories of "sandbox" and "testbox").
</highlight></code>


== SITEMAP ==
"<code>Preferred-Languages</code>" is the two-letter ISO 639-1 language code.


<u>[[SITEMAP#ROBOTS|SITEMAP]]</u> is an extension of the Robots Exclusion Protocol to allow listing a [https://www.sitemaps.org/ sitemap] in the "<code>/robots.txt</code>" file. <ref><u><code>[[SITEMAP#ROBOTS]]</code></u></ref> <ref><code>https://www.sitemaps.org/</code></ref> <ref><code>https://www.sitemaps.org/protocol.html</code></ref>&ensp; Having a precompiled list of links for the website available makes the bot's job to crawl and index the site a lot easier and more efficient.&ensp; Since the first thing a good bot does when accessing a website is to check for a "<code>/robots.txt</code>" file, it is best to have the link to the sitemap listed directly in the "<code>/robots.txt</code>" file so the bot doesn't have to guess whether or not the website has a sitemap available (which could be either "<code>/sitemap.txt</code>" or "<code>/sitemap.xml</code>").&ensp; The following "<code>/robots.txt</code>" provides an example of a public website with a sitemap.&ensp; Note that unlike the other ROBOTS instructions, the sitemap should be provided with a full URL (uniform resource locator) and not with a relative link.
== CANONICAL ==


<code><highlight lang="robots">
"<code>Canonical</code>" refers to the preferred uniform resource locator (URL) for "<code>security.txt</code>".&ensp; If you forget to update "<code>/.well-known/security.txt</code>", it tells security researchers that the canonical version is at "<code>/security.txt</code>" instead.&ensp; The canonical URL (uniform resource locator) must be an HTTPS link.
User-agent: *
Disallow:
Sitemap: https://www.example.net/sitemap.xml
</highlight></code>


== SECURITY ==
== ROBOTS ==


Additional protocols such as <u>[[SECURITY]]</u> and <u>[[HUMANS]]</u> can also be added to the Robots Exclusion Protocol but these are not officially supported so should be commented out to avoid confusing bots while still allowing nonbot users to find the relevant files.&ensp; The example below shows "<code>/robots.txt</code>" for a public website that includes additional protocols for SITEMAP, SECURITY, and HUMANS.
SECURITY should be added to the Robots Exclusion Protocol ("<code>/robots.txt</code>") as a comment ("<code>#</code>").&ensp; This lets anyone viewing the Robots Exclusion Protocol for the website know that you have specified a contactpage to report security vulnerabilities to.&ensp; An example Robots Exclusion Protocol with SECURITY is given below.


<code><highlight lang="robots">
<highlight lang="robots">
User-agent: *
User-agent: *
Disallow:
Disallow:
Sitemap: https://www.example.net/sitemap.txt
Sitemap: https://www.example.net/sitemap.txt
# Security: https://www.example.net/.well-known/security.txt
# Security: https://www.example.net/security.txt
# Humans: https://www.example.net/humans.txt
# Humans: https://www.example.net/humans.txt
</highlight></code>
</highlight>
 
== Nicole Sharp's Website ==
 
Below is the Robots Exclusion Protocol of "<u><code>[https://www.nicolesharp.net/robots.txt nicolesharp.net/robots.txt]</code></u>" showing hidden directories ("<code>/sandbox/</code>" and "<code>/testbox/</code>") for webdevelopment plus additional protocols and comments, including a comment line to provide <u>[[attribution]]</u> to the author of the file (<u>[[Nicole Sharp]]</u>).
 
<code><highlight lang="robots">
User-agent: *
Disallow: /sandbox/
Disallow: /testbox/
Sitemap: https://www.nicolesharp.net/sitemap.txt
# Security: https://www.nicolesharp.net/security.txt
# Humans: https://www.nicolesharp.net/humans.txt
 
# Robots Exclusion Protocol for Nicole Sharp's Website.
# 2023-09-04 Nicole Sharp
# https://www.nicolesharp.net/
</highlight></code>
 
== MediaWiki ==
 
[[mw:Main Page|Wikimedia MediaWiki]] automatically applies different META ROBOTS instructions in the HEAD element to different pages, so you should not add any custom ROBOTS instructions via [[mw:HeadScript|HeadScript]].
 
To get webcrawlers to follow links on MediaWiki, add the following to "<code>[[mw:$wgNoFollowLinks|LocalSettings.php]]</code>". <ref><code>[[mw:$wgNoFollowLinks]]</code></ref>
 
<code><syntaxhighlight lang="php">
$wgNoFollowLinks = false;
# https://www.mediawiki.org/wiki/$wgNoFollowLinks
</syntaxhighlight></code>


== see also ==
== see also ==


* <u><code>https://www.nicolesharp.net/robots.txt</code></u>
* <u><code>[[security for Nicole Sharp's Website]]</code></u>
* <code>https://www.rfc-editor.org/rfc/rfc9309</code>
* <u><code>https://www.nicolesharp.net/security.txt</code></u>
* <code>https://www.robotstxt.org/</code>
* <code>https://www.rfc-editor.org/rfc/rfc9116</code>
* <code>https://developers.google.com/search/docs/crawling-indexing/robots/</code>
* <code>https://www.securitytxt.org/</code>
* <code>[[wikipedia:ROBOTS]]</code>
* <u><code>[[ROBOTS#SECURITY]]</code></u>
* <u><code>[[SITEMAP]]</code></u>
* <u><code>[[SECURITY]]</code></u>
* <u><code>[[HUMANS]]</code></u>
* <u><code>[[HUMANS]]</code></u>


Line 128: Line 103:
== keywords ==
== keywords ==


<code>bots, development, HUMANS, humans.txt, indexing, ROBOTS, robots.txt, SECURITY, security.txt, searchbots, SITEMAP, sitemap.txt, TXT, web, webcrawlers, webcrawling, webdevelopment, WWW</code>
<code>cybersecurity, development, ROBOTS, robots.txt, security, security.txt, TXT, webdevelopment</code>


{{#seo:|keywords=bots, development, HUMANS, humans.txt, indexing, ROBOTS, robots.txt, SECURITY, security.txt, searchbots, SITEMAP, sitemap.txt, TXT, web, webcrawlers, webcrawling, webdevelopment, WWW}}
{{#seo:|keywords=cybersecurity, development, ROBOTS, robots.txt, security, security.txt, TXT, webdevelopment}}


[[category:webdevelopment]]
[[category:webdevelopment]]
[[category:pages with images]]

Revision as of 2023-09-06T03:54:29

[Image.]  A "/security.txt" file helps to make your website more secure by providing a standardized means for security researchers to contact you about any security vulnerabilities discovered on your website.  Photo depicts a secured red padlock with a heart-shaped pattern of glittery sequins. [1] [2] [3]

The SECURITY website protocol involves adding a plaintext file of "/security.txt" and/or "/.well-known/security.txt" that provides information about how to contact the website administrator in the case that any security vulnerabilities are discovered on the website. [4] [5]

documentation

editor

As with all webtext files, you should use an advanced text editor such as Notepad-Plus-Plus (not Microsoft Windows Notepad). [6]  Files should be saved with Unix line endings and UTF-8 (Unicode Transformation Format Eight-Bit) character encoding.

HUMANS

SECURITY is somewhat redundant with HUMANS and more technical to set up and use.  If you already have "/humans.txt" then you don't really need "/security.txt" but it can be helpful as it provides a standardized way for security researchers (as opposed to any human) to reach you in case a security vulnerability is discovered on your website.

directory

I recommend to put "security.txt" in the root webdirectory ("/") together with "/robots.txt", "/sitemap.txt", and "/humans.txt", but a copy should also be placed in "/.well-known/" since this is the recommended location from the protocol.  When you make an update to "security.txt", remember to save it to both locations.

security

A canonical "security.txt" should only be accessible by HTTPS (Hypertext Transfer Protocol Secure).

HTTP

If your site does not have a security certificate, then you should use a comment in the Robots Exclusion Protocol ("/robots.txt") instead of using "security.txt" to specify any security contact info. [7] [8] [9]  In the example Robots Exclusion Protocol below, "security.txt" has been replaced by "security.htm" as a nonsecure HTTP link to the security policy webpage without using the SECURITY protocol.

User-agent: *
Disallow:
Sitemap: http://www.example.net/sitemap.txt
# Security: http://www.example.net/security.htm
# Humans: http://www.example.net/humans.txt

comments

Comments are added to SECURITY with a hash ("#") at the beginning of a new line.

example

"/security.txt" for Nicole Sharp's Website is given below.

Contact: https://www.nicolesharp.net/wiki/Nicole_Sharp
Expires: 2024-01-18
Acknowledgments: https://www.securitytxt.org/
Preferred-Languages: en
Canonical: https://www.nicolesharp.net/security.txt
Policy: https://www.nicolesharp.net/wiki/security_for_Nicole_Sharp's_Website
# Security for Nicole Sharp's Website.
# 2023-09-06 Nicole Sharp
# https://www.nicolesharp.net/

EXPIRES

The "Expires" field should be for either a) the day before your next domain name registration renewal date or b) the day before your next webhosting service renewal date, whichever is soonest.  If you don't renew your domain name registration or your webhosting service, bad things can happen and your website security policy should be considered voided (since you don't have a website any more).  This also means that you should update "security.txt" each time you renew your domain name registration and/or webhosting service.

"Expires" takes the form of an ISO 8601 date.  The actual date of expiration depends on timezone so you should set the expiration time to zero hundred hours zulu (UTC) the day before the date of expiration.  This will put the time of the expiration for the website security policy as somewhere between zero and twenty-four hours before the time of expiration for the website.

PREFERRED-LANGUAGES

"Preferred-Languages" is the two-letter ISO 639-1 language code.

CANONICAL

"Canonical" refers to the preferred uniform resource locator (URL) for "security.txt".  If you forget to update "/.well-known/security.txt", it tells security researchers that the canonical version is at "/security.txt" instead.  The canonical URL (uniform resource locator) must be an HTTPS link.

ROBOTS

SECURITY should be added to the Robots Exclusion Protocol ("/robots.txt") as a comment ("#").  This lets anyone viewing the Robots Exclusion Protocol for the website know that you have specified a contactpage to report security vulnerabilities to.  An example Robots Exclusion Protocol with SECURITY is given below.

User-agent: *
Disallow:
Sitemap: https://www.example.net/sitemap.txt
# Security: https://www.example.net/security.txt
# Humans: https://www.example.net/humans.txt

see also

references

  1. commons:category:padlocks
  2. commons:category:padlocks by color
  3. commons:category:red padlocks
  4. https://www.rfc-editor.org/rfc/rfc9116
  5. https://www.securitytxt.org/
  6. https://www.notepad-plus-plus.org/
  7. ROBOTS#SECURITY
  8. https://www.rfc-editor.org/rfc/rfc9309
  9. https://www.robotstxt.org/

keywords

cybersecurity, development, ROBOTS, robots.txt, security, security.txt, TXT, webdevelopment