License for Nicole Sharp's Website and ROBOTS: Difference between pages

From NikkiWiki
(Difference between pages)
Jump to navigation Jump to search
No edit summary
 
No edit summary
 
Line 1: Line 1:
[[image:CC-BY-SA Yellow.svg|thumb|[Image.]&ensp; Creative Commons Attribution-ShareAlike Free Cultural Work logo. <ref><code>[[commons:category:Definition of Free Cultural Works set of license icons]]</code></ref>]]
One of the first files you should add to your website is "<code>/robots.txt</code>."&ensp; This is a plaintext file for the Robots Exclusion Protocol (ROBOTS language).&ensp; What the <code>robots.txt</code> file does is instruct which webdirectories should be accessed or avoided by web bots.


== Free Cultural Work ==
An important thing to remember is that no bot is <em>required</em> to follow the Robots Exclusion Protocol.&ensp; The protocol only affects the behavior of compliant or well-behaved bots and anyone can program a bot to ignore "<code>robots.txt</code>".&ensp; As such, you should <em>not</em> use the Robots Exclusion Protocol to try to hide sensitive directories, especially since publicly listing the directories in "<code>robots.txt</code>" simply gives malicious bots an easy way to find the very directories you don't want them to visit.&ensp; To hide directories from public access (on Apache <abbr title="Hypertext Transfer Protocol">HTTP</abbr> Server) you should use "<code>/.htaccess</code>" (hypertext access) instead.


<u>[[about NikkiWiki|<cite class="n">NikkiWiki</cite>]]</u> and <u>[[about Nicole Sharp's Website|<cite class="n">Nicole Sharp's Website</cite>]]</u> are [https://www.freedomdefined.org/ <strong>Free Cultural Works</strong>]. <ref><code>https://www.freedomdefined.org/</code></ref> <ref><code>https://www.freedomdefined.org/licenses#Creative_Commons_Attribution_ShareAlike</code></ref>
Comments are added to the Robots Exclusion Protocol with a hash ("<code>#</code>") at the beginning of a new line.


== Attribution-ShareAlike ==
For a public website, you want to allow access to all bots in order to get the site crawled and indexed by as many search engines as possible.&ensp; The following "<code>/robots.txt</code>" allows all bots access to all directories (minus anything excluded by "<code>/.htaccess</code>").


Unless otherwise noted, content on <cite class="n">NikkiWiki</cite> and <cite class="n">Nicole Sharp's Website</cite> is available under the <strong><u>[[CC BY-SA|<cite class="n">Creative Commons Attribution-ShareAlike (CC BY-SA) 4.0 International Public License</cite>]]</u></strong>. <ref><code>[[mw:manual:copyright]]</code></ref> <ref><code>https://www.creativecommons.org/choose/</code></ref> <ref><code>https://www.creativecommons.org/licenses/by-sa/4.0/</code></ref> <ref><code>https://www.creativecommons.org/licenses/by-sa/4.0/legalcode</code></ref> <ref><code>[[Creative Commons Attribution-ShareAlike 4.0 International Public License]]</code></ref>&ensp; This is the same license used by [[wikipedia:Main Page|<cite class="n">Wikipedia</cite>]], [[wikibooks:Main Page|<cite class="n">Wikibooks</cite>]], [[wikiversity:Main Page|<cite class="n">Wikiversity</cite>]], and the other wikiprojects of the Wikimedia Foundation, allowing to freely copy and paste wikitext (with attribution) back and forth between NikkiWiki and Wikimedia. <ref><code>https://www.wikimedia.org/</code></ref>
<syntaxhighlight lang="text">
User-agent: *
Disallow:
# All bots can crawl/index all files and directories.
</syntaxhighlight>


* <code>https://www.creativecommons.org/licenses/by-sa/4.0/</code>
Websites and webdirectories that are not publicly indexed on search engines are referred to as the "deep web" or "darknet".&ensp; For example, you may want to create a mirror of your website for testing purposes, but don't want the development website publicly indexed since it will create duplicate or misleading results on search engines.&ensp; The following "<code>/robots.txt</code>" creates a darknet website that instructs all bots compliant with the Robots Exclusion Protocol to not crawl or index any part of the site.
* <code>https://www.creativecommons.org/licenses/by-sa/4.0/legalcode</code>


== Creative Commons ==
<syntaxhighlight lang="text">
User-agent: *
Disallow: /
# No bots can crawl/index any files or directories.
</syntaxhighlight>


Videos explaining how Creative Commons licenses work. <ref><code>[[commons:category:Creative Commons]]</code></ref> <ref><code>[[commons:category:videos about Creative Commons]]</code></ref> <ref><code>[[commons:file:What is Creative Commons?.webm]]</code></ref> <ref><code>[[commons:category:Creative Commons explanational videos]]</code></ref> <ref><code>[[commons:file:Creative Commons Kiwi.ogv]]</code></ref> <ref><code>[[commons:file:How Creative Commons Work.webm]]</code></ref>
The following "<code>/robots.txt</code>" only excludes one webdirectory from crawling/indexing but permits access to all other webdirectories on the site.


=== [[commons:file:What is Creative Commons?.webm|What Is Creative Commons?]] ===
<syntaxhighlight lang="text">
User-agent: *
Disallow: /testbox/
# Bots can crawl/index all files and directories except for "/testbox/" (exclusion is applied recursively to all subdirectories of "testbox").
</syntaxhighlight>


<!-- [[file:What is Creative Commons?.webm]] -->
"<code>robots.txt</code>" should always be placed in the root webdirectory ("<code>/</code>") if possible.
<vimeo>https://www.vimeo.com/202862112/</vimeo>


=== [[commons:file:Creative Commons Kiwi.ogv|How Creative Commons Works]] ===
SITEMAP is an extension of the Robots Exclusion Protocol to allow listing a sitemap in the "<code>robots.txt</code>" file.&ensp; Having a precompiled list of links for the website available makes the bot's job to crawl and index the site a lot easier and more efficient.&ensp; Since the first thing a good bot does when accessing a website is to check for a "<code>/robots.txt</code>" file, it is best to have the link to the sitemap listed directly in the "<code>/robots.txt</code>" file so the bot doesn't have to guess whether or not the website has a sitemap available (which could be either "<code>/sitemap.txt</code>" or "<code>/sitemap.xml</code>").  The following example shows a "<code>robots.txt</code>" for a public website with a sitemap.&ensp; Note that unlike the other ROBOTS instructions, the sitemap should be provided with a full URL (uniform resource locator) and not with a relative link.


<!-- [[file:Creative Commons Kiwi.ogv]] -->
<syntaxhighlight lang="text">
[[file:How Creative Commons Work.webm]]
User-agent: *
Disallow:
Sitemap: https://www.example.net/sitemap.txt
</syntaxhighlight>


== contact ==
Additional protocols such as SECURITY and HUMANS can also be added to the Robots Exclusion Protocol but these are not officially supported so should be commented out to avoid confusing bots while still allowing nonbot users to find the relevant files.  Example below shows a "<code>robots.txt</code>" file for a public website that includes additional protocols for SITEMAP, SECURITY, and HUMANS.


<cite class="n">NikkiWiki</cite> and <cite class="n">Nicole Sharp's Website</cite> are hosted, published, and written in the United States of America (USA). <ref><code>https://help.dreamhost.com/hc/articles/360003842672/</code></ref> <ref><u><code>[[about Nicole Sharp]]</code></u></ref><!--sse-->&ensp; If you are a copyright holder and believe that I have used your copyrighted material in a manner that is not [https://www.copyright.gov/fair-use/ Fair Use] or permitted by license, <u title="email Nicole Sharp">[mailto:&#x77;&#x69;&#x6b;&#x69;&#x40;&#x6e;&#x69;&#x63;&#x6f;&#x6c;&#x65;&#x73;&#x68;&#x61;&#x72;&#x70;&#x2e;&#x6e;&#x65;&#x74; please let me know]</u> to have the content removed, altered, or relicensed.<!--/sse--> <ref><code>https://www.copyright.gov/fair-use/</code></ref>
<syntaxhighlight lang="text">
User-agent: *
Disallow:
Sitemap: https://www.example.net/sitemap.txt
# Security: https://www.example.net/.well-known/security.txt
# Humans: https://www.example.net/humans.txt
</syntaxhighlight>


== see also ==
Below is the Robots Exclusion Protocol of "<u>[https://www.nicolesharp.net/robots.txt <code>nicolesharp.net/robots.txt</code>]</u>" showing a hidden directory ("<code>/testbox/</code>") for webdevelopment plus additional protocols and comments, including a comment line for the author of the file (required for attribution under the <u>[[NikkiLicense|<cite>Creative Commons Attribution-ShareAlike 4.0 International Public License</cite>]]</u>).
 
<syntaxhighlight lang="text">
User-agent: *
Disallow: /testbox/
Sitemap: https://www.nicolesharp.net/sitemap.txt
# Security: https://www.nicolesharp.net/security.txt
# Humans: https://www.nicolesharp.net/humans.txt


* <u><code>[[Creative Commons Attribution-ShareAlike 4.0 International Public License]]</code></u>
# Robots Exclusion Protocol for Nicole Sharp's Website.
* <code>https://www.freedomdefined.org/</code>
# 2023-09-04 Nicole Sharp
* <code>[[commons:file:Creative Commons Kiwi.ogv]]</code>
# https://www.nicolesharp.net/
</syntaxhighlight>


== references ==
== see also ==


<references />
* <u><code>https://www.nicolesharp.net/robots.txt</code></u>
* <code>https://www.robotstxt.org/</code>
* <code>https://www.sitemaps.org/</code>
* <code>https://www.securitytxt.org/</code>
* <code>https://humanstxt.org/</code>


== keywords ==
== keywords ==


<code>Attribution-ShareAlike, BY-SA, ©, CC, copying, copyleft, copyright, license, licensing, NikkiLicense</code>
<code>bots, crawling, development, HUMANS, humans.txt, indexing, ROBOTS, robots.txt, SECURITY, security.txt, searchbots, SITEMAP, sitemap.txt, TXT, web, webdevelopment, WWW</code>


{{#seo:|keywords=Attribution-ShareAlike, BY-SA, ©, CC, copying, copyleft, copyright, license, licensing, NikkiLicense}}
{{#seo:|keywords=bots, crawling, development, HUMANS, humans.txt, indexing, ROBOTS, robots.txt, SECURITY, security.txt, searchbots, SITEMAP, sitemap.txt, TXT, web, webdevelopment, WWW}}


[[category:NikkiLicense]]
[[category:webdevelopment]]
[[category:pages with images]]
[[category:pages with audiovideo]]

Revision as of 2023-09-04T07:12:10

One of the first files you should add to your website is "/robots.txt."  This is a plaintext file for the Robots Exclusion Protocol (ROBOTS language).  What the robots.txt file does is instruct which webdirectories should be accessed or avoided by web bots.

An important thing to remember is that no bot is required to follow the Robots Exclusion Protocol.  The protocol only affects the behavior of compliant or well-behaved bots and anyone can program a bot to ignore "robots.txt".  As such, you should not use the Robots Exclusion Protocol to try to hide sensitive directories, especially since publicly listing the directories in "robots.txt" simply gives malicious bots an easy way to find the very directories you don't want them to visit.  To hide directories from public access (on Apache HTTP Server) you should use "/.htaccess" (hypertext access) instead.

Comments are added to the Robots Exclusion Protocol with a hash ("#") at the beginning of a new line.

For a public website, you want to allow access to all bots in order to get the site crawled and indexed by as many search engines as possible.  The following "/robots.txt" allows all bots access to all directories (minus anything excluded by "/.htaccess").

User-agent: *
Disallow:
# All bots can crawl/index all files and directories.

Websites and webdirectories that are not publicly indexed on search engines are referred to as the "deep web" or "darknet".  For example, you may want to create a mirror of your website for testing purposes, but don't want the development website publicly indexed since it will create duplicate or misleading results on search engines.  The following "/robots.txt" creates a darknet website that instructs all bots compliant with the Robots Exclusion Protocol to not crawl or index any part of the site.

User-agent: *
Disallow: /
# No bots can crawl/index any files or directories.

The following "/robots.txt" only excludes one webdirectory from crawling/indexing but permits access to all other webdirectories on the site.

User-agent: *
Disallow: /testbox/
# Bots can crawl/index all files and directories except for "/testbox/" (exclusion is applied recursively to all subdirectories of "testbox").

"robots.txt" should always be placed in the root webdirectory ("/") if possible.

SITEMAP is an extension of the Robots Exclusion Protocol to allow listing a sitemap in the "robots.txt" file.  Having a precompiled list of links for the website available makes the bot's job to crawl and index the site a lot easier and more efficient.  Since the first thing a good bot does when accessing a website is to check for a "/robots.txt" file, it is best to have the link to the sitemap listed directly in the "/robots.txt" file so the bot doesn't have to guess whether or not the website has a sitemap available (which could be either "/sitemap.txt" or "/sitemap.xml"). The following example shows a "robots.txt" for a public website with a sitemap.  Note that unlike the other ROBOTS instructions, the sitemap should be provided with a full URL (uniform resource locator) and not with a relative link.

User-agent: *
Disallow:
Sitemap: https://www.example.net/sitemap.txt

Additional protocols such as SECURITY and HUMANS can also be added to the Robots Exclusion Protocol but these are not officially supported so should be commented out to avoid confusing bots while still allowing nonbot users to find the relevant files. Example below shows a "robots.txt" file for a public website that includes additional protocols for SITEMAP, SECURITY, and HUMANS.

User-agent: *
Disallow:
Sitemap: https://www.example.net/sitemap.txt
# Security: https://www.example.net/.well-known/security.txt
# Humans: https://www.example.net/humans.txt

Below is the Robots Exclusion Protocol of "nicolesharp.net/robots.txt" showing a hidden directory ("/testbox/") for webdevelopment plus additional protocols and comments, including a comment line for the author of the file (required for attribution under the Creative Commons Attribution-ShareAlike 4.0 International Public License).

User-agent: *
Disallow: /testbox/
Sitemap: https://www.nicolesharp.net/sitemap.txt
# Security: https://www.nicolesharp.net/security.txt
# Humans: https://www.nicolesharp.net/humans.txt

# Robots Exclusion Protocol for Nicole Sharp's Website.
# 2023-09-04 Nicole Sharp
# https://www.nicolesharp.net/

see also

keywords

bots, crawling, development, HUMANS, humans.txt, indexing, ROBOTS, robots.txt, SECURITY, security.txt, searchbots, SITEMAP, sitemap.txt, TXT, web, webdevelopment, WWW