ROBOTS and SITEMAP: Difference between pages

From NikkiWiki
(Difference between pages)
Jump to navigation Jump to search
No edit summary
 
No edit summary
 
Line 1: Line 1:
[[image:Exciting Comics 3.jpg|thumb|[Image.]&ensp; The Robots Exclusion Protocol will not prevent bad bots from accessing your website. <ref><code>[[commons:category:robots in art]]</code></ref>]]
[[image:Googleplex with Pride colors 2015.gk.jpg|thumb|[Image.]&ensp; A sitemap will help your webpages to get indexed by search engines quickly and efficiently.&ensp; As of September 2023, [https://www.google.com/ Google] is the most popular search engine in the world, followed in second place by [https://www.bing.com/ Microsoft Bing]. <ref><code>https://www.similarweb.com/engines/</code></ref> <ref><code>https://www.google.com/</code></ref> <ref><code>https://www.bing.com/</code></ref>&ensp; Photograph depicting the corporate headquarters for Google Search in California (<abbr title="United States of America">USA</abbr>) with the Google logo decorated in the rainbow colors of the [[wikipedia:LGBT flag|LGBT (lesbian, gay, bi, trans, et al) flag]] in celebration of [https://obamawhitehouse.archives.gov/the-press-office/2011/05/31/presidential-proclamation-lesbian-gay-bisexual-and-transgender-pride-mon/ <abbr title="United States American">USA</abbr> LGBT Pride Month]. <ref><code>[[commons:category:Google]]</code></ref> <ref><code>[[commons:category:Google logos]]</code></ref> <ref><code>[[wikipedia:Googleplex]]</code></ref> <ref><code>[[wikipedia:LGBT flag]]</code></ref> <ref><code>[[wikipedia:USA LGBT Pride Month#Recognition]]</code></ref> <ref><code>https://obamawhitehouse.archives.gov/the-press-office/2011/05/31/presidential-proclamation-lesbian-gay-bisexual-and-transgender-pride-mon/</code></ref>]]


One of the first files you should add to your website is "<code>/robots.txt</code>". <ref><code>https://www.robotstxt.org/</code></ref>&ensp; This is a plaintext file for the [https://www.robotstxt.org/ Robots Exclusion Protocol] (ROBOTS language). <ref><code>https://www.robotstxt.org/robotstxt.html</code></ref>&ensp; What the "<code>/robots.txt</code>" file does is instruct which webdirectories should be accessed or avoided by web bots.
Adding a [https://www.sitemaps.org/ sitemap] to your website allows searchbots to find pages much faster and more efficiently, allowing them to be quickly indexed for search engines.&ensp; Sitemaps can be saved as either "<code>/sitemap.txt</code>" or "<code>/sitemap.xml</code>" and should be placed in the root webdirectory ("<code>/</code>"). <ref><code>https://www.sitemaps.org/</code></ref> <ref><code>https://www.sitemaps.org/protocol.html</code></ref>&ensp; Using plaintext (TXT) is much faster and easier than writing extensible markup language (XML).&ensp; I recommend keeping the sitemap as plaintext, allowing the SITEMAP protocol to join the ranks of the other plaintext website protocols for <u>[[ROBOTS]]</u>, [https://www.securitytxt.org/ SECURITY], and [https://humanstxt.org/ HUMANS].


An important thing to remember is that no bot is <em>required</em> to follow the [[wikipedia:Robots Exclusion Protocol|Robots Exclusion Protocol]]. <ref><code>https://www.robotstxt.org/faq/prevent.html</code></ref> <ref><code>https://www.robotstxt.org/faq/blockjustbad.html</code></ref> <ref><code>https://www.robotstxt.org/faq/legal.html</code></ref>&ensp; The protocol only affects the behavior of compliant or well-behaved bots and anyone can program a bot to ignore the Robots Exclusion Protocol.&ensp; As such, you should <em>not</em> use the Robots Exclusion Protocol to try to hide sensitive directories, especially since publicly listing the directories in "<code>/robots.txt</code>" simply gives malicious bots an easy way to find the very directories you don't want them to visit. <ref><code>https://www.robotstxt.org/faq/nosecurity.html</code></ref>&ensp; On Apache HTTP (Hypertext Transfer Protocol) Server, you should use "<code>/.htaccess</code>" (hypertext access) instead to hide directories from public access.
As with all webtext files, you should use an advanced text editor such as [https://www.notepad-plus-plus.org/ Notepad-Plus-Plus] (not Microsoft Windows Notepad). <ref><code>https://www.notepad-plus-plus.org/</code></ref>&ensp; Files should be saved with [https://www.npp-user-manual.org/docs/preferences/#new-document Unix line endings and UTF-8 (Unicode Transformation Format Eight-Bit) character encoding].


"<code>robots.txt</code>" will only work from the root webdirectory ("<code>/</code>"). <ref><code>https://www.robotstxt.org/faq/shared.html</code></ref>
== canonical links ==


Comments are added to the Robots Exclusion Protocol with a hash ("<code>#</code>") at the beginning of a new line.
To create a sitemap, you simply make a plaintext list of each URL (uniform resource locator) for the website with one URL per line and no other content (no comments).&ensp; Only URLs for a single domain should be included — do not add URLs for subdomains or alias domains.&ensp; You should also only list canonical URLs.&ensp; This means that if a particular webpage can be accessed from multiple URLs, only one URL should be listed for that webpage in the sitemap.


As with all webtext files, you should use an advanced text editor such as [https://www.notepad-plus-plus.org/ Notepad-Plus-Plus] that supports Unix line endings.&ensp; Do not use Microsoft Windows Notepad.
For example, there are many different ways to access <u>[[Nicole Sharp's Homepage]]</u>:


For a public website, you want to allow access to all bots in order to get the site crawled and indexed by as many search engines as possible.&ensp; The following "<code>/robots.txt</code>" allows all bots access to all files and directories (minus anything excluded by "<code>/.htaccess</code>").
<code><pre>
https://www.nicolesharp.net/
https://www.nicolesharp.net/index.htm
https://www.nicolesharp.net/index.html
https://www.nicolesharp.net/w/
https://www.nicolesharp.net/w/index.php
https://www.nicolesharp.net/w/index.php?title=NikkiWiki
https://www.nicolesharp.net/w/index.php?title=Main_Page
https://www.nicolesharp.net/w/index.php?title=NikkiWiki:Main_Page
https://www.nicolesharp.net/wiki/
https://www.nicolesharp.net/wiki/NikkiWiki
https://www.nicolesharp.net/wiki/Main_Page
https://www.nicolesharp.net/wiki/index
</pre></code>


<code><highlight lang="robots">
The canonical URL though is
User-agent: *
: <u><code>[[about Nicole Sharp's Homepage|https://www.nicolesharp.net/wiki/NikkiWiki]]</code></u>
Disallow:
since all of the other URLs redirect to that URL.
# All bots can crawl/index all files and directories.
 
</highlight></code>
=== MediaWiki ===
 
In [[mw:Main Page|Wikimedia MediaWiki]], canonical URLs are provided by adding
: <code>[[mw:$wgEnableCanonicalServerLink|$wgEnableCanonicalServerLink]] = true;</code>
to "<code>LocalSettings.php</code>". <ref><code>[[mw:$wgEnableCanonicalServerLink]]</code></ref>
 
== no subdomains ==
 
Here are even more ways to access Nicole Sharp's Homepage:
 
<code><pre>
https://nicolesharp.net/
https://www.nicolesharp.net/
https://web.nicolesharp.net/
https://en.nicolesharp.net/
https://eng.nicolesharp.net/
https://us.nicolesharp.net/
https://usa.nicolesharp.net/
https://wiki.nicolesharp.net/
https://w.nicolesharp.net/
http://www.nicolesharp.net/
http://nicolesharp.net/
http://nicolesharp.altervista.org/
http://nicolesharp.dreamhosters.com/
https://nicolesharp.dreamhosters.com/
</pre></code>


Websites and webdirectories that are not publicly indexed on search engines are referred to as the "[[wikipedia:deep web|deep web]]" or "deepnet" (not to be confused with the "[[wikipedia:dark web|dark web]]" or "darknet"). <ref><code>[[wikipedia:deep web]]</code></ref> <ref><code>[[wikipedia:dark web]]</code></ref>&ensp; For example, you may want to create a mirror of your website for testing purposes but don't want the development website publicly indexed since it will create duplicate or misleading results on search engines.&ensp; The following "<code>/robots.txt</code>" creates a "deepnet" website that instructs all bots compliant with the Robots Exclusion Protocol to not crawl or index any part of the site.
With the exception of "<code><nowiki>https://www.nicolesharp.net/</nowiki></code>", none of these other URLs should be included in "<u><code>https://www.nicolesharp.net/sitemap.txt</code></u>".&ensp; All of the URLs should have the same protocol (either all HTTPS [Hypertext Transfer Protocol Secure] or all HTTP [Hypertext Transfer Protocol]) and all of the URLs should be on the same subdomain (for example, either all with "<code>www</code>" or all without "<code>www</code>").


<code><highlight lang="robots">
== example ==
User-agent: *
Disallow: /
# No compliant bots will crawl/index any files or directories.
</highlight></code>


The following "<code>/robots.txt</code>" excludes two webdirectories ("<code>/sandbox/</code>" and "<code>/testbox/</code>") from crawling/indexing but permits access to all other files and directories on the site.
The following "<code>/sitemap.txt</code>" example gives a compliant sitemap for "<u><code>[[Nicole Sharp's Website|https://www.nicolesharp.net/]]</code></u>":


<code><highlight lang="robots">
<code><syntaxhighlight lang="text">
User-agent: *
https://www.nicolesharp.net/wiki/NikkiWiki
Disallow: /sandbox/
https://www.nicolesharp.net/wiki/about_NikkiWiki
Disallow: /testbox/
https://www.nicolesharp.net/wiki/Nicole_Sharp
# Compliant bots will crawl/index all files and directories except for "/sandbox/" and "/testbox/" (exclusion is applied recursively to all subdirectories of "sandbox" and "testbox").
https://www.nicolesharp.net/wiki/license_for_Nicole_Sharp's_Website
</highlight></code>
https://www.nicolesharp.net/wiki/analytics_for_Nicole_Sharp's_Website
https://www.nicolesharp.net/wiki/donations
https://www.nicolesharp.net/wiki/security
</syntaxhighlight></code>


[https://www.sitemaps.org/ SITEMAP] is an extension of the Robots Exclusion Protocol to allow listing a sitemap in the "<code>/robots.txt</code>" file.&ensp; Having a precompiled list of links for the website available makes the bot's job to crawl and index the site a lot easier and more efficient.&ensp; Since the first thing a good bot does when accessing a website is to check for a "<code>/robots.txt</code>" file, it is best to have the link to the sitemap listed directly in the "<code>/robots.txt</code>" file so the bot doesn't have to guess whether or not the website has a sitemap available (which could be either "<code>/sitemap.txt</code>" or "<code>/sitemap.xml</code>").&ensp; The following "<code>/robots.txt</code>" provides an example of a public website with a sitemap.&ensp; Note that unlike the other ROBOTS instructions, the sitemap should be provided with a full URL (uniform resource locator) and not with a relative link.
Only canonical URLs are included, all of the URLs have the same protocol ("<code>https://</code>"), and all of the URLs are on the same subdomain ("<code>www.nicolesharp.net</code>").&ensp; Each new subdomain will need its own sitemap.


<code><highlight lang="robots">
== ROBOTS ==
User-agent: *
Disallow:
Sitemap: https://www.example.net/sitemap.xml
</highlight></code>


Additional protocols such as [https://www.securitytxt.org/ SECURITY] and [https://humanstxt.org/ HUMANS] can also be added to the Robots Exclusion Protocol but these are not officially supported so should be commented out to avoid confusing bots while still allowing nonbot users to find the relevant files.&ensp; The example below shows "<code>/robots.txt</code>" for a public website that includes additional protocols for SITEMAP, SECURITY, and HUMANS.
Once your sitemap is completed, you can add it to the [[wikipedia:ROBOTS#Sitemap|Robots Exclusion Protocol]] to be indexed by searchbots. <ref><u><code>[[ROBOTS#SITEMAP]]</code></u></ref>&ensp; An example "<code>/robots.txt</code>" with a sitemap is given below.


<code><highlight lang="robots">
<code><highlight lang="robots">
Line 50: Line 83:
Disallow:
Disallow:
Sitemap: https://www.example.net/sitemap.txt
Sitemap: https://www.example.net/sitemap.txt
# Security: https://www.example.net/.well-known/security.txt
# Humans: https://www.example.net/humans.txt
</highlight></code>
Below is the Robots Exclusion Protocol of "<u><code>[https://www.nicolesharp.net/robots.txt nicolesharp.net/robots.txt]</code></u>" showing hidden directories ("<code>/sandbox/</code>" and "<code>/testbox/</code>") for webdevelopment plus additional protocols and comments, including a comment line to provide <u>[[attribution]]</u> to the author of the file (<u>[[Nicole Sharp]]</u>).
<code><highlight lang="robots">
User-agent: *
Disallow: /sandbox/
Disallow: /testbox/
Sitemap: https://www.nicolesharp.net/sitemap.txt
# Security: https://www.nicolesharp.net/security.txt
# Humans: https://www.nicolesharp.net/humans.txt
# Robots Exclusion Protocol for Nicole Sharp's Website.
# 2023-09-04 Nicole Sharp
# https://www.nicolesharp.net/
</highlight></code>
</highlight></code>


== see also ==
== see also ==


* <u><code>https://www.nicolesharp.net/robots.txt</code></u>
* <u><code>https://www.nicolesharp.net/sitemap.txt</code></u>
* <code>https://www.robotstxt.org/</code>
* <code>https://www.sitemaps.org/</code>
* <code>https://www.sitemaps.org/</code>
* <u><code>[[ROBOTS]]</code></u>
* <code>https://www.securitytxt.org/</code>
* <code>https://www.securitytxt.org/</code>
* <code>https://humanstxt.org/</code>
* <code>https://humanstxt.org/</code>
Line 83: Line 99:
== keywords ==
== keywords ==


<code>bots, development, HUMANS, humans.txt, indexing, ROBOTS, robots.txt, SECURITY, security.txt, searchbots, SITEMAP, sitemap.txt, TXT, web, webcrawlers, webcrawling, webdevelopment, WWW</code>
<code>bots, CANONICAL, development, hyperlinks, indexing, links, ROBOTS, robots.txt, searchbots, SITEMAP, sitemap.txt, TXT, URLs, web, webcrawlers, webcrawling, webdevelopment, weblinks, WWW</code>


{{#seo:|keywords=bots, development, HUMANS, humans.txt, indexing, ROBOTS, robots.txt, SECURITY, security.txt, searchbots, SITEMAP, sitemap.txt, TXT, web, webcrawlers, webcrawling, webdevelopment, WWW}}
{{#seo:|keywords=bots, CANONICAL, development, hyperlinks, indexing, links, ROBOTS, robots.txt, searchbots, SITEMAP, sitemap.txt, TXT, URLs, web, webcrawlers, webcrawling, webdevelopment, weblinks, WWW}}


[[category:webdevelopment]]
[[category:webdevelopment]]
[[category:pages with images]]
[[category:pages with images]]

Revision as of 2023-09-05T06:37:25

[Image.]  A sitemap will help your webpages to get indexed by search engines quickly and efficiently.  As of September 2023, Google is the most popular search engine in the world, followed in second place by Microsoft Bing. [1] [2] [3]  Photograph depicting the corporate headquarters for Google Search in California (USA) with the Google logo decorated in the rainbow colors of the LGBT (lesbian, gay, bi, trans, et al) flag in celebration of USA LGBT Pride Month. [4] [5] [6] [7] [8] [9]

Adding a sitemap to your website allows searchbots to find pages much faster and more efficiently, allowing them to be quickly indexed for search engines.  Sitemaps can be saved as either "/sitemap.txt" or "/sitemap.xml" and should be placed in the root webdirectory ("/"). [10] [11]  Using plaintext (TXT) is much faster and easier than writing extensible markup language (XML).  I recommend keeping the sitemap as plaintext, allowing the SITEMAP protocol to join the ranks of the other plaintext website protocols for ROBOTS, SECURITY, and HUMANS.

As with all webtext files, you should use an advanced text editor such as Notepad-Plus-Plus (not Microsoft Windows Notepad). [12]  Files should be saved with Unix line endings and UTF-8 (Unicode Transformation Format Eight-Bit) character encoding.

canonical links

To create a sitemap, you simply make a plaintext list of each URL (uniform resource locator) for the website with one URL per line and no other content (no comments).  Only URLs for a single domain should be included — do not add URLs for subdomains or alias domains.  You should also only list canonical URLs.  This means that if a particular webpage can be accessed from multiple URLs, only one URL should be listed for that webpage in the sitemap.

For example, there are many different ways to access Nicole Sharp's Homepage:

https://www.nicolesharp.net/
https://www.nicolesharp.net/index.htm
https://www.nicolesharp.net/index.html
https://www.nicolesharp.net/w/
https://www.nicolesharp.net/w/index.php
https://www.nicolesharp.net/w/index.php?title=NikkiWiki
https://www.nicolesharp.net/w/index.php?title=Main_Page
https://www.nicolesharp.net/w/index.php?title=NikkiWiki:Main_Page
https://www.nicolesharp.net/wiki/
https://www.nicolesharp.net/wiki/NikkiWiki
https://www.nicolesharp.net/wiki/Main_Page
https://www.nicolesharp.net/wiki/index

The canonical URL though is

https://www.nicolesharp.net/wiki/NikkiWiki

since all of the other URLs redirect to that URL.

MediaWiki

In Wikimedia MediaWiki, canonical URLs are provided by adding

$wgEnableCanonicalServerLink = true;

to "LocalSettings.php". [13]

no subdomains

Here are even more ways to access Nicole Sharp's Homepage:

https://nicolesharp.net/
https://www.nicolesharp.net/
https://web.nicolesharp.net/
https://en.nicolesharp.net/
https://eng.nicolesharp.net/
https://us.nicolesharp.net/
https://usa.nicolesharp.net/
https://wiki.nicolesharp.net/
https://w.nicolesharp.net/
http://www.nicolesharp.net/
http://nicolesharp.net/
http://nicolesharp.altervista.org/
http://nicolesharp.dreamhosters.com/
https://nicolesharp.dreamhosters.com/

With the exception of "https://www.nicolesharp.net/", none of these other URLs should be included in "https://www.nicolesharp.net/sitemap.txt".  All of the URLs should have the same protocol (either all HTTPS [Hypertext Transfer Protocol Secure] or all HTTP [Hypertext Transfer Protocol]) and all of the URLs should be on the same subdomain (for example, either all with "www" or all without "www").

example

The following "/sitemap.txt" example gives a compliant sitemap for "https://www.nicolesharp.net/":

https://www.nicolesharp.net/wiki/NikkiWiki
https://www.nicolesharp.net/wiki/about_NikkiWiki
https://www.nicolesharp.net/wiki/Nicole_Sharp
https://www.nicolesharp.net/wiki/license_for_Nicole_Sharp's_Website
https://www.nicolesharp.net/wiki/analytics_for_Nicole_Sharp's_Website
https://www.nicolesharp.net/wiki/donations
https://www.nicolesharp.net/wiki/security

Only canonical URLs are included, all of the URLs have the same protocol ("https://"), and all of the URLs are on the same subdomain ("www.nicolesharp.net").  Each new subdomain will need its own sitemap.

ROBOTS

Once your sitemap is completed, you can add it to the Robots Exclusion Protocol to be indexed by searchbots. [14]  An example "/robots.txt" with a sitemap is given below.

User-agent: *
Disallow:
Sitemap: https://www.example.net/sitemap.txt

see also

references

  1. https://www.similarweb.com/engines/
  2. https://www.google.com/
  3. https://www.bing.com/
  4. commons:category:Google
  5. commons:category:Google logos
  6. wikipedia:Googleplex
  7. wikipedia:LGBT flag
  8. wikipedia:USA LGBT Pride Month#Recognition
  9. https://obamawhitehouse.archives.gov/the-press-office/2011/05/31/presidential-proclamation-lesbian-gay-bisexual-and-transgender-pride-mon/
  10. https://www.sitemaps.org/
  11. https://www.sitemaps.org/protocol.html
  12. https://www.notepad-plus-plus.org/
  13. mw:$wgEnableCanonicalServerLink
  14. ROBOTS#SITEMAP

keywords

bots, CANONICAL, development, hyperlinks, indexing, links, ROBOTS, robots.txt, searchbots, SITEMAP, sitemap.txt, TXT, URLs, web, webcrawlers, webcrawling, webdevelopment, weblinks, WWW