NikkiWiki and ROBOTS: Difference between pages

From NikkiWiki
(Difference between pages)
Jump to navigation Jump to search
No edit summary
 
 
Line 1: Line 1:
[[image:FSU Earth.jpg|thumb|[Image.]  Nicole Sharp is a graduate of [[wikipedia:Frostburg State University|Frostburg State University]] of Maryland.  Pictured is the Frostburg State University pennant flying on the [[wikipedia:International Space Station|International Space Station]].  The pawprints on the pennant are of the university mascot, a bobcat (''[[wikipedia:Lynx rufus|Lynx rufus]]'').]]
[[image:Exciting Comics 3.jpg|thumb|[Image.]&ensp; The Robots Exclusion Protocol will not prevent bad bots from accessing your website. <ref><code>[[commons:category:robots in art]]</code></ref>]]


<strong>Welcome to <cite class="u">NikkiWiki, Nicole Sharp's Wiki</cite>!</strong>
One of the first files you should add to your website is "<code>/robots.txt</code>". <ref><code>https://www.robotstxt.org/</code></ref>&ensp; This is a plaintext file for the [https://www.robotstxt.org/ Robots Exclusion Protocol] (ROBOTS language). <ref><code>https://www.robotstxt.org/robotstxt.html</code></ref>&ensp; What the "<code>/robots.txt</code>" file does is instruct which webdirectories should be accessed or avoided by web bots.


Homepage for <u>[[Nicole Sharp]]</u> of [https://www.frostburg.edu/about-frostburg/campus-and-community/ Frostburg State University], Maryland, United States of America (USA).
An important thing to remember is that no bot is <em>required</em> to follow the [[wikipedia:Robots Exclusion Protocol|Robots Exclusion Protocol]]. <ref><code>https://www.robotstxt.org/faq/prevent.html</code></ref> <ref><code>https://www.robotstxt.org/faq/blockjustbad.html</code></ref> <ref><code>https://www.robotstxt.org/faq/legal.html</code></ref>&ensp; The protocol only affects the behavior of compliant or well-behaved bots and anyone can program a bot to ignore the Robots Exclusion Protocol.&ensp; As such, you should <em>not</em> use the Robots Exclusion Protocol to try to hide sensitive directories, especially since publicly listing the directories in "<code>/robots.txt</code>" simply gives malicious bots an easy way to find the very directories you don't want them to visit. <ref><code>https://www.robotstxt.org/faq/nosecurity.html</code></ref>&ensp; On Apache HTTP (Hypertext Transfer Protocol) Server, you should use "<code>/.htaccess</code>" (hypertext access) instead to hide directories from public access.


There are <u>[[special:allpages|{{NUMBEROFPAGES}} pages]]</u> in <u>[[special:categories|{{PAGESINNAMESPACE:14}} categories]]</u> on <cite class="n">NikkiWiki</cite>. <u>[[special:recentchanges|View recent changes.]]</u>
"<code>robots.txt</code>" will only work from the root webdirectory ("<code>/</code>"). <ref><code>https://www.robotstxt.org/faq/shared.html</code></ref>


<u>[[about NikkiWiki|<cite class="n">NikkiWiki</cite>]]</u> and <u>[[about Nicole Sharp's Website|<cite class="n">Nicole Sharp's Website</cite>]]</u> are [https://www.freedomdefined.org/ <strong>Free Cultural Works</strong>].
Comments are added to the Robots Exclusion Protocol with a hash ("<code>#</code>") at the beginning of a new line.


<cite class="n">NikkiWiki</cite> is powered by [[mw:What is MediaWiki?|<strong>Wikimedia MediaWiki</strong>]], the same free open-source wikisoftware used for [[wikipedia:Main Page|<cite class="n">Wikipedia</cite>]], [[wikibooks:Main Page|<cite class="n">Wikibooks</cite>]], [[wikiversity:Main Page|<cite class="n">Wikiversity</cite>]], [[commons:Main Page|<cite class="n">Wikimedia Commons</cite>]], and the other wikiprojects of the [https://www.wikimedia.org/ Wikimedia Foundation].
As with all webtext files, you should use an advanced text editor such as [https://www.notepad-plus-plus.org/ Notepad-Plus-Plus] (not Microsoft Windows Notepad). <ref><code>https://www.notepad-plus-plus.org/</code></ref>&ensp; Files should be saved with [https://www.npp-user-manual.org/docs/preferences/#new-document Unix line endings and UTF-8 (Unicode Transformation Format Eight-Bit) character encoding].


<cite class="n">NikkiWiki</cite> is optimized for desktop users.&ensp; A mobile view optimized for mobile web browsers is available but will not provide as rich of an experience for wiki browsing.&ensp; You can switch back and forth between mobile view and desktop view by selecting "Mobile" or "Desktop" on the footer menu.&ensp; Note that desktop view on a mobile browser may display bullets (•) as periods (.) for unordered lists.
== examples ==


<u>[[license for Nicole Sharp's Website|Unless otherwise noted, content on <cite class="n">NikkiWiki</cite> and <cite class="n">Nicole Sharp's Website</cite> is available under the <cite>Creative Commons Attribution-ShareAlike (CC BY-SA) 4.0 International Public License</cite>.]]</u>&ensp; This is the same license used by <cite class="n">Wikipedia</cite>, <cite class="n">Wikibooks</cite>, <cite class="n">Wikiversity</cite>, and the other wikiprojects of the Wikimedia Foundation, allowing to freely copy and paste wikitext (with attribution) back and forth between NikkiWiki and Wikimedia.
=== public ===


<u>[[analytics for Nicole Sharp's Website|<cite class="n">NikkiWiki</cite> and <cite class="n">Nicole Sharp's Website</cite> use webanalytics cookies from Cloudflare Analytics, Matomo Analytics, Google Analytics, Microsoft Clarity, and Yandex Metrica.]]</u>
For a public website, you want to allow access to all bots in order to get the site crawled and indexed by as many search engines as possible.&ensp; The following "<code>/robots.txt</code>" allows all bots access to all files and directories (minus anything excluded by "<code>/.htaccess</code>").


<cite class="n">Nicole Sharp's Website</cite> was originally <u>[[homepage for Nicole Sharp's Website|published]]</u> in <time datetime="2006">2006</time> as "<code>personal.frostburg.edu/nlsharp0</code>" and remains a nonprofit educational website.
<code><highlight lang="robots">
User-agent: *
Disallow:
# All bots can crawl/index all files and directories.
</highlight></code>


<cite class="n">NikkiWiki</cite> and <cite class="n">Nicole Sharp's Website</cite> are hosted, published, and written in the United States of America.
=== private ===


<!--sse--><u>[mailto:&#x77;&#x69;&#x6b;&#x69;&#x40;&#x6e;&#x69;&#x63;&#x6f;&#x6c;&#x65;&#x73;&#x68;&#x61;&#x72;&#x70;&#x2e;&#x6e;&#x65;&#x74; Email Nicole Sharp.]</u><!--/sse-->
Websites and webdirectories that are not publicly indexed on search engines are referred to as the "[[wikipedia:deep web|deep web]]" or "deepnet" (not to be confused with the "[[wikipedia:dark web|dark web]]" or "darknet"). <ref><code>[[wikipedia:deep web]]</code></ref> <ref><code>[[wikipedia:dark web]]</code></ref>&ensp; For example, you may want to create a mirror of your website for testing purposes but don't want the development website publicly indexed since it will create duplicate or misleading results on search engines.&ensp; The following "<code>/robots.txt</code>" creates a "deepnet" website that instructs all bots compliant with the Robots Exclusion Protocol to not crawl or index any part of the site.


{{#seo:|keywords=academic, ACM, AL, Alleganian, Allegany, Alleghenian, Allegheny, America, American, Americana, anthropologist, Anthropology, Appalachia, Appalachian, AS, astronomer, astronomy, Attribution-ShareAlike, autist, autistic, Baccalaurea, Baccalaureate, bi, biracial, bisexual, Bogotan, Bogotana, BS, BSc, BY-SA, CC, Chesapeake, college, collegiate, Colombian, Colombiana, Cumberland, Cumberlander, disabled, DreamHost, Earth, Earther, Earthling, Earth-Moon, EDU, education, educational, EN, ENG, engineer, Engineering, English, EN-US, female, feminism, feminist, Frostburg, frostburg.edu, FSU, girl, grad, graduate, home, homepage, homesite, homewiki, homewikisite, Homo, human, index, indexpage, ΚΜΕ, Latin, Latina, Latinoamerican, Latinoamericana, LBTQ, learning, lesbian, LGBT, LGBTQIA, mainpage, Maryland, Marylander, math, mathematician, Mathematics, MD, MediaWiki, minority, minority-owned, MW, net, network, networking, Nicole, nicolesharp.net, Nikki, NikkiLicense, NikkiSite, NikkiWiki, NLS, nonprofit, NS, physicist, Physics, Potomac, queer, Ruiz, scholar, scholarly, schoolgirl, Science, scientific, scientist, Sharp, site, socioanthropologist, socioanthropology, sociologist, Sociology, ΣΠΣ, student, Terra, Terran, Terra-Luna, TG, t-girl, trans, transgender, transgirl, transsexual, transwoman, transwomyn, TS, undergrad, undergraduate, uni, university, US, USA, US-MD, Virginian, web, webhome, webhomepage, website, wiki, wikihome, wikihomepage, wikimainpage, wikisite, woman, womyn, WWW}}
<code><highlight lang="robots">
User-agent: *
Disallow: /
# No compliant bots will crawl/index any files or directories.
</highlight></code>


[[category:Nicole Sharp's Website]]
=== hybrid ===
[[category:NikkiWiki]]
 
The following "<code>/robots.txt</code>" excludes two webdirectories ("<code>/sandbox/</code>" and "<code>/testbox/</code>") from crawling/indexing but permits access to all other files and directories on the site.
 
<code><highlight lang="robots">
User-agent: *
Disallow: /sandbox/
Disallow: /testbox/
# Compliant bots will crawl/index all files and directories except for "/sandbox/" and "/testbox/" (exclusion is applied recursively to all subdirectories of "sandbox" and "testbox").
</highlight></code>
 
== SITEMAP ==
 
<u>[[SITEMAP|SITEMAP#ROBOTS]]</u> is an extension of the Robots Exclusion Protocol to allow listing a [https://www.sitemaps.org/ sitemap] in the "<code>/robots.txt</code>" file. <ref><u><code>[[SITEMAP#ROBOTS]]</code></u></ref> <ref><code>https://www.sitemaps.org/</code></ref> <ref><code>https://www.sitemaps.org/protocol.html</code></ref>&ensp; Having a precompiled list of links for the website available makes the bot's job to crawl and index the site a lot easier and more efficient.&ensp; Since the first thing a good bot does when accessing a website is to check for a "<code>/robots.txt</code>" file, it is best to have the link to the sitemap listed directly in the "<code>/robots.txt</code>" file so the bot doesn't have to guess whether or not the website has a sitemap available (which could be either "<code>/sitemap.txt</code>" or "<code>/sitemap.xml</code>").&ensp; The following "<code>/robots.txt</code>" provides an example of a public website with a sitemap.&ensp; Note that unlike the other ROBOTS instructions, the sitemap should be provided with a full URL (uniform resource locator) and not with a relative link.
 
<code><highlight lang="robots">
User-agent: *
Disallow:
Sitemap: https://www.example.net/sitemap.xml
</highlight></code>
 
== SECURITY ==
 
Additional protocols such as [https://www.securitytxt.org/ SECURITY] and [https://humanstxt.org/ HUMANS] can also be added to the Robots Exclusion Protocol but these are not officially supported so should be commented out to avoid confusing bots while still allowing nonbot users to find the relevant files.&ensp; The example below shows "<code>/robots.txt</code>" for a public website that includes additional protocols for SITEMAP, SECURITY, and HUMANS.
 
<code><highlight lang="robots">
User-agent: *
Disallow:
Sitemap: https://www.example.net/sitemap.txt
# Security: https://www.example.net/.well-known/security.txt
# Humans: https://www.example.net/humans.txt
</highlight></code>
 
== Nicole Sharp's Website ==
 
Below is the Robots Exclusion Protocol of "<u><code>[https://www.nicolesharp.net/robots.txt nicolesharp.net/robots.txt]</code></u>" showing hidden directories ("<code>/sandbox/</code>" and "<code>/testbox/</code>") for webdevelopment plus additional protocols and comments, including a comment line to provide <u>[[attribution]]</u> to the author of the file (<u>[[Nicole Sharp]]</u>).
 
<code><highlight lang="robots">
User-agent: *
Disallow: /sandbox/
Disallow: /testbox/
Sitemap: https://www.nicolesharp.net/sitemap.txt
# Security: https://www.nicolesharp.net/security.txt
# Humans: https://www.nicolesharp.net/humans.txt
 
# Robots Exclusion Protocol for Nicole Sharp's Website.
# 2023-09-04 Nicole Sharp
# https://www.nicolesharp.net/
</highlight></code>
 
== see also ==
 
* <u><code>https://www.nicolesharp.net/robots.txt</code></u>
* <code>https://www.robotstxt.org/</code>
* <u><code>[[SITEMAP]]</code></u>
* <code>https://www.securitytxt.org/</code>
* <code>https://humanstxt.org/</code>
 
== references ==
 
<references />
 
== keywords ==
 
<code>bots, development, HUMANS, humans.txt, indexing, ROBOTS, robots.txt, SECURITY, security.txt, searchbots, SITEMAP, sitemap.txt, TXT, web, webcrawlers, webcrawling, webdevelopment, WWW</code>
 
{{#seo:|keywords=bots, development, HUMANS, humans.txt, indexing, ROBOTS, robots.txt, SECURITY, security.txt, searchbots, SITEMAP, sitemap.txt, TXT, web, webcrawlers, webcrawling, webdevelopment, WWW}}
 
[[category:webdevelopment]]
[[category:pages with images]]
[[category:pages with images]]

Revision as of 2023-09-05T00:52:39

[Image.]  The Robots Exclusion Protocol will not prevent bad bots from accessing your website. [1]

One of the first files you should add to your website is "/robots.txt". [2]  This is a plaintext file for the Robots Exclusion Protocol (ROBOTS language). [3]  What the "/robots.txt" file does is instruct which webdirectories should be accessed or avoided by web bots.

An important thing to remember is that no bot is required to follow the Robots Exclusion Protocol. [4] [5] [6]  The protocol only affects the behavior of compliant or well-behaved bots and anyone can program a bot to ignore the Robots Exclusion Protocol.  As such, you should not use the Robots Exclusion Protocol to try to hide sensitive directories, especially since publicly listing the directories in "/robots.txt" simply gives malicious bots an easy way to find the very directories you don't want them to visit. [7]  On Apache HTTP (Hypertext Transfer Protocol) Server, you should use "/.htaccess" (hypertext access) instead to hide directories from public access.

"robots.txt" will only work from the root webdirectory ("/"). [8]

Comments are added to the Robots Exclusion Protocol with a hash ("#") at the beginning of a new line.

As with all webtext files, you should use an advanced text editor such as Notepad-Plus-Plus (not Microsoft Windows Notepad). [9]  Files should be saved with Unix line endings and UTF-8 (Unicode Transformation Format Eight-Bit) character encoding.

examples

public

For a public website, you want to allow access to all bots in order to get the site crawled and indexed by as many search engines as possible.  The following "/robots.txt" allows all bots access to all files and directories (minus anything excluded by "/.htaccess").

User-agent: *
Disallow:
# All bots can crawl/index all files and directories.

private

Websites and webdirectories that are not publicly indexed on search engines are referred to as the "deep web" or "deepnet" (not to be confused with the "dark web" or "darknet"). [10] [11]  For example, you may want to create a mirror of your website for testing purposes but don't want the development website publicly indexed since it will create duplicate or misleading results on search engines.  The following "/robots.txt" creates a "deepnet" website that instructs all bots compliant with the Robots Exclusion Protocol to not crawl or index any part of the site.

User-agent: *
Disallow: /
# No compliant bots will crawl/index any files or directories.

hybrid

The following "/robots.txt" excludes two webdirectories ("/sandbox/" and "/testbox/") from crawling/indexing but permits access to all other files and directories on the site.

User-agent: *
Disallow: /sandbox/
Disallow: /testbox/
# Compliant bots will crawl/index all files and directories except for "/sandbox/" and "/testbox/" (exclusion is applied recursively to all subdirectories of "sandbox" and "testbox").

SITEMAP

SITEMAP#ROBOTS is an extension of the Robots Exclusion Protocol to allow listing a sitemap in the "/robots.txt" file. [12] [13] [14]  Having a precompiled list of links for the website available makes the bot's job to crawl and index the site a lot easier and more efficient.  Since the first thing a good bot does when accessing a website is to check for a "/robots.txt" file, it is best to have the link to the sitemap listed directly in the "/robots.txt" file so the bot doesn't have to guess whether or not the website has a sitemap available (which could be either "/sitemap.txt" or "/sitemap.xml").  The following "/robots.txt" provides an example of a public website with a sitemap.  Note that unlike the other ROBOTS instructions, the sitemap should be provided with a full URL (uniform resource locator) and not with a relative link.

User-agent: *
Disallow:
Sitemap: https://www.example.net/sitemap.xml

SECURITY

Additional protocols such as SECURITY and HUMANS can also be added to the Robots Exclusion Protocol but these are not officially supported so should be commented out to avoid confusing bots while still allowing nonbot users to find the relevant files.  The example below shows "/robots.txt" for a public website that includes additional protocols for SITEMAP, SECURITY, and HUMANS.

User-agent: *
Disallow:
Sitemap: https://www.example.net/sitemap.txt
# Security: https://www.example.net/.well-known/security.txt
# Humans: https://www.example.net/humans.txt

Nicole Sharp's Website

Below is the Robots Exclusion Protocol of "nicolesharp.net/robots.txt" showing hidden directories ("/sandbox/" and "/testbox/") for webdevelopment plus additional protocols and comments, including a comment line to provide attribution to the author of the file (Nicole Sharp).

User-agent: *
Disallow: /sandbox/
Disallow: /testbox/
Sitemap: https://www.nicolesharp.net/sitemap.txt
# Security: https://www.nicolesharp.net/security.txt
# Humans: https://www.nicolesharp.net/humans.txt

# Robots Exclusion Protocol for Nicole Sharp's Website.
# 2023-09-04 Nicole Sharp
# https://www.nicolesharp.net/

see also

references

  1. commons:category:robots in art
  2. https://www.robotstxt.org/
  3. https://www.robotstxt.org/robotstxt.html
  4. https://www.robotstxt.org/faq/prevent.html
  5. https://www.robotstxt.org/faq/blockjustbad.html
  6. https://www.robotstxt.org/faq/legal.html
  7. https://www.robotstxt.org/faq/nosecurity.html
  8. https://www.robotstxt.org/faq/shared.html
  9. https://www.notepad-plus-plus.org/
  10. wikipedia:deep web
  11. wikipedia:dark web
  12. SITEMAP#ROBOTS
  13. https://www.sitemaps.org/
  14. https://www.sitemaps.org/protocol.html

keywords

bots, development, HUMANS, humans.txt, indexing, ROBOTS, robots.txt, SECURITY, security.txt, searchbots, SITEMAP, sitemap.txt, TXT, web, webcrawlers, webcrawling, webdevelopment, WWW