{"id":148,"date":"2022-05-11T18:21:58","date_gmt":"2022-05-12T01:21:58","guid":{"rendered":"https:\/\/peden.ece.uw.edu\/computing\/?page_id=148"},"modified":"2022-05-11T18:21:58","modified_gmt":"2022-05-12T01:21:58","slug":"about-search-engines","status":"publish","type":"page","link":"https:\/\/peden.ece.uw.edu\/computing\/about-search-engines\/","title":{"rendered":"About Search Engines and &#8220;Spiders&#8221;"},"content":{"rendered":"<p>\u2022 <a href=\"#prevent\">How do I prevent search engines from listing my pages?<\/a><br \/>\n\u2022 <a href=\"#encourage\">How can I make sure the search engines include my pages?<\/a><!--more--><\/p>\n<p><a name=\"prevent\"><\/a><\/p>\n<hr class=\"subbar\" \/>\n<h2>How to prevent the search engines from including your pages in their index<\/h2>\n<p>There are three main ways to accomplish this.<\/p>\n<h3>Meta tags<\/h3>\n<p>Most search engines will honor the &#8220;robots&#8221; meta-tag. This is an HTML tag that&#8217;s inserted into the HEAD section of your Web page. The tag looks like this:<\/p>\n<pre>    &lt;meta name=\"robots\" content=\"noindex,nofollow\"&gt;\r\n<\/pre>\n<p>When the search engine sees &#8220;noindex&#8221;, it doesn&#8217;t include the page in its list. &#8220;nofollow&#8221; tells it not to follow any links to other pages that are on that particular page, either. You can use either directive by itself, or in combination as shown above.<\/p>\n<h3>The robots.txt file<\/h3>\n<p>The various search engine providers have gotten together, and set up a standard for excluding Web pages from their indexes. This is done by creating a <a href=\"http:\/\/info.webcrawler.com\/mak\/projects\/robots\/norobots.html\">robots.txt<\/a> file in the root directory of a Web site. When a search engine spider first visits a site, it looks for this file.<\/p>\n<p>The structure of a robots.txt file is pretty simple. It should contain a list of spiders (called &#8220;User-agent&#8221;), followed by a list of what that spider shouldn&#8217;t index. Since in general people aren&#8217;t making different rules for different spiders, you can use &#8220;*&#8221; to indicate all spiders. Comments begin with a pound sign, &#8220;#&#8221;.<\/p>\n<pre>    User-agent: *           # Rules apply to all search engines\r\n    Disallow: \/temp         # Exclude anything starting with \"\/temp\"\r\n                            #   (can be directories or files)\r\n    Disallow: \/private\/     # Exclude the entire directory \"\/private\"\r\n    Disallow: \/secret.html  # Don't index the file secret.html\r\n<\/pre>\n<p>The advantage to using robots.txt is you can make one rule for a whole directory, instead of adding meta tags to each file. Also, it makes it easier to track just what is, and isn&#8217;t, being excluded from the search engines. But since it can only be in your Web server&#8217;s root directory, you will have to ask the Webmaster to make additions or changes &#8211; search engines generally do not honor robots.txt files found anywhere else.<\/p>\n<h3>Access control<\/h3>\n<p>Directories that are only accessible by certain users cannot be indexed by the search engines, since they can&#8217;t get in the directory to read the files.<\/p>\n<p><a name=\"encourage\"><\/a><\/p>\n<hr class=\"subbar\" \/>\n<h2>How to get the search engines to index your pages<\/h2>\n<p>In theory, the major search engines will automatically find (and index) your pages if a link to them exists from another page &#8211; so you normally shouldn&#8217;t have to do anything. On rare occasions, however, you may find you need to specifically tell them about your page.<\/p>\n<p>Each of the search engines has a form you can use for this purpose. You just have to visit the site, and find the link called &#8220;add your URL&#8221;, &#8220;index your pages&#8221;, or something similar. The procedures (and URLs) each engine uses change from time to time, so we&#8217;re not including links to these pages &#8211; but normally you can find links to them from the search engines&#8217; front pages.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>\u2022 How do I prevent search engines from listing my pages? \u2022 How can I make sure the search engines include my pages?<\/p>\n<div><a class=\"more\" href=\"https:\/\/peden.ece.uw.edu\/computing\/about-search-engines\/\">Read more<\/a><\/div>\n","protected":false},"author":1,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"tags":[4,17],"class_list":["post-148","page","type-page","status-publish","hentry","tag-faq","tag-web"],"_links":{"self":[{"href":"https:\/\/peden.ece.uw.edu\/computing\/wp-json\/wp\/v2\/pages\/148","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/peden.ece.uw.edu\/computing\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/peden.ece.uw.edu\/computing\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/peden.ece.uw.edu\/computing\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/peden.ece.uw.edu\/computing\/wp-json\/wp\/v2\/comments?post=148"}],"version-history":[{"count":1,"href":"https:\/\/peden.ece.uw.edu\/computing\/wp-json\/wp\/v2\/pages\/148\/revisions"}],"predecessor-version":[{"id":149,"href":"https:\/\/peden.ece.uw.edu\/computing\/wp-json\/wp\/v2\/pages\/148\/revisions\/149"}],"wp:attachment":[{"href":"https:\/\/peden.ece.uw.edu\/computing\/wp-json\/wp\/v2\/media?parent=148"}],"wp:term":[{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/peden.ece.uw.edu\/computing\/wp-json\/wp\/v2\/tags?post=148"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}