Nearly all search engines, including Google, recognize www.mysite.com and mysite.com as two different websites. This is very bad for SEO reasons. Instead of having one site listed highly in the search engines, you will have two sites with less importance. See more about SEO/SEM to learn why. To fix this, we can create a very simple .htaccess file to redirect any traffic from mysite.com to www.mysite.com, that way all incoming traffic hits the same URL, and Google gives all of the page ranking importance to one website. Open up an ASCII text editor such as Notepad and create the following file, named htaccess.txt: RewriteEngine On Next, upload this file to your server (to the root directory of your website) and change the filename from "htaccess.txt" to ".htaccess". That's all there is to it. Although less common, you can also implement this strategy with reverse thinking, sending all www traffic to a non-www URL. To do this, you would replace the aforementioned code with the following: RewriteEngine On NOTE: Only use one of these two code segments. Hopefully any questions that you may have had about htaccess files and www-redirects have just been cleared up. If not, please feel free to send an email to Info@asisbiz.com http://en.wikipedia.org/wiki/Meta_tags Checking robots.txt with http://www.google.com/The robots.txt analysis tool reads the robots.txt file in the same way Googlebot does. If the tool interprets a line as a syntax error, Googlebot doesn't understand that line. If the tool shows that a URL is allowed, Googlebot interprets that URL as allowed.
This tool provides results only for Google user-agents (such as Googlebot). Other bots may not interpret the robots.txt file in the same way. For instance, Googlebot supports an extended definition of the standard. It understands Allow: directives, as well as some pattern matching. So while the tool shows lines that include these extensions as understood, remember that this applies only to Googlebot and not necessarily to other bots that may crawl your site. A link to the current robots.txt file on your site. To analyze a site's robots.txt file: Sign into "https://www.google.com/webmasters/tools" Google Webmaster Tools with your "http://www.google.com/accounts/ManageAccount" Google Account. On the Dashboard, click the URL for the site you want. Click Tools, and then click Analyze robots.txt. A Standard for Robot ExclusionTable of contents: Introduction Status of this documentThis document represents a consensus on 30 June 1994 on the robots mailing list (robots-request@nexor.co.uk), between the majority of robot authors and other people with an interest in robots. It has also been open for discussion on the Technical World Wide Web mailing list (www-talk@info.cern.ch). This document is based on a previous working draft under the same title. It is not an official standard backed by a standards body, or owned by any commercial organisation. It is not enforced by anybody, and there no guarantee that all current and future robots will use it. Consider it a common facility the majority of robot authors offer the WWW community to protect WWW server against unwanted accesses by their robots. The latest version of this document can be found on http://www.robotstxt.org/wc/robots.html. IntroductionWWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages. For more information see "http://www.robotstxt.org/" the robots page. In 1993 and 1994 there have been occasions where robots have visited WWW servers where they weren't welcome for various reasons. Sometimes these reasons were robot specific, e.g. certain robots swamped servers with rapid-fire requests, or retrieved the same files repeatedly. In other situations robots traversed parts of WWW servers that weren't suitable, e.g. very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting). These incidents indicated the need for established mechanisms for WWW servers to indicate to robots which parts of their server should not be accessed. This standard addresses this need with an operational solution. The MethodThe method used to exclude robots from a server is to create a file on the server which specifies an access policy for robots. This file must be accessible via HTTP on the local URL "/robots.txt". The contents of this file are specified below.
This approach was chosen because it can be easily implemented on any existing WWW server, and a robot can find the access policy with only a single document retrieval.
A possible drawback of this single-file approach is that only a server administrator can maintain such a list, not the individual document maintainers on the server. This can be resolved by a local process to construct the single file from a number of others, but if, or how, this is done is outside of the scope of this document. The filename should fit in file naming restrictions of all common operating systems. The filename extension should not require extra server configuration. The filename should indicate the purpose of the file and be easy to remember. The likelihood of a clash with existing files should be minimal. The FormatThe format and semantics of the "/robots.txt" file are as follows: For example, Disallow: /help disallows both /help.html and /help/index.html, whereas Disallow: /help/ would disallow /help/index.html but allow /help.html. Any empty value, indicates that all URLs can be retrieved. ExamplesThe following example "/robots.txt" file specifies that no robots should visit any URL starting with "/cyberworld/map/" or "/tmp/", or /foo.html:
This example "/robots.txt" file specifies that no robots should visit any URL starting with "/cyberworld/map/", except the robot called "cybermapper":
This example indicates that no robots should visit this site further:
Example CodeAlthough it is not part of this specification, some example code in Perl is available in norobots.pl. It is a bit more flexible in its parsing than this document specificies, and is provided as-is, without warranty. Note: This code is no longer available. Instead I recommend using the robots exclusion code in the Perl libwww-perl5 library, available from http://www.cpan.org/CPAN in the http://www.cpan.org/modules/by-module/LWP/ LWP directory. Author's Addresshttp://www.greenhills.co.uk/mak/mak.html Martijn Koster Creating a robots.txt fileThe easiest way to create a robots.txt file is to use the Generate robots.txt tool in Webmaster Tools. Once you've created the file, you can use the Analyze robots.txt tool to make sure that it's behaving as you expect. Once you've created your robots.txt file, save it to the root of your domain with the name robots.txt. This is where robots will check for your file. If it's saved elsewhere, they won't find it. You can also create the robots.txt file manually, using any text editor. It should be an ASCII-encoded text file, not an HTML file. The filename should be lowercase. Google and other search engines treat http://www.example.com, https://www.example.com, and http://example.com as different sites. If you want to restrickt crawling on each of these sites, you can create a separate robots.txt for every version of your site's URL. Syntax The simplest robots.txt file uses two rules: User-agent: the robot the following rule applies to Disallow: the URL you want to block These two lines are considered a single entry in the file. You can include as many entries as you want. You can include multiple Disallow lines and multiple user-agents in one entry. User-agents and bots A user-agent is a specific search engine robot. The "http://www.robotstxt.org/wc/active.html" Web Robots Database lists many common bots. You can set an entry to apply to a specific bot (by listing the name) or you can set it to apply to all bots (by listing an asterisk). An entry that applies to all bots looks like this:
Google uses several different bots (user-agents). The bot we use for our web search is Googlebot. Our other bots like Googlebot-Mobile and Googlebot-Image follow rules you set up for Googlebot, but you can set up specific rules for these specific bots as well. Blocking user-agents The Disallow line lists the pages you want to block. You can list a specific URL or a pattern. The entry should begin with a forward slash (/). To block the entire site, use a forward slash.
To block a directory and everything in it, follow the directory name with a forward slash.
To block a page, list the page.
To remove a specific image from Google image search, add the following:
To remove all images on your site from Google image search:
To block files of a specific file type (for example, .gif), use the following:
To prevent pages on your site from being crawled, while still displaying AdSense ads on those pages, disallow all bots other than Mediapartners-Google. This keeps the pages from appearing in search results, but allows the Mediapartners-Google robot to analyze the pages to determine the ads to show. The Mediapartners-Google robot doesn't share pages with the other Google user-agents. For example:
Note that directives are case-sensitive. For instance, Disallow: /junk_file.asp would block http://www.example.com/junk_file.asp, but would allow http://www.example.com/Junk_file.asp. Pattern matching Googlebot (but not all search engines) respects some pattern matching. To match a sequence of characters, use an asterisk (*). For instance, to block access to all subdirectories that begin with private:
To block access to all URLs that include a question mark (?) (more specifically, any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string):
To specify matching the end of a URL, use $. For instance, to block any URLs that end with .xls:
You can use this pattern matching in combination with the Allow directive. For instance, if a ? indicates a session ID, you may want to exclude all URLs that contain them to ensure Googlebot doesn't crawl duplicate pages. But URLs that end with a ? may be the version of the page that you do want included. For this situation, you can set your robots.txt file as follows:
The Disallow: / *? directive will block any URL that includes a ? (more specifically, it will block any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string). Blocking Googlebot
For instance, to block Googlebot entirely, you can use the following syntax:
Allowing Googlebot
Googlebot follows the line directed at it, rather than the line directed at everyone.
Those entries would block all pages inside the folder1 directory except for myfile.html.
Introduction to "robots.txt"There is a hidden, relentless force that permeates the web and its billions of web pages and files, unbeknownst to the majority of us sentient beings. I'm talking about search engine crawlers and robots here. Every day hundreds of them go out and scour the web, whether it's Google trying to index the entire web, or a spam bot collecting any email address it could find for less than honorable intentions. As site owners, what little control we have over what robots are allowed to do when they visit our sites exist in a magical little file called "robots.txt." Robots.txt is a regular text file that through its name, has special meaning to the majority of "honorable" robots on the web. By defining a few rules in this text file, you can instruct robots to not crawl and index certain files, directories within your site, or at all. For example, you may not want Google to crawl the /images directory of your site, as it's both meaningless to you and a waste of your site's bandwidth. "Robots.txt" lets you tell Google just that. Creating your "robots.txt" fileSo lets get moving. Create a regular text file called "robots.txt", and make sure it's named exactly that. This file must be uploaded to the root accessible directory of your site, not a subdirectory (ie: http://www.mysite.com but NOT http://www.mysite.com/stuff/). It is only by following the above two rules will search engines interpret the instructions contained in the file. Deviate from this, and "robots.txt" becomes nothing more than a regular text file, like Cinderella after midnight.
Now that you know what to name your text file and where to upload it, you need to learn what to actually put in it to send commands off to search engines that follow this protocol (formally the "Robots Exclusion Protocol"). The format is simple enough for most intents and purposes: a USERAGENT line to identify the crawler in question followed by one or more DISALLOW: lines to disallow it from crawling certain parts of your site.
With the above declared, all robots (indicated by "*") are instructed to not index any of your pages (indicated by "/"). Most likely not what you want, but you get the idea.
3) The following disallows all search engines and robots from crawling select directories and pages:
4) You can conditionally target multiple robots in "robots.txt." Take a look at the below:
This is interesting- here we declare that crawlers in general should not crawl any parts of our site, EXCEPT for Google, which is allowed to crawl the entire site apart from /cgi-bin/ and /privatedir/. So the rules of specificity apply, not inheritance.
Here I'm saying all crawlers should be prohibited from crawling our site, except for http://pages.alexa.com/help/webmasters/ or Alexa, which is allowed.
Web Reference: |
This webpage was updated 27th September 2011