While simple, a robots.txt file can have a large effect on how search engines crawl your website. This text file is not required, but does provide instructions to search engines on how to crawl the site, and is supported by all major search engines. However, this protocol is purely advisory and can be ignored by web crawling bots if they so choose.
A robots.txt file is composed of disallow and allow statements that instruct which sections of the site search engines should and shouldn’t crawl. Through the use of user-agents statements, you can provide specific allow and disallow statements to particular search engines.
Additionally, an XML Sitemap declaration can be added as well to provide an additional signal about your XML Sitemaps or Sitemap Index file to search engines.
It is important to remember that the robots.txt file should be found in the root directory of your site.
The Robots.txt file should always be there: https://www.example.com/robots.txt
User agents in a nutshell, are the way internet browsers and search engines bots identify themselves to webservers. For example googlebot is Google’s web crawler's user agent. By utilizing user agents in a robots.txt file, you can include or exclude specific pages, directories, or sections of your website by search engine. It’s important to note that if you use user agent sections, such as a googlebot section, googlebot will ignore all other sections in the robots.txt file.
Example of a robots.txt file
User-agent: * Disallow: /ShoppingCartView Disallow: /webapp/wcs/stores/servel/CheckoutUserLogonView User-agent: googlebot Disallow: /catalogs Disallow: /hidden-categories
In this example, you can see there are two "User-agent" sections. The first applies to all search engines, as designated by the regular expression ("regex") wildcard. However, the second statement applies to just googlebot. In this case, Googlebot will ignore all preceeding statements before the googlebot specific "User-agent" statement. While it is perfectly valid to have specific User-agent sections, ensuring each section has all the proper disallow and allow statements is key.
As discussed above, robots.txt files use two types of statements and the disallow statement is by far the most common. When you add a disallow statement to your file, you are indicating to search engines to not crawl certain parts of your site. There are two considerations to take into account when adding a disallow statement.
First, you should think of disallows as a sledgehammer and that once you add that statement those pages on your site will not be crawled. Second, these statements should be specific so that only pages you do not want crawled are included.
As in grade school, capitalization matters with disallow and allow statements. The following represents two different commands:
User-agent: * Disallow: /cart/ Disallow: /Cart/
You may be wondering why anyone would want to prevent a search engine bot from crawling a section of their site. Often times, there are sections of a site that provide no benefit by being crawled and indexed by search engines. This may include:
- Shopping carts
- Private folders
- Account pages and login
By disallowing areas with no inherent SEO value, you allow search bots to spend more time and bandwidth crawling other areas of your site that you want in search engine results pages (SERPs). These pages may include your home, category, product, informational pages, and more.
Allow statements can be used to open up smaller subsections of a disallow statement.
In the following example, while the disallow statement is saying do not crawl the content of /folder1/, the allow statement is saying crawl the myfile.html page within this folder1. With allow and disallow statements, the more specific statement wins, so search bots will respect this allow statement despite it being covered by the disallow.
User-agent:* Disallow: /folder1/ Allow: /folder1/myfile.html
You do not need to rush and add allow statements to your robots.txt file for every page and section of your site you want crawled. However, if you do have sections of your site currently covered by a disallow statement that you want crawled, the allow statement may be useful.
It is important to note, that allow statements may not be followed with the same adherence as disallow statements due to them being a more recent addition to the robots.txt protocol.
XML Sitemap Declarations
Another feature that can be utilized on the robots.txt file is the XML Sitemap declaration. Since search engine bots start crawling a site by checking the robots.txt file, it provides you an opportunity to notify them of your XML Sitemap(s).
If you do not have an XML Sitemap, don’t worry as this feature, like all of the robots.txt file, is not required. You can have multiple XML Sitemap declarations within your file, however if you have a Sitemap index you should only specify this index and not each individual Sitemap. Below are examples of XML Sitemap declarations.
Multiple XML Sitemaps
Sitemap: https://www.domain.com/sitemap-products.xml Sitemap: https://www.domain.com/sitemap-categories.xml Sitemap: https://www.domain.com/sitemap-blogposts.xml
XML Sitemap index
For clarity purpose, this declaration should be found at the very end of your robots.txt file. For example, here is the content of this site's robots.txt file:
# All robots allowed User-agent: * Disallow: # Sitemap files Sitemap: https://technicalseo.com/sitemap.xml