If you have been in the web development field for long enough, you must know that there are plenty of search engines.
And each one of them uses a bot that tends to add websites to their search engine index.
As a developer, sometimes you may not wish to allow all pages of a website to be crawled and indexed by every single search engine bot.
This is exactly where the robots.txt file comes in.
This particular file is capable of allowing or disallowing search engine bots from crawling a particular page or even the whole website.
What is the robots.txt file?
As the extension of the file suggests, this is a text file. The webmasters tend to use them to instruct the web robots on how they can ‘crawl’ a particular website.
This file is associated with the REP or the robots exclusion protocol.
This protocol is responsible for deciding how the robots crawl through the web world, index content, and serve them to the users.
Apart from that it includes the meta robots, page, subdirectory, site-wide instructions as well.
These files are used to guide certain user-agents through a particular website, and instruct them which of the pages to crawl and which ones they cannot crawl through.
You can specify this by using the keywords like “disallow”, “allow”, etc. The basic format to follow while writing a robots.txt file is:
User-agent: [user-agent name]
Disallow: [URL not to be crawled]
How does the robots.txt file work?
Before you can actually use the robots.txt file, you need to know how the file actually works.
Whenever a search engine robot finds your website URL, it looks out for the robots.txt file in your server.
Once found, it will first look for the instructions inside, then it works accordingly.
Including robots.txt file might come handy in case you are trying to hide something on your website from the web crawler.
These are the situations where you can use robots.txt to good effect:
- In case you have some duplicate pages and you do not want the search engines to find them
- If you wish to exclude your internal search result pages from getting indexed
- In case you wish to stop the search engines from crawling certain pages or even a whole
- Ifyou wish to stop the search engines from crawling certain files like pdf, image files, etc.
- In case you wish to point out the sitemap location to the search engines robots.
Robots.txt syntax that you can use
There are certain syntaxes or keywords that you can use while writing a robots.txt file.
Mostly people tend to use five different keywords to write the robots.txt file:
- User-Agent: You can use this keyword to specify the web crawler that you are putting the instructions for. You can find the name of specific user agent names online.
- Disallow: You can use this command to tell a particular user-agent which of the urls are disallowed to crawl through.
- Allow: Using this command, you can specify a particular folder or file that you wish to be crawled by certain web crawlers
- Crawl-delay: This is to instruct the web crawler to wait for certain time, before it can crawl through the pages. In case your website has a high page loading time, you can use this trick perfectly to delay the crawl time for the search robots.
- Sitemap: This syntax is not supported by many search engine robots, only a few. Using this keyword you can easily provide the XML sitemap location.
How to write robots.txt files?
Now that you know everything about robots.txt, it’s time to write one of your own for your website.
Here are the steps that you need to follow:
Before you get going, you will need to have access to your web hosting or a FTP account will do for sure.
Once you have them you need to logon to your web hosting. Now go to your file manger and create a file named robots.txt.
Make sure to include the file where your website is located.
If you do not have access to your web hosting, you can easily use a FTP account to create the file.
In this step you will need to edit and write some codes in your file to instruct the web crawlers.
All you need to do is find the user-agent names online and specify those names that you wish to allow or disallow from crawling your website or certain files.
If you wish to block certain user-agent you need to use the following instructions:
User-agent: name of the robot
- This will block only the name of the robot that you will specify, to block all the web crawlers, you will need to use a * and the instruction will be:
- To ensure that a search crawler does not crawl through certain file or may be certain folder, you can use similar kind of syntax:
- This will ensure that the web crawlers won’t be crawling through the specified filename and the folder name.
To block certain web crawlers only, use the User-Agent name instead of the *.
Once you are done with the changes, you can save the robots.txt file and it shall be able to restrict or allow certain pages or folders or files from certain web crawlers.
So, are you ready to implement the robots.txt file and start blocking search engines using the robots.txt?
If you think you are ready, you can certainly get going with the implementation, but you need to remember certain things while writing the file:
- Robots.txt file is case sensitive, so make sure to use the correct file and syntax
- Make sure to put the file on the top most directory of your web hosting
- For each and every sub domain and the main domain on your hosting you will have to write separate robots.txt file. Writing a single robots.txt file for one domain or sub domain will not work for other domains or sub domains, even if they are on the same server.