What is a robots.txt file and what is it for?

- Andrés Cruz

En español
What is a robots.txt file and what is it for?

Today our daily lives arise around questions; every day we ask dozens of questions to search engines like Google, in which magically, as if a robot had everything ready, it shows us a search result based on what we ask; However, have you stopped to wonder how this is possible, where these results magically appear on our screen come from.

The answer is that these search engines like Google are always looking for new sites and analyzing existing ones, always in search of data and one of the pioneering files when it comes to working with search engines to index our website are robots.txt .

Which robots are we referring to? and What is a robots.txt?

The robots.txt file is nothing more than a plain text file with a .txt extension that allows you to prevent certain robots that analyze websites from adding unnecessary information; It even allows you to block these robots; in other words; allows you to indicate guidelines or recommendations on how to index our website by the different robots.

Robots are generally search engines that access the pages that make up our site in order to analyze; these are also called bots, spiders, crawlers, etc.

We can also specify to spiders or crawlers what they can crawl or index and which parts they cannot or should not access, that is, these are simply recommendations, but this does not guarantee that in the end the spider will not try to get into parts that we do not want.

This is a file that is directly linked to the organization of our site, in it we indicate which parts of our website are visible or not and with this indicate how our website is organized, for this reason, it is essential that it is well programmed and For that you can follow the guidelines indicated in this entry.

Some well-known robots

  • Googlebot: The Google search engine; with which you discover new pages and update existing ones.
  • Mediapartners-Google: The Google robot that finds pages with adsense.
  • Bingbot: Microsoft's search engine; As in the case of the Google search engine, with bingbot, Microsoft discovers new pages with which Bing feeds.

These to name some of the most important ones; but there are many more.

When a robot is going to analyze a website, the first thing it does is look for the robots.txt in the root of the website to know which pages it has to index and see if there is any section of the website that it should not go through or index.

What is robots.txt for?

If we decide to use a robots.txt on our website we can achieve a series of benefits:

  • Prohibit areas: This is one of the most common uses, it basically consists of having areas on our website that do not appear in search engines; for example the site administration section, website logs, or any other content that we do not want to be located by these robots.
    Eliminate duplicate content: Search engines will give us a better weighting if we avoid duplicating content.
  • Avoid overloading the server: In this way we can prevent a robot from saturating the server with excessive requests to it.
    Indicate the sitemap (sitemap).
  • Block robots: There are "bad robots" whose sole purpose is to crawl the web looking for email to spam, to name a few.

If our website will not have "prohibited zones", nor a sitemap, nor will it have duplicate content; then we should not include the robots.txt on the website; not even empty.

What isn't robots.txt used for?

As we have mentioned throughout the article, the robots.txt establishes recommendations on how to index our site; but some robots (which we will call "bad robots") may not respect these recommendations for different reasons:

  • If the website has sensitive or delicate information and you do not want it to be located by search engines, it is advisable to use some security means to protect the information.

Features of robots.txt

Some of the main characteristics that robots.txt must comply with are the following:

  • This file must be unique on our website.
  • Raw file.
    URLs are case sensitive and do not leave blank spaces between lines.
  • The file must be in the root of the website
  • Indicates the location of the sitemap or website map in XML.

How do we create the robots.txt file?

The robots.txt file must be stored in the root of our server; It is a file that anyone can access from their browser and you must create or upload a file called "robots.txt"; that easy; In the end your route should look like:

https://tuweb.com/robots.txt

User-agent: This, as its name indicates, corresponds to the user agent; there are several types and refers to search engine robots or spiders; You can see a collection of them in the following link Robots Database. Its syntax would be:
User-agent: -Robot name-
Disallow: Tells the spider that we do not want it to access or traverse a URL, subdirectory or directory of our website or application.
Disallow: -resource you want to block-
Allow: Contrary to the previous one, it indicates which URL folders or subfolders we want to analyze and index.
Allow: -resource you want to index
Sitemap: Indicates the path of our sitemap or XML sitemap.

To simplify the rules we detailed above, you can use some special characters: *, ?, $; the asterisk to indicate all, the dollar to indicate the end of a resource (for example, to reference all PHP files /*.php$) and the question mark to indicate that it matches the end of the URL.

For example, to indicate in the robots.txt file that you want to prevent all spiders from making a particular directory private, you can apply the following rules:

User-agent: * Disallow: /private-folder/

You just want to exclude Googlebot:

User-agent: Googlebot Disallow: /private-folder/

Or Bing's:

User-agent: Bingbot Disallow: /private-folder/

The following rule serves to indicate that all folders that contain private-folder will be private.

User-agent: * Disallow: /private-folder*/

To block a URL:

Disallow: /android/my-page-locked.html

Robots.txt file example

User-Agent: * Disallow: /private-folder*/ Disallow: /admin/ Allow: /uploads/images/ Sitemap: https://tuweb.com/sitemap.xml

In the example above, we are telling all spiders not to index or process folders containing private-folder or the admin folder, but to process resources in /uploads/images/ and we are also telling them where our map of the place.

If we want to indicate that the content of our images is allowed, to all agents, and also indicate where our Sitemap is:

Although remember that these are only recommendations made to search spiders so that they do not index a particular folder or resource.

Conclusions

Adding the robots.txt file to our website is recommended since it is a way to organize our website, indicating to the robots which sites are not accessible, where there is duplicate content, where the sitemap of our website is, among others; but we must remember that these are only recommendations; that is, it will not prevent a malicious robot from accessing the areas that have been disabled in robots.txt; In addition to this, robots.txt is a public file which can be accessed and its contents consulted by any entity with a web browser.

Andrés Cruz

Develop with Laravel, Django, Flask, CodeIgniter, HTML5, CSS3, MySQL, JavaScript, Vue, Android, iOS, Flutter

Andrés Cruz In Udemy

I agree to receive announcements of interest about this Blog.