Google has ignored the noindex in the robots.txt

As of September 1, 2019, Google no longer supports unpublished rules in the Robot Exclusion Protocol (REP). Therefore, developers who still use rules such as noindex will have to resort to various alternatives already offered by Google.

An American company recently published some information on this topic in its blog for webmasters. She says she doesn’t want to change the protocol rules, but rather to define essentially all undefined scenarios. This is for robots.txt syntactic matching and syntactic analysis. The entity also wants to standardize reputation and create a C++ library used for analysis robots.txt open source files.

Google no longer supports robots.txt files with the “noindex”directive

In the proposal submitted to the Internet Development Task Force (IETF):

Google wants to allow the use of any URI-based data transfer protocol robots.txt. This is no longer just an HTTP restriction. The protocol can be used for FTP and CoAP;
Developers should at least scan the first 500 kibibytes robots.txt. Setting the maximum file size also reduces the load on the servers;
Google offers a maximum 24-hour caching or the value of the caching directive. This way, the developer can update robots.txt submit at any time;
If previously available robots.txt unavailable due to a server failure, unauthorized pages are no longer viewed for a “sufficiently long period”.
The American company also announced the removal of any code related to unsupported and unpublished rules, such as noindex . This causes, as a reason, maintaining a healthy ecosystem and preparing for possible open source versions. However, there are several alternatives to noindex .

Alternatives to using the “noindex” directive in robots.txt

Not many webmasters use the noindex rule, but they exist. Since September 1, they have been forced to resort to other methods.

There is no index in meta robot tags:
The noindex directive, supported in both HTTP response headers and HTML response headers, is an effective way to remove URLs from the index when allowing crawling.
HTTP status codes 404 and 410 :
Both status codes 404 and 410 mean that the page does not exist. Clearly, the 404 status code indicates that the browser has contacted the server, but the latter has not found the requested information or data. In addition, the server can send this error when the information is found, but access is not granted.
As for the code 410, it indicates that access to the requested resource is not available on the source server. It also notifies that this state may be final. For these codes, the URLs are removed from the Google index after analysis and processing.

Password protection :

The page hidden by the ID is automatically removed from the Google index. Unless there are tags indicating that this is content available by subscription or using paid access.

Disable in robots.txt file:

The pages that search engines can index are only those that they know. By blocking the page so that it is not viewed, its content will not be indexed. Despite the fact that the search engine can index the URL by links from other pages without seeing the content, it aims to make them less visible in the future.

A tool for deleting Search Console URLs :

This is a quick and easy method that allows you to temporarily remove a URL from Google search results. When the request is accepted, the block lasts for a maximum of 90 days. After this time, the information can be displayed again in the results. If by chance Googlebolt can no longer access the URL, it concludes that the page has been deleted. Therefore, it will treat any page found at this URL as a new one. Then it will appear in the Google search results.

What is the Robot Exclusion Protocol (REP)?

Webmaster Martijn Koster created the standard in 1994, after an invasion of his site by crawlers. This is the Robot Exclusion Protocol (REP). Thus, REP was born thanks to the contribution of many other webmasters. It was later adopted by search engines to help website owners. The goal is to help them manage their server resource more easily.

The resource in text format is located at the root of the website and contains a list of URLs that should not be indexed by search robots. Thus, the robots conditionally read robots.txt before indexing the website. On a web server, the robot exclusion protocol is often stored in this text file.

Therefore, all resources that do not have a proven public interest are not displayed in the search engine results. The work of the HTTP server, as well as traffic in the computer network, is only reduced. However, it should not be overlooked that this protocol does not have a security criterion; this is an indicator for friendly robots.

Some bots deliberately ignore the file in order to find personal information. In robots.txt it may accidentally contain a site map in XML format, the address of the site map assigned to search engines.

We will have to find the means to live in “this new world” without having to robots.txt

The web giant has given webmasters the opportunity not to index pages thanks to noindex in robots.txt. But since September 1, Googlebot no longer takes into account rules such as nofollow, crawl delay, or even noindex . They have never been documented by Google. These rules, according to the American company, do not meet the standard that Martin Koster came up with in 1994.

Developers who used the noindex directive in robots.txt therefore, they will have to change their habits by choosing other palliative indexing solutions, such as meta-robots or X-robots-the noindex tag, codes 404 and 410. This event shows that the search engine, which occupies 90% of the global market share, does not rest on its laurels.

10 Best way to make money online from a website.

I hope you can see more clearly now! ?

Feel free to send me your comments ?

June 24, 2024
No Comments

Top 5 WordPress Accessibility Plugins Every Website Should Be Using in 2024

June 24, 2024
No Comments