What is, the guide to optimize it for your SEO, Robots.txt
What is, the guide to optimize it for your SEO, Robots.txt
A robots.txt file tells web search tool crawlers which URLs the crawler can access on your webpage. This is utilized basically to try not to over-burden your website with demands; it's anything but an instrument for keeping a page out of Google. To keep a page out of Google, block ordering with noindex or secret word secure the page.
The main objective of the robots.txt file is therefore to manage the crawl budget of the robot by prohibiting it from browsing pages with low added value, but which must exist for the user journey (shopping cart, etc.)..
How can it work?
Web search tools have two fundamental task:
- Crawl the Web to find content and
- Index that content so it very well may be conveyed to clients searching for data.
Explanation:
To slither locales, web indexes follow connections to get starting with one webpage then onto the next, they creep a large number of connections and sites. This is classified "spidering". When the web search tool robot gets to a site, it searches for a robots.txt record. On the off chance that it observes one, the robot will initially peruse this record prior to proceeding to peruse the page. On the off chance that the robots.txt document doesn't contain orders precluding client specialist movement or then again assuming the site doesn't have a robots.txt record, it will slither other data on the site.
Create a robots.txt file
You can handle which records crawlers might access on your site with a robots.txt document. A robots.txt record lives at the base of your site. Along these lines, for site www.example.com, the robots.txt document lives at www.example.com/robots.txt. robots.txt is a plain text record that adheres to the Robots Exclusion Guideline. A robots.txt record comprises of at least one standards. Each standard squares or permits access for a given crawler to a predetermined document way in that site. Except if you determine in any case in your robots.txt record, all documents are certainly took into consideration creeping.
Here is a simple robots.txt file with two rules:
User-agent: GooglebotDisallow: /nogooglebot/User-agent: *Allow: /Sitemap: http://www.example.com/sitemap.xml
Here's what that robots.txt file means:
- The user agent named Googlebot is not allowed to crawl any URL that starts with http://example.com/nogooglebot/.
- All other user agents are allowed to crawl the entire site. This could have been omitted and the result would be the same; the default behavior is that user agents are allowed to crawl the entire site.
- The site's sitemap file is located at http://www.example.com/sitemap.xml.
Basic guidelines for creating a robots.txt file
Creating a robots.txt file and making it generally accessible and useful involves four steps:
- Create a file named robots.txt.
- Add rules to the robots.txt file.
- Upload the robots.txt file to your site.
- Test the robots.txt file.
Create a robots.txt file
You can use almost any text editor to create a robots.txt file. For example, Notepad, TextEdit, vi, and emacs can create valid robots.txt files. Don't use a word processor; word processors often save files in a proprietary format and can add unexpected characters, such as curly quotes, which can cause problems for crawlers. Make sure to save the file with UTF-8 encoding if prompted during the save file dialog.
Format and location rules:
- The file must be named robots.txt.
- Your site can have only one robots.txt file.
- The robots.txt file must be located at the root of the website host to which it applies. For instance, to control crawling on all URLs below https://www.example.com/, the robots.txt file must be located at https://www.example.com/robots.txt. It cannot be placed in a subdirectory (for example, at https://example.com/pages/robots.txt). If you're unsure about how to access your website root, or need permissions to do so, contact your web hosting service provider. If you can't access your website root, use an alternative blocking method such as meta tags.
- A robots.txt file can apply to subdomains (for example, https://website.example.com/robots.txt) or on non-standard ports (for example, http://example.com:8181/robots.txt).
- A robots.txt file must be an UTF-8 encoded text file (which includes ASCII). Google may ignore characters that are not part of the UTF-8 range, potentially rendering robots.txt rules invalid.
Useful robots.txt rules
Here are some common useful robots.txt rules:
1.Disallow crawling of the entire website
Keep in mind that in some situations URLs from the website may still be indexed, even if they haven't been crawled.
Note: This does not match the various AdsBot crawlers, which must be named explicitly.
User-agent: *
Disallow: /
2.Disallow crawling of a directory and its contents
Append a forward slash to the directory name to disallow crawling of a whole directory.
Caution: Remember, don't use robots.txt to block access to private content; use proper authentication instead. URLs disallowed by the robots.txt file might still be indexed without being crawled, and the robots.txt file can be viewed by anyone, potentially disclosing the location of your private content.
User-agent: *Disallow: /calendar/Disallow: /junk/
3.Allow access to a single crawler
Only googlebot-news may crawl the whole site.
User-agent: Googlebot-newsAllow: /User-agent: *Disallow: /
4.Allow access to all but a single crawler
Unnecessarybot may not crawl the site, all other bots may.
User-agent: UnnecessarybotDisallow: /User-agent: *Allow: /
5.Disallow crawling of a single web page
For example, disallow the useless_file.html page.
User-agent: *Disallow: /useless_file.html
6.Block a specific image from Google Images
For example, disallow the dogs.jpg image.
User-agent: Googlebot-ImageDisallow: /images/dogs.jpg
7.Block all images on your site from Google Images
Google can't index images and videos without crawling them.
User-agent: Googlebot-ImageDisallow: /
8.Disallow crawling of files of a specific file type
For example, disallow for crawling all .gif files.
User-agent: GooglebotDisallow: /*.gif$
9.Disallow crawling of an entire site, but allow Mediapartners-Google
This implementation hides your pages from search results, but the Mediapartners-Google web crawler can still analyze them to decide what ads to show visitors on your site.
User-agent: *Disallow: /User-agent: Mediapartners-GoogleAllow: /
10.Use $ to match URLs that end with a specific string
For example, disallow all .xls files.
User-agent: GooglebotDisallow: /*.xls$
Comments
Post a Comment