Caring search cgi. How to disable indexing so that Windows does not slow down. Western and Japanese animation

Sometimes it happens that you want to download a free music album 2007, released by a performer that three and a half people know. You find a torrent file, launch it, the download reaches 14.7% and ... that's it. Days and weeks go by, and the download stays still. You start looking for an album in Google, scouring forums and finally find links to some file hosting services, but they have not worked for a long time.

This happens more and more often - copyright holders are constantly closing useful resources. And while popular content is still not a problem, finding a seven-year-old TV series in Spanish can be extremely difficult.

Whatever you need on the Internet, there are a number of ways to find it. We offer all of the following options solely for acquaintance with the content, but in no case for theft.

Usenet

Usenet is a distributed network of servers between which data is synchronized. The structure of Usenet resembles a hybrid of a forum and Email... Users can connect to special groups (Newsgroups), read or write something in them. As with mail, messages have a subject that helps define the subject of the group. Today Usenet is used mostly for file sharing.

Until 2008, major Usenet providers stored files for only 100–150 days, but then the files were stored forever. Smaller providers leave content for 1,000 days or more, which is often enough.

Around mid-2001, Usenet began to be noticed by copyright holders, which forced ISPs to remove copyrighted content. But enthusiasts quickly found a workaround: they began to give the files confusing names, protect archives with passwords and add them to special sites that can be accessed only by invitation.

In Russia, almost no one knows about the existence of Usenet, which cannot be said about the countries where the authorities are diligently fighting piracy. Unlike BitTorrent, Usenet cannot determine a user's IP address without the help of a service provider or Internet service provider.

How to connect to Usenet

In most cases, you will not be able to connect for free. You will have to be content with either a short storage time for files, or low speed, or access only to text groups.

Providers offer two types of paid access: monthly subscriptions with unlimited downloadable data or unlimited time plans with limited traffic. The second option is for those who only occasionally need to download something. The largest providers of such services are Altopia, Giganews, Eweka, NewsHosting, Astraweb.

Now you need to figure out where to get NZB files with meta information - something like torrent files. For this, special search engines are used - indexers.

Indexers

Public indexers are full of spam and, but they are still good for finding files uploaded five or more years ago. Here is some of them:

Free indexers that require registration are more suitable for finding new files. They are well structured, the content has not only titles, but also descriptions with pictures. You can try the following:

Also, there are indexers only for certain types of content. For example, anizb is for anime fans, and albumsindex for those looking for music.

Downloading from Usenet

Take Fraser Park (The FP) as an example, a little-known film from 2011 that is nearly impossible to find in 1080p. You need to find the NZB file and run it through a program like NZBGet or SABnzbd.

How to download via IRC

You need an IRC client. Almost any will do - the vast majority support DCC. Connect to the server you are interested in and start downloading.

Largest book servers:

irc.undernet.org, room #bookz;
irc.irchighway.net, room #ebooks.

Films:

irc.abjects.net, room #moviegods;
irc.abjects.net, room # beast-xdcc.

Western and Japanese animation:

irc.rizon.net, room #news;
irc.xertion.org, room # cartoon-world.

You can use the! Find or @find commands to find files. The bot will send the results as a private message. If possible, prefer the @search command - it launches a special bot that provides search results in a single file rather than a huge stream of text.

Let's try to download How Music Got Free, a book about the music industry by Stephen Witt.

medium.com

The bot responded to the @search request and sent the results as a ZIP file over DCC.

medium.com

We send a download request.

medium.com

And we accept the file.

medium.com

If you found the file using the indexer, then you do not need to search for it in the channel. Just send the bot a download request using the command from the indexer site.

DC ++

In a DC network, all communication takes place through a server called a hub. In it, you can search for specific types of files: audio, video, archives, documents, disk images.

Sharing files in DC ++ is very simple: just check the box next to the folder to which you want to share general access... Due to this, you can find something completely unimaginable - something that you yourself have forgotten for a long time, but that may suddenly come in handy to someone.

How to download via DC ++

Any client will do. For Windows the best option is FlylinkDC ++. Linux users can choose between and AirDC ++ Web.

Search and download are conveniently implemented: enter a request, select a content type, click "Search" and double-click on the result to download the file. You can also view a list of all files opened by the user and download all files from the selected folder. To do this, right-click on the search result and select the appropriate item.

medium.com

If you don't find something, please try again later. Often times, people only turn on the DC client when they need to download something themselves.

Indexers

The built-in search only finds files in the lists of online users. To find rare content, you need an indexer.

The only known option is spacelib.dlinkddns.com and its mirror dcpoisk.no-ip.org. The results are presented in the form of magnet links, when you click on which, the files are immediately downloaded through the DC client. It should be borne in mind that sometimes the indexer is unavailable for a long time - sometimes up to two months.

eDonkey2000 (ed2k), Kad

Like DC ++, ed2k is a decentralized data transfer protocol with a centralized hub for finding and connecting users to each other. In eDonkey2000 you can find almost the same as in DC ++: old TV shows with different voice acting, music, programs, games, old ones, as well as books on mathematics and biology. However, there are also new releases here.

Recently I needed to install a search engine for indexing HTML pages. I stopped at mnoGoSearch. In the process of reading the documentation, I wrote out some points that may come in handy later, so that I don't have to dig into the manuals again. The result is something like a little cheat sheet. In case it is suddenly useful to someone, I post it here.

indexer -E create - creates all the required tables in the database (assuming the database itself has already been created).

indexer -E blob - creates an index on all indexed information (it must be performed every time after starting the indexer, if the blob storage method is used, otherwise the search will be carried out only on the old information in the database for which the indexer -E blob was previously executed) ...

indexer -E wordstat - creates an index on all found words. search.cgi uses it when the Suggest option is on. If you enable this option, then if the search does not return results, search.cgi will offer options for the correct spelling of the query in case the user made a mistake.

Documents are indexed only when they are considered obsolete. The expiration period is set by the Period option, which can be specified to the config several times before each definition of the URL to be indexed. If you need to re-index all documents, ignoring this instruction, you should run indexer -a.

Indexer has options -t, -g, -u, -s, -y to restrict work only with part of the linkbase. -t matches the tag restriction, -g matches the category restriction, -u - URL part restriction (SQL LIKE templates with% and _ symbols are supported), -s - HTTP document status restriction, -y - Content-Type restrictions ... All restrictions for the same key are combined with the OR operator, and groups of different keys - with the AND operator.

To clean up the entire database, use the indexer -C command. You can also delete only a part of the database using the subsection specifying switches -t, -g, -u, -s, -y.

Database statistics for SQL servers

If you run indexer -S, it will display database statistics including the total number of documents and the number of obsolete documents for each status. Subsection specifying keys also work for this command.

Status code values:

0 - new (not yet indexed) document
200 - "OK" (url indexed successfully)
301 - "Moved Permanently"
302 - "Moved Temporarily"
303 - "See Other"
304 - "Not modified" (url has not been modified since the previous indexing)
401 - "Authorization required" (requires login / password for this document)
403 - "Forbidden" (no access to this document)
404 - "Not found" (the specified document does not exist)
500 - "Internal Server Error" (error in cgi, etc.)
503 - "Service Unavailable"
504 - "Gateway Timeout" (timeout when receiving a document)

HTTP response code 401 indicates that the document is password protected. You can use the AuthBasic command in indexer.conf to specify the login: password for the URL.

Link checking (only for SQL servers)

When launched with the -I switch, indexer displays URL pairs and the page that links to it. This is useful for finding broken links on pages. You can also use subsection constraint keys for this mode. For example, indexer -I -s 404 will show the addresses of all documents not found, along with the addresses of pages containing links to these documents.

Parallel indexing (only for SQL servers)

MySQL and PostgreSQL users can run multiple indexers at the same time with the same indexer.conf configuration file. Indexer uses MySQL and PostgreSQL's locking mechanism to avoid double indexing of the same documents by different indexers running at the same time. Parallel indexing may not work correctly with other supported SQL servers. You can also use the multi-threaded version of the indexer with any SQL Server that supports parallel connections to the base. The multithreaded version uses its own locking mechanism.

It is not recommended to use the same base with various files config indexer.conf! One process can add some documents to the database, while the other can delete the same documents, and both can work without stopping.

On the other hand, you can run several indexers with different configuration files and different bases for any supported SQL server.

Responding to HTTP response codes

The pseudo-language is used for the description:

200 OK
304 Not Modified
301 Moved Permanently
302 Moved Temporarily
303 See Other

of this document

300 Multiple Choices
305 Use Proxy (proxy redirect)
400 Bad Request
401 Unauthorized
402 Payment Required
403 Forbidden
404 Not found
405 Method Not Allowed
406 Not Acceptable
407 Proxy Authentication Required
408 Request Timeout
409 Conflict
410 Gone
411 Length Required
412 Precondition Failed
413 Request Entity Too Large
414 Request-URI Too Long
415 Unsupported Media Type
500 Internal Server Error
501 Not Implemented
502 Bad Gateway
505 Protocol Version Not Supported
503 Service Unavailable
504 Gateway Timeout

Content-Encoding support

MnoGoSearch search engine supports compression HTTP requests and responses (Content encoding). Compressing the requests and responses of the http server can significantly improve the performance of processing http requests by reducing the amount of data transferred.

Using the compression of http requests allows you to reduce traffic by two or more times.

The HTTP 1.1 specification (RFC 2616) defines four methods for encoding the content of server responses: gzip, deflate, compress, and identity.

If Content-encoding support is enabled, indexer sends the Accept-Encoding header to the http server: gzip, deflate, compress.

If the http server supports any of the gzip encoding methods, deflate or compress will send a response encoded with that method.

To build mnoGoSearch with support for compressing HTTP requests, you need the zlib library.

To enable Content encoding support, you need to configure mnoGoSearch with the following key:
./configure --with-zlib

Boolean search

To specify complex queries, you can build boolean search queries. For this you need to search form specify search mode bool.

MnoGoSearch understands the following boolean operators:

& - logical AND. For example, mysql & odbc. mnoGoSearch will search for URLs containing both "mysql" and "odbc". You can also use the + sign for this operator.

| - logical OR. For example, mysql | odbc. mnoGoSearch will search for URLs containing either the word "mysql" or the word "odbc".

~ - logical NOT. For example, mysql & ~ odbc. mnoGoSearch will search for URLs containing the word "mysql" but not containing the word "odbc". Attention! ~ just excludes some documents from the search result. The query "~ odbc" won't find anything!

() is a grouping operator for creating more complex search queries. For example, (mysql | msql) & ~ postgres.

"- phrase highlighting operator. For example," russian apache "&" web server ". You can also use the" sign "for this operator.

It's common for Microsoft to come up with a cool trick that will dramatically improve the comfort of your computer. And in the end it turns out, as always, a significant deterioration in working conditions 🙂 This also happened in the case of the function of indexing the contents of disks, invented by Microsoft, in order to speed up the search for information.

This service runs in the background and gradually scans files. Collecting all the information takes a significant amount of time, but we shouldn't notice it. SHOULD NOT, but in practice, especially with large amounts of information and connecting external disks, there is a process of braking the entire system, to which there is no end in sight. The SearchFilterHost process can be launched 5-10 minutes after the system starts and load the computer to the limit, and for those who have a laptop, this problem can be especially urgent.

How the indexing service works on Windows

It works as follows: scanned file system and all information is entered into a special database (index), and then a search is performed on this database. This database contains file names and paths, creation time, content key phrases (if it is a document or html page), document property values and other data. So when looking for by standard means, for example from the "START" menu, operating system does not iterate over all files, but simply accesses the database.

As time goes on, we install new programs, download new files, new types of files are added to the system to be indexed, and the operating system sometimes gets too carried away with the indexing process, greatly slowing down the work. This can be easily noticed if you are not doing anything, and the hard drive is groaning incessantly, while the searchfilterhost.exe process is hanging in the "Task Manager", which consumes 30-50% of the processor's resources.

You can, of course, wait until the process is over, but if you have to wait 30-40 minutes? Therefore, it is better to deal with this problem right away. We have three ways to resolve the issue.

End the SearchFilterHost process and turn off the Indexing Service altogether

It is possible in the task manager. In principle, this option is not bad, it will add stability to the system, increase the free space on the system disk, and the brakes associated with indexing will disappear. Personally, I use the search function in file manager"Total Commander" and I find it much more convenient than the standard Windows 7/10 search. If you also use third party program, and that such search on the content of documents is not heard, then indexing is simply not needed. And if you have or virtual machine it is even recommended to disable indexing. This is done very simply:

Suspend Indexing Service

In Windows XP, there were special settings for the indexing system, with the help of which it was possible to lower the priority of the service in favor of running programs. But in Windows 7-10 there is no such thing, and we can only pause indexing. This can be done if the SearchFilterHost process strongly interferes with its work, but you do not want to completely turn off the service. To do this, enter the words “index parameters” in the search bar of the “Start” menu and select “Indexing parameters” from the search results.

In the parameters window, click "Pause" and enjoy comfortable work 🙂

Disable indexing of individual drives

You do not have to disable the service at all, but disable indexing on individual disks. To do this, go to "My Computer" and right-click on the desired disk, for example, on which there are many, many files, and select "Properties". In the properties window, uncheck "Allow indexing of this volume"

I hope this article was interesting and useful. We have considered possible problems with the indexing service in Windows 7/8/10 and figured out how to defeat the voracious SearchFilterHost process. You can also simplify your life even more, and in new articles I will return to the optimization issue more than once, so I advise you to subscribe to blog updates and be the first to know the news.

See how you can quickly take off your T-shirt!

The robots.txt file is one of the most important ones when optimizing any website. Its absence can lead to a high load on the site from search robots and slow indexing and reindexing, and incorrect settings can lead to the fact that the site will disappear completely from the search or simply will not be indexed. Therefore, it will not be searched in Yandex, Google and others. search engines Oh. Let's figure out all the nuances correct setting robots.txt.

First, a short video that will give you a basic understanding of what a robots.txt file is.

How robots.txt affects site indexing

Search bots will index your site regardless of the presence of a robots.txt file. If such a file exists, then the robots can be guided by the rules that are written in this file. At the same time, some robots may ignore certain rules, or some rules may be specific only for some bots. In particular, GoogleBot does not use the Host and Crawl-Delay directives, YandexNews has recently begun to ignore the Crawl-Delay directives, and YandexDirect and YandexVideoParser ignore more general directives in the robot (but are guided by those specified specifically for them).

More about exceptions:
Yandex exclusions
Robot Exclusion Standard (Wikipedia)

The maximum load on the site is created by robots that download content from your site. Therefore, specifying what to index and what to ignore, as well as with what time intervals to download, you can, on the one hand, significantly reduce the load on the site from the robots, and on the other hand, speed up the download process by prohibiting crawling unnecessary pages ...

To such unnecessary pages includes ajax, json scripts responsible for pop-up forms, banners, captcha display, etc., order forms and a shopping cart with all the steps of checkout, search functionality, Personal Area, admin panel.

For most robots, it is also advisable to disable indexing of all JS and CSS. But for GoogleBot and Yandex, such files must be left for indexing, since they are used by search engines to analyze the usability of the site and its ranking (Google proof, Yandex proof).

Robots.txt directives

Directives are rules for robots. There is a W3C specification from January 30, 1994 and an extended standard from 1996. However, not all search engines and robots support these or those directives. In this regard, it will be more useful for us to know not the standard, but how the main robots are guided by certain directives.

Let's look at it in order.

User-agent

This is the most important directive that determines for which robots the rules follow.

For all robots:
User-agent: *

For a specific bot:
User-agent: GoogleBot

Please note that robots.txt is not case-sensitive. Those. user agent for Google can just as well be written in the following way:
user-agent: googlebot

Below is a table of the main user agents of various search engines.

The bot	Function
Google
Googlebot	Google's main indexing robot
Googlebot-News	Google News
Googlebot-Image	Google pictures
Googlebot-Video	video
Mediapartners-Google
Mediapartners	Google Adsense, Google Mobile Adsense
AdsBot-Google	check the quality of the landing page
AdsBot-Google-Mobile-Apps	Google Robot for Apps
Yandex.
YandexBot	Yandex's main indexing robot
YandexImages	Yandex.Images
YandexVideo	Yandex.Video
YandexMedia	multimedia data
YandexBlogs	blog search robot
YandexAddurl	a robot that accesses a page when adding it through the "Add URL" form
YandexFavicons	a robot that indexes favicons
YandexDirect	Yandex.Direct
YandexMetrika	Yandex.Metrica
YandexCatalog	Yandex.Catalog
YandexNews	Yandex.News
YandexImageResizer	mobile services robot
Bing
Bingbot	Bing's main indexing robot
Yahoo!
Slurp	the main indexing robot Yahoo!
Mail.Ru
Mail.Ru	main indexing robot Mail.Ru
Rambler
StackRambler	Formerly the main indexing robot Rambler. However, from 23.06.11 Rambler will no longer support its own search engine and now uses Yandex technology on its services. No longer relevant.

Disallow and Allow

Disallow closes pages and sections of the site from indexing.
Allow forcibly opens pages and sections of the site for indexing.

But everything is not so simple here.

First, you need to know additional operators and understand how they are used - they are *, $ and #.

* Is any number of characters, including their absence. In this case, you do not need to put an asterisk at the end of the line, it is assumed that it is there by default.
$ - indicates that the character before it must be the last.
# - a comment, everything after this character in the string is not taken into account by the robot.

Examples of using:

Disallow: *? S =
Disallow: / category / $

Second, you need to understand how nested rules are executed.
Remember that the order in which the directives are written is not important. The inheritance of the rules of what to open or close from indexing is determined by which directories are specified. Let's take an example.

Allow: * .css
Disallow: / template /

http://site.ru/template/ - closed from indexing
http://site.ru/template/style.css - closed from indexing
http://site.ru/style.css - open for indexing
http://site.ru/theme/style.css - open for indexing

If you want all .css files to be open for indexing, you will have to additionally register this for each of the closed folders. In our case:

Allow: * .css
Allow: /template/*.css
Disallow: / template /

Again, the order of the directives is not important.

Sitemap

Directive for specifying the path to the XML Sitemap file. The URL is written in the same way as in the address bar.

For example,

Sitemap: http://site.ru/sitemap.xml

The Sitemap directive is specified anywhere in the robots.txt file without reference to a specific user-agent. Several Sitemap rules can be specified.

Host

Directive for specifying the main mirror of the site (in most cases: with www or without www). Note that the main mirror is specified WITHOUT http: //, but WITH https: //. Also, if necessary, the port is indicated.
The directive is supported only by Yandex and Mail.Ru bots. Other robots, in particular GoogleBot, will not count the command. Host is registered only once!

Example 1:
Host: site.ru

Example 2:
Host: https://site.ru

Crawl-delay

Directive for setting the time interval between the download of the site pages by the robot. Supported by robots from Yandex, Mail.Ru, Bing, Yahoo. The value can be set in whole or fractional units (separator - point), time in seconds.

Example 1:
Crawl-delay: 3

Example 2:
Crawl-delay: 0.5

If the site has a light load, then there is no need to establish such a rule. However, if the indexing of pages by a robot leads to the fact that the site exceeds the limits or experiences significant loads up to server interruptions, then this directive will help reduce the load.

The higher the value, the fewer pages the robot will load in one session. The optimal value is determined individually for each site. It is better to start with not very large values - 0.1, 0.2, 0.5 - and gradually increase them. For search engine robots that are less important for promotion results, such as Mail.Ru, Bing, and Yahoo, you can initially set higher values than for Yandex robots.

Clean-param

This rule tells the crawler that URLs with the specified parameters do not need to be indexed. The rule takes two arguments: a parameter and a section URL. The directive is supported by Yandex.

Clean-param: author_id http://site.ru/articles/

Clean-param: author_id & sid http://site.ru/articles/

Clean-Param: utm_source & utm_medium & utm_campaign

Other parameters

In the extended robots.txt specification, you can find more Request-rate and Visit-time parameters. However, they are not currently supported by the mainstream search engines.

The meaning of the directives:
Request-rate: 1/5 - load no more than one page in five seconds
Visit-time: 0600-0845 - download pages only from 6 am to 8:45 am GMT.

Closing robots.txt

If you need to configure so that your site is NOT indexed by search robots, then you need to register the following directives:

User-agent: *
Disallow: /

Check that these directives are written on the test sites of your site.

Correct robots.txt setting

For Russia and the CIS countries, where Yandex's share is significant, directives should be written for all robots and separately for Yandex and Google.

To correctly configure robots.txt, use the following algorithm:

Close the site admin panel from indexing
Close your personal account, authorization, registration from indexing
Close the shopping cart, order form, delivery and order information from indexing
Close from indexing ajax, json scripts
Close the cgi folder from indexing
Close plugins, themes, js, css from indexing for all robots, except Yandex and Google
Close search functionality from indexing
Close service sections from indexing that do not carry any value for the site in search (error 404, list of authors)
Close technical duplicate pages from indexing, as well as pages on which all content in one form or another is duplicated from other pages (calendars, archives, RSS)
Close from indexing a page with filter options, sorting, comparison
Close the page with the parameters of UTM tags and sessions from indexing
Check what is indexed by Yandex and Google using the "site:" parameter (in search bar type "site: site.ru"). If the search contains pages that also need to be closed from indexing, add them to robots.txt
Specify Sitemap and Host
Add Crawl-Delay and Clean-Param as needed
Check if robots.txt is correct with google tools and Yandex (described below)
After 2 weeks, double-check if there are new pages in the search results that should not be indexed. Repeat the above steps if necessary.

Robots.txt example

# Example robots.txt file for setting up a hypothetical site https://site.ru User-agent: * Disallow: / admin / Disallow: / plugins / Disallow: / search / Disallow: / cart / Disallow: * /? S = Disallow : * sort = Disallow: * view = Disallow: * utm = Crawl-Delay: 5 User-agent: GoogleBot Disallow: / admin / Disallow: / plugins / Disallow: / search / Disallow: / cart / Disallow: * /? s = Disallow: * sort = Disallow: * view = Disallow: * utm = Allow: /plugins/*.css Allow: /plugins/*.js Allow: /plugins/*.png Allow: /plugins/*.jpg Allow: /plugins/*.gif User-agent: Yandex Disallow: / admin / Disallow: / plugins / Disallow: / search / Disallow: / cart / Disallow: * /? s = Disallow: * sort = Disallow: * view = Allow: /plugins/*.css Allow: /plugins/*.js Allow: /plugins/*.png Allow: /plugins/*.jpg Allow: /plugins/*.gif Clean-Param: utm_source & utm_medium & utm_campaign Crawl-Delay: 0.5 Sitemap: https://site.ru/sitemap.xml Host: https://site.ru

How to add and where is robots.txt located

After you have created a robots.txt file, you need to place it on your site at site.ru/robots.txt - i.e. in the root directory. The crawler always accesses the file at the URL /robots.txt

How to check robots.txt

Robots.txt check is carried out on the following links:

In Yandex.Webmaster - on the Tools tab> Robots.txt analysis
V Google Search Console- on the Crawl tab> Robots.txt File Checker

Typical robots.txt errors

At the end of the article I will give a few typical errors in the robots.txt file.

robots.txt missing
in robots.txt the site is closed from indexing (Disallow: /)
the file contains only the most basic directives, there is no detailed study of the file
pages with UTM tags and session identifiers are not closed from indexing in the file
only directives are specified in the file
Allow: * .css
Allow: * .js
Allow: * .png
Allow: * .jpg
Allow: * .gif
wherein css files, js, png, jpg, gif are closed by other directives in a number of directories
the Host directive is spelled out several times
Host does not specify https protocol
the path to the Sitemap is specified incorrectly, or the incorrect protocol or site mirror is specified

P.S.

P.S.2

Useful video from Yandex (Attention! Some recommendations are only suitable for Yandex).

Fill in all the required fields one by one. As you go along, you will see your Robots.txt being filled with directives. All directives of the Robots.txt file are detailed below.

Mark, copy and paste the text into a text editor. Save the file as "robots.txt" in the root directory of your site.

Description of the robots.txt file format

The robots.txt file consists of entries, each of which consists of two fields: a line with the name of the client application (user-agent), and one or more lines starting with the Disallow directive:

Directive ":" value

Robots.txt should be generated in Unix text format. Most good text editors already know how to convert translation characters Windows strings on Unix. Or your FTP client should be able to do this. For editing, do not try to use an HTML editor, especially one that does not have a text mode for displaying the code.

Directive User-agent:

For Rambler: User-agent: StackRambler For Yandex: User-agent: Yandex For Google: User-Agent: googlebot

You can create instructions for all robots:

User-agent: *

Directive Disallow:

The second part of the entry consists of Disallow lines. These lines are directives (instructions, commands) for this robot. Each group entered by the User-agent string must have at least one Disallow statement. The number of Disallow instructions is not limited; they tell the robot which files and / or directories the robot is not allowed to index. You can prevent indexing of a file or directory.

The following directive disallows indexing of the / cgi-bin / directory:

Disallow: / cgi-bin / Pay attention to the / at the end of the directory name! To prohibit visiting the directory "/ dir", the instruction should be of the form: "Disallow: / dir /". And the line "Disallow: / dir" prohibits visiting all pages of the server, the full name of which (from the root of the server) begins with "/ dir". For example: "/dir.html", "/dir/index.html", "/directory.html".

The directive written as follows prohibits indexing of the index.htm file located in the root:

Disallow: /index.htm

Directive Allow understands only Yandex.

User-agent: Yandex Allow: / cgi-bin Disallow: / # prohibits downloading everything except pages starting with "/ cgi-bin" For other search engines you will have to list all closed documents. Think over the structure of the site so that documents closed for indexing are collected, if possible, in one place.

If the Disallow directive is empty, it means that the robot can index ALL files. At least one Disallow directive must be present for each User-agent field for robots.txt to be considered correct. A completely empty robots.txt means the same as if it didn't exist at all.

The Rambler robot understands * as any character, therefore the Disallow: * instruction means prohibiting indexing of the entire site.

Allow, Disallow directives without parameters. The absence of parameters for the Allow, Disallow directives is interpreted as follows: User-agent: Yandex Disallow: # same as Allow: / User-agent: Yandex Allow: # same as Disallow: /

Using the special characters "*" and "$".
When specifying the paths of the Allow-Disallow directives, you can use the special characters "*" and "$", thus specifying certain regular expressions... The special character "*" means any (including empty) sequence of characters. Examples:

User-agent: Yandex Disallow: /cgi-bin/*.aspx # disallows "/cgi-bin/example.aspx" and "/cgi-bin/private/test.aspx" Disallow: / * private # disallows more than " / private ", but also" / cgi-bin / private " The special character "$".
By default, "*" is appended to the end of each rule described in robots.txt, for example: User-agent: Yandex Disallow: / cgi-bin * # blocks access to pages starting with "/ cgi-bin" Disallow: / cgi- bin # the same to cancel the "*" at the end of the rule, you can use the special character "$", for example: User-agent: Yandex Disallow: / example $ # prohibits "/ example", but does not prohibit "/example.html" User -agent: Yandex Disallow: / example # prohibits both "/ example" and "/example.html" User-agent: Yandex Disallow: / example $ # prohibits only "/ example" Disallow: / example * $ # the same, like "Disallow: / example" disallows both /example.html and / example

Directive Host.

If your site has mirrors, a special mirroring robot will detect them and form a group of mirrors for your site. Only the main mirror will participate in the search. You can specify it using robots.txt using the "Host" directive, specifying the name of the main mirror as its parameter. The "Host" directive does not guarantee the choice of the specified master mirror; nevertheless, the algorithm considers it with high priority when making a decision. Example: # If www.glavnoye-zerkalo.ru is the main site mirror, then robots.txt for # www.neglavnoye-zerkalo.ru looks like this User-Agent: * Disallow: / forum Disallow: / cgi-bin Host: www.glavnoye -zerkalo.ru For compatibility with robots that do not fully follow the standard when processing robots.txt, the "Host" directive must be added in the group starting with the "User-Agent" entry, immediately after the "Disallow" ("Allow") directives ... The argument of the "Host" directive is the domain name with a port number (80 by default), separated by a colon. The Host directive parameter must consist of one valid hostname (i.e. RFC 952 compliant and not an IP address) and a valid port number. Incorrectly composed "Host:" lines are ignored.

Examples of ignored Host directives:

Host: www.myhost-.ru Host: www.-myhost.ru Host: www.myhost.ru:100000 Host: www.my_host.ru Host: .my-host.ru: 8000 Host: my-host.ru. Host: my..host.ru Host: www.myhost.ru/ Host: www.myhost.ru:8080/ Host: 213.180.194.129 Host: www.firsthost.ru, www.secondhost.ru # in one line - one domain! Host: www.firsthost.ru www.secondhost.ru # in one line - one domain !! Host: crew-communication.рф # you need to use punycode

Directive Crawl-delay

Sets the timeout in seconds with which search robot downloads pages from your server (Crawl-delay).

If the server is heavily loaded and does not have time to process download requests, use the "Crawl-delay" directive. It allows you to set the search robot a minimum time period (in seconds) between the end of the download of one page and the start of the download of the next. For compatibility with robots that do not fully follow the standard for processing robots.txt, the "Crawl-delay" directive must be added in the group starting with the "User-Agent" entry, immediately after the "Disallow" ("Allow") directives.

The Yandex search robot supports fractional Crawl-Delay values, for example, 0.5. This does not guarantee that the crawler will visit your site every half second, but it gives the crawler more freedom and allows the crawler to crawl faster.

User-agent: Yandex Crawl-delay: 2 # sets a timeout of 2 seconds User-agent: * Disallow: / search Crawl-delay: 4.5 # sets a timeout of 4.5 seconds

Directive Clean-param

Directive to exclude parameters from the address bar. those. requests containing such a parameter and not containing - will be considered identical.

Blank lines and comments

Blank lines are allowed between groups of instructions entered by the User-agent.

The Disallow instruction is taken into account only if it is subordinate to any User-agent string - that is, if there is a User-agent string above it.

Any text from the pound sign "#" to the end of the line is considered a comment and is ignored.

Example:

The following simple file robots.txt prohibits indexing of all pages of the site to all robots, except for the Rambler robot, which, on the contrary, is allowed to index all pages of the site.

# Instructions for all robots User-agent: * Disallow: / # Instructions for the Rambler robot User-agent: StackRambler Disallow:

Common mistakes:

Reversed syntax: User-agent: / Disallow: StackRambler And should be like this: User-agent: StackRambler Disallow: / Several Disallow directives on one line: Disallow: / css / / cgi-bin / / images / Correctly like this: Disallow: / css / Disallow: / cgi-bin / Disallow: / images /

Notes:

The presence of empty line breaks between the "User-agent" and "Disallow" ("Allow") directives, as well as between the "Disallow" ("Allow") directives, is unacceptable.
In accordance with the standard, it is recommended to insert a blank line feed before each "User-agent" directive.

Additionally

Usenet

How to connect to Usenet

Indexers

Downloading from Usenet

How to download via IRC

DC ++

How to download via DC ++

Indexers

eDonkey2000 (ed2k), Kad

How the indexing service works on Windows

End the SearchFilterHost process and turn off the Indexing Service altogether

Suspend Indexing Service

Disable indexing of individual drives

How robots.txt affects site indexing

Robots.txt directives

User-agent

Disallow and Allow

Sitemap

Host

Crawl-delay

Clean-param

Other parameters

Closing robots.txt

Correct robots.txt setting

Robots.txt example

How to add and where is robots.txt located

How to check robots.txt

Typical robots.txt errors

P.S.

P.S.2

Description of the robots.txt file format

Directive User-agent:

Directive Disallow:

Directive Host.

Examples of ignored Host directives:

Directive Crawl-delay

Directive Clean-param

Blank lines and comments

Common mistakes:

You May Also Like