Sorry, no results found for "".

DatoCMS Site Search > How the crawling works

How the crawling works

The crawling process starts from the URL you configure as the "Website frontend URL" in your build trigger settings. It recursively follows all the hyperlinks pointing to your domain.

User Agent

The User-Agent used by our crawler is DatoCMSSearchBot.

How can i control what pages will be crawled on my site?

DatoCMSSearchBot respects the robots.txt directives user-agent and disallow. In the example below, DatoCMSSearchBot won't crawl documents that are under /do-not-crawl/ or /not-allowed.

User-agent: DatoCMSSearchBot # DatoCMS's user agent
Disallow: /do-not-crawl/ # disallow this directory
User-agent: * # any robot
Disallow: /not-allowed/ # disallow this directory

As of today DatoCMSSearchBot does not support the crawl-delay directive in robots.txt and robots meta tags on HTML pages such as nofollow and noindex.

Sitemaps

In addition to following the links found within pages, if your website provides a Sitemap file, the crawler will utilize it as an extra source of URLs to crawl. Sitemap Index files are also supported.

The crawler will first look for for sitemap directives in the robots.txt file. If a robots.txt file does not exist, or it does not offer any sitemap directive, the crawler will try with /sitemap.xml under the root of your domain.

Ensure the URLs in your sitemaps match your domain!

Any link to domains different than the one configured as the "Website frontend URL" in your build trigger settings will be ignored by the bot.

Language Detection

Through the HTML global lang attribute present on a page — or language-detection heuristics, if the attribute is missing — we detect the language of every crawled page, so that indexing will happen with proper stemming.

That is, if the visitor searches for "cats", we'll also return results for "cat", "catlike", "catty", etc.

Plain HTML only

The crawler does not execute JavaScript on the spidered pages, it only parses plain HTML. If your website is a Single Page App, you'll need to setup pre-rendering to make it readable by our bot.

Excluding content from indexing

To give your users the best experience, it's often useful to instruct DatoCMSSearchBot to exclude certain parts of your pages from indexing — ie. website headers and footers. Those sections are repeated in every page, thus can only degrade your search results.

To do that, you can simply add a data-datocms-noindex attribute to the HTML elements of your page you want to exclude: everything cointained in those elements will be ignored during indexing.

<body>
<div class="header" data-datocms-noindex>
...
</div>
<div class="main-content">
...
</div>
<div class="footer" data-datocms-noindex>
...
</div>
</body>

Crawling time

The time needed to finish the crawling operation depends on the number of pages in your website and your hosting's performances, but normally it's about ~20 indexed pages/sec.