Sorry, no results found for "".
The crawling process starts from the URL you configure as the "Website frontend URL" in your build trigger settings. It recursively follows all the hyperlinks pointing to your domain.
The User-Agent used by our crawler is DatoCMSSearchBot
.
DatoCMSSearchBot respects the robots.txt directives user-agent
and disallow
. In the example below, DatoCMSSearchBot won't crawl documents that are under /do-not-crawl/
or /not-allowed
.
As of today DatoCMSSearchBot does not support the crawl-delay
directive in robots.txt and robots meta tags on HTML pages such as nofollow
and noindex
.
In addition to following the links found within pages, if your website provides a Sitemap file, the crawler will utilize it as an extra source of URLs to crawl. Sitemap Index files are also supported.
The crawler will first look for for sitemap
directives in the robots.txt file. If a robots.txt file does not exist, or it does not offer any sitemap directive, the crawler will try with /sitemap.xml
under the root of your domain.
Any link to domains different than the one configured as the "Website frontend URL" in your build trigger settings will be ignored by the bot.
Through the HTML global lang
attribute present on a page — or language-detection heuristics, if the attribute is missing — we detect the language of every crawled page, so that indexing will happen with proper stemming.
That is, if the visitor searches for "cats", we'll also return results for "cat", "catlike", "catty", etc.
The crawler does not execute JavaScript on the spidered pages, it only parses plain HTML. If your website is a Single Page App, you'll need to setup pre-rendering to make it readable by our bot.
To give your users the best experience, it's often useful to instruct DatoCMSSearchBot to exclude certain parts of your pages from indexing — ie. website headers and footers. Those sections are repeated in every page, thus can only degrade your search results.
To do that, you can simply add a data-datocms-noindex
attribute to the HTML elements of your page you want to exclude: everything cointained in those elements will be ignored during indexing.
<body> <div class="header" data-datocms-noindex> ... </div> <div class="main-content"> ... </div> <div class="footer" data-datocms-noindex> ... </div></body>
The time needed to finish the crawling operation depends on the number of pages in your website and your hosting's performances, but normally it's about ~20 indexed pages/sec.