Follows on from discussions in #63 - currently the HostAlias setting is relatively limited, requiring an exact match before it crawls a link with that domain.
To make crawling a large number of subdomains easier, support for a wildcard (*) would be useful.
eg.
using InfinityCrawler;
var crawler = new Crawler();
var result = await crawler.Crawl(new Uri("http://example.org/"), new CrawlSettings {
UserAgent = "MyVeryOwnWebCrawler/1.0",
RequestProcessorOptions = new RequestProcessorOptions
{
MaxNumberOfSimultaneousRequests = 5
},
HostAliases = new [] { "*.example.org" }
});
There likely doesn't need to be any specific rules around wildcard handling. A host alias that is only a wildcard would indicate crawling any domain linked to. This is likely where analyzers of some kind would be useful as well as additional documentation.
A full wildcard setup does allow crawling of more complex subdomains like web.*.example.org, which may help in some specific usecases.
Follows on from discussions in #63 - currently the
HostAliassetting is relatively limited, requiring an exact match before it crawls a link with that domain.To make crawling a large number of subdomains easier, support for a wildcard (
*) would be useful.eg.
There likely doesn't need to be any specific rules around wildcard handling. A host alias that is only a wildcard would indicate crawling any domain linked to. This is likely where analyzers of some kind would be useful as well as additional documentation.
A full wildcard setup does allow crawling of more complex subdomains like
web.*.example.org, which may help in some specific usecases.