-
-
Notifications
You must be signed in to change notification settings - Fork 123
Add option to respect robots.txt disallows #888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This handles the robots.txt matches for pages, but not for subpage resources. Perhaps we should have a: Perhaps we have a RobotsHandler that handles both interception + page queue filtering? |
When enabled, the new --robots flag will result in the crawler fetching robots.txt for each page origin, cacheing in Redis by URL to avoid duplicate fetches, and checking if URLs are allowed by the policies therein before queueing.
Does not yet include testing that a page URL disallowed by robots is not queued, as I haven't yet been able to find a Webrecorder- managed site with a robots.txt with disallows to test against.
004c4eb to
ae81f47
Compare
- move robots logic to ./src/utils/robots.ts - add --robotsAgent arg, defaulting to Browsertrix/1.x - remove logging for 'using cached robots' as its too verbose (for every link) - cache empty "" robots responses and treat as allow all - treat non-200 and non-429 and non-503 responses as empty ""
Based on user feedback, decided to keep just keep this focused on pages only, can add other options later if there is a need. Added some minor refactoring in: e9e3738:
Added the cacheing of empty "", and error responses as "", otherwise, these were getting fetched every time and not cached. |
src/util/robots.ts
Outdated
| return await resp.text(); | ||
| } | ||
|
|
||
| if (resp.status === 429 || resp.status === 503) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, while the retry mechanism is good and something we should perhaps add for required crawl resources (see: #921), here it could potentially lead to indefinite waiting just to fetch robots.txt, for every link. If a site is rate limiting access to robots.txt, that could potentially slow down the whole crawl.
I think we should reevaluate this in a later pass, and remove for now, also adding a timedRun() around the robots fetch, perhaps limiting to 10 seconds at most
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably the way to handle this would be to fetch robots async, outside of the main crawler loop, and also avoid starting a fetch multiple times for the same domain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also added quick batching of multiple requests, to avoid sending multiple requests for robots.txt in one crawler.
Could further improve for multi-instance deployments, but should be good first pass.
- for now, limit to 10 seconds to fetch robots, if failed, treat same as no robots
limit size of robots.txt to 100K
Fixes #631
This PR introduces a
--robotsCLI flag to the crawler. When enabled, the crawler will attempt to fetch robots.txt for each host encountered during crawling, and if found will check if pages are disallowed before adding them to the crawl queue.Fetched robots.txt bodies are cached by their URL in Redis using an LRU cache mechanism which retains the 100 most recently accessed ones in the cache to prevent memory usage from getting out of control.
Robots.txt bodies are parsed and checked for page allow/disallow status using the https://github.com/samclarke/robots-parser library, which is the most active and well-maintained implementation I could find with TypeScript types.
I have added basic tests that check the log lines to ensure robots.txt is being fetched and cached, but I haven't been able to find a robots.txt on a Webrecorder-managed domain with disallows to test that disallowed URLs are not actually queued. Maybe we could set up a single disallow on our website's robots.txt for this purpose?
Manual testing
util/constants.ts, setROBOTS_CACHE_LIMITto 2--robotsand--logging debugenabled and a mixture of URLs that do and do not have a robots.txt specified, and for those that do, a mixture of URLs that will be allowed and disallowed, e.g.:robotsandlinkscontext. What should be present:Log lines for fetching and caching robots.txt
{"timestamp":"2025-09-30T14:45:03.888Z","logLevel":"debug","context":"robots","message":"Fetching robots.txt","details":{"url":"https://forums.gentoo.org/robots.txt"}} {"timestamp":"2025-09-30T14:45:04.226Z","logLevel":"debug","context":"robots","message":"Caching robots.txt body","details":{"url":"https://forums.gentoo.org/robots.txt"}}Log lines for using cached robots.txt
{"timestamp":"2025-09-30T14:45:04.234Z","logLevel":"debug","context":"robots","message":"Using cached robots.txt body","details":{"url":"https://forums.gentoo.org/robots.txt"}}Log lines for not queueing page URLs disallowed by robots
{"timestamp":"2025-09-30T14:45:04.235Z","logLevel":"debug","context":"links","message":"Page URL not queued, disallowed by robots.txt","details":{"url":"https://forums.gentoo.org/search.php"}}Log lines for deleting least-recently accessed cached robots.txt when over cache limit
{"timestamp":"2025-09-30T14:45:04.843Z","logLevel":"debug","context":"robots","message":"Deleting cached robots.txt, over cache limit","details":{"url":"https://forums.gentoo.org/robots.txt"}}Log lines for hosts where robots.txt cannot be found or fetched
{"timestamp":"2025-09-30T14:45:05.619Z","logLevel":"debug","context":"robots","message":"Robots.txt not fetched","details":{"url":"https://bitarchivist.net/robots.txt","status":404}}