Skip to content

Commit ae81f47

Browse files
tw4likreymer
authored andcommitted
Add tests for robots.txt being fetched and cached
Does not yet include testing that a page URL disallowed by robots is not queued, as I haven't yet been able to find a Webrecorder- managed site with a robots.txt with disallows to test against.
1 parent 0a3ef30 commit ae81f47

File tree

1 file changed

+35
-0
lines changed

1 file changed

+35
-0
lines changed

tests/robots_txt.test.js

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
import child_process from "child_process";
2+
3+
test("test robots.txt is fetched and cached", async () => {
4+
const res = child_process.execSync(
5+
"docker run -v $PWD/test-crawls:/crawls webrecorder/browsertrix-crawler crawl --url https://specs.webrecorder.net/ --url https://webrecorder.net/ --scopeType page --robots --logging debug",
6+
);
7+
8+
const log = res.toString();
9+
10+
// robots.txt not found
11+
expect(
12+
log.indexOf(
13+
'"logLevel":"debug","context":"robots","message":"Fetching robots.txt","details":{"url":"https://specs.webrecorder.net/robots.txt"}}',
14+
) > 0,
15+
).toBe(true);
16+
17+
expect(
18+
log.indexOf(
19+
'"logLevel":"debug","context":"robots","message":"Robots.txt not fetched","details":{"url":"https://specs.webrecorder.net/robots.txt","status":404}}',
20+
) > 0,
21+
).toBe(true);
22+
23+
// robots.txt found and cached
24+
expect(
25+
log.indexOf(
26+
'"logLevel":"debug","context":"robots","message":"Fetching robots.txt","details":{"url":"https://webrecorder.net/robots.txt"}}',
27+
) > 0,
28+
).toBe(true);
29+
30+
expect(
31+
log.indexOf(
32+
'"logLevel":"debug","context":"robots","message":"Caching robots.txt body","details":{"url":"https://webrecorder.net/robots.txt"}}',
33+
) > 0,
34+
).toBe(true);
35+
});

0 commit comments

Comments
 (0)