Skip to content

Create a course information scraper #8

@AJaccP

Description

@AJaccP

🧠 Context

src/data/courses.json is currently hand-crafted with only 8 courses. This ticket builds a script that scrapes the Carleton undergraduate calendar to produce a full course list in the same format.

The scraper's job is data extraction only — it does not parse prerequisites into the AST. Each course it outputs should have prereq: null and prereqRaw set to the raw prerequisite text copied verbatim from the calendar. The prereq parser (separate ticket) handles converting those strings into structured AST nodes later.

The output schema is the Course type defined in src/types/course.ts. Every field must be populated:

  • code — e.g. "COMP 1405"
  • title — full course title
  • credits — as a number (Carleton uses 0.5 for one-term courses)
  • description — course description text
  • prereq: null — always null, filled in by the parser later
  • prereqRaw — raw prerequisite string from the calendar, or null if no prerequisite is listed
  • precludes — array of course codes listed as preclusions, or []

Start with COMP courses. SYSC, MATH, and STAT are stretch goals and can be covered later since CS students commonly take courses from those departments and they appear in some COMP prerequisite trees.


🛠️ Implementation Plan

  1. Create a scripts/ folder at the project root. The scraper lives at scripts/scrape-courses.ts and is run with:

    pnpm run scrape:courses

    Add this script to package.json.

  2. Install cheerio for HTML parsing. Use Node's built-in fetch (available in Node 22) for HTTP requests — do not add axios or node-fetch. This project has security settings in pnpm-workspace.yaml — if the cheerio install is blocked by a policy error, flag it to Jacc rather than working around it.

  3. Before writing any code, inspect the Carleton undergraduate calendar course listing pages in your browser. Understand the HTML structure — how courses, titles, credits, descriptions, prerequisites, and preclusions are marked up. Save a real HTML page from the calendar as a fixture file under scripts/fixtures/ to use in tests.

  4. Write the scraper to fetch the course listing page(s), parse the HTML with Cheerio, and extract the fields above for each course entry.

  5. Write tests in scripts/scrape-courses.test.ts that run against the saved HTML fixture — not the live network. Verify that a known course (e.g. COMP 3004) is extracted correctly with the right code, title, credits, and prereqRaw string.

  6. The scraper should write its output to scripts/output/courses-scraped.json, not directly to src/data/courses.json. The merge with the existing hand-crafted entries needs a human review step.


✅ Acceptance Criteria

  • pnpm run scrape:courses runs without errors and writes scripts/output/courses-scraped.json
  • Output is a valid JSON array where every entry conforms to the Course type
  • prereq is null on every entry
  • prereqRaw contains the raw prerequisite text from the calendar, or null if none
  • precludes is populated where the calendar lists preclusions, or []
  • Tests run against a saved HTML fixture (no live network calls in tests)
  • COMP courses are fully covered
  • pnpm typecheck passes

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

Status
In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions