diff --git a/.gitignore b/.gitignore index 0a2799c5..d5153e6b 100644 --- a/.gitignore +++ b/.gitignore @@ -23,3 +23,6 @@ substack_html_pages/* # Ignore downloaded image assets substack_images/ + +# Ignore fetched comment threads +substack_comments/ diff --git a/README.md b/README.md index 59ba6efd..4af0d4c7 100644 --- a/README.md +++ b/README.md @@ -1,77 +1,143 @@ -๏ปฟ# Substack2Markdown - -Substack2Markdown is a Python tool for downloading free and premium Substack posts and saving them as both Markdown and -HTML files, and includes a simple HTML interface to browse and sort through the posts. It will save paid for content as -long as you're subscribed to that substack. - -๐Ÿ†• @Firevvork has built a web version of this tool at [Substack Reader](https://www.substacktools.com/reader) - no -installation required! (Works for free Substacks only.) - - -![Substack2Markdown Interface](./assets/images/screenshot.png) - -Once you run the script, it will create a folder named after the substack in `/substack_md_files`, -and then begin to scrape the substack URL, converting the blog posts into markdown files. Once all the posts have been -saved, it will generate an HTML file in `/substack_html_pages` directory that allows you to browse the posts. - -You can either hardcode the substack URL and the number of posts you'd like to save into the top of the file, or -specify them as command line arguments. - -## Features - -- Converts Substack posts into Markdown files. -- Generates an HTML file to browse Markdown files. -- Supports free and premium content (with subscription). +๏ปฟ# Substack2Markdown + +Substack2Markdown is a Python tool for downloading free and premium Substack posts and saving them as both Markdown and +HTML files, and includes a simple HTML interface to browse and sort through the posts. It will save paid for content as +long as you're subscribed to that substack. + +๐Ÿ†• @Firevvork has built a web version of this tool at [Substack Reader](https://www.substacktools.com/reader) - no +installation required! (Works for free Substacks only.) + + +![Substack2Markdown Interface](./assets/images/main-page.png) + +Once you run the script, it will create a folder named after the substack in `/substack_md_files`, +and then begin to scrape the substack URL, converting the blog posts into markdown files. Once all the posts have been +saved, it will generate an HTML file in `/substack_html_pages` directory that allows you to browse the posts. + +You can either hardcode the substack URL and the number of posts you'd like to save into the top of the file, or +specify them as command line arguments. + +## Features + +- Converts Substack posts into Markdown files. +- Generates an HTML file to browse Markdown files. +- Supports free and premium content (with subscription). - Supports scraping a single post URL directly (for example, `/p/my-post`). - Can download Substack-hosted images locally with `--images`. -- The HTML interface allows sorting essays by date or likes. - -## Installation - -Clone the repo and install the dependencies: - -```bash -git clone https://github.com/yourusername/substack_scraper.git -cd substack_scraper - -# # Optinally create a virtual environment -# python -m venv venv -# # Activate the virtual environment -# .\venv\Scripts\activate # Windows -# source venv/bin/activate # Linux - -pip install -r requirements.txt -``` - -For the premium scraper, update the `config.py` in the root directory with your Substack email and password: - -```python -EMAIL = "your-email@domain.com" -PASSWORD = "your-password" -``` - -You'll also need Microsoft Edge installed for the Selenium webdriver. - -## Usage - -Specify the Substack URL and the directory to save the posts to: - -You can hardcode your desired Substack URL and the number of posts you'd like to save into the top of the file and run: -```bash -python substack_scraper.py -``` - -For free Substack sites: - -```bash -python substack_scraper.py --url https://example.substack.com --directory /path/to/save/posts -``` - -For premium Substack sites: - -```bash -python substack_scraper.py --url https://example.substack.com --directory /path/to/save/posts --premium -``` +- Can fetch each post's comment thread with `--comments` (public threads free; paid-only threads with `--premium`), + rendered into the per-post HTML page and surfaced as a sortable "Comments" column in the index. +- The HTML interface allows sorting essays by date, likes, or comments. +- Cross-platform browser/driver support, including **macOS (Intel & Apple Silicon)** with automatic + `chromedriver`/`edgedriver` download and crash recovery. +- Optional MDX frontmatter output for static-site generators. +- **Substack-styled HTML rendering** โ€” per-post pages match the classic Substack look (Spectral + serif, orange links, centered title/subtitle/byline header). Re-render existing posts with + `render_posts.py` without re-scraping (see [Substack-style Rendering](#substack-style-rendering)). + +## System Architecture + +```mermaid +flowchart TD + CLI["CLI entrypoint
substack_scraper.py main()"] + Scraper{Premium content?} + Free["SubstackScraper
requests-based, no auth"] + Premium["PremiumSubstackScraper
Selenium driver + login"] + BM["BrowserManager
version detection, driver download,
crash recovery, periodic restart"] + Core["BaseSubstackScraper
scrape_posts() loop"] + + FetchBody["get_url_soup()
fetch post HTML body"] + Extract["extract_post_data()
title, date, likes,
comment_count, body"] + ImgOpt{--images?} + ImgProc["process_markdown_images()
download + rewrite links"] + CmtOpt{--comments?} + CmtFetch["scrape_comments_for_post()
public JSON API + cache"] + + SaveMD["save_to_file()
*.md"] + Render["_write_post_html()
body + comments โ†’ *.html"] + Index["generate_html_file()
data/*.json + author index"] + + CLI --> Scraper + Scraper -->|free| Free + Scraper -->|paid| Premium + Premium --> BM + Free --> Core + Premium --> Core + Core --> FetchBody + FetchBody --> Extract + Extract --> ImgOpt + ImgOpt -->|yes| ImgProc --> CmtOpt + ImgOpt -->|no| CmtOpt + CmtOpt -->|yes| CmtFetch --> SaveMD + CmtOpt -->|no| SaveMD + SaveMD --> Render + Render --> Index +``` + +```mermaid +flowchart LR + substack["Substack site"] + subgraph Local["On-disk outputs"] + MD[("substack_md_files/")] + HTML[("substack_html_pages/")] + IMG[("substack_images/")] + CMT[("substack_comments/")] + DATA[("data/.json")] + end + substack --> MD + substack --> HTML + substack --> IMG + substack --> CMT + MD & HTML & CMT --> DATA +``` + +## Installation + +Clone the repo and install the dependencies: + +```bash +git clone https://github.com/yourusername/substack_scraper.git +cd substack_scraper + +# # Optionally create a virtual environment +# python -m venv venv +# # Activate the virtual environment +# .\venv\Scripts\activate # Windows +# source venv/bin/activate # Linux + +pip install -r requirements.txt +``` + +For the premium scraper, update the `config.py` in the root directory with your Substack email and password: + +```python +EMAIL = "your-email@domain.com" +PASSWORD = "your-password" +``` + +For premium scraping you need a Chromium-based browser installed. **Chrome** is the default (recommended); **Microsoft +Edge** is also supported. On macOS the scraper auto-detects Chrome/Edge under `/Applications`, including Apple Silicon +(`mac-arm64`) vs Intel (`mac-x64`) builds, and downloads the matching driver. + +## Usage + +Specify the Substack URL and the directory to save the posts to: + +You can hardcode your desired Substack URL and the number of posts you'd like to save into the top of the file and run: +```bash +python substack_scraper.py +``` + +For free Substack sites: + +```bash +python substack_scraper.py --url https://example.substack.com --directory /path/to/save/posts +``` + +For premium Substack sites (Chrome is recommended): + +```bash +python substack_scraper.py --url https://example.substack.com --directory /path/to/save/posts --premium +``` To scrape a single post directly: @@ -84,34 +150,197 @@ To download images locally and rewrite markdown image links: ```bash python substack_scraper.py --url https://example.substack.com --images ``` - -To scrape a specific number of posts: - -```bash -python substack_scraper.py --url https://example.substack.com --directory /path/to/save/posts --number 5 -``` - -To emit YAML frontmatter (title/subtitle/date/author/image) suitable for MDX sites -instead of the default `# title` / `**Likes:** N` header: - -```bash -python substack_scraper.py --url https://example.substack.com --frontmatter mdx -``` - -### Online Version - -For a hassle-free experience without any local setup: - -1. Visit [Substack Reader](https://www.substacktools.com/reader) -2. Enter the Substack URL you want to read or export -3. Click "Go" to instantly view the content or "Export" to download Markdown files - -This online version provides a user-friendly web interface for reading and exporting free Substack articles, with no installation required. However, please note that the online version currently does not support exporting premium content. For full functionality, including premium content export, please use the local script as described above. Built by @Firevvork. - -## Viewing Markdown Files in Browser - -To read the Markdown files in your browser, install the [Markdown Viewer](https://chromewebstore.google.com/detail/markdown-viewer/ckkdlimhmcjmikdlpkmbgfkaikojcbjk) -browser extension. But note, we also save the files as HTML for easy viewing, -just set the toggle to HTML on the author homepage. - -Or you can use our [Substack Reader](https://www.substacktools.com/reader) online tool, which allows you to read and export free Substack articles directly in your browser. (Note: Premium content export is currently only available in the local script version) + +### Comments + +Fetch each post's comment thread with `--comments`. Threads are cached under +`substack_comments//.comments.json` (so re-runs are cheap and make no extra network calls), rendered into +the individual post's HTML page, and counted in the sortable index. + +- **Public threads** need no authentication โ€” works with the free scraper. +- **Paid-only threads** require `--premium` (the logged-in browser's cookies authenticate the comment API). + +![Comments](./assets/images/comments.png) + +```bash +# Public comment threads (free scraper) +python substack_scraper.py --url https://example.substack.com --comments + +# Paid-only comment threads +python substack_scraper.py --url https://example.substack.com --premium --comments + +# Newest first instead of Substack's "best" ordering +python substack_scraper.py --url https://example.substack.com --comments --comments-sort most_recent_first +``` + +To scrape a specific number of posts: + +```bash +python substack_scraper.py --url https://example.substack.com --directory /path/to/save/posts --number 5 +``` + +To emit YAML frontmatter (title/subtitle/date/author/image) suitable for MDX sites +instead of the default `# title` / `**Likes:** N` header: + +```bash +python substack_scraper.py --url https://example.substack.com --frontmatter mdx +``` + +### Workflow + +```mermaid +sequenceDiagram + participant User + participant Scraper + participant Substack + participant FS as Local filesystem + + User->>Scraper: run with --url [--premium] [--comments] + Scraper->>Substack: discover post URLs (archive feed) + loop each post + Scraper->>Substack: fetch post body (free: requests / premium: Selenium) + Substack-->>Scraper: post HTML + Scraper->>Scraper: extract metadata + render markdown + opt --images + Scraper->>Substack: download images + Scraper->>FS: substack_images// + end + opt --comments + Scraper->>Substack: GET posts/{slug} โ†’ post id + Scraper->>Substack: GET post/{id}/comments + Substack-->>Scraper: nested comment thread + Scraper->>FS: substack_comments//.comments.json (cache) + end + Scraper->>FS: *.md + *.html (body + comments baked in) + end + Scraper->>FS: data/.json + author index page +``` + +## CLI Reference + +| Flag | Default | Description | +| --- | --- | --- | +| `-u, --url` | โ€” | Base URL of the Substack site (or a single `/p/` URL). | +| `--render-only` | off | Skip scraping; re-render existing Markdown into Substack-styled HTML (no network). | +| `--render-all` | off | With `--render-only`, re-render every author under `data/`. | +| `-d, --directory` | `substack_md_files` | Directory for scraped Markdown files. | +| `--html-directory` | `substack_html_pages` | Directory for scraped HTML files. | +| `-n, --number` | `0` (all) | Number of posts to scrape. | +| `--images` | off | Download Substack-hosted images and rewrite markdown links. | +| `--comments` | off | Fetch each post's comment thread (public free; paid-only needs `--premium`). | +| `--comments-sort` | `best` | Comment order: `best` or `most_recent_first`. | +| `--frontmatter` | `legacy` | `legacy` (`# title` / `**Likes:** N`) or `mdx` (YAML frontmatter). | +| `-p, --premium` | off | Use browser automation for premium/paid content. | +| `--browser` | `chrome` | `chrome` or `edge` for premium scraping. | +| `--headless` | off | Run the browser headless (may trigger CAPTCHA). | +| `--persistent-profile` | off | Reuse a browser profile to persist login across runs. | +| `--skip-login` | off | Skip login (use with `--persistent-profile` after first login). | +| `--chrome-driver-path` | โ€” | Manual `chromedriver` path (troubleshooting). | +| `--edge-driver-path` | โ€” | Manual `msedgedriver` path (troubleshooting). | +| `--chrome-path` | โ€” | Manual Chrome executable path. | +| `--edge-path` | โ€” | Manual Edge executable path. | +| `--user-agent` | โ€” | Custom User-Agent string. | + +## Browser & Driver Support + +The `BrowserManager` resolves a working WebDriver through a layered fallback: + +```mermaid +flowchart TD + Start([create_driver]) --> Detect["Detect installed browser
macOS: /Applications, Win: registry, Linux: PATH"] + Detect --> S1{"Explicit
--*-driver-path?"} + S1 -->|yes| Use1["Use it"] + S1 -->|no| S2["Download matching driver to local cache
(mac-arm64 / mac-x64 / mac64_m1 / win64 / linux64)"] + S2 --> S3{"webdriver_manager?"} + S3 -->|yes| Use2["Use it (rejects non-executable artifacts)"] + S3 -->|no| S4["Selenium Manager (last resort)"] + Use1 & Use2 & S4 --> Probe{"Dead session
or 40-post interval?"} + Probe -->|yes| Recreate["_recreate_driver()
persistent profile โ†’ no re-login"] + Recreate --> Probe + Probe -->|no| Done([scraping]) +``` + +Notable resilience features: + +- **macOS driver detection** โ€” locates Chrome/Edge under `/Applications` (and `~/Applications`) and picks the correct + platform build for Apple Silicon vs Intel. +- **Crash recovery** โ€” on `InvalidSessionIdException`, `tab crashed`, `chrome not reachable`, etc., the driver is + recreated and the same URL is retried. With a persistent profile this requires no re-login. +- **Periodic restart** โ€” every 40 posts the driver is restarted to shed accumulated state/leaks. + +## Backfilling Comment Counts + +Posts scraped before comment support was added have no `comment_count` and render as "0 Comments" in the index. +`backfill_comment_counts.py` resolves each post's count from Substack's public posts API, writes it back into +`data/.json`, and regenerates the author's HTML index. Posts that already have a count are skipped by default. + +```bash +# Backfill one or more authors (defaults to https://.substack.com/) +python backfill_comment_counts.py aischoollibrarian +python backfill_comment_counts.py news --base-url https://aakashgupta.substack.com/ + +# Re-fetch even posts that already have a count +python backfill_comment_counts.py aischoollibrarian --force +``` + +## Substack-style Rendering + +Per-post HTML pages are rendered to match the **classic default Substack article look**: a +Spectral serif body (19px / 1.6 line-height), left-aligned text, orange (`#ff6719`) links on a +white background, a ~728px single column, and a centered header block (cover image โ†’ title โ†’ +subtitle โ†’ author ยท date byline). Title/subtitle/date are pulled out of the body into a +structured header, so they're arranged like a real Substack post rather than inlined as markdown. + +![Rendered post](./assets/images/post.png) + +The rendered HTML is decoupled from scraping, so you can re-apply the theme to already-scraped +posts at any time **without re-scraping or any network calls**. Metadata comes from the on-disk +Markdown (legacy or MDX frontmatter) plus `data/.json`; cached comment threads (from +`--comments`) are baked in when present. + +```bash +# Re-render one or more authors from existing Markdown + data JSON +python render_posts.py aischoollibrarian +python render_posts.py aischoollibrarian news + +# Re-render every author found under data/ +python render_posts.py --all +``` + +The same is available as a flag on the main CLI: + +```bash +python substack_scraper.py --render-only aischoollibrarian +python substack_scraper.py --render-only --render-all +``` + +Newly scraped posts automatically use the Substack-styled renderer. The Markdown/MDX source +files are unchanged โ€” only the HTML output is restyled. + +## Output Layout + +``` +substack_md_files// # scraped posts (.md) +substack_html_pages// # scraped posts (.html, body + optional comments) +substack_images// # downloaded images (--images) +substack_comments// # cached comment threads (--comments) +data/.json # index data (title, date, likes, comment_count, links) +``` + +## Online Version + +For a hassle-free experience without any local setup: + +1. Visit [Substack Reader](https://www.substacktools.com/reader) +2. Enter the Substack URL you want to read or export +3. Click "Go" to instantly view the content or "Export" to download Markdown files + +This online version provides a user-friendly web interface for reading and exporting free Substack articles, with no installation required. However, please note that the online version currently does not support exporting premium content. For full functionality, including premium content export, please use the local script as described above. Built by @Firevvork. + +## Viewing Markdown Files in Browser + +To read the Markdown files in your browser, install the [Markdown Viewer](https://chromewebstore.google.com/detail/markdown-viewer/ckkdlimhmcjmikdlpkmbgfkaikojcbjk) +browser extension. But note, we also save the files as HTML for easy viewing, +just set the toggle to HTML on the author homepage. + +Or you can use our [Substack Reader](https://www.substacktools.com/reader) online tool, which allows you to read and export free Substack articles directly in your browser. (Note: Premium content export is currently only available in the local script version) diff --git a/assets/css/essay-styles.css b/assets/css/essay-styles.css index 00d93bb4..02ea6c78 100644 --- a/assets/css/essay-styles.css +++ b/assets/css/essay-styles.css @@ -1,111 +1,317 @@ +/* ========================================================================== + Classic default-theme Substack article styling. + + Typography values were measured from a live Substack post body via DevTools + (Spectral serif, ~19px/1.6 body, left-aligned, orange links on white) plus + the documented default theme tokens (--print_pop #ff6719, etc.). + ========================================================================== */ + +:root { + --print_pop: #ff6719; + --print_pop_darken: #e45a13; + --web_bg_color: #ffffff; + --print_on_web_bg_color: #363737; + --print_secondary_on_web_bg_color: #868787; + --background_contrast_1: #f6f5f2; + --background_contrast_2: #eeece6; + --body_width: 728px; + --body_font: 'Spectral', Georgia, 'Times New Roman', serif; + --ui_font: 'Spectral', Georgia, 'Times New Roman', serif; +} + body { - font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif; + font-family: var(--body_font); margin: 0; - padding: 0; - background-color: rgba(249, 245, 228, 0.51); - color: #333; + padding: 16px 0; + background-color: var(--web_bg_color); + color: var(--print_on_web_bg_color); line-height: 1.6; + -webkit-font-smoothing: antialiased; + text-rendering: optimizeLegibility; } .markdown-content { - max-width: 800px; - margin: 20px auto; - padding: 20px; - background: rgba(255, 255, 250); - box-shadow: 0 2px 4px 0 rgba(0,0,0,.2); + max-width: var(--body_width); + margin: 0 auto; + padding: 0 16px; + background: var(--web_bg_color); + /* Substack is flat: no card shadow / box. */ } -h1, h2, h3, h4, h5, h6 { - padding: 20px 0; +/* ---------- Post header (title / subtitle / byline / cover) ---------- */ + +.markdown-content .post-header { text-align: center; + margin-bottom: 40px; + padding-top: 8px; +} + +.markdown-content .post-cover { + width: 100%; + height: auto; + border-radius: 4px; + margin: 0 0 32px; + display: block; +} + +.markdown-content .post-title { + font-family: var(--body_font); + font-size: 36px; + font-weight: 700; + line-height: 1.15; + letter-spacing: -0.01em; + color: var(--print_on_web_bg_color); + margin: 0 0 10px; + padding: 0; + text-align: center; +} + +.markdown-content .post-subtitle { + font-family: var(--body_font); + font-size: 20px; + font-weight: 400; + line-height: 1.4; + color: var(--print_secondary_on_web_bg_color); + margin: 0 0 20px; + padding: 0; + text-align: center; +} + +.markdown-content .post-byline { + font-size: 13px; + line-height: 1.4; + color: var(--print_secondary_on_web_bg_color); margin: 0; - color: #333; } -h1 { font-size: 2.5em; } -h2 { font-size: 2em; } -h3 { font-size: 1.75em; } -h4 { font-size: 1.5em; } -h5 { font-size: 1.25em; } -h6 { font-size: 1em; } +.markdown-content .post-byline a { + color: var(--print_secondary_on_web_bg_color); +} + +/* ---------- Body content ---------- */ .markdown-content p { - max-width: 70ch; - margin: 0 auto 1em; - text-align: justify; + font-family: var(--body_font); + font-size: 19px; + line-height: 1.6; + color: var(--print_on_web_bg_color); + margin: 0 0 1.4em; + text-align: left; +} + +.markdown-content h1, +.markdown-content h2, +.markdown-content h3, +.markdown-content h4, +.markdown-content h5, +.markdown-content h6 { + font-family: var(--body_font); + color: var(--print_on_web_bg_color); + font-weight: 700; + line-height: 1.25; + text-align: left; + margin: 1.8em 0 0.6em; + padding: 0; } -.markdown-content ul, .markdown-content ol { - max-width: 65ch; - margin: 0 auto 1em; - padding-left: 2em; +.markdown-content h1 { font-size: 28px; } +.markdown-content h2 { font-size: 28px; } +.markdown-content h3 { font-size: 22px; } +.markdown-content h4 { font-size: 18px; } +.markdown-content h5 { font-size: 16px; } +.markdown-content h6 { font-size: 14px; text-transform: uppercase; letter-spacing: 0.04em; } + +.markdown-content ul, +.markdown-content ol { + max-width: 100%; + margin: 0 0 1.4em; + padding-left: 1.6em; + font-size: 19px; + line-height: 1.6; } .markdown-content li { margin-bottom: 0.5em; } +.markdown-content li > ul, +.markdown-content li > ol { + margin-top: 0.5em; + margin-bottom: 0; +} + .markdown-content a { - color: #0066cc; + color: var(--print_pop); text-decoration: none; } .markdown-content a:hover { + color: var(--print_pop_darken); text-decoration: underline; } +.markdown-content strong { + font-weight: 700; + color: var(--print_on_web_bg_color); +} + .markdown-content blockquote { - border-left: 4px solid #ddd; - padding-left: 1em; - margin: 0 auto 1em; - max-width: 65ch; + border-left: 3px solid var(--background_contrast_2); + margin: 1.6em 0; + padding: 0.2em 0 0.2em 1.2em; + font-family: var(--body_font); font-style: italic; - color: #555; + font-size: 19px; + line-height: 1.6; + color: var(--print_on_web_bg_color); + max-width: 100%; +} + +.markdown-content blockquote p:last-child { + margin-bottom: 0; } .markdown-content code { - background-color: #f4f4f4; - padding: 0.2em 0.4em; + background-color: var(--background_contrast_1); + padding: 0.15em 0.35em; border-radius: 3px; - font-family: monospace; + font-family: 'SF Mono', Menlo, Monaco, Consolas, 'Courier New', monospace; + font-size: 0.88em; + color: var(--print_on_web_bg_color); } .markdown-content pre { - background-color: #f4f4f4; - padding: 1em; - border-radius: 5px; + background-color: var(--background_contrast_1); + padding: 14px 16px; + border-radius: 4px; overflow-x: auto; - max-width: 70ch; - margin: 0 auto 1em; + margin: 0 0 1.4em; + line-height: 1.45; +} + +.markdown-content pre code { + background-color: transparent; + padding: 0; + font-size: 0.85em; } .markdown-content img { max-width: 100%; height: auto; display: block; - margin: 0 auto 1em; + margin: 0 0 1.4em; + border-radius: 4px; +} + +.markdown-content hr { + border: none; + border-top: 1px solid var(--background_contrast_2); + margin: 2.2em auto; + width: 100%; + height: 0; } .markdown-content table { border-collapse: collapse; - margin: 0 auto 1em; - max-width: 70ch; + margin: 0 0 1.4em; + width: 100%; + font-size: 17px; } -.markdown-content th, .markdown-content td { - border: 1px solid #ddd; - padding: 8px; +.markdown-content th, +.markdown-content td { + border: 1px solid var(--background_contrast_2); + padding: 8px 12px; text-align: left; } .markdown-content th { - background-color: #f4f4f4; - font-weight: bold; + background-color: var(--background_contrast_1); + font-weight: 700; } -.markdown-content hr { - border: none; - border-top: 1px solid #ddd; - margin: 2em auto; - max-width: 70ch; -} \ No newline at end of file +/* ---------- Comments section (per-post page) ---------- */ +.markdown-content .comments { + border-top: 1px solid var(--background_contrast_2); + margin-top: 3em; + padding-top: 1.5em; +} + +.markdown-content .comments > h2 { + font-size: 22px; + text-align: left; + padding: 0; + margin: 0 0 1em; +} + +.markdown-content .comment-thread, +.markdown-content .comment-children { + list-style: none; + padding-left: 0; + margin: 0; +} + +.markdown-content .comment-children { + padding-left: 1.5em; + border-left: 2px solid var(--background_contrast_2); + margin-left: 0.5em; +} + +.markdown-content .comment { + margin-bottom: 1.25em; +} + +.markdown-content .comment-header { + display: flex; + align-items: center; + gap: 0.5em; + flex-wrap: wrap; + font-size: 14px; + color: var(--print_secondary_on_web_bg_color); +} + +.markdown-content .comment-avatar { + width: 28px; + height: 28px; + border-radius: 50%; + object-fit: cover; +} + +.markdown-content .comment-author { + font-weight: 600; + color: var(--print_on_web_bg_color); +} + +.markdown-content .comment-author-flag { + font-size: 11px; + background: var(--background_contrast_1); + color: var(--print_secondary_on_web_bg_color); + padding: 0.1em 0.45em; + border-radius: 3px; +} + +.markdown-content .comment-date { + color: var(--print_secondary_on_web_bg_color); +} + +.markdown-content .comment-reactions { + color: var(--print_pop); +} + +.markdown-content .comment-body { + margin: 0.25em 0 0.5em; + text-align: left; +} + +.markdown-content .comment-body p { + font-size: 16px; + margin: 0 0 0.7em; +} + +.markdown-content .comment-body p:first-child { + margin-top: 0; +} + +.markdown-content .comment-body p:last-child { + margin-bottom: 0; +} diff --git a/assets/css/style.css b/assets/css/style.css index c2956085..e7eda0cf 100644 --- a/assets/css/style.css +++ b/assets/css/style.css @@ -1,18 +1,27 @@ +:root { + --print_pop: #ff6719; + --print_pop_darken: #e45a13; + --print_on_web_bg_color: #363737; + --print_secondary_on_web_bg_color: #868787; + --background_contrast_1: #f6f5f2; +} + body { - font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif; + font-family: 'Spectral', Georgia, 'Times New Roman', serif; margin: 0; padding: 0; - background-color: rgba(249, 245, 228, 0.51); - color: #333; + background-color: #ffffff; + color: var(--print_on_web_bg_color); + -webkit-font-smoothing: antialiased; } h1 { padding: 20px; text-align: center; - margin-bottom: 20px; + margin: 0 0 20px; font-size: 2.5em; - margin: 0; - color: #333; + font-weight: 700; + color: var(--print_on_web_bg_color); } #essays-container { @@ -35,23 +44,28 @@ h1 { } #essays-container li a { - color: #000; + color: var(--print_on_web_bg_color); text-decoration: none; - font-size: 24px; - font-weight: bold; + font-size: 22px; + font-weight: 700; display: block; } +#essays-container li a:hover { + color: var(--print_pop); + text-decoration: underline; +} + #essays-container li .subtitle { - color: #555; - font-size: 18px; - margin: 10px 0; - font-weight: normal; + color: var(--print_secondary_on_web_bg_color); + font-size: 17px; + margin: 8px 0; + font-weight: 400; } #essays-container li .metadata { - color: #888; - font-size: 16px; + color: var(--print_secondary_on_web_bg_color); + font-size: 14px; margin-bottom: 5px; font-style: normal; } @@ -63,19 +77,21 @@ h1 { } button { - background-color: #cbd2f5; + background-color: var(--print_pop); border: none; - color: white; + color: #ffffff; padding: 10px 20px; text-align: center; text-decoration: none; display: inline-block; - font-size: 16px; + font-family: 'Spectral', Georgia, 'Times New Roman', serif; + font-size: 15px; + font-weight: 600; margin: 4px 2px; cursor: pointer; border-radius: 4px; } button:hover { - background-color: #a7b2e1; + background-color: var(--print_pop_darken); } diff --git a/assets/images/comments.png b/assets/images/comments.png new file mode 100644 index 00000000..68968779 Binary files /dev/null and b/assets/images/comments.png differ diff --git a/assets/images/main-page.png b/assets/images/main-page.png new file mode 100644 index 00000000..e9b04ff9 Binary files /dev/null and b/assets/images/main-page.png differ diff --git a/assets/images/post.png b/assets/images/post.png new file mode 100644 index 00000000..e6dcd3d6 Binary files /dev/null and b/assets/images/post.png differ diff --git a/assets/js/populate-essays.js b/assets/js/populate-essays.js index 0c700383..d7c376d7 100644 --- a/assets/js/populate-essays.js +++ b/assets/js/populate-essays.js @@ -1,5 +1,6 @@ let sortLikesAscending = false; let sortDatesAscending = false; +let sortCommentsAscending = false; let showHTML = true; function sortEssaysByDate(data) { @@ -15,13 +16,22 @@ function sortEssaysByLikes(data) { ? a.like_count - b.like_count : b.like_count - a.like_count); } + +function sortEssaysByComments(data) { + sortCommentsAscending = !sortCommentsAscending; // Toggle the sort order + return data.sort((a, b) => { + const ac = a.comment_count || 0; + const bc = b.comment_count || 0; + return sortCommentsAscending ? ac - bc : bc - ac; + }); +} function populateEssays(data) { const essaysContainer = document.getElementById('essays-container'); const list = data.map(essay => `
  • ${essay.title}
    ${essay.subtitle}
    - +
  • `).join(''); essaysContainer.innerHTML = `
      ${list}
    `; @@ -45,13 +55,28 @@ document.addEventListener('DOMContentLoaded', () => { showHTML = false; // Default to showing markdown as there won't be any html files in older versions } - document.getElementById('sort-by-date').addEventListener('click', () => { - populateEssays(sortEssaysByDate([...essaysData])); - }); + // Guard each sort button so a stale/older HTML template missing one of + // them doesn't crash this handler and leave the page blank. + const sortByDateBtn = document.getElementById('sort-by-date'); + if (sortByDateBtn) { + sortByDateBtn.addEventListener('click', () => { + populateEssays(sortEssaysByDate([...essaysData])); + }); + } - document.getElementById('sort-by-likes').addEventListener('click', () => { - populateEssays(sortEssaysByLikes([...essaysData])); - }); + const sortByLikesBtn = document.getElementById('sort-by-likes'); + if (sortByLikesBtn) { + sortByLikesBtn.addEventListener('click', () => { + populateEssays(sortEssaysByLikes([...essaysData])); + }); + } + + const sortByCommentsBtn = document.getElementById('sort-by-comments'); + if (sortByCommentsBtn) { + sortByCommentsBtn.addEventListener('click', () => { + populateEssays(sortEssaysByComments([...essaysData])); + }); + } populateEssays(essaysData); }); diff --git a/author_template.html b/author_template.html index c37d1017..e875196b 100644 --- a/author_template.html +++ b/author_template.html @@ -4,6 +4,9 @@ <!-- AUTHOR_NAME --> + + + @@ -12,6 +15,7 @@

    +
    diff --git a/backfill_comment_counts.py b/backfill_comment_counts.py new file mode 100644 index 00000000..dfd81031 --- /dev/null +++ b/backfill_comment_counts.py @@ -0,0 +1,106 @@ +#!/usr/bin/env python3 +"""Backfill ``comment_count`` into existing ``data/.json`` files. + +The author index HTML pages sort and display comment counts, but posts scraped +before comment counting was added (or without ``--comments``) have no +``comment_count`` field and render as "0 Comments" even when the post has +comments. + +This script resolves each post's comment count from Substack's public posts API +(``{base}/api/v1/posts/{slug}``), writes it back into the JSON, and regenerates +the author's HTML index page. + +Posts that already have a ``comment_count`` (e.g. scraped with ``--comments``, +which yields the accurate recursive count) are skipped by default. + +Usage: + python backfill_comment_counts.py aischoollibrarian + python backfill_comment_counts.py news --base-url https://aakashgupta.substack.com/ + python backfill_comment_counts.py aischoollibrarian natesnewsletter # multiple authors +""" +from __future__ import annotations + +import argparse +import json +import os +import time + +from substack_scraper import JSON_DATA_DIR, get_post_id_from_slug, generate_html_file + + +def slug_from_entry(entry: dict) -> str | None: + """Derive a post slug from an essay entry's local file path.""" + for key in ("file_link", "html_link"): + link = entry.get(key) + if link: + base = os.path.basename(link) + stem, _ = os.path.splitext(base) + if stem: + return stem + return None + + +def backfill_author(author: str, base_url: str, force: bool = False) -> None: + json_path = os.path.join(JSON_DATA_DIR, f"{author}.json") + if not os.path.exists(json_path): + print(f"[SKIP] No data file for author '{author}' ({json_path})") + return + if not base_url.endswith("/"): + base_url += "/" + + with open(json_path, "r", encoding="utf-8") as f: + data = json.load(f) + + updated = 0 + skipped = 0 + failed = 0 + for entry in data: + if not force and entry.get("comment_count") is not None: + skipped += 1 + continue + slug = slug_from_entry(entry) + if not slug: + failed += 1 + continue + try: + meta = get_post_id_from_slug(base_url, slug) + except Exception as exc: # network/parse error + print(f"[ERR] {slug}: {exc}") + failed += 1 + continue + if meta is None: + failed += 1 + continue + _post_id, comment_count, _perms = meta + entry["comment_count"] = comment_count + updated += 1 + # Be gentle with the public API to avoid "too many requests" blocks. + time.sleep(0.4) + + with open(json_path, "w", encoding="utf-8") as f: + json.dump(data, f, ensure_ascii=False, indent=4) + + print(f"[{author}] updated={updated} skipped(existing)={skipped} failed={failed} of {len(data)}") + generate_html_file(author) + print(f"[{author}] regenerated HTML index") + + +def main() -> None: + parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter) + parser.add_argument("authors", nargs="+", help="Author name(s) (= data/.json stem)") + parser.add_argument( + "--base-url", + default=None, + help="Publication base URL. Defaults to https://.substack.com/. " + "Provide when the writer name differs from the subdomain (e.g. news -> aakashgupta.substack.com).", + ) + parser.add_argument("--force", action="store_true", help="Re-fetch even posts that already have a comment_count.") + args = parser.parse_args() + + for author in args.authors: + base_url = args.base_url or f"https://{author}.substack.com/" + backfill_author(author, base_url, force=args.force) + + +if __name__ == "__main__": + main() diff --git a/render_posts.py b/render_posts.py new file mode 100644 index 00000000..ab064e84 --- /dev/null +++ b/render_posts.py @@ -0,0 +1,163 @@ +#!/usr/bin/env python3 +"""Re-render existing scraped posts into the classic Substack look (no re-scraping). + +Posts are scraped once and saved as Markdown/MDX (``substack_md_files/``) with +structured metadata in ``data/.json``. This script re-renders the per-post HTML +pages (``substack_html_pages/``) from that on-disk content using the classic +Substack theme (Spectral, orange links, centered header) โ€” so the new look can be applied +to posts scraped before the structured renderer existed, or re-applied after the theme is +tweaked, **without any network calls**. + +When a cached comment thread exists (``substack_comments//.comments.json``, +created by ``--comments``), it is baked into the rendered page. + +Usage: + python render_posts.py aischoollibrarian # re-render one author + python render_posts.py aischoollibrarian news # re-render multiple authors + python render_posts.py --all # every data/*.json author +""" +from __future__ import annotations + +import argparse +import json +import os + +from substack_scraper import ( + COMMENTS_DATA_DIR, + JSON_DATA_DIR, + count_all_comments, + render_post_to_html_file, +) + + +def detect_frontmatter_format(md_text: str) -> str: + """``mdx`` if the file starts with a YAML frontmatter block, else ``legacy``.""" + return "mdx" if md_text.lstrip().startswith("---") else "legacy" + + +def render_author(author: str, force: bool = True) -> None: + """Re-render every post for one author from its on-disk md + data json.""" + json_path = os.path.join(JSON_DATA_DIR, f"{author}.json") + if not os.path.exists(json_path): + print(f"[SKIP] No data file for author '{author}' ({json_path})") + return + + with open(json_path, "r", encoding="utf-8") as f: + entries = json.load(f) + + rendered = 0 + skipped = 0 + failed = 0 + + for entry in entries: + md_path = entry.get("file_link") + html_path = entry.get("html_link") + if not md_path or not html_path: + failed += 1 + continue + if not os.path.exists(md_path): + print(f"[SKIP] Missing source markdown: {md_path}") + failed += 1 + continue + + # Skip if the HTML is already newer than its source (unless --force). + if not force and os.path.exists(html_path) \ + and os.path.getmtime(html_path) >= os.path.getmtime(md_path): + skipped += 1 + continue + + try: + with open(md_path, "r", encoding="utf-8") as f: + md_text = f.read() + fmt = detect_frontmatter_format(md_text) + _meta, body = _split_for_render(md_text, fmt) + + # Structured metadata comes from the data json (richer than what's in the md). + meta = { + "title": entry.get("title", ""), + "subtitle": entry.get("subtitle", ""), + "author": entry.get("author", ""), + "date": entry.get("date", ""), + "cover_image": entry.get("cover_image", ""), + } + + comments_list = _load_cached_comments(author, md_path) + render_post_to_html_file( + html_path, body, meta=meta, comments_list=comments_list, frontmatter_format=fmt + ) + rendered += 1 + except Exception as exc: # keep going; one bad post shouldn't abort the author + print(f"[ERR] {md_path}: {exc}") + failed += 1 + + print(f"[{author}] rendered={rendered} skipped(up-to-date)={skipped} failed={failed} of {len(entries)}") + + +def _split_for_render(md_text: str, fmt: str): + """Local import to avoid a circular reference at module load time.""" + from substack_scraper import split_metadata_and_body + return split_metadata_and_body(md_text, fmt) + + +def _load_cached_comments(author: str, md_path: str) -> list: + """Load a cached comment thread for a post, if present (no network).""" + slug, _ = os.path.splitext(os.path.basename(md_path)) + cache_path = os.path.join(COMMENTS_DATA_DIR, author, f"{slug}.comments.json") + if not os.path.exists(cache_path): + return [] + try: + with open(cache_path, "r", encoding="utf-8") as f: + comments = json.load(f) + if isinstance(comments, list) and count_all_comments(comments) > 0: + return comments + except (json.JSONDecodeError, OSError) as exc: + print(f"[WARN] Corrupt comments cache {cache_path}: {exc}") + return [] + + +def discover_authors() -> list: + """Return every author stem found under ``data/`` (``.json``).""" + if not os.path.isdir(JSON_DATA_DIR): + return [] + return sorted( + os.path.splitext(name)[0] + for name in os.listdir(JSON_DATA_DIR) + if name.endswith(".json") + ) + + +def main() -> None: + parser = argparse.ArgumentParser( + description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter + ) + parser.add_argument( + "authors", nargs="*", help="Author name(s) (= data/.json stem)." + ) + parser.add_argument( + "--all", action="store_true", help="Re-render every author found in data/." + ) + parser.add_argument( + "--force", action="store_true", default=True, + help="Overwrite HTML even when it is newer than its source markdown (default).", + ) + parser.add_argument( + "--no-force", dest="force", action="store_false", + help="Skip posts whose HTML is already newer than the source markdown.", + ) + args = parser.parse_args() + + if args.all: + authors = discover_authors() + if not authors: + print("[SKIP] No authors found under data/.") + for author in authors: + render_author(author, force=args.force) + elif args.authors: + for author in args.authors: + render_author(author, force=args.force) + else: + parser.error("Provide one or more authors, or use --all.") + + +if __name__ == "__main__": + main() diff --git a/substack_scraper.py b/substack_scraper.py index 5ff195b0..798ce799 100644 --- a/substack_scraper.py +++ b/substack_scraper.py @@ -30,7 +30,12 @@ from selenium.webdriver.edge.service import Service as EdgeService from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC -from selenium.common.exceptions import SessionNotCreatedException, TimeoutException, WebDriverException +from selenium.common.exceptions import ( + SessionNotCreatedException, + TimeoutException, + WebDriverException, + InvalidSessionIdException, +) from config import EMAIL, PASSWORD @@ -39,9 +44,11 @@ BASE_MD_DIR: str = "substack_md_files" BASE_HTML_DIR: str = "substack_html_pages" BASE_IMAGE_DIR: str = "substack_images" +COMMENTS_DATA_DIR: str = "substack_comments" HTML_TEMPLATE: str = "author_template.html" JSON_DATA_DIR: str = "data" NUM_POSTS_TO_SCRAPE: int = 0 +COMMENTS_SORT: str = "best" def resolve_image_url(url: str) -> str: @@ -143,6 +150,458 @@ def replace_image(match): return re.sub(pattern, replace_image, md_content) +# ============================================================================= +# COMMENT HELPERS +# ============================================================================= + +def _request_json_with_rate_limit_retry( + url: str, + session: Optional[requests.Session] = None, + max_attempts: int = 5, +) -> Optional[dict]: + """GET a JSON URL, retrying on rate-limiting with exponential backoff. + + Returns parsed JSON on success, or ``None`` on non-retryable failure / after exhausting + retries. A shared ``session`` may be passed to carry auth cookies (premium scraper). + """ + requester = session if session is not None else requests + for attempt in range(1, max_attempts + 1): + try: + response = requester.get(url) + if not response.ok: + text_lower = (response.text or "").lower() + if response.status_code == 429 or "too many requests" in text_lower: + if attempt == max_attempts: + print(f"[WARN] Max attempts reached for URL: {url}. Too many requests.") + return None + base = 2 ** attempt + delay = base + random.uniform(-0.2 * base, 0.2 * base) + print(f"[{attempt}/{max_attempts}] Too many requests. Retrying in {delay:.2f}s...") + sleep(delay) + continue + return None + return response.json() + except Exception as e: + if attempt == max_attempts: + print(f"[WARN] Error fetching JSON {url}: {e}") + return None + base = 2 ** attempt + delay = base + random.uniform(-0.2 * base, 0.2 * base) + sleep(delay) + return None + + +def get_post_id_from_slug( + base_url: str, slug: str, session: Optional[requests.Session] = None +) -> Optional[Tuple[int, int, str]]: + """Fetch post metadata needed for comments via the public post API. + + Returns ``(post_id, comment_count, write_comment_permissions)`` or ``None`` on failure. + """ + api_url = f"{base_url}api/v1/posts/{slug}" + data = _request_json_with_rate_limit_retry(api_url, session=session) + if not isinstance(data, dict): + return None + post_id = data.get("id") + if post_id is None: + return None + permissions = data.get("write_comment_permissions", "") or "" + if isinstance(permissions, list): + permissions = ",".join(str(p) for p in permissions) + return ( + int(post_id), + int(data.get("comment_count", 0) or 0), + str(permissions), + ) + + +def fetch_comments( + base_url: str, + post_id: int, + sort: str = COMMENTS_SORT, + session: Optional[requests.Session] = None, +) -> List[dict]: + """Fetch the (nested) comment thread for a post via the public comments API.""" + api_url = f"{base_url}api/v1/post/{post_id}/comments?all_comments=true&sort={sort}" + data = _request_json_with_rate_limit_retry(api_url, session=session) + if not isinstance(data, dict): + return [] + return data.get("comments", []) or [] + + +def count_all_comments(comments: List[dict]) -> int: + """Count comments including nested children.""" + total = 0 + for c in comments: + total += 1 + total += count_all_comments(c.get("children", []) or []) + return total + + +def _format_comment_date(date_str: str) -> str: + if not date_str: + return "" + try: + date_obj = datetime.fromisoformat(date_str.replace("Z", "+00:00")) + return date_obj.strftime("%b %d, %Y") + except (ValueError, TypeError): + return date_str + + +def _format_reactions(comment: dict) -> str: + reactions = comment.get("reactions") or {} + if isinstance(reactions, dict) and reactions: + total = sum(int(v) for v in reactions.values() if isinstance(v, (int, float))) + else: + total = int(comment.get("reaction_count", 0) or 0) + return f" โค {total}" if total > 0 else "" + + +def render_comments_markdown(comments: List[dict], depth: int = 0) -> str: + """Render a nested comment thread as Markdown. + + Top-level comments are rendered as normal blocks; replies are nested as blockquotes. + """ + if not comments: + return "" + blocks = [] + for c in comments: + name = c.get("name", "Anonymous") or "Anonymous" + is_author = bool((c.get("metadata") or {}).get("is_author")) + author_flag = " (author)" if is_author else "" + + date_str = _format_comment_date(c.get("date", "")) + edited = " (edited)" if c.get("edited_at") else "" + reactions = _format_reactions(c) + + header = f"**{name}**{author_flag}" + meta_bits = [] + if date_str: + meta_bits.append(date_str + edited) + tail = " ยท ".join(meta_bits) + reactions + if tail.strip(): + header += f" ยท {tail.strip()}" + + body = (c.get("body", "") or "").strip() + if c.get("deleted"): + body = "_[comment deleted]_" + block = f"{header}\n\n{body}" if body else header + + children = c.get("children", []) or [] + if children: + child_md = render_comments_markdown(children, depth + 1) + if child_md: + indented = "\n".join(f"> {ln}" if ln else ">" for ln in child_md.splitlines()) + block += f"\n\n{indented}" + + blocks.append(block) + + return "\n\n".join(blocks) + + +def _html_escape(text: str) -> str: + """Minimal HTML escaping for safe insertion into attribute/text content.""" + if not text: + return "" + return ( + text.replace("&", "&") + .replace("<", "<") + .replace(">", ">") + .replace('"', """) + ) + + +def render_comments_html(comments: List[dict]) -> str: + """Render a nested comment thread as semantic HTML for embedding in a post page. + + Returns ``""`` for an empty thread (caller should then render no section). Each comment + renders an avatar (when available), author name, optional "Author" flag, date, reaction + count, the comment body (converted from its markdown-ish text to HTML), and any nested + replies inside ``
      ``. + """ + if not comments: + return "" + + def render_one(c: dict) -> str: + name = c.get("name", "Anonymous") or "Anonymous" + is_author = bool((c.get("metadata") or {}).get("is_author")) + date_str = _format_comment_date(c.get("date", "")) + photo_url = (c.get("photo_url") or "").strip() + + reactions = c.get("reactions") or {} + if isinstance(reactions, dict) and reactions: + react_total = sum(int(v) for v in reactions.values() if isinstance(v, (int, float))) + else: + react_total = int(c.get("reaction_count", 0) or 0) + + avatar_html = ( + f'' + if photo_url else "" + ) + author_flag_html = ( + 'Author' if is_author else "" + ) + date_html = ( + f'{_html_escape(date_str)}' if date_str else "" + ) + reactions_html = ( + f'โค {react_total}' if react_total > 0 else "" + ) + + header = ( + f'
      {avatar_html}' + f'{_html_escape(name)}' + f'{author_flag_html}{date_html}{reactions_html}
      ' + ) + + body = (c.get("body", "") or "").strip() + if c.get("deleted"): + body = "_[comment deleted]_" + body_html = md_to_html_static(body) + + children = c.get("children", []) or [] + children_html = "" + if children: + inner = "".join(f"
    • {render_one(child)}
    • " for child in children) + children_html = f'
        {inner}
      ' + + return f'{header}
      {body_html}
      {children_html}' + + total = count_all_comments(comments) + items = "".join(f"
    • {render_one(c)}
    • " for c in comments) + return ( + f'
      ' + f'

      Comments ({total})

      ' + f'
        {items}
      ' + f'
      ' + ) + + +def md_to_html_static(md_content: str) -> str: + """Module-level Markdown โ†’ HTML conversion (mirrors BaseSubstackScraper.md_to_html).""" + if not md_content: + return "" + return markdown.markdown(md_content, extensions=['extra']) + + +# ============================================================================= +# STRUCTURED HEADER RENDERING (classic Substack article look) +# ============================================================================= + +def _format_header_date(date_str: str) -> str: + """Format an ISO date (``YYYY-MM-DD``) for display in the byline. + + Falls back to the raw string (including the sentinel ``"Date not found"``). + Mirrors the legacy header date formatting in ``combine_metadata_and_content``. + """ + if not date_str: + return "" + try: + return datetime.fromisoformat(date_str).strftime("%b %d, %Y") + except ValueError: + return date_str + + +def build_post_header(meta: dict) -> str: + """Render a Substack-style centered post header from structured metadata. + + ``meta`` is a dict that may contain: ``title``, ``subtitle``, ``author``, + ``date`` (ISO ``YYYY-MM-DD``), ``cover_image``. Any missing/empty field is + simply omitted. Returns an HTML string for a ``
      `` + block, or ``""`` if there is nothing to render (no title, subtitle, author + or date). The cover image is shown above the title when present. + """ + if not isinstance(meta, dict) or not meta: + return "" + + cover_image = (meta.get("cover_image") or "").strip() + title = (meta.get("title") or "").strip() + subtitle = (meta.get("subtitle") or "").strip() + author = (meta.get("author") or "").strip() + date_str = (meta.get("date") or "").strip() + + if not (title or subtitle or author or date_str): + return "" + + parts = [] + if cover_image: + parts.append( + f'' + ) + if title: + parts.append(f'

      {_html_escape(title)}

      ') + if subtitle: + parts.append(f'

      {_html_escape(subtitle)}

      ') + + byline_bits = [] + if author: + byline_bits.append(_html_escape(author)) + display_date = _format_header_date(date_str) + if display_date: + byline_bits.append(_html_escape(display_date)) + if byline_bits: + parts.append( + f'' + ) + + return f'
      {"".join(parts)}
      ' + + +def split_metadata_and_body(md_content: str, frontmatter_format: str = "legacy") -> Tuple[dict, str]: + """Inverse of ``combine_metadata_and_content``: recover metadata + body. + + Used by ``render_posts.py`` to re-render on-disk markdown into the structured + Substack look without re-scraping. Returns ``(meta_dict, body_md)`` where + ``meta_dict`` has keys ``title``, ``subtitle``, ``author``, ``date``, + ``cover_image``, and ``like_count`` (any that aren't found are absent). + + - ``mdx``: strip the leading YAML frontmatter (``---\\nโ€ฆ\\n---``) and parse it. + - ``legacy``: strip the leading ``# title`` line, an optional ``## subtitle``, + the ``****`` line, and the ``**Likes:** N`` line. + + If the content doesn't match the expected pattern, it is returned unchanged as + the body with an empty metadata dict (so rendering is never destructive). + """ + if not md_content: + return {}, "" + + meta: dict = {} + + if frontmatter_format == "mdx": + m = re.match(r'^---\s*\n(.*?)\n---\s*\n?(.*)$', md_content, re.DOTALL) + if m: + for line in m.group(1).splitlines(): + if ":" not in line: + continue + key, _, raw = line.partition(":") + key = key.strip() + val = raw.strip() + # Strip surrounding YAML quotes. + if (val.startswith('"') and val.endswith('"')) \ + or (val.startswith("'") and val.endswith("'")): + val = val[1:-1] + if key and val: + if key == "image": + meta["cover_image"] = val + else: + meta[key] = val + return meta, m.group(2).lstrip("\n") + + # legacy format + lines = md_content.split("\n") + idx = 0 + + # Title: "# ..." + if idx < len(lines) and lines[idx].startswith("# "): + meta["title"] = lines[idx][2:].strip() + idx += 1 + # Skip the blank line after the title. + if idx < len(lines) and lines[idx].strip() == "": + idx += 1 + # Subtitle: "## ..." + if idx < len(lines) and lines[idx].startswith("## "): + meta["subtitle"] = lines[idx][3:].strip() + idx += 1 + if idx < len(lines) and lines[idx].strip() == "": + idx += 1 + # Date: "**...**" + date_match = re.match(r'^\*\*(.+?)\*\*$', lines[idx]) if idx < len(lines) else None + if date_match: + meta["date"] = date_match.group(1).strip() + idx += 1 + if idx < len(lines) and lines[idx].strip() == "": + idx += 1 + # Likes: "**Likes:** N" + likes_match = re.match(r'^\*\*Likes:\*\*\s*(\d+)\s*$', lines[idx]) if idx < len(lines) else None + if likes_match: + meta["like_count"] = likes_match.group(1) + idx += 1 + if idx < len(lines) and lines[idx].strip() == "": + idx += 1 + + body = "\n".join(lines[idx:]).lstrip("\n") + return meta, body + + +def build_post_document( + html_dir: str, + body_html: str, + comments_html: str = "", + header_html: str = "", + title: Optional[str] = None, +) -> str: + """Assemble the full HTML document for a post page (classic Substack shell). + + Shared by the scraper's ``save_to_html_file`` and the standalone ``render_posts.py`` + so both produce identical markup: Spectral webfont, the essay stylesheet, an optional + structured header above the body, and optional comments below it. + """ + css_path = os.path.relpath("./assets/css/essay-styles.css", html_dir) + css_path = css_path.replace("\\", "/") + + doc_title = _html_escape(title) if title else "Markdown Content" + header_block = f"\n {header_html}" if header_html else "" + comments_block = f"\n {comments_html}" if comments_html else "" + + return f""" + + + + + + {doc_title} + + + + + + +
      {header_block} + {body_html}{comments_block} +
      + + + """ + + +def render_post_to_html_file( + html_filepath: str, + body_md: str, + meta: Optional[dict] = None, + comments_list: Optional[list] = None, + frontmatter_format: str = "legacy", +) -> None: + """Re-render a post page from markdown + structured metadata + cached comments. + + Network-free: reads only local content. Used by ``render_posts.py`` (and the + ``--render-only`` CLI path) to apply the classic Substack look to posts that were + scraped before the structured renderer existed, without re-scraping. + + ``body_md`` is rendered as the post body; ``meta`` (title/subtitle/author/date/ + cover_image) becomes the header. If ``body_md`` still contains a legacy/mdx header + (i.e. it's the full on-disk markdown), pass ``split=True``... otherwise it is split + via ``split_metadata_and_body`` automatically when ``meta`` is empty. + """ + body = body_md + header_meta = meta or {} + + # If no structured meta was supplied, try to recover it from the markdown itself. + if not header_meta: + header_meta, body = split_metadata_and_body(body_md, frontmatter_format) + + body_html = md_to_html_static(body) + comments_html = render_comments_html(comments_list) if comments_list else "" + header_html = build_post_header(header_meta) + title = header_meta.get("title") + + html_dir = os.path.dirname(html_filepath) + document = build_post_document( + html_dir, body_html, comments_html=comments_html, header_html=header_html, title=title + ) + with open(html_filepath, "w", encoding="utf-8") as f: + f.write(document) + + def extract_main_part(url: str) -> str: parts = urlparse(url).netloc.split('.') return parts[1] if parts[0] == 'www' else parts[0] @@ -224,17 +683,26 @@ def get_browser_version(browser: str) -> Optional[str]: except Exception: pass else: # macOS/Linux - try: - result = subprocess.run( - ['google-chrome', '--version'], - capture_output=True, text=True, timeout=10 - ) - if result.returncode == 0: - match = re.search(r'(\d+\.\d+\.\d+\.\d+)', result.stdout) - if match: - version = match.group(1) - except Exception: - pass + candidates = [ + "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome", + os.path.expanduser("~/Applications/Google Chrome.app/Contents/MacOS/Google Chrome"), + "google-chrome", # Linux PATH fallback + "chromium", + "chromium-browser", + ] + for candidate in candidates: + try: + result = subprocess.run( + [candidate, "--version"], + capture_output=True, text=True, timeout=10 + ) + if result.returncode == 0: + match = re.search(r'(\d+\.\d+\.\d+\.\d+)', result.stdout) + if match: + version = match.group(1) + break + except Exception: + continue elif browser == 'edge': if os.name == 'nt': # Windows @@ -255,17 +723,24 @@ def get_browser_version(browser: str) -> Optional[str]: except Exception: pass else: # macOS/Linux - try: - result = subprocess.run( - ['microsoft-edge', '--version'], - capture_output=True, text=True, timeout=10 - ) - if result.returncode == 0: - match = re.search(r'(\d+\.\d+\.\d+\.\d+)', result.stdout) - if match: - version = match.group(1) - except Exception: - pass + candidates = [ + "/Applications/Microsoft Edge.app/Contents/MacOS/Microsoft Edge", + os.path.expanduser("~/Applications/Microsoft Edge.app/Contents/MacOS/Microsoft Edge"), + "microsoft-edge", # Linux PATH fallback + ] + for candidate in candidates: + try: + result = subprocess.run( + [candidate, "--version"], + capture_output=True, text=True, timeout=10 + ) + if result.returncode == 0: + match = re.search(r'(\d+\.\d+\.\d+\.\d+)', result.stdout) + if match: + version = match.group(1) + break + except Exception: + continue return version @@ -321,9 +796,25 @@ def get_user_data_dir(browser: str) -> str: if not os.path.exists(base_dir): os.makedirs(base_dir) return os.path.join(base_dir, f'{browser}_profile') - + + @staticmethod + def _driver_platform(browser: str) -> str: + """Return the Chrome-for-Testing / Edge driver platform string for the current OS/arch. + + Detects ARM vs Intel on macOS so Apple Silicon gets mac-arm64 instead of mac-x64. + """ + if os.name == 'nt': + return 'win64' + if sys.platform == 'darwin': + is_arm = os.uname().machine == 'arm64' + if browser == 'edge': + # Edge driver uses a different naming convention: mac64 vs mac64_m1 + return 'mac64_m1' if is_arm else 'mac64' + return 'mac-arm64' if is_arm else 'mac-x64' + return 'linux64' + @classmethod - def download_driver_with_requests(cls, browser: str, browser_version: str) -> Optional[str]: + def download_driver_with_requests(cls, browser: str, browser_version: str, quiet: bool = False) -> Optional[str]: """ Download the correct driver directly using requests. This bypasses webdriver_manager issues and gives us full control. @@ -344,12 +835,14 @@ def download_driver_with_requests(cls, browser: str, browser_version: str) -> Op if os.path.exists(driver_path): cached_version = cls.get_driver_version(driver_path) if cached_version and cls.versions_compatible(browser_version, cached_version): - print(f"Using cached chromedriver {cached_version}") + if not quiet: + print(f"Using cached chromedriver {cached_version}") return driver_path try: # Get the latest driver version for this Chrome version - print(f"Fetching Chrome driver info for version {major_version}...") + if not quiet: + print(f"Fetching Chrome driver info for version {major_version}...") # Try the Chrome for Testing endpoints endpoints = [ @@ -366,7 +859,7 @@ def download_driver_with_requests(cls, browser: str, browser_version: str) -> Op if resp.ok: driver_version = resp.text.strip() # Construct download URL - platform = 'win64' if os.name == 'nt' else ('mac-x64' if sys.platform == 'darwin' else 'linux64') + platform = cls._driver_platform(browser) download_url = f"https://storage.googleapis.com/chrome-for-testing-public/{driver_version}/{platform}/chromedriver-{platform}.zip" except Exception: pass @@ -382,7 +875,7 @@ def download_driver_with_requests(cls, browser: str, browser_version: str) -> Op if driver_version.startswith(major_version): downloads = stable.get('downloads', {}).get('chromedriver', []) - platform = 'win64' if os.name == 'nt' else ('mac-x64' if sys.platform == 'darwin' else 'linux64') + platform = cls._driver_platform(browser) for d in downloads: if d.get('platform') == platform: download_url = d.get('url') @@ -392,7 +885,8 @@ def download_driver_with_requests(cls, browser: str, browser_version: str) -> Op print(f"Could not find chromedriver download URL for Chrome {major_version}") return None - print(f"Downloading chromedriver {driver_version}...") + if not quiet: + print(f"Downloading chromedriver {driver_version}...") resp = requests.get(download_url, timeout=120) if not resp.ok: print(f"Download failed: HTTP {resp.status_code}") @@ -416,7 +910,8 @@ def download_driver_with_requests(cls, browser: str, browser_version: str) -> Op # Make executable on Unix if os.name != 'nt': os.chmod(target_path, 0o755) - print(f"[OK] Chromedriver downloaded to: {target_path}") + if not quiet: + print(f"[OK] Chromedriver downloaded to: {target_path}") return target_path print("Could not find chromedriver in downloaded archive") @@ -434,15 +929,17 @@ def download_driver_with_requests(cls, browser: str, browser_version: str) -> Op if os.path.exists(driver_path): cached_version = cls.get_driver_version(driver_path) if cached_version and cls.versions_compatible(browser_version, cached_version): - print(f"Using cached msedgedriver {cached_version}") + if not quiet: + print(f"Using cached msedgedriver {cached_version}") return driver_path try: # Get latest Edge driver version - print(f"Fetching Edge driver info for version {major_version}...") + if not quiet: + print(f"Fetching Edge driver info for version {major_version}...") # Edge driver download URL pattern - platform = 'win64' if os.name == 'nt' else ('mac64' if sys.platform == 'darwin' else 'linux64') + platform = cls._driver_platform(browser) # Try to get the exact version version_url = f"https://msedgedriver.azureedge.net/LATEST_RELEASE_{major_version}" @@ -457,7 +954,8 @@ def download_driver_with_requests(cls, browser: str, browser_version: str) -> Op download_url = f"https://msedgedriver.azureedge.net/{driver_version}/edgedriver_{platform}.zip" - print(f"Downloading msedgedriver {driver_version}...") + if not quiet: + print(f"Downloading msedgedriver {driver_version}...") resp = requests.get(download_url, timeout=120) if not resp.ok: print(f"Download failed: HTTP {resp.status_code}") @@ -478,7 +976,8 @@ def download_driver_with_requests(cls, browser: str, browser_version: str) -> Op target.write(source.read()) if os.name != 'nt': os.chmod(target_path, 0o755) - print(f"[OK] msedgedriver downloaded to: {target_path}") + if not quiet: + print(f"[OK] msedgedriver downloaded to: {target_path}") return target_path print("Could not find msedgedriver in downloaded archive") @@ -499,6 +998,7 @@ def create_driver( browser_path: Optional[str] = None, user_agent: Optional[str] = None, use_persistent_profile: bool = False, + quiet: bool = False, ) -> webdriver.Remote: """ Creates a WebDriver instance with smart fallback logic. @@ -509,6 +1009,10 @@ def create_driver( 3. Download driver directly to our cache (bypasses PATH issues) 4. Fall back to webdriver_manager 5. Fall back to Selenium Manager + + Args: + quiet: Suppress informational prints (used during periodic driver restarts). + Warnings/errors are still printed. """ browser = browser.lower() if browser not in cls.SUPPORTED_BROWSERS: @@ -525,7 +1029,8 @@ def create_driver( # Detect browser version browser_version = cls.get_browser_version(browser) - print(f"Detected {browser.title()} version: {browser_version or 'unknown'}") + if not quiet: + print(f"Detected {browser.title()} version: {browser_version or 'unknown'}") if not browser_version: print(f"WARNING: Could not detect {browser.title()} version. Make sure it's installed.") @@ -548,7 +1053,8 @@ def create_driver( if use_persistent_profile: profile_dir = cls.get_user_data_dir(browser) options.add_argument(f"user-data-dir={profile_dir}") - print(f"Using persistent profile at: {profile_dir}") + if not quiet: + print(f"Using persistent profile at: {profile_dir}") # Common options for stability options.add_argument("--no-sandbox") @@ -561,7 +1067,8 @@ def create_driver( # Strategy 1: Explicit driver path if driver_path and os.path.exists(driver_path): try: - print(f"Using explicit driver path: {driver_path}") + if not quiet: + print(f"Using explicit driver path: {driver_path}") driver_version = cls.get_driver_version(driver_path) if driver_version: print(f"Driver version: {driver_version}") @@ -580,11 +1087,13 @@ def create_driver( # Strategy 2: Download to our cache (primary method - bypasses PATH issues) if browser_version: - print(f"\nDownloading driver to local cache (bypasses system PATH)...") + if not quiet: + print(f"\nDownloading driver to local cache (bypasses system PATH)...") try: - downloaded_path = cls.download_driver_with_requests(browser, browser_version) + downloaded_path = cls.download_driver_with_requests(browser, browser_version, quiet=quiet) if downloaded_path and os.path.exists(downloaded_path): - print(f"Using downloaded driver: {downloaded_path}") + if not quiet: + print(f"Using downloaded driver: {downloaded_path}") if browser == 'chrome': service = ChromeService(executable_path=downloaded_path) return webdriver.Chrome(service=service, options=options) @@ -596,29 +1105,43 @@ def create_driver( print(f"[FAIL] Direct download failed: {e}") # Strategy 3: webdriver_manager with explicit path - print("\nTrying webdriver_manager...") + if not quiet: + print("\nTrying webdriver_manager...") try: if browser == 'chrome': from webdriver_manager.chrome import ChromeDriverManager from webdriver_manager.core.os_manager import ChromeType mgr = ChromeDriverManager() driver_path_wdm = mgr.install() - print(f"webdriver_manager installed driver to: {driver_path_wdm}") - service = ChromeService(executable_path=driver_path_wdm) - return webdriver.Chrome(service=service, options=options) + if not quiet: + print(f"webdriver_manager installed driver to: {driver_path_wdm}") + # Reject known non-executable artifacts (THIRD_PARTY_NOTICES / LICENSE) returned + # by webdriver_manager on some platforms before trying to exec them. + if driver_path_wdm and os.path.isfile(driver_path_wdm) \ + and not driver_path_wdm.endswith("THIRD_PARTY_NOTICES.chromedriver") \ + and not driver_path_wdm.endswith("LICENSE.chromedriver"): + service = ChromeService(executable_path=driver_path_wdm) + return webdriver.Chrome(service=service, options=options) + print(f"[SKIP] webdriver_manager returned a non-driver file, falling through.") else: from webdriver_manager.microsoft import EdgeChromiumDriverManager mgr = EdgeChromiumDriverManager() driver_path_wdm = mgr.install() - print(f"webdriver_manager installed driver to: {driver_path_wdm}") - service = EdgeService(executable_path=driver_path_wdm) - return webdriver.Edge(service=service, options=options) + if not quiet: + print(f"webdriver_manager installed driver to: {driver_path_wdm}") + if driver_path_wdm and os.path.isfile(driver_path_wdm) \ + and not driver_path_wdm.endswith("THIRD_PARTY_NOTICES.msedgedriver") \ + and not driver_path_wdm.endswith("LICENSE.msedgedriver"): + service = EdgeService(executable_path=driver_path_wdm) + return webdriver.Edge(service=service, options=options) + print(f"[SKIP] webdriver_manager returned a non-driver file, falling through.") except Exception as e: errors.append(f"webdriver_manager failed: {e}") print(f"[FAIL] webdriver_manager failed: {e}") # Strategy 4: Let Selenium Manager try (last resort) - print("\nTrying Selenium Manager (last resort)...") + if not quiet: + print("\nTrying Selenium Manager (last resort)...") try: if browser == 'chrome': return webdriver.Chrome(options=options) @@ -735,6 +1258,8 @@ def __init__( html_save_dir: str, download_images: bool = False, frontmatter_format: str = "legacy", + fetch_comments_flag: bool = False, + comments_sort: str = COMMENTS_SORT, ): if frontmatter_format not in ("legacy", "mdx"): raise ValueError("frontmatter_format must be 'legacy' or 'mdx'") @@ -766,6 +1291,13 @@ def __init__( self.download_images: bool = download_images self.image_dir = Path(BASE_IMAGE_DIR) / self.writer_name + self.fetch_comments: bool = fetch_comments_flag + self.comments_sort: str = comments_sort + self.comments_save_dir: str = os.path.join(COMMENTS_DATA_DIR, self.writer_name) + if self.fetch_comments and not os.path.exists(self.comments_save_dir): + os.makedirs(self.comments_save_dir) + print(f"Created comments directory {self.comments_save_dir}") + if self.is_single_post: self.post_urls: List[str] = [original_url] else: @@ -842,35 +1374,41 @@ def save_to_file(filepath: str, content: str) -> None: @staticmethod def md_to_html(md_content: str) -> str: """Converts Markdown to HTML.""" - return markdown.markdown(md_content, extensions=['extra']) + return md_to_html_static(md_content) + + def save_to_html_file( + self, + filepath: str, + content: str, + comments_html: str = "", + header_html: str = "", + title: Optional[str] = None, + ) -> None: + """Saves HTML content to a file with a link to the external CSS file. + + Renders the classic Substack article shell: Spectral webfont, the essay + stylesheet, and the body inside ``
      ``. - def save_to_html_file(self, filepath: str, content: str) -> None: - """Saves HTML content to a file with a link to an external CSS file.""" + - ``header_html`` (optional): a structured post header block (see + ``build_post_header``) rendered above the body. When omitted the body is + rendered as-is (legacy flat-markdown behaviour). + - ``title`` (optional): used for ````/document title. + - ``comments_html`` (optional): appended inside ``<main>`` after the body + (renders the fetched comment thread on the individual post page). + """ if not isinstance(filepath, str): raise ValueError("filepath must be a string") if not isinstance(content, str): raise ValueError("content must be a string") html_dir = os.path.dirname(filepath) - css_path = os.path.relpath("./assets/css/essay-styles.css", html_dir) - css_path = css_path.replace("\\", "/") - - html_content = f""" - <!DOCTYPE html> - <html lang="en"> - <head> - <meta charset="UTF-8"> - <meta name="viewport" content="width=device-width, initial-scale=1.0"> - <title>Markdown Content - - - -
      - {content} -
      - - - """ + html_content = build_post_document( + html_dir, + content, + comments_html=comments_html, + header_html=header_html, + title=title, + ) with open(filepath, 'w', encoding='utf-8') as file: file.write(html_content) @@ -940,11 +1478,14 @@ def combine_metadata_and_content( metadata += f"**Likes:** {like_count}\n\n" return metadata + content - def extract_post_data(self, soup: BeautifulSoup, url: str = "") -> Tuple[str, str, str, str, str, str, str]: + def extract_post_data(self, soup: BeautifulSoup, url: str = "") -> Tuple[str, str, str, str, str, str, str, str, str]: """Converts a Substack post soup to markdown. Returns: - ``(title, subtitle, author, date, cover_image, like_count, md_content)``. + ``(title, subtitle, author, date, cover_image, like_count, comment_count, + md_content, body_md)``. ``md_content`` is the body merged with the selected + frontmatter header (what gets saved to disk); ``body_md`` is the post body + alone, used by the structured HTML renderer. """ # Title title_element = soup.select_one("h1.post-title, h2") @@ -959,6 +1500,7 @@ def extract_post_data(self, soup: BeautifulSoup, url: str = "") -> Tuple[str, st date = "" author = "" cover_image = "" + comment_count = "0" script_tag = soup.find("script", {"type": "application/ld+json"}) if script_tag and script_tag.string: try: @@ -980,6 +1522,18 @@ def extract_post_data(self, soup: BeautifulSoup, url: str = "") -> Tuple[str, st cover_image = img.get("url", "") if isinstance(img, dict) else str(img) elif isinstance(images, dict): cover_image = images.get("url", "") + # Comment count: prefer the top-level field, then the CommentAction + # statistic in interactionStatistic (matches the public API value). + raw_cc = ld_json.get("comment_count") + if raw_cc is None: + for stat in ld_json.get("interactionStatistic") or []: + if not isinstance(stat, dict): + continue + if "CommentAction" in str(stat.get("interactionType", "")): + raw_cc = stat.get("userInteractionCount") + break + if raw_cc is not None and str(raw_cc).strip().lstrip("-").isdigit(): + comment_count = str(int(raw_cc)) except (json.JSONDecodeError, ValueError, KeyError): pass @@ -1024,7 +1578,7 @@ def extract_post_data(self, soup: BeautifulSoup, url: str = "") -> Tuple[str, st title, subtitle, date, author, cover_image, like_count, md, self.frontmatter_format ) - return title, subtitle, author, date, cover_image, like_count, md_content + return title, subtitle, author, date, cover_image, like_count, comment_count, md_content, md @abstractmethod def get_url_soup(self, url: str) -> str: @@ -1044,6 +1598,119 @@ def save_essays_data_to_json(self, essays_data: list) -> None: with open(json_path, 'w', encoding='utf-8') as f: json.dump(essays_data, f, ensure_ascii=False, indent=4) + def _get_session(self) -> Optional[requests.Session]: + """Return a requests session for JSON API calls (used for comments). + + Default is ``None`` (unauthenticated, free scraper). Premium scrapers override this + to build a session seeded with the logged-in browser's cookies so paid-only comment + threads can be fetched. + """ + return None + + def scrape_comments_for_post(self, url: str) -> Optional[dict]: + """Fetch (and cache) a post's comment thread, returning the comment list. + + Behavior: + - If ``{slug}.comments.json`` already exists on disk, load and return it (cache hit, + **no network calls**) โ€” this makes ``--comments`` cheap to re-run on already-scraped + publications. + - Otherwise resolve the post id via the public posts API, fetch the nested thread, + and persist it to ``{slug}.comments.json`` (raw payload, machine fidelity). + - Returns ``None`` when there are no comments, when a paid-only thread can't be read + without ``--premium`` (logged once), or on fetch failure. + - Returns ``{"total_comments": int, "comments": list, "json_path": str}`` on success. + + The rendered comments live in the individual post HTML page (see ``scrape_posts``); + no separate ``.comments.md`` file is written. + """ + slug = get_post_slug(url) if is_post_url(url) else (url.rstrip('/').split('/')[-1] or "unknown_post") + + json_path = os.path.join(self.comments_save_dir, f"{slug}.comments.json") + + # Cache hit: load from disk without hitting the network. + if os.path.exists(json_path): + try: + with open(json_path, "r", encoding="utf-8") as f: + cached = json.load(f) + if isinstance(cached, list): + return { + "total_comments": count_all_comments(cached), + "comments": cached, + "json_path": json_path, + } + except (json.JSONDecodeError, OSError) as e: + print(f"[WARN] Corrupt comments cache {json_path}: {e}. Refetching.") + + session = self._get_session() + + meta = get_post_id_from_slug(self.base_substack_url, slug, session=session) + if meta is None: + print(f"[SKIP] Could not resolve post metadata for comments: {url}") + return None + post_id, comment_count, permissions = meta + + if comment_count == 0: + return None + + # Small delay between the post-lookup and the comments call to avoid 429s. + sleep(random.uniform(1.0, 2.0)) + + comments = fetch_comments(self.base_substack_url, post_id, sort=self.comments_sort, session=session) + + if not comments: + # comment_count > 0 but empty payload โ†’ likely a paid-only thread without auth. + if permissions and "only_paid" in permissions: + print( + f"[SKIP] {comment_count} comments on {url} are paid-only " + f"โ€” rerun with --premium to fetch them." + ) + return None + + os.makedirs(self.comments_save_dir, exist_ok=True) + with open(json_path, "w", encoding="utf-8") as f: + json.dump(comments, f, ensure_ascii=False, indent=4) + + return { + "total_comments": count_all_comments(comments), + "comments": comments, + "json_path": json_path, + } + + def _write_post_html( + self, + html_filepath: str, + md_content: str, + comments_list: Optional[list] = None, + meta: Optional[dict] = None, + ) -> None: + """Convert markdown to HTML and write the post page, optionally baking in comments. + + Rendering modes: + + - **Structured** (``meta`` given): ``md_content`` is treated as the post *body* only. + Metadata (title/subtitle/author/date/cover) is rendered into a Substack-style + header via ``build_post_header`` and placed above the body, so the title/date are + no longer inlined in the body text. This is the classic Substack look. + - **Flat** (``meta`` is ``None``): ``md_content`` is the merged title+metadata+body + markdown and is rendered wholesale (the historical behaviour). Existing callers and + unit tests that pass the merged markdown rely on this path. + + When ``comments_list`` is non-empty, the thread is rendered (via + ``render_comments_html``) and injected into the page's ``
      `` after the article + body. An empty/None list produces a page without a comments section. + """ + body_html = self.md_to_html(md_content) + comments_html = render_comments_html(comments_list) if comments_list else "" + header_html = build_post_header(meta) if meta else "" + title = (meta or {}).get("title") if meta else None + self.save_to_html_file( + html_filepath, + body_html, + comments_html=comments_html, + header_html=header_html, + title=title, + ) + def scrape_posts(self, num_posts_to_scrape: int = 0) -> None: """Iterates over all posts and saves them as markdown and html files.""" essays_data = [] @@ -1060,12 +1727,20 @@ def scrape_posts(self, num_posts_to_scrape: int = 0) -> None: if not os.path.exists(md_filepath): soup = self.get_url_soup(url) if soup is None: + # Body is paywalled/unavailable. Still attempt comments so the + # free scraper can report paid-only threads (the post metadata + # API is public even when the body is not). + if self.fetch_comments: + try: + self.scrape_comments_for_post(url) + except Exception as ce: + pbar.write(f"[WARN] Comments failed for {url}: {ce}") total += 1 pbar.total = total pbar.refresh() continue - title, subtitle, author, date, cover_image, like_count, md = self.extract_post_data(soup, url) + title, subtitle, author, date, cover_image, like_count, comment_count, md, body_md = self.extract_post_data(soup, url) # Skip writing if extraction clearly failed โ€” leaves no stale file so reruns retry. content_element = soup.select_one("div.available-content") @@ -1086,23 +1761,81 @@ def scrape_posts(self, num_posts_to_scrape: int = 0) -> None: leave=False, ) as img_pbar: md = process_markdown_images(md, self.writer_name, slug, img_pbar) + # Re-apply to the raw body so the rendered HTML body uses the + # same local image paths. Downloads are skipped (files exist). + body_md = process_markdown_images(body_md, self.writer_name, slug) self.save_to_file(md_filepath, md) - html_content = self.md_to_html(md) - self.save_to_html_file(html_filepath, html_content) - essays_data.append({ + # Fetch comments BEFORE rendering the HTML so they can be baked into + # the individual post page. The .md source stays clean. + comments_result = None + if self.fetch_comments: + try: + comments_result = self.scrape_comments_for_post(url) + except Exception as ce: + pbar.write(f"[WARN] Comments failed for {url}: {ce}") + comments_list = comments_result["comments"] if comments_result else [] + + # Structured render: metadata becomes a Substack-style header, the + # body is rendered separately (no inlined # title / **date** block). + post_meta = { + "title": title, + "subtitle": subtitle, + "author": author, + "date": date, + "cover_image": cover_image, + } + self._write_post_html(html_filepath, body_md, comments_list, meta=post_meta) + + essay_entry = { "title": title, "subtitle": subtitle, "author": author, "date": date, "cover_image": cover_image, "like_count": like_count, + # Top-level comment count from the page's ld+json; always + # available (no extra request). When --comments scrapes the + # full thread below, total_comments overrides this with the + # recursive count (includes nested replies). + "comment_count": comment_count, "file_link": md_filepath, "html_link": html_filepath - }) + } + if comments_result: + essay_entry["comment_count"] = comments_result.get("total_comments") + essay_entry["total_comments"] = comments_result.get("total_comments") + essay_entry["comments_json_link"] = comments_result.get("json_path") + essays_data.append(essay_entry) + + # Periodic driver restart to shed accumulated state/leaks before they + # destabilize the renderer. Only applies to scrapers with a driver. + if hasattr(self, "_recreate_driver"): + self._scrape_counter += 1 + if self._scrape_counter % 40 == 0: + pbar.write("[MAINT] Periodic driver restart to shed state...") + self._recreate_driver() + sleep(random.uniform(4, 8)) else: pbar.write(f"File already exists: {md_filepath}") + # Fetch comments independently of the post body so --comments can be + # added to an already-scraped publication without re-scraping posts. + # Re-render the HTML (from the on-disk md) with comments baked in. + if self.fetch_comments: + try: + comments_result = self.scrape_comments_for_post(url) + comments_list = comments_result["comments"] if comments_result else [] + with open(md_filepath, "r", encoding="utf-8") as f: + md_text = f.read() + on_disk_meta, on_disk_body = split_metadata_and_body( + md_text, self.frontmatter_format + ) + self._write_post_html( + html_filepath, on_disk_body, comments_list, meta=on_disk_meta + ) + except Exception as ce: + pbar.write(f"[WARN] Comments failed for {url}: {ce}") except Exception as e: pbar.write(f"Error scraping post: {e}") @@ -1126,9 +1859,17 @@ def __init__( html_save_dir: str, download_images: bool = False, frontmatter_format: str = "legacy", + fetch_comments_flag: bool = False, + comments_sort: str = COMMENTS_SORT, ): super().__init__( - base_substack_url, md_save_dir, html_save_dir, download_images, frontmatter_format + base_substack_url, + md_save_dir, + html_save_dir, + download_images, + frontmatter_format, + fetch_comments_flag=fetch_comments_flag, + comments_sort=comments_sort, ) def get_url_soup(self, url: str, max_attempts: int = 5) -> Optional[BeautifulSoup]: @@ -1180,6 +1921,8 @@ def __init__( use_persistent_profile: bool = False, skip_login: bool = False, frontmatter_format: str = "legacy", + fetch_comments_flag: bool = False, + comments_sort: str = COMMENTS_SORT, ) -> None: """ Initialize the premium scraper with browser automation. @@ -1196,6 +1939,15 @@ def __init__( use_persistent_profile: Reuse browser profile across runs (saves login) skip_login: Skip login if using a pre-authenticated profile """ + # Store settings so the driver can be recreated with identical options after a crash. + self._browser = browser + self._headless = headless + self._driver_path = driver_path + self._browser_path = browser_path + self._user_agent = user_agent + self._base_substack_url = base_substack_url + self._scrape_counter = 0 + # Initialize driver before calling super().__init__ since that fetches URLs self.driver = BrowserManager.create_driver( browser=browser, @@ -1218,7 +1970,13 @@ def __init__( sleep(3) super().__init__( - base_substack_url, md_save_dir, html_save_dir, download_images, frontmatter_format + base_substack_url, + md_save_dir, + html_save_dir, + download_images, + frontmatter_format, + fetch_comments_flag=fetch_comments_flag, + comments_sort=comments_sort, ) def login(self) -> None: @@ -1263,10 +2021,77 @@ def is_login_failed(self) -> bool: error_container = self.driver.find_elements(By.ID, 'error-container') return len(error_container) > 0 and error_container[0].is_displayed() + def _get_session(self) -> Optional[requests.Session]: + """Build a requests.Session seeded with the logged-in browser's cookies. + + Lets the JSON comment endpoints authenticate as the current user, so paid-only + comment threads can be fetched. Cookies are read fresh each call so they stay valid + after a ``_recreate_driver()`` (persistent profile carries the session). + """ + session = requests.Session() + try: + cookies = self.driver.get_cookies() + except Exception as e: + print(f"[WARN] Could not read browser cookies for comments: {e}") + return session + for c in cookies: + try: + session.cookies.set( + c.get("name", ""), + c.get("value", ""), + domain=c.get("domain"), + path=c.get("path", "/"), + ) + except Exception: + continue + try: + session.headers.update({ + "User-Agent": self.driver.execute_script("return navigator.userAgent;"), + }) + except Exception: + pass + return session + + def _driver_is_dead(self) -> bool: + """Cheap liveness probe. Returns True if the driver/session is unusable.""" + try: + _ = self.driver.current_url + return False + except (WebDriverException, InvalidSessionIdException): + return True + + def _recreate_driver(self) -> None: + """Tear down the (possibly dead) driver and build a fresh one with the same settings. + + With a persistent profile, saved cookies mean NO re-login is required after recreation. + """ + try: + self.driver.quit() + except Exception: + pass + self.driver = BrowserManager.create_driver( + browser=self._browser, + headless=self._headless, + driver_path=self._driver_path, + browser_path=self._browser_path, + user_agent=self._user_agent, + use_persistent_profile=self.use_persistent_profile, + quiet=True, + ) + if not self.use_persistent_profile: + self.login() + else: + try: + self.driver.get(self._base_substack_url) + sleep(3) + except Exception: + pass + def get_url_soup(self, url: str, max_attempts: int = 5) -> Optional[BeautifulSoup]: """Gets soup from URL using logged-in Selenium driver, with retry on rate limiting.""" for attempt in range(1, max_attempts + 1): try: + sleep(random.uniform(4.0, 9.0)) self.driver.get(url) # Wait up to 20s for the post body (or a paywall marker) to appear, instead of a fixed sleep. @@ -1300,6 +2125,23 @@ def get_url_soup(self, url: str, max_attempts: int = 5) -> Optional[BeautifulSou except RuntimeError: raise except Exception as e: + msg = str(e).lower() + crashed = ( + isinstance(e, InvalidSessionIdException) + or any(s in msg for s in ( + "tab crashed", + "chrome not reachable", + "no such session", + "session not created", + "unable to connect to renderer", + "target window already closed", + )) + ) + if crashed: + print(f"[{attempt}/{max_attempts}] Tab/session crashed โ€” recreating driver: {e}") + self._recreate_driver() + sleep(random.uniform(5, 10)) # cool-down after recovery + continue # retry the SAME url on a fresh driver raise ValueError(f"Error fetching page: {url}. Error: {e}") from e raise RuntimeError(f"Failed to fetch page after {max_attempts} attempts: {url}") @@ -1335,6 +2177,9 @@ def parse_args() -> argparse.Namespace: # Subsequent runs (skip login, use saved session) python substack_scraper.py --url https://example.substack.com --premium --persistent-profile --skip-login + # Fetch comments too (public threads; add --premium for paid-only comments) + python substack_scraper.py --url https://example.substack.com --comments + # Use manually downloaded driver python substack_scraper.py --url https://example.substack.com --premium --chrome-driver-path /path/to/chromedriver """ @@ -1344,6 +2189,16 @@ def parse_args() -> argparse.Namespace: "-u", "--url", type=str, help="The base URL of the Substack site to scrape." ) + parser.add_argument( + "--render-only", action="store_true", + help="Skip scraping. Re-render existing on-disk Markdown into the Substack-styled " + "HTML (no network). Give authors as positional args or use --all. Equivalent to " + "running render_posts.py." + ) + parser.add_argument( + "--render-all", action="store_true", + help="With --render-only, re-render every author under data/." + ) parser.add_argument( "-d", "--directory", type=str, help="The directory to save scraped markdown posts." @@ -1361,6 +2216,17 @@ def parse_args() -> argparse.Namespace: action="store_true", help="Download images and update markdown to use local paths." ) + parser.add_argument( + "--comments", + action="store_true", + help="Fetch each post's comment thread as separate .comments.md/.comments.json files. " + "Public threads need no auth; paid-only threads require --premium." + ) + parser.add_argument( + "--comments-sort", type=str, default=COMMENTS_SORT, + choices=["best", "most_recent_first"], + help="Comment sort order (default: best)." + ) parser.add_argument( "--frontmatter", type=str, default="legacy", choices=["legacy", "mdx"], help="Header format for scraped markdown. 'legacy' (default) uses the original " @@ -1414,12 +2280,39 @@ def parse_args() -> argparse.Namespace: help="Custom user agent string." ) + parser.add_argument( + "authors", nargs="*", default=[], + help="Author name(s) for --render-only (= data/.json stem).", + ) + return parser.parse_args() +def _run_render_only(args: argparse.Namespace) -> None: + """Delegate the --render-only path to the standalone renderer (network-free).""" + import render_posts + + if args.render_all: + authors = render_posts.discover_authors() + if not authors: + print("[SKIP] No authors found under data/.") + return + for author in authors: + render_posts.render_author(author, force=True) + elif args.authors: + for author in args.authors: + render_posts.render_author(author, force=True) + else: + print("Provide one or more authors, or use --render-only --render-all.") + + def main(): args = parse_args() + if args.render_only: + _run_render_only(args) + return + if args.directory is None: args.directory = BASE_MD_DIR @@ -1449,6 +2342,8 @@ def main(): use_persistent_profile=args.persistent_profile, skip_login=args.skip_login, frontmatter_format=args.frontmatter, + fetch_comments_flag=args.comments, + comments_sort=args.comments_sort, ) else: scraper = SubstackScraper( @@ -1457,6 +2352,8 @@ def main(): html_save_dir=args.html_directory, download_images=args.images, frontmatter_format=args.frontmatter, + fetch_comments_flag=args.comments, + comments_sort=args.comments_sort, ) scraper.scrape_posts(args.number) @@ -1476,6 +2373,8 @@ def main(): use_persistent_profile=args.persistent_profile, skip_login=args.skip_login, frontmatter_format=args.frontmatter, + fetch_comments_flag=args.comments, + comments_sort=args.comments_sort, ) else: scraper = SubstackScraper( @@ -1484,6 +2383,8 @@ def main(): html_save_dir=args.html_directory, download_images=args.images, frontmatter_format=args.frontmatter, + fetch_comments_flag=args.comments, + comments_sort=args.comments_sort, ) scraper.scrape_posts(num_posts_to_scrape=NUM_POSTS_TO_SCRAPE) diff --git a/tests/test_substack_scraper.py b/tests/test_substack_scraper.py index ab8a8bbb..c102de5d 100644 --- a/tests/test_substack_scraper.py +++ b/tests/test_substack_scraper.py @@ -1,5 +1,6 @@ import os import sys +import json import shutil import pytest @@ -227,4 +228,562 @@ def test_scraper_initialization(tmp_path): assert scraper.writer_name == "example" assert os.path.isdir(os.path.join(md_dir, "example")) - assert os.path.isdir(os.path.join(html_dir, "example")) \ No newline at end of file + assert os.path.isdir(os.path.join(html_dir, "example")) + + +# --------------------------------------------------------------------------- +# Comment helpers +# --------------------------------------------------------------------------- + + +# 9. test_render_comments_markdown_flat +def test_render_comments_markdown_flat(): + comments = [{ + "name": "Alice", + "body": "Great post!", + "date": "2026-06-01T22:54:34.749Z", + "reactions": {"โค": 0}, + "metadata": {"is_author": False}, + "children": [], + }] + + rendered = ss.render_comments_markdown(comments) + + assert "**Alice**" in rendered + assert "Great post!" in rendered + assert "Jun 01, 2026" in rendered + # No reactions shown when count is 0 + assert "โค" not in rendered + # No nested blockquotes at top level + assert ">" not in rendered + + +# 10. test_render_comments_markdown_nested +def test_render_comments_markdown_nested(): + comments = [{ + "name": "Alice", + "body": "Parent comment", + "date": "2026-06-01T22:54:34.749Z", + "reactions": {}, + "metadata": {"is_author": False}, + "children": [{ + "name": "Bob", + "body": "Reply one\n\nReply two", + "date": "2026-06-02T15:36:32.599Z", + "reactions": {}, + "metadata": {"is_author": False}, + "children": [], + }], + }] + + rendered = ss.render_comments_markdown(comments) + + assert "Parent comment" in rendered + assert "**Bob**" in rendered + # Child is rendered as a blockquote under the parent + assert "> **Bob**" in rendered + # Multi-paragraph child body: every line prefixed with ">" + assert "> Reply one" in rendered + assert "> Reply two" in rendered + # Blank line inside the blockquote is rendered as ">" + assert "\n>\n>" in rendered + + +# 11. test_render_comments_markdown_author_flag_and_reactions +def test_render_comments_markdown_author_flag_and_reactions(): + comments = [{ + "name": "Janey Park", + "body": "Author reply", + "date": "2026-06-02T15:36:32.599Z", + "reactions": {"โค": 5}, + "metadata": {"is_author": True}, + "children": [], + }] + + rendered = ss.render_comments_markdown(comments) + + assert "**Janey Park** (author)" in rendered + assert "โค 5" in rendered + + +# 12. test_count_all_comments_recursive +def test_count_all_comments_recursive(): + comments = [ + {"children": [{"children": [{"children": []}]}]}, + {"children": [{"children": []}]}, + {"children": []}, + ] + + assert ss.count_all_comments(comments) == 6 + + assert ss.count_all_comments([]) == 0 + + +# 13. test_get_post_id_from_slug +@patch("substack_scraper._request_json_with_rate_limit_retry") +def test_get_post_id_from_slug(mock_request): + mock_request.return_value = { + "id": 201818433, + "comment_count": 2, + "write_comment_permissions": "everyone", + } + + result = ss.get_post_id_from_slug("https://example.substack.com/", "my-post") + + assert result == (201818433, 2, "everyone") + mock_request.assert_called_once_with( + "https://example.substack.com/api/v1/posts/my-post", session=None + ) + + +@patch("substack_scraper._request_json_with_rate_limit_retry") +def test_get_post_id_from_slug_returns_none_on_failure(mock_request): + mock_request.return_value = None + + assert ss.get_post_id_from_slug("https://example.substack.com/", "missing") is None + + +# 14. test_fetch_comments_returns_list +@patch("substack_scraper._request_json_with_rate_limit_retry") +def test_fetch_comments_returns_list(mock_request): + mock_request.return_value = {"comments": [{"id": 1}, {"id": 2}], "automod_hidden_comments": []} + + result = ss.fetch_comments("https://example.substack.com/", 201818433) + + assert result == [{"id": 1}, {"id": 2}] + mock_request.assert_called_once_with( + "https://example.substack.com/api/v1/post/201818433/comments?all_comments=true&sort=best", + session=None, + ) + + +@patch("substack_scraper._request_json_with_rate_limit_retry") +def test_fetch_comments_empty_payload(mock_request): + mock_request.return_value = {"comments": [], "automod_hidden_comments": []} + + assert ss.fetch_comments("https://example.substack.com/", 123) == [] + + +# 15. test_parse_args_supports_comments_flag +def test_parse_args_supports_comments_flag(monkeypatch): + monkeypatch.setattr( + sys, + "argv", + ["substack_scraper.py", "--url", "https://example.substack.com/p/post", "--comments"], + ) + + args = ss.parse_args() + + assert args.comments is True + assert args.comments_sort == "best" + + +# 16. test_scrape_comments_loads_from_cache +def test_scrape_comments_loads_from_cache(tmp_path): + """Cached comments JSON is loaded from disk WITHOUT any network calls.""" + scraper = DummyScraper( + "https://example.substack.com/p/my-post", + str(tmp_path / "md"), + str(tmp_path / "html"), + fetch_comments_flag=True, + ) + + # Pre-create the comments JSON cache with a real comment list. + os.makedirs(scraper.comments_save_dir, exist_ok=True) + slug = "my-post" + json_path = os.path.join(scraper.comments_save_dir, f"{slug}.comments.json") + cached = [{"id": 1, "name": "Alice", "body": "Hi", "children": []}] + with open(json_path, "w", encoding="utf-8") as f: + json.dump(cached, f) + + # Cache hit โ†’ no network (get_post_id_from_slug must NOT be called). + with patch("substack_scraper.get_post_id_from_slug") as mock_meta: + result = scraper.scrape_comments_for_post("https://example.substack.com/p/my-post") + + assert result is not None + assert result["comments"] == cached + assert result["total_comments"] == 1 + assert result["json_path"] == json_path + mock_meta.assert_not_called() + + +# --------------------------------------------------------------------------- +# Comment HTML rendering (for the individual post page) +# --------------------------------------------------------------------------- + + +# 17. test_render_comments_html_flat +def test_render_comments_html_flat(): + comments = [{ + "name": "Alice", + "body": "Great post!", + "date": "2026-06-01T22:54:34.749Z", + "reactions": {"โค": 0}, + "metadata": {"is_author": False}, + "children": [], + }] + + html = ss.render_comments_html(comments) + + assert '
      ' in html + assert "

      Comments (1)

      " in html + assert 'class="comment-author">Alice' in html + assert "Great post!" in html + assert "Jun 01, 2026" in html + # No avatar when photo_url missing; no reactions when count 0; not an author. + assert "comment-avatar" not in html + assert "comment-author-flag" not in html + assert "comment-reactions" not in html + + +# 18. test_render_comments_html_nested +def test_render_comments_html_nested(): + comments = [{ + "name": "Alice", + "body": "Parent", + "date": "2026-06-01T22:54:34.749Z", + "reactions": {}, + "metadata": {"is_author": False}, + "children": [{ + "name": "Bob", + "body": "Reply", + "date": "2026-06-02T15:36:32.599Z", + "reactions": {}, + "metadata": {"is_author": False}, + "children": [], + }], + }] + + html = ss.render_comments_html(comments) + + assert "Parent" in html + assert "Bob" in html + assert "Reply" in html + # Total includes the nested child. + assert "

      Comments (2)

      " in html + # Child is rendered inside a .comment-children list. + assert '
        ' in html + + +# 19. test_render_comments_html_author_flag_and_reactions +def test_render_comments_html_author_flag_and_reactions(): + comments = [{ + "name": "Janey Park", + "body": "Author reply", + "date": "2026-06-02T15:36:32.599Z", + "photo_url": "https://example.com/avatar.png", + "reactions": {"โค": 5}, + "metadata": {"is_author": True}, + "children": [], + }] + + html = ss.render_comments_html(comments) + + assert 'Author' in html + assert 'โค 5' in html + assert 'body

        ", comments_html="
        C
        ") + + content = Path(html_path).read_text(encoding="utf-8") + assert '
        C
        ' in content + assert "

        body

        " in content + + +def test_save_to_html_file_without_comments(tmp_path): + scraper = DummyScraper( + "https://example.substack.com/p/my-post", + str(tmp_path / "md"), + str(tmp_path / "html"), + ) + html_path = str(tmp_path / "out.html") + + scraper.save_to_html_file(html_path, "

        body

        ") + + content = Path(html_path).read_text(encoding="utf-8") + assert "

        body

        " in content + assert "comments" not in content + + +# 22. test_write_post_html_includes_comments_section +def test_write_post_html_includes_comments_section(tmp_path): + scraper = DummyScraper( + "https://example.substack.com/p/my-post", + str(tmp_path / "md"), + str(tmp_path / "html"), + ) + html_path = str(tmp_path / "out.html") + comments = [{"name": "Alice", "body": "Hi", "children": []}] + + scraper._write_post_html(html_path, "# Title\n\nBody", comments) + + content = Path(html_path).read_text(encoding="utf-8") + assert '
        ' in content + assert "Alice" in content + # The markdown body was converted to HTML. + assert "

        Title

        " in content or "

        " in content + + +# --------------------------------------------------------------------------- +# Structured Substack header / render helpers +# --------------------------------------------------------------------------- + + +# 23. test_build_post_header_renders_full_header +def test_build_post_header_renders_full_header(): + meta = { + "title": "My Essay", + "subtitle": "A subtitle", + "author": "Jane Doe", + "date": "2026-06-19", + "cover_image": "https://example.com/cover.png", + } + + html = ss.build_post_header(meta) + + assert '
        ' in html + assert '

        My Essay

        ' in html + assert '

        A subtitle

        ' in html + assert 'Only a title

        ' in html + assert "post-subtitle" not in html + assert "post-byline" not in html + assert "post-cover" not in html + + +# 25. test_build_post_header_empty_returns_empty +@pytest.mark.parametrize("meta", [{}, None]) +def test_build_post_header_empty_returns_empty(meta): + assert ss.build_post_header(meta) == "" + + +# 26. test_build_post_header_escapes_html +def test_build_post_header_escapes_html(): + html = ss.build_post_header({"title": ''}) + assert "