Skip to content

Conversation

@BoDonkey
Copy link
Contributor

Summary

Summarize the changes briefly, including which issue/ticket this resolves. If it closes an existing Github issue, include "Closes #[issue number]"
This PR adds structured data creation and improved robots.txt generation, along with an llms.txt generation. The module has also been structured to allow easy addition of new schema at the project level. The README has been updated for the new features.

What are the specific steps to test this change?

For example:

  1. Run the website and log in as an admin
  2. Open a piece manager modal and select several pieces
  3. Click the "Archive" button on the top left of the manager and confirm that it should proceed
  4. Check that all pieces have been archived properly
  1. Add the module
  2. Add SEO info to the global config
  3. Add structured data to the page.
  4. Save, update and do a hard reload with cache clear. Check head for correct meta and structured data.
  5. Run the structured data through https://search.google.com/test/rich-results to check that data is correctly formed.

What kind of change does this PR introduce?

(Check at least one)

  • Bug fix
  • New feature
  • Refactor
  • Documentation
  • Build-related changes
  • Other

Make sure the PR fulfills these requirements:

  • It includes a) the existing issue ID being resolved, b) a convincing reason for adding this feature, or c) a clear description of the bug it resolves
  • The changelog is updated
  • Related documentation has been updated
  • Related tests have been updated

If adding a new feature without an already open issue, it's best to open a feature request issue first and wait for approval before working on it.

Other information:

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR significantly enhances the SEO module by adding comprehensive JSON-LD structured data support, improved robots.txt generation with AI crawler controls, and llms.txt generation for AI/LLM interactions.

Key Changes

  • Added 18 different JSON-LD schema types (Product, Event, JobPosting, Recipe, HowTo, etc.) with detailed field configurations
  • Enhanced robots.txt with selective AI crawler controls (GPTBot, ClaudeBot, etc.)
  • Implemented llms.txt generation for AI training policy management
  • Added global organization settings and social profiles
  • Expanded internationalization support across multiple languages

Reviewed Changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated no comments.

Show a summary per file
File Description
modules/@apostrophecms/seo-fields-global/index.js Added global SEO settings including AI crawler controls, organization info, social profiles, and llms.txt configuration
modules/@apostrophecms/seo-fields-doc-type/index.js Added 18 schema types with conditional field groups for structured data on pages/pieces
lib/jsonld-schemas.js New handler class for generating JSON-LD structured data with fallback logic and validation
lib/nodes.js Enhanced head generation with theme colors, pagination, JSON-LD injection, and hreflang support
lib/utils.js New utility for extracting image data from relationships
index.js Added methods for custom schema registration
i18n/*.json Comprehensive translations for all new fields across 6 languages
CHANGELOG.md Documented new structured data feature

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@BoDonkey BoDonkey requested a review from boutell November 14, 2025 16:08
Copy link
Member

@boutell boutell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excited for this.

I do have notes, see above.

Also: this is a lot of code running on every page. Where are the unit tests?

README.md Outdated

### ⚙️ Requires Content Structure Setup

Advanced structured data types need specific fields in your content types:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Links needed to what these things are. It took me a few eyeblinks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a link to the introductory sentence. Is that enough?

README.md Outdated
Traditional search engines (Googlebot, Bingbot) are always allowed unless using "Block All" mode.

**Technical Notes:**
- A physical `robots.txt` file in your `public/` directory will override these settings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or sites/public and dashboard/public, for multisite

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

README.md Outdated
**Available Modes:**

1. **Allow All (Search + AI)** - Default open access for all crawlers
2. **Allow Search, Block AI Training** - Maintains search rankings while protecting content from AI training
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"by AI agents that choose to respect this standard" (many don't)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

README.md Outdated

1. **Allow All (Search + AI)** - Default open access for all crawlers
2. **Allow Search, Block AI Training** - Maintains search rankings while protecting content from AI training
3. **Selective AI Crawlers** - Granular control over individual AI crawlers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"that support this standard"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

2. **Allow Search, Block AI Training** - Maintains search rankings while protecting content from AI training
3. **Selective AI Crawlers** - Granular control over individual AI crawlers
4. **Block All** - Prevents all indexing
5. **Custom** - Write your own robots.txt content
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning emoji: the only thing worse than leaving your per-launch site open to Google is forgetting to open up your launched, public site to Google. Make sure you follow up at launch time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

README.md Outdated
});
```
The module automatically generates `<link rel="preload">` tags for each configured font. The `crossorigin` attribute is automatically added for absolute URLs (CDN/external fonts) and omitted for relative URLs (self-hosted fonts).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is an alternate way to load fonts? Should you remove other frontend CSS or HTML you may have written to load these fonts or is it complementary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to clarify these points - and it is complementary.

README.md Outdated
**Where to store fonts:**
1. **Self-hosted (recommended)**: Place font files in `public/fonts/` and reference as `/fonts/filename.woff2`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is self-hosted recommended? We sell hosting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would call this "simple single-server deployments"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed - not sure if we want to "sell" hosting here with some wording changes?

README.md Outdated
- Requires CORS: `Access-Control-Allow-Origin: *`
- Example: `{ url: 'https://cdn.yoursite.com/fonts/inter.woff2' }`
3. **Don't use with Google Fonts**: They have their own optimization and don't benefit from preload
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not even when you self-serve them? Google stopped recommending use of their CDN a long time ago right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realize they discouraged using their CDN. Thanks! Changed.

**Best practices:**
- Only preload fonts used above the fold (typically 1-2 fonts maximum)
- Use `woff2` format for best compression (supported by all modern browsers)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL

}
}

if (process.env.APOS_SEO_DEBUG && date && fallbackSource && !fallbackSource.startsWith('schema.')) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recommend writing a "log" function for this file that does this process.env test so it's not repeated with potential typos.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored.

@BoDonkey
Copy link
Contributor Author

Excited for this.

I do have notes, see above.

Also: this is a lot of code running on every page. Where are the unit tests?

Tests added

@BoDonkey BoDonkey requested a review from boutell November 19, 2025 19:52
Copy link
Member

@boutell boutell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This continues to be very cool

In addition to my notes above...

Unit tests are cool but please add some traditional functional tests that stand up actual pieces and pages and verify it works as expected when run end to end.

2. **Allow Search, Block AI Training** - Maintains search rankings while protecting content from AI training
3. **Selective AI Crawlers** - Granular control over individual AI crawlers
2. **Allow Search, Block AI Training** - Maintains search rankings while protecting content from AI training by AI agents that choose to respect this standard
3. **Selective AI Crawlers** - Granular control over individual AI crawlers that support this standard
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You provided specific choices in a dropdown. Has there been testing by us or anyone else to verify that the bots in question respect the setting? If they don't, it feels like a mistake to say "here's how you block [X]" and have it just plain not work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above. According to official docs, they respect the robots.txt, I'm not sure how I would test their real-world behavior. I modified the README to hedge a little more.

README.md Outdated
5. **Custom** - Write your own robots.txt content

**Selective Mode Crawlers:**
For fine-grained control, use Selective mode to choose specific AI crawlers:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I wrote this comment, I thought the presence of these on the list implied they really can be blocked. If that's not known they should not be listed until it is demonstrated.

| **AI Support** | Most AI crawlers respect robots.txt | **Most AI systems do not read llms.txt** |
| **Example** | "Block GPTBot from accessing /api/*" | "Content may be used for search, not training" |

> **⚠️ Important:** While `llms.txt` represents forward-thinking SEO strategy, it should not be relied upon for actual crawler control. Use `robots.txt` for enforceable policies. The `llms.txt` file serves as a policy statement and may gain broader adoption over time.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even robots.txt may be ignored by some AI crawlers, especially those that falsely identify themselves as web browsers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noted this now.

- **robots.txt** provides technical enforcement
- **llms.txt** clearly communicates your policies to compliant AI systems
- **robots.txt** provides technical enforcement (works now)
- **llms.txt** clearly communicates your policies (may work in the future)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very helpful sentence.

**Recommended approach:** Use both together:
- **robots.txt** provides technical enforcement
- **llms.txt** clearly communicates your policies to compliant AI systems
- **robots.txt** provides technical enforcement (works now)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

respected now by many crawlers, some bad actors ignore it.

README.md Outdated
}
```
For **Article** and **Recipe** schemas, you can provide author information in multiple formats. Authors can be stored either as a simple string field or as a relationship to
`@apostrophecms/user`. The SEO module supports both.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I flagged this as a concern: relationships to users usually won't work because users are invisible to non-admin users and the public. I would suggest allowing any relationship with the name _author and documenting it that way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised:

**Resolution order:**

1. `schema.author` string (if present and non-empty)
2. `document.author` string
3. `_author` relationship: the first joined doc on `document._author`
4. `updatedBy` user on the document: `title`, then `name`

README.md Outdated
```

**Option 3: Automatic fallback:**
If neither field is provided, the module uses the currently logged-in user's name.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it was not changed?

README.md Outdated
/* Your existing CSS - keep this! */
@font-face {
font-family: 'Inter';
src: url('/fonts/inter-variable.woff2') format('woff2');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to fail when deploying to the cloud. The styles will be in S3, but the fonts will not be, unless you put them in the public/ subdir of a module, and use a /modules/... path in the CSS to access them. Per our asset docs

README.md Outdated
**Where to store fonts:**
1. **Self-hosted (recommended)**: Place font files in `public/fonts/` and reference as `/fonts/filename.woff2`
1. **Simple single-server deployments**: Place font files in `public/fonts/` and reference as `/fonts/filename.woff2`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the /modules/... approach

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.


// helper: ensure URL is absolute
ensureAbsoluteUrl(url, baseUrl) {
if (!url) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use the URL constructor with its second argument to save a lot of code?

@BoDonkey BoDonkey requested a review from boutell December 8, 2025 15:54
// Verify social profiles
assert(Array.isArray(orgSchema.sameAs), 'Organization should have sameAs array');
assert.strictEqual(orgSchema.sameAs.length, 2);
assert(orgSchema.sameAs.includes('https://twitter.com/examplecorp'));

Check failure

Code scanning / CodeQL

Incomplete URL substring sanitization High test

'
https://twitter.com/examplecorp
' can be anywhere in the URL, and arbitrary hosts may come before or after it.

Copilot Autofix

AI 28 days ago

Copilot could not generate an autofix suggestion

Copilot could not generate an autofix suggestion for this alert. Try pushing a new commit or if the problem persists contact support.

assert(Array.isArray(orgSchema.sameAs), 'Organization should have sameAs array');
assert.strictEqual(orgSchema.sameAs.length, 2);
assert(orgSchema.sameAs.includes('https://twitter.com/examplecorp'));
assert(orgSchema.sameAs.includes('https://linkedin.com/company/example'));

Check failure

Code scanning / CodeQL

Incomplete URL substring sanitization High test

'
https://linkedin.com/company/example
' can be anywhere in the URL, and arbitrary hosts may come before or after it.

Copilot Autofix

AI 28 days ago

To fix the problem, we should avoid using substring or loose matches to check the presence of expected URLs in the output array orgSchema.sameAs. Instead of orgSchema.sameAs.includes(...), use a strict equality check on array elements, such as .indexOf(...) !== -1 or, more idiomatically for modern JavaScript, still includes(...), but only for exact matches. However, since the concern is about substring matches in one long string, we must confirm that the test checks for presence of an exact match, not a substring. If .sameAs is an array of full URLs, then .includes() with the full string as argument is sufficient (since .includes() on array checks for strict equality of elements).

Therefore, the usage as stated is actually correct if sameAs is an array of strings, and not a big concatenated string. But since CodeQL flags it, and if we want to make the intent 100% clear, we can use assertions that test exact matches to expected values — and, for completeness, we may want to verify the expected content and order of the array as well.

In short, check for exact matches to the expected array of URLs using deep equality where possible.

What to change

  • In file test/functional-tests.js, in the block where orgSchema.sameAs is checked (lines 117–119), replace loose includes() assertions with strict assertions that the sameAs property equals the expected array.
  • If order is not guaranteed, use array membership checks for exact strings only.

Suggested changeset 1
test/functional-tests.js

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/test/functional-tests.js b/test/functional-tests.js
--- a/test/functional-tests.js
+++ b/test/functional-tests.js
@@ -115,8 +115,14 @@
       // Verify social profiles
       assert(Array.isArray(orgSchema.sameAs), 'Organization should have sameAs array');
       assert.strictEqual(orgSchema.sameAs.length, 2);
-      assert(orgSchema.sameAs.includes('https://twitter.com/examplecorp'));
-      assert(orgSchema.sameAs.includes('https://linkedin.com/company/example'));
+      assert.deepStrictEqual(
+        orgSchema.sameAs.sort(),
+        [
+          'https://linkedin.com/company/example',
+          'https://twitter.com/examplecorp'
+        ].sort(),
+        'Organization should have expected LinkedIn and Twitter URLs in sameAs array'
+      );
     });
 
     it('should render homepage with minimal configuration', async function () {
EOF
@@ -115,8 +115,14 @@
// Verify social profiles
assert(Array.isArray(orgSchema.sameAs), 'Organization should have sameAs array');
assert.strictEqual(orgSchema.sameAs.length, 2);
assert(orgSchema.sameAs.includes('https://twitter.com/examplecorp'));
assert(orgSchema.sameAs.includes('https://linkedin.com/company/example'));
assert.deepStrictEqual(
orgSchema.sameAs.sort(),
[
'https://linkedin.com/company/example',
'https://twitter.com/examplecorp'
].sort(),
'Organization should have expected LinkedIn and Twitter URLs in sameAs array'
);
});

it('should render homepage with minimal configuration', async function () {
Copilot is powered by AI and may make mistakes. Always verify output.
llmsTxt.includes('A test site demonstrating llms.txt'),
'Should include site description'
);
assert(llmsTxt.includes('https://example.com'), 'Should include base URL');

Check failure

Code scanning / CodeQL

Incomplete URL substring sanitization High test

'
https://example.com
' can be anywhere in the URL, and arbitrary hosts may come before or after it.

Copilot Autofix

AI 28 days ago

Copilot could not generate an autofix suggestion

Copilot could not generate an autofix suggestion for this alert. Try pushing a new commit or if the problem persists contact support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants