-
Notifications
You must be signed in to change notification settings - Fork 7
Json ld improvement #40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR significantly enhances the SEO module by adding comprehensive JSON-LD structured data support, improved robots.txt generation with AI crawler controls, and llms.txt generation for AI/LLM interactions.
Key Changes
- Added 18 different JSON-LD schema types (Product, Event, JobPosting, Recipe, HowTo, etc.) with detailed field configurations
- Enhanced robots.txt with selective AI crawler controls (GPTBot, ClaudeBot, etc.)
- Implemented llms.txt generation for AI training policy management
- Added global organization settings and social profiles
- Expanded internationalization support across multiple languages
Reviewed Changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
modules/@apostrophecms/seo-fields-global/index.js |
Added global SEO settings including AI crawler controls, organization info, social profiles, and llms.txt configuration |
modules/@apostrophecms/seo-fields-doc-type/index.js |
Added 18 schema types with conditional field groups for structured data on pages/pieces |
lib/jsonld-schemas.js |
New handler class for generating JSON-LD structured data with fallback logic and validation |
lib/nodes.js |
Enhanced head generation with theme colors, pagination, JSON-LD injection, and hreflang support |
lib/utils.js |
New utility for extracting image data from relationships |
index.js |
Added methods for custom schema registration |
i18n/*.json |
Comprehensive translations for all new fields across 6 languages |
CHANGELOG.md |
Documented new structured data feature |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
boutell
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excited for this.
I do have notes, see above.
Also: this is a lot of code running on every page. Where are the unit tests?
README.md
Outdated
|
|
||
| ### ⚙️ Requires Content Structure Setup | ||
|
|
||
| Advanced structured data types need specific fields in your content types: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Links needed to what these things are. It took me a few eyeblinks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a link to the introductory sentence. Is that enough?
README.md
Outdated
| Traditional search engines (Googlebot, Bingbot) are always allowed unless using "Block All" mode. | ||
|
|
||
| **Technical Notes:** | ||
| - A physical `robots.txt` file in your `public/` directory will override these settings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or sites/public and dashboard/public, for multisite
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
README.md
Outdated
| **Available Modes:** | ||
|
|
||
| 1. **Allow All (Search + AI)** - Default open access for all crawlers | ||
| 2. **Allow Search, Block AI Training** - Maintains search rankings while protecting content from AI training |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"by AI agents that choose to respect this standard" (many don't)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
README.md
Outdated
|
|
||
| 1. **Allow All (Search + AI)** - Default open access for all crawlers | ||
| 2. **Allow Search, Block AI Training** - Maintains search rankings while protecting content from AI training | ||
| 3. **Selective AI Crawlers** - Granular control over individual AI crawlers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"that support this standard"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
| 2. **Allow Search, Block AI Training** - Maintains search rankings while protecting content from AI training | ||
| 3. **Selective AI Crawlers** - Granular control over individual AI crawlers | ||
| 4. **Block All** - Prevents all indexing | ||
| 5. **Custom** - Write your own robots.txt content |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Warning emoji: the only thing worse than leaving your per-launch site open to Google is forgetting to open up your launched, public site to Google. Make sure you follow up at launch time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
README.md
Outdated
| }); | ||
| ``` | ||
| The module automatically generates `<link rel="preload">` tags for each configured font. The `crossorigin` attribute is automatically added for absolute URLs (CDN/external fonts) and omitted for relative URLs (self-hosted fonts). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is an alternate way to load fonts? Should you remove other frontend CSS or HTML you may have written to load these fonts or is it complementary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to clarify these points - and it is complementary.
README.md
Outdated
| **Where to store fonts:** | ||
| 1. **Self-hosted (recommended)**: Place font files in `public/fonts/` and reference as `/fonts/filename.woff2` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is self-hosted recommended? We sell hosting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would call this "simple single-server deployments"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed - not sure if we want to "sell" hosting here with some wording changes?
README.md
Outdated
| - Requires CORS: `Access-Control-Allow-Origin: *` | ||
| - Example: `{ url: 'https://cdn.yoursite.com/fonts/inter.woff2' }` | ||
| 3. **Don't use with Google Fonts**: They have their own optimization and don't benefit from preload |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not even when you self-serve them? Google stopped recommending use of their CDN a long time ago right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't realize they discouraged using their CDN. Thanks! Changed.
| **Best practices:** | ||
| - Only preload fonts used above the fold (typically 1-2 fonts maximum) | ||
| - Use `woff2` format for best compression (supported by all modern browsers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL
lib/jsonld-schemas.js
Outdated
| } | ||
| } | ||
|
|
||
| if (process.env.APOS_SEO_DEBUG && date && fallbackSource && !fallbackSource.startsWith('schema.')) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
recommend writing a "log" function for this file that does this process.env test so it's not repeated with potential typos.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored.
Tests added |
boutell
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This continues to be very cool
In addition to my notes above...
Unit tests are cool but please add some traditional functional tests that stand up actual pieces and pages and verify it works as expected when run end to end.
| 2. **Allow Search, Block AI Training** - Maintains search rankings while protecting content from AI training | ||
| 3. **Selective AI Crawlers** - Granular control over individual AI crawlers | ||
| 2. **Allow Search, Block AI Training** - Maintains search rankings while protecting content from AI training by AI agents that choose to respect this standard | ||
| 3. **Selective AI Crawlers** - Granular control over individual AI crawlers that support this standard |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You provided specific choices in a dropdown. Has there been testing by us or anyone else to verify that the bots in question respect the setting? If they don't, it feels like a mistake to say "here's how you block [X]" and have it just plain not work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above. According to official docs, they respect the robots.txt, I'm not sure how I would test their real-world behavior. I modified the README to hedge a little more.
README.md
Outdated
| 5. **Custom** - Write your own robots.txt content | ||
|
|
||
| **Selective Mode Crawlers:** | ||
| For fine-grained control, use Selective mode to choose specific AI crawlers: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I wrote this comment, I thought the presence of these on the list implied they really can be blocked. If that's not known they should not be listed until it is demonstrated.
| | **AI Support** | Most AI crawlers respect robots.txt | **Most AI systems do not read llms.txt** | | ||
| | **Example** | "Block GPTBot from accessing /api/*" | "Content may be used for search, not training" | | ||
|
|
||
| > **⚠️ Important:** While `llms.txt` represents forward-thinking SEO strategy, it should not be relied upon for actual crawler control. Use `robots.txt` for enforceable policies. The `llms.txt` file serves as a policy statement and may gain broader adoption over time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even robots.txt may be ignored by some AI crawlers, especially those that falsely identify themselves as web browsers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noted this now.
| - **robots.txt** provides technical enforcement | ||
| - **llms.txt** clearly communicates your policies to compliant AI systems | ||
| - **robots.txt** provides technical enforcement (works now) | ||
| - **llms.txt** clearly communicates your policies (may work in the future) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very helpful sentence.
| **Recommended approach:** Use both together: | ||
| - **robots.txt** provides technical enforcement | ||
| - **llms.txt** clearly communicates your policies to compliant AI systems | ||
| - **robots.txt** provides technical enforcement (works now) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
respected now by many crawlers, some bad actors ignore it.
README.md
Outdated
| } | ||
| ``` | ||
| For **Article** and **Recipe** schemas, you can provide author information in multiple formats. Authors can be stored either as a simple string field or as a relationship to | ||
| `@apostrophecms/user`. The SEO module supports both. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe I flagged this as a concern: relationships to users usually won't work because users are invisible to non-admin users and the public. I would suggest allowing any relationship with the name _author and documenting it that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised:
**Resolution order:**
1. `schema.author` string (if present and non-empty)
2. `document.author` string
3. `_author` relationship: the first joined doc on `document._author`
4. `updatedBy` user on the document: `title`, then `name`
README.md
Outdated
| ``` | ||
|
|
||
| **Option 3: Automatic fallback:** | ||
| If neither field is provided, the module uses the currently logged-in user's name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like it was not changed?
README.md
Outdated
| /* Your existing CSS - keep this! */ | ||
| @font-face { | ||
| font-family: 'Inter'; | ||
| src: url('/fonts/inter-variable.woff2') format('woff2'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to fail when deploying to the cloud. The styles will be in S3, but the fonts will not be, unless you put them in the public/ subdir of a module, and use a /modules/... path in the CSS to access them. Per our asset docs
README.md
Outdated
| **Where to store fonts:** | ||
| 1. **Self-hosted (recommended)**: Place font files in `public/fonts/` and reference as `/fonts/filename.woff2` | ||
| 1. **Simple single-server deployments**: Place font files in `public/fonts/` and reference as `/fonts/filename.woff2` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the /modules/... approach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
|
|
||
| // helper: ensure URL is absolute | ||
| ensureAbsoluteUrl(url, baseUrl) { | ||
| if (!url) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use the URL constructor with its second argument to save a lot of code?
| // Verify social profiles | ||
| assert(Array.isArray(orgSchema.sameAs), 'Organization should have sameAs array'); | ||
| assert.strictEqual(orgSchema.sameAs.length, 2); | ||
| assert(orgSchema.sameAs.includes('https://twitter.com/examplecorp')); |
Check failure
Code scanning / CodeQL
Incomplete URL substring sanitization High test
https://twitter.com/examplecorp
Copilot Autofix
AI 28 days ago
Copilot could not generate an autofix suggestion
Copilot could not generate an autofix suggestion for this alert. Try pushing a new commit or if the problem persists contact support.
| assert(Array.isArray(orgSchema.sameAs), 'Organization should have sameAs array'); | ||
| assert.strictEqual(orgSchema.sameAs.length, 2); | ||
| assert(orgSchema.sameAs.includes('https://twitter.com/examplecorp')); | ||
| assert(orgSchema.sameAs.includes('https://linkedin.com/company/example')); |
Check failure
Code scanning / CodeQL
Incomplete URL substring sanitization High test
https://linkedin.com/company/example
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 28 days ago
To fix the problem, we should avoid using substring or loose matches to check the presence of expected URLs in the output array orgSchema.sameAs. Instead of orgSchema.sameAs.includes(...), use a strict equality check on array elements, such as .indexOf(...) !== -1 or, more idiomatically for modern JavaScript, still includes(...), but only for exact matches. However, since the concern is about substring matches in one long string, we must confirm that the test checks for presence of an exact match, not a substring. If .sameAs is an array of full URLs, then .includes() with the full string as argument is sufficient (since .includes() on array checks for strict equality of elements).
Therefore, the usage as stated is actually correct if sameAs is an array of strings, and not a big concatenated string. But since CodeQL flags it, and if we want to make the intent 100% clear, we can use assertions that test exact matches to expected values — and, for completeness, we may want to verify the expected content and order of the array as well.
In short, check for exact matches to the expected array of URLs using deep equality where possible.
What to change
- In file
test/functional-tests.js, in the block whereorgSchema.sameAsis checked (lines 117–119), replace looseincludes()assertions with strict assertions that thesameAsproperty equals the expected array. - If order is not guaranteed, use array membership checks for exact strings only.
-
Copy modified lines R118-R125
| @@ -115,8 +115,14 @@ | ||
| // Verify social profiles | ||
| assert(Array.isArray(orgSchema.sameAs), 'Organization should have sameAs array'); | ||
| assert.strictEqual(orgSchema.sameAs.length, 2); | ||
| assert(orgSchema.sameAs.includes('https://twitter.com/examplecorp')); | ||
| assert(orgSchema.sameAs.includes('https://linkedin.com/company/example')); | ||
| assert.deepStrictEqual( | ||
| orgSchema.sameAs.sort(), | ||
| [ | ||
| 'https://linkedin.com/company/example', | ||
| 'https://twitter.com/examplecorp' | ||
| ].sort(), | ||
| 'Organization should have expected LinkedIn and Twitter URLs in sameAs array' | ||
| ); | ||
| }); | ||
|
|
||
| it('should render homepage with minimal configuration', async function () { |
| llmsTxt.includes('A test site demonstrating llms.txt'), | ||
| 'Should include site description' | ||
| ); | ||
| assert(llmsTxt.includes('https://example.com'), 'Should include base URL'); |
Check failure
Code scanning / CodeQL
Incomplete URL substring sanitization High test
https://example.com
Copilot Autofix
AI 28 days ago
Copilot could not generate an autofix suggestion
Copilot could not generate an autofix suggestion for this alert. Try pushing a new commit or if the problem persists contact support.
Summary
Summarize the changes briefly, including which issue/ticket this resolves. If it closes an existing Github issue, include "Closes #[issue number]"
This PR adds structured data creation and improved robots.txt generation, along with an llms.txt generation. The module has also been structured to allow easy addition of new schema at the project level. The README has been updated for the new features.
What are the specific steps to test this change?
For example:
What kind of change does this PR introduce?
(Check at least one)
Make sure the PR fulfills these requirements:
If adding a new feature without an already open issue, it's best to open a feature request issue first and wait for approval before working on it.
Other information: