Json ld improvement #40

BoDonkey · 2025-11-14T16:05:47Z

Summary

Summarize the changes briefly, including which issue/ticket this resolves. If it closes an existing Github issue, include "Closes #[issue number]"
This PR adds structured data creation and improved robots.txt generation, along with an llms.txt generation. The module has also been structured to allow easy addition of new schema at the project level. The README has been updated for the new features.

What are the specific steps to test this change?

For example:

Run the website and log in as an admin

Open a piece manager modal and select several pieces

Click the "Archive" button on the top left of the manager and confirm that it should proceed

Check that all pieces have been archived properly

Add the module
Add SEO info to the global config
Add structured data to the page.
Save, update and do a hard reload with cache clear. Check head for correct meta and structured data.
Run the structured data through https://search.google.com/test/rich-results to check that data is correctly formed.

What kind of change does this PR introduce?

(Check at least one)

Make sure the PR fulfills these requirements:

It includes a) the existing issue ID being resolved, b) a convincing reason for adding this feature, or c) a clear description of the bug it resolves
The changelog is updated
Related documentation has been updated
Related tests have been updated

If adding a new feature without an already open issue, it's best to open a feature request issue first and wait for approval before working on it.

Other information:

Copilot

Pull Request Overview

This PR significantly enhances the SEO module by adding comprehensive JSON-LD structured data support, improved robots.txt generation with AI crawler controls, and llms.txt generation for AI/LLM interactions.

Key Changes

Added 18 different JSON-LD schema types (Product, Event, JobPosting, Recipe, HowTo, etc.) with detailed field configurations
Enhanced robots.txt with selective AI crawler controls (GPTBot, ClaudeBot, etc.)
Implemented llms.txt generation for AI training policy management
Added global organization settings and social profiles
Expanded internationalization support across multiple languages

Reviewed Changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
`modules/@apostrophecms/seo-fields-global/index.js`	Added global SEO settings including AI crawler controls, organization info, social profiles, and llms.txt configuration
`modules/@apostrophecms/seo-fields-doc-type/index.js`	Added 18 schema types with conditional field groups for structured data on pages/pieces
`lib/jsonld-schemas.js`	New handler class for generating JSON-LD structured data with fallback logic and validation
`lib/nodes.js`	Enhanced head generation with theme colors, pagination, JSON-LD injection, and hreflang support
`lib/utils.js`	New utility for extracting image data from relationships
`index.js`	Added methods for custom schema registration
`i18n/*.json`	Comprehensive translations for all new fields across 6 languages
`CHANGELOG.md`	Documented new structured data feature

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

boutell

Excited for this.

I do have notes, see above.

Also: this is a lot of code running on every page. Where are the unit tests?

boutell · 2025-11-17T14:11:54Z

README.md

+
+### ⚙️ Requires Content Structure Setup
+
+Advanced structured data types need specific fields in your content types:


Links needed to what these things are. It took me a few eyeblinks.

I added a link to the introductory sentence. Is that enough?

boutell · 2025-11-17T14:13:03Z

README.md

+Traditional search engines (Googlebot, Bingbot) are always allowed unless using "Block All" mode.
+
+**Technical Notes:**
+- A physical `robots.txt` file in your `public/` directory will override these settings


or sites/public and dashboard/public, for multisite

boutell · 2025-11-17T14:13:44Z

README.md

+**Available Modes:**
+
+1. **Allow All (Search + AI)** - Default open access for all crawlers
+2. **Allow Search, Block AI Training** - Maintains search rankings while protecting content from AI training


"by AI agents that choose to respect this standard" (many don't)

boutell · 2025-11-17T14:13:55Z

README.md

+
+1. **Allow All (Search + AI)** - Default open access for all crawlers
+2. **Allow Search, Block AI Training** - Maintains search rankings while protecting content from AI training
+3. **Selective AI Crawlers** - Granular control over individual AI crawlers


"that support this standard"

boutell · 2025-11-17T14:14:36Z

README.md

+2. **Allow Search, Block AI Training** - Maintains search rankings while protecting content from AI training
+3. **Selective AI Crawlers** - Granular control over individual AI crawlers
+4. **Block All** - Prevents all indexing
+5. **Custom** - Write your own robots.txt content


Warning emoji: the only thing worse than leaving your per-launch site open to Google is forgetting to open up your launched, public site to Google. Make sure you follow up at launch time.

boutell · 2025-11-17T14:41:37Z

README.md

+});
+```
+
+The module automatically generates `<link rel="preload">` tags for each configured font. The `crossorigin` attribute is automatically added for absolute URLs (CDN/external fonts) and omitted for relative URLs (self-hosted fonts).


So this is an alternate way to load fonts? Should you remove other frontend CSS or HTML you may have written to load these fonts or is it complementary?

Updated to clarify these points - and it is complementary.

boutell · 2025-11-17T14:41:59Z

README.md

+
+**Where to store fonts:**
+
+1. **Self-hosted (recommended)**: Place font files in `public/fonts/` and reference as `/fonts/filename.woff2`


Why is self-hosted recommended? We sell hosting.

I would call this "simple single-server deployments"

Changed - not sure if we want to "sell" hosting here with some wording changes?

boutell · 2025-11-17T14:42:43Z

README.md

+   - Requires CORS: `Access-Control-Allow-Origin: *`
+   - Example: `{ url: 'https://cdn.yoursite.com/fonts/inter.woff2' }`
+
+3. **Don't use with Google Fonts**: They have their own optimization and don't benefit from preload


Not even when you self-serve them? Google stopped recommending use of their CDN a long time ago right?

I didn't realize they discouraged using their CDN. Thanks! Changed.

boutell · 2025-11-17T14:43:00Z

README.md

+
+**Best practices:**
+- Only preload fonts used above the fold (typically 1-2 fonts maximum)
+- Use `woff2` format for best compression (supported by all modern browsers)


boutell · 2025-11-17T14:45:24Z

lib/jsonld-schemas.js

+      }
+    }
+
+    if (process.env.APOS_SEO_DEBUG && date && fallbackSource && !fallbackSource.startsWith('schema.')) {


recommend writing a "log" function for this file that does this process.env test so it's not repeated with potential typos.

Refactored.

BoDonkey · 2025-11-18T20:30:29Z

Excited for this.

I do have notes, see above.

Also: this is a lot of code running on every page. Where are the unit tests?

Tests added

boutell

This continues to be very cool

In addition to my notes above...

Unit tests are cool but please add some traditional functional tests that stand up actual pieces and pages and verify it works as expected when run end to end.

boutell · 2025-11-19T20:29:31Z

README.md

-2. **Allow Search, Block AI Training** - Maintains search rankings while protecting content from AI training
-3. **Selective AI Crawlers** - Granular control over individual AI crawlers
+2. **Allow Search, Block AI Training** - Maintains search rankings while protecting content from AI training by AI agents that choose to respect this standard
+3. **Selective AI Crawlers** - Granular control over individual AI crawlers that support this standard


You provided specific choices in a dropdown. Has there been testing by us or anyone else to verify that the bots in question respect the setting? If they don't, it feels like a mistake to say "here's how you block [X]" and have it just plain not work.

See comment above. According to official docs, they respect the robots.txt, I'm not sure how I would test their real-world behavior. I modified the README to hedge a little more.

boutell · 2025-11-19T20:30:09Z

README.md

+5. **Custom** - Write your own robots.txt content
+
+**Selective Mode Crawlers:**
+For fine-grained control, use Selective mode to choose specific AI crawlers:


When I wrote this comment, I thought the presence of these on the list implied they really can be blocked. If that's not known they should not be listed until it is demonstrated.

boutell · 2025-11-19T20:34:08Z

README.md

+| **AI Support** | Most AI crawlers respect robots.txt | **Most AI systems do not read llms.txt** |
 | **Example** | "Block GPTBot from accessing /api/*" | "Content may be used for search, not training" |

+> **⚠️ Important:** While `llms.txt` represents forward-thinking SEO strategy, it should not be relied upon for actual crawler control. Use `robots.txt` for enforceable policies. The `llms.txt` file serves as a policy statement and may gain broader adoption over time.


Even robots.txt may be ignored by some AI crawlers, especially those that falsely identify themselves as web browsers.

I noted this now.

boutell · 2025-11-19T20:34:30Z

README.md

- **robots.txt** provides technical enforcement
- **llms.txt** clearly communicates your policies to compliant AI systems
+- **robots.txt** provides technical enforcement (works now)
+- **llms.txt** clearly communicates your policies (may work in the future)


This is a very helpful sentence.

boutell · 2025-11-19T20:35:00Z

README.md

 **Recommended approach:** Use both together:
- **robots.txt** provides technical enforcement
- **llms.txt** clearly communicates your policies to compliant AI systems
+- **robots.txt** provides technical enforcement (works now)


respected now by many crawlers, some bad actors ignore it.

boutell · 2025-11-19T20:36:15Z

README.md

-}
-```
+For **Article** and **Recipe** schemas, you can provide author information in multiple formats. Authors can be stored either as a simple string field or as a relationship to
+`@apostrophecms/user`. The SEO module supports both.


I believe I flagged this as a concern: relationships to users usually won't work because users are invisible to non-admin users and the public. I would suggest allowing any relationship with the name _author and documenting it that way.

Revised:

**Resolution order:** 1. `schema.author` string (if present and non-empty) 2. `document.author` string 3. `_author` relationship: the first joined doc on `document._author` 4. `updatedBy` user on the document: `title`, then `name`

boutell · 2025-11-19T20:36:42Z

README.md

+```
+
+**Option 3: Automatic fallback:**
+If neither field is provided, the module uses the currently logged-in user's name.


Seems like it was not changed?

boutell · 2025-11-19T20:39:39Z

README.md

+/* Your existing CSS - keep this! */
+@font-face {
+  font-family: 'Inter';
+  src: url('/fonts/inter-variable.woff2') format('woff2');


This is going to fail when deploying to the cloud. The styles will be in S3, but the fonts will not be, unless you put them in the public/ subdir of a module, and use a /modules/... path in the CSS to access them. Per our asset docs

boutell · 2025-11-19T20:39:51Z

README.md

 **Where to store fonts:**

-1. **Self-hosted (recommended)**: Place font files in `public/fonts/` and reference as `/fonts/filename.woff2`
+1. **Simple single-server deployments**: Place font files in `public/fonts/` and reference as `/fonts/filename.woff2`


Use the /modules/... approach

boutell · 2025-11-19T20:41:08Z

lib/jsonld-schemas.js


+  // helper: ensure URL is absolute
+  ensureAbsoluteUrl(url, baseUrl) {
+    if (!url) {


Why not use the URL constructor with its second argument to save a lot of code?

test/functional-tests.js

+      // Verify social profiles
+      assert(Array.isArray(orgSchema.sameAs), 'Organization should have sameAs array');
+      assert.strictEqual(orgSchema.sameAs.length, 2);
+      assert(orgSchema.sameAs.includes('https://twitter.com/examplecorp'));


test/functional-tests.js

+      assert(Array.isArray(orgSchema.sameAs), 'Organization should have sameAs array');
+      assert.strictEqual(orgSchema.sameAs.length, 2);
+      assert(orgSchema.sameAs.includes('https://twitter.com/examplecorp'));
+      assert(orgSchema.sameAs.includes('https://linkedin.com/company/example'));


To fix the problem, we should avoid using substring or loose matches to check the presence of expected URLs in the output array orgSchema.sameAs. Instead of orgSchema.sameAs.includes(...), use a strict equality check on array elements, such as .indexOf(...) !== -1 or, more idiomatically for modern JavaScript, still includes(...), but only for exact matches. However, since the concern is about substring matches in one long string, we must confirm that the test checks for presence of an exact match, not a substring. If .sameAs is an array of full URLs, then .includes() with the full string as argument is sufficient (since .includes() on array checks for strict equality of elements).

Therefore, the usage as stated is actually correct if sameAs is an array of strings, and not a big concatenated string. But since CodeQL flags it, and if we want to make the intent 100% clear, we can use assertions that test exact matches to expected values — and, for completeness, we may want to verify the expected content and order of the array as well.

In short, check for exact matches to the expected array of URLs using deep equality where possible.

What to change

In file test/functional-tests.js, in the block where orgSchema.sameAs is checked (lines 117–119), replace loose includes() assertions with strict assertions that the sameAs property equals the expected array.

If order is not guaranteed, use array membership checks for exact strings only.

test/functional-tests.js

+        llmsTxt.includes('A test site demonstrating llms.txt'),
+        'Should include site description'
+      );
+      assert(llmsTxt.includes('https://example.com'), 'Should include base URL');


BoDonkey added 30 commits October 4, 2025 08:59

JSON-LD, OG and llms.txt improvements

6efd106

Fix typo

7b05bb9

Fix duplication

901b55f

Fix missing JSON comma

8435d04

Fix relationship field

284ffeb

Fix relationship retrieval

3b54aa8

Image corrections

8b814f2

Fix class

27c3f18

Fix improper logo call

e57fd1a

Remove duplicate @context

2b375ca

Debug update

e103950

Remove debug logs

00f5675

Minor fixes

c01ba2d

Move getImageData to lib

1898ce1

Expand llms.txt depth

64ed2c0

Add llms.txt debugging

8d5d6ec

Better filtering on llms.txt

2cd48ed

Schema improvements

9bf0ee5

New QA schema and README updates

97f8f8d

Slight enhancements to schemas

a029ffc

README rearrangement

c5122fc

Critical font update

d5d2ca4

Refine nodes and breadcrumbs

142aa69

Log clean-up

005bbc8

Remove open graph fields

62a44ed

Fix line length

8ebed77

Re-add utils

0b1cada

Fix comment block

de746c2

Additional OG cleanup

6f508cc

Revert seo title to meta tag

53f4285

BoDonkey added 5 commits November 2, 2025 07:40

Change title message

1e1e983

Update title help in other languages

318170d

Improve custom schema extension

bf188a7

Update extension examples

a7ceb1d

Further README update

e03eb8c

BoDonkey requested a review from Copilot November 14, 2025 16:05

Copilot started reviewing on behalf of BoDonkey November 14, 2025 16:06 View session

Copilot finished reviewing on behalf of BoDonkey November 14, 2025 16:06

Copilot AI reviewed Nov 14, 2025

View reviewed changes

BoDonkey requested a review from boutell November 14, 2025 16:08

boutell requested changes Nov 17, 2025

View reviewed changes

BoDonkey added 3 commits November 18, 2025 08:19

Partial changes in respone to first comments

fcde7bc

Add missing time element to event

f77c8e8

Add missing strings and tests

cab8380

BoDonkey added 6 commits November 18, 2025 16:22

All but complicated table updated

c9165c2

README corrections and clarifications

933bf02

Add default schema type for blog and event piece-type modules

4b93ae6

Custom mapping added

4475b05

linting

fd8fb9a

Fixed test typo

020f4a3

BoDonkey requested a review from boutell November 19, 2025 19:52

boutell requested changes Nov 19, 2025

View reviewed changes

BoDonkey and others added 4 commits December 1, 2025 15:50

Responses minus tests

99b1dab

test changes

688ee6b

fix typo

7699f20

FInish tests

1a1e9fc

BoDonkey requested a review from boutell December 8, 2025 15:54

github-advanced-security bot found potential problems Dec 8, 2025

View reviewed changes

boutell approved these changes Dec 9, 2025

View reviewed changes


		### ⚙️ Requires Content Structure Setup

		Advanced structured data types need specific fields in your content types:


		Where to store fonts:

		1. Self-hosted (recommended): Place font files in `public/fonts/` and reference as `/fonts/filename.woff2`

@@ -115,8 +115,14 @@
                   // Verify social profiles
                   assert(Array.isArray(orgSchema.sameAs), 'Organization should have sameAs array');
                   assert.strictEqual(orgSchema.sameAs.length, 2);
-                  assert(orgSchema.sameAs.includes('https://twitter.com/examplecorp'));
-                  assert(orgSchema.sameAs.includes('https://linkedin.com/company/example'));
+                  assert.deepStrictEqual(
+                    orgSchema.sameAs.sort(),
+                    [
+                      'https://linkedin.com/company/example',
+                      'https://twitter.com/examplecorp'
+                    ].sort(),
+                    'Organization should have expected LinkedIn and Twitter URLs in sameAs array'
+                  );
                 });
                 it('should render homepage with minimal configuration', async function () {

Json ld improvement #40

Are you sure you want to change the base?

Json ld improvement #40

Uh oh!

Conversation

BoDonkey commented Nov 14, 2025

Summary

What are the specific steps to test this change?

What kind of change does this PR introduce?

Make sure the PR fulfills these requirements:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Key Changes

Reviewed Changes

Uh oh!

boutell left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BoDonkey commented Nov 18, 2025

Uh oh!

boutell left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment