feat(002): add sitemap generation feature

- Refactor kmeContentSourceAdapter.js into getValidToken(), oidcAuthFlow(), and sitemapFlow(); add sitemap generation using hydra:member response structure - Add searchApiBaseUrl, tenant, proxyBaseUrl fields to kme_CSA_settings.json and kme_CSA_settings.json.example - Add 17 unit tests for sitemap flow and non-sitemap routing regression - Add 5 contract tests for sitemap endpoint (proxy-http.test.js) - Add [Unreleased] sitemap entry to CHANGELOG.md - Add full specs/002-sitemap-generation/ artifact directory (spec, plan, tasks, data-model, contracts, research, quickstart, checklist) - Update constitution.md: add redis as permitted global, refresh kme_CSA_settings references - Update copilot-instructions.md SPECKIT marker to sitemap plan
2026-04-22 22:08:08 -05:00
parent 49a6b2e4e7
commit 50b87297d2
17 changed files with 1879 additions and 40 deletions
--- a/specs/002-sitemap-generation/research.md
+++ b/specs/002-sitemap-generation/research.md
@@ -0,0 +1,190 @@
+# Research: Sitemap XML Generation
+
+**Feature**: `002-sitemap-generation`
+**Branch**: `002-sitemap-generation`
+**Date**: 2025-07-14
+
+---
+
+## R-001: Token Reuse — OIDC Cache Pattern
+
+**Decision**: Reuse `redis.hGet('authorization', 'token')` / `redis.hGet('authorization', 'expiry')`
+and the existing stampede-guard / token-refresh flow verbatim.
+
+**Rationale**: The existing `kmeContentSourceAdapter.js` already implements a correct, battle-tested
+pattern for obtaining a valid OIDC `id_token` from Redis and refreshing it when expired. Duplicating
+only the cache-read portion (steps 1–3 of the existing flow) would create divergence. Calling the
+full existing logic first and then branching to the sitemap flow avoids that risk while reusing the
+security invariants already proven in production.
+
+**Approach in code**: Refactor the top-level IIFE so that:
+1. URL routing check happens **first** (before any async work).
+2. For sitemap requests, a shared `getValidToken()` helper (inlined in the script, no imports)
+   performs the identical cache-hit → stampede-guard → refresh → cache-write sequence.
+3. For all other requests, the existing flow runs unchanged.
+
+**Alternatives considered**:
+- Call the existing OIDC logic unconditionally, then branch: rejected because it adds unnecessary
+  latency to non-sitemap requests (token check not needed for sitemap but would execute anyway).
+- Separate helper file: rejected by the monolithic architecture constraint (Section I, constitution).
+
+---
+
+## R-002: KME Knowledge Search Service API — Response Envelope
+
+**Decision**: Assume the response body is a JSON object with a top-level `items` array. Each element
+of `items` is an object whose `vkm:url` property holds the canonical document URL.
+
+**Rationale**: The feature spec states:
+> "The `vkm:url` field is present at the top level of each item object in the search results
+> array; the exact response envelope shape will be confirmed against the live API during
+> implementation."
+
+The most common shape for knowledge/search services is `{ items: [ { "vkm:url": "...", ... } ] }`.
+This assumption allows the code to be written and fully unit-tested before live-API access is
+available. A single `items` extraction line (`response.data.items ?? response.data`) means the
+adaption to the real shape is a one-line change.
+
+**Concrete assumption**:
+```json
+{
+  "items": [
+    { "vkm:url": "https://kme.example.com/knowledge/doc-1", "title": "…" },
+    { "vkm:url": "https://kme.example.com/knowledge/doc-2", "title": "…" }
+  ]
+}
+```
+
+**Verification required**: During implementation, run the live API call against
+`<searchApiBaseUrl>/<tenant>` and confirm:
+1. The top-level key that holds the array (likely `items`, `results`, or the root is directly an
+   array).
+2. That `vkm:url` is a string property, not nested deeper.
+
+**Fallback**: If the root is a bare array, `response.data` itself is used as the items array.
+
+**Alternatives considered**:
+- `results` key: equally plausible; the code will use `response.data.items ?? response.data` as a
+  defensive pattern until confirmed.
+- Deeply nested: no evidence for this; rejected pending confirmation.
+
+---
+
+## R-003: xmlbuilder2 `create()` API for Sitemap XML
+
+**Decision**: Use the `xmlBuilder` context variable (which is `xmlbuilder2`'s `create` function)
+with the following call chain:
+
+```javascript
+const doc = xmlBuilder({ version: '1.0', encoding: 'UTF-8' });
+const urlset = doc.ele('urlset', { xmlns: 'http://www.sitemaps.org/schemas/sitemap/0.9' });
+for (const item of items) {
+  urlset.ele('url').ele('loc').txt(locValue).up().up();
+}
+const xml = doc.end({ prettyPrint: false });
+```
+
+**Rationale**: `xmlbuilder2` v4.x `create()` returns a `XMLBuilder` document node. Calling `.ele()`
+on it creates the root element. Child elements are built by chaining `.ele()` / `.txt()` / `.up()`.
+`doc.end({ prettyPrint: false })` serialises to a string prefixed with `<?xml version="1.0"
+encoding="UTF-8"?>`. `prettyPrint: false` is chosen for minimal byte overhead (sitemap consumers
+parse XML, not read it).
+
+**Sitemap namespace**: `http://www.sitemaps.org/schemas/sitemap/0.9` — required by the Sitemaps
+protocol and the XSD schema referenced in SC-004.
+
+**Validation**: The serialised string must begin with `<?xml` and contain a valid `<urlset>` root.
+Unit tests will assert this.
+
+**Alternatives considered**:
+- Manual string concatenation: rejected (error-prone escaping, violates FR-008 which requires
+  xmlBuilder).
+- `xmlbuilder` (v1/v2): not the installed package; rejected.
+
+---
+
+## R-004: Axios Error Differentiation — 502 vs 504
+
+**Decision**: Reuse the exact error-detection pattern already present in the script:
+
+| Condition | Status | Detection |
+|---|---|---|
+| `err.response` is defined | 502 Bad Gateway | Axios sets `err.response` for non-2xx HTTP responses |
+| `err.code === 'ECONNABORTED'` | 504 Gateway Timeout | Axios timeout (pre-Node 18) |
+| `err.code === 'ERR_CANCELED'` | 504 Gateway Timeout | Axios timeout (Node 18+ / AbortSignal) |
+| Other | 502 Bad Gateway | Treated as upstream failure |
+
+**Rationale**: The existing script already uses this exact pattern for token-service errors
+(`err.response`, `err.code === 'ECONNABORTED' || err.code === 'ERR_CANCELED'`). Reusing it for
+search-service errors ensures consistent error classification across all upstream calls.
+
+**Timeout value**: 10 000 ms, as stated in the spec assumption ("consistent with industry-standard
+defaults for proxy-initiated upstream requests").
+
+**Alternatives considered**:
+- `AbortController` + `fetch`: not available in the VM context (only `axios` is injected). Rejected.
+- Different timeout for search vs auth: spec does not require this; YAGNI.
+
+---
+
+## R-005: Settings Validation — New Fields
+
+**Decision**: At the entry point of the sitemap flow, perform an explicit guard before any async
+operation:
+
+```javascript
+const requiredSitemapFields = ['searchApiBaseUrl', 'tenant', 'proxyBaseUrl'];
+for (const field of requiredSitemapFields) {
+  if (!kme_CSA_settings[field]) {
+    res.writeHead(500, { 'Content-Type': 'text/plain' });
+    res.end('Configuration error: missing required field: ' + field);
+    return;
+  }
+}
+```
+
+**Rationale**: FR-011 requires HTTP 500 with a descriptive message for missing settings. Checking
+before any async work means no I/O is attempted against an unconfigured upstream, and the error
+message identifies exactly which field is absent.
+
+**The three new fields to add to `kme_CSA_settings.json`**:
+
+| Field | Type | Description |
+|---|---|---|
+| `searchApiBaseUrl` | string | Base URL of the KME Knowledge Search Service |
+| `tenant` | string | Tenant identifier appended to search base URL |
+| `proxyBaseUrl` | string | Externally accessible HTTPS URL of this adapter instance |
+
+---
+
+## R-006: `loc` URL Construction and `vkm:url` Encoding
+
+**Decision**: Construct each `<loc>` as:
+
+```javascript
+`${proxyBaseUrl}?kmeURL=${encodeURIComponent(item['vkm:url'])}`
+```
+
+**Rationale**: FR-005 specifies exactly this pattern. `encodeURIComponent` is a built-in available
+inside the VM context without injection (it is a standard JavaScript global). Using it percent-encodes
+the `vkm:url` value, producing a safe query-string parameter even if the value contains `://`, `?`,
+`#`, or other URL-special characters.
+
+**Empty/missing guard** (FR-006):
+```javascript
+const vkmUrl = item['vkm:url'];
+if (!vkmUrl) continue; // omit silently
+```
+
+---
+
+## Summary of All Decisions
+
+| ID | Topic | Decision |
+|---|---|---|
+| R-001 | Token reuse | Inline shared token-fetch logic; branch on URL first |
+| R-002 | Search API response shape | Assume `{ items: [...] }`; verify against live API |
+| R-003 | xmlbuilder2 API | `xmlBuilder({...}).ele('urlset', {...})…doc.end({})` |
+| R-004 | Error mapping | Reuse existing `err.response` / `err.code` pattern |
+| R-005 | Settings validation | Explicit `requiredSitemapFields` guard → HTTP 500 |
+| R-006 | `loc` construction | `proxyBaseUrl?kmeURL=encodeURIComponent(vkm:url)` |