Files
kme_content_adapter/specs/002-sitemap-generation/research.md
Peter.Morton 50b87297d2 feat(002): add sitemap generation feature
- Refactor kmeContentSourceAdapter.js into getValidToken(), oidcAuthFlow(),
  and sitemapFlow(); add sitemap generation using hydra:member response structure
- Add searchApiBaseUrl, tenant, proxyBaseUrl fields to kme_CSA_settings.json
  and kme_CSA_settings.json.example
- Add 17 unit tests for sitemap flow and non-sitemap routing regression
- Add 5 contract tests for sitemap endpoint (proxy-http.test.js)
- Add [Unreleased] sitemap entry to CHANGELOG.md
- Add full specs/002-sitemap-generation/ artifact directory
  (spec, plan, tasks, data-model, contracts, research, quickstart, checklist)
- Update constitution.md: add redis as permitted global, refresh
  kme_CSA_settings references
- Update copilot-instructions.md SPECKIT marker to sitemap plan
2026-04-22 22:08:08 -05:00

191 lines
7.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Research: Sitemap XML Generation
**Feature**: `002-sitemap-generation`
**Branch**: `002-sitemap-generation`
**Date**: 2025-07-14
---
## R-001: Token Reuse — OIDC Cache Pattern
**Decision**: Reuse `redis.hGet('authorization', 'token')` / `redis.hGet('authorization', 'expiry')`
and the existing stampede-guard / token-refresh flow verbatim.
**Rationale**: The existing `kmeContentSourceAdapter.js` already implements a correct, battle-tested
pattern for obtaining a valid OIDC `id_token` from Redis and refreshing it when expired. Duplicating
only the cache-read portion (steps 13 of the existing flow) would create divergence. Calling the
full existing logic first and then branching to the sitemap flow avoids that risk while reusing the
security invariants already proven in production.
**Approach in code**: Refactor the top-level IIFE so that:
1. URL routing check happens **first** (before any async work).
2. For sitemap requests, a shared `getValidToken()` helper (inlined in the script, no imports)
performs the identical cache-hit → stampede-guard → refresh → cache-write sequence.
3. For all other requests, the existing flow runs unchanged.
**Alternatives considered**:
- Call the existing OIDC logic unconditionally, then branch: rejected because it adds unnecessary
latency to non-sitemap requests (token check not needed for sitemap but would execute anyway).
- Separate helper file: rejected by the monolithic architecture constraint (Section I, constitution).
---
## R-002: KME Knowledge Search Service API — Response Envelope
**Decision**: Assume the response body is a JSON object with a top-level `items` array. Each element
of `items` is an object whose `vkm:url` property holds the canonical document URL.
**Rationale**: The feature spec states:
> "The `vkm:url` field is present at the top level of each item object in the search results
> array; the exact response envelope shape will be confirmed against the live API during
> implementation."
The most common shape for knowledge/search services is `{ items: [ { "vkm:url": "...", ... } ] }`.
This assumption allows the code to be written and fully unit-tested before live-API access is
available. A single `items` extraction line (`response.data.items ?? response.data`) means the
adaption to the real shape is a one-line change.
**Concrete assumption**:
```json
{
"items": [
{ "vkm:url": "https://kme.example.com/knowledge/doc-1", "title": "…" },
{ "vkm:url": "https://kme.example.com/knowledge/doc-2", "title": "…" }
]
}
```
**Verification required**: During implementation, run the live API call against
`<searchApiBaseUrl>/<tenant>` and confirm:
1. The top-level key that holds the array (likely `items`, `results`, or the root is directly an
array).
2. That `vkm:url` is a string property, not nested deeper.
**Fallback**: If the root is a bare array, `response.data` itself is used as the items array.
**Alternatives considered**:
- `results` key: equally plausible; the code will use `response.data.items ?? response.data` as a
defensive pattern until confirmed.
- Deeply nested: no evidence for this; rejected pending confirmation.
---
## R-003: xmlbuilder2 `create()` API for Sitemap XML
**Decision**: Use the `xmlBuilder` context variable (which is `xmlbuilder2`'s `create` function)
with the following call chain:
```javascript
const doc = xmlBuilder({ version: '1.0', encoding: 'UTF-8' });
const urlset = doc.ele('urlset', { xmlns: 'http://www.sitemaps.org/schemas/sitemap/0.9' });
for (const item of items) {
urlset.ele('url').ele('loc').txt(locValue).up().up();
}
const xml = doc.end({ prettyPrint: false });
```
**Rationale**: `xmlbuilder2` v4.x `create()` returns a `XMLBuilder` document node. Calling `.ele()`
on it creates the root element. Child elements are built by chaining `.ele()` / `.txt()` / `.up()`.
`doc.end({ prettyPrint: false })` serialises to a string prefixed with `<?xml version="1.0"
encoding="UTF-8"?>`. `prettyPrint: false` is chosen for minimal byte overhead (sitemap consumers
parse XML, not read it).
**Sitemap namespace**: `http://www.sitemaps.org/schemas/sitemap/0.9` — required by the Sitemaps
protocol and the XSD schema referenced in SC-004.
**Validation**: The serialised string must begin with `<?xml` and contain a valid `<urlset>` root.
Unit tests will assert this.
**Alternatives considered**:
- Manual string concatenation: rejected (error-prone escaping, violates FR-008 which requires
xmlBuilder).
- `xmlbuilder` (v1/v2): not the installed package; rejected.
---
## R-004: Axios Error Differentiation — 502 vs 504
**Decision**: Reuse the exact error-detection pattern already present in the script:
| Condition | Status | Detection |
|---|---|---|
| `err.response` is defined | 502 Bad Gateway | Axios sets `err.response` for non-2xx HTTP responses |
| `err.code === 'ECONNABORTED'` | 504 Gateway Timeout | Axios timeout (pre-Node 18) |
| `err.code === 'ERR_CANCELED'` | 504 Gateway Timeout | Axios timeout (Node 18+ / AbortSignal) |
| Other | 502 Bad Gateway | Treated as upstream failure |
**Rationale**: The existing script already uses this exact pattern for token-service errors
(`err.response`, `err.code === 'ECONNABORTED' || err.code === 'ERR_CANCELED'`). Reusing it for
search-service errors ensures consistent error classification across all upstream calls.
**Timeout value**: 10 000 ms, as stated in the spec assumption ("consistent with industry-standard
defaults for proxy-initiated upstream requests").
**Alternatives considered**:
- `AbortController` + `fetch`: not available in the VM context (only `axios` is injected). Rejected.
- Different timeout for search vs auth: spec does not require this; YAGNI.
---
## R-005: Settings Validation — New Fields
**Decision**: At the entry point of the sitemap flow, perform an explicit guard before any async
operation:
```javascript
const requiredSitemapFields = ['searchApiBaseUrl', 'tenant', 'proxyBaseUrl'];
for (const field of requiredSitemapFields) {
if (!kme_CSA_settings[field]) {
res.writeHead(500, { 'Content-Type': 'text/plain' });
res.end('Configuration error: missing required field: ' + field);
return;
}
}
```
**Rationale**: FR-011 requires HTTP 500 with a descriptive message for missing settings. Checking
before any async work means no I/O is attempted against an unconfigured upstream, and the error
message identifies exactly which field is absent.
**The three new fields to add to `kme_CSA_settings.json`**:
| Field | Type | Description |
|---|---|---|
| `searchApiBaseUrl` | string | Base URL of the KME Knowledge Search Service |
| `tenant` | string | Tenant identifier appended to search base URL |
| `proxyBaseUrl` | string | Externally accessible HTTPS URL of this adapter instance |
---
## R-006: `loc` URL Construction and `vkm:url` Encoding
**Decision**: Construct each `<loc>` as:
```javascript
`${proxyBaseUrl}?kmeURL=${encodeURIComponent(item['vkm:url'])}`
```
**Rationale**: FR-005 specifies exactly this pattern. `encodeURIComponent` is a built-in available
inside the VM context without injection (it is a standard JavaScript global). Using it percent-encodes
the `vkm:url` value, producing a safe query-string parameter even if the value contains `://`, `?`,
`#`, or other URL-special characters.
**Empty/missing guard** (FR-006):
```javascript
const vkmUrl = item['vkm:url'];
if (!vkmUrl) continue; // omit silently
```
---
## Summary of All Decisions
| ID | Topic | Decision |
|---|---|---|
| R-001 | Token reuse | Inline shared token-fetch logic; branch on URL first |
| R-002 | Search API response shape | Assume `{ items: [...] }`; verify against live API |
| R-003 | xmlbuilder2 API | `xmlBuilder({...}).ele('urlset', {...})…doc.end({})` |
| R-004 | Error mapping | Reuse existing `err.response` / `err.code` pattern |
| R-005 | Settings validation | Explicit `requiredSitemapFields` guard → HTTP 500 |
| R-006 | `loc` construction | `proxyBaseUrl?kmeURL=encodeURIComponent(vkm:url)` |