feat(002): add sitemap generation feature
- Refactor kmeContentSourceAdapter.js into getValidToken(), oidcAuthFlow(), and sitemapFlow(); add sitemap generation using hydra:member response structure - Add searchApiBaseUrl, tenant, proxyBaseUrl fields to kme_CSA_settings.json and kme_CSA_settings.json.example - Add 17 unit tests for sitemap flow and non-sitemap routing regression - Add 5 contract tests for sitemap endpoint (proxy-http.test.js) - Add [Unreleased] sitemap entry to CHANGELOG.md - Add full specs/002-sitemap-generation/ artifact directory (spec, plan, tasks, data-model, contracts, research, quickstart, checklist) - Update constitution.md: add redis as permitted global, refresh kme_CSA_settings references - Update copilot-instructions.md SPECKIT marker to sitemap plan
This commit is contained in:
190
specs/002-sitemap-generation/research.md
Normal file
190
specs/002-sitemap-generation/research.md
Normal file
@@ -0,0 +1,190 @@
|
||||
# Research: Sitemap XML Generation
|
||||
|
||||
**Feature**: `002-sitemap-generation`
|
||||
**Branch**: `002-sitemap-generation`
|
||||
**Date**: 2025-07-14
|
||||
|
||||
---
|
||||
|
||||
## R-001: Token Reuse — OIDC Cache Pattern
|
||||
|
||||
**Decision**: Reuse `redis.hGet('authorization', 'token')` / `redis.hGet('authorization', 'expiry')`
|
||||
and the existing stampede-guard / token-refresh flow verbatim.
|
||||
|
||||
**Rationale**: The existing `kmeContentSourceAdapter.js` already implements a correct, battle-tested
|
||||
pattern for obtaining a valid OIDC `id_token` from Redis and refreshing it when expired. Duplicating
|
||||
only the cache-read portion (steps 1–3 of the existing flow) would create divergence. Calling the
|
||||
full existing logic first and then branching to the sitemap flow avoids that risk while reusing the
|
||||
security invariants already proven in production.
|
||||
|
||||
**Approach in code**: Refactor the top-level IIFE so that:
|
||||
1. URL routing check happens **first** (before any async work).
|
||||
2. For sitemap requests, a shared `getValidToken()` helper (inlined in the script, no imports)
|
||||
performs the identical cache-hit → stampede-guard → refresh → cache-write sequence.
|
||||
3. For all other requests, the existing flow runs unchanged.
|
||||
|
||||
**Alternatives considered**:
|
||||
- Call the existing OIDC logic unconditionally, then branch: rejected because it adds unnecessary
|
||||
latency to non-sitemap requests (token check not needed for sitemap but would execute anyway).
|
||||
- Separate helper file: rejected by the monolithic architecture constraint (Section I, constitution).
|
||||
|
||||
---
|
||||
|
||||
## R-002: KME Knowledge Search Service API — Response Envelope
|
||||
|
||||
**Decision**: Assume the response body is a JSON object with a top-level `items` array. Each element
|
||||
of `items` is an object whose `vkm:url` property holds the canonical document URL.
|
||||
|
||||
**Rationale**: The feature spec states:
|
||||
> "The `vkm:url` field is present at the top level of each item object in the search results
|
||||
> array; the exact response envelope shape will be confirmed against the live API during
|
||||
> implementation."
|
||||
|
||||
The most common shape for knowledge/search services is `{ items: [ { "vkm:url": "...", ... } ] }`.
|
||||
This assumption allows the code to be written and fully unit-tested before live-API access is
|
||||
available. A single `items` extraction line (`response.data.items ?? response.data`) means the
|
||||
adaption to the real shape is a one-line change.
|
||||
|
||||
**Concrete assumption**:
|
||||
```json
|
||||
{
|
||||
"items": [
|
||||
{ "vkm:url": "https://kme.example.com/knowledge/doc-1", "title": "…" },
|
||||
{ "vkm:url": "https://kme.example.com/knowledge/doc-2", "title": "…" }
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**Verification required**: During implementation, run the live API call against
|
||||
`<searchApiBaseUrl>/<tenant>` and confirm:
|
||||
1. The top-level key that holds the array (likely `items`, `results`, or the root is directly an
|
||||
array).
|
||||
2. That `vkm:url` is a string property, not nested deeper.
|
||||
|
||||
**Fallback**: If the root is a bare array, `response.data` itself is used as the items array.
|
||||
|
||||
**Alternatives considered**:
|
||||
- `results` key: equally plausible; the code will use `response.data.items ?? response.data` as a
|
||||
defensive pattern until confirmed.
|
||||
- Deeply nested: no evidence for this; rejected pending confirmation.
|
||||
|
||||
---
|
||||
|
||||
## R-003: xmlbuilder2 `create()` API for Sitemap XML
|
||||
|
||||
**Decision**: Use the `xmlBuilder` context variable (which is `xmlbuilder2`'s `create` function)
|
||||
with the following call chain:
|
||||
|
||||
```javascript
|
||||
const doc = xmlBuilder({ version: '1.0', encoding: 'UTF-8' });
|
||||
const urlset = doc.ele('urlset', { xmlns: 'http://www.sitemaps.org/schemas/sitemap/0.9' });
|
||||
for (const item of items) {
|
||||
urlset.ele('url').ele('loc').txt(locValue).up().up();
|
||||
}
|
||||
const xml = doc.end({ prettyPrint: false });
|
||||
```
|
||||
|
||||
**Rationale**: `xmlbuilder2` v4.x `create()` returns a `XMLBuilder` document node. Calling `.ele()`
|
||||
on it creates the root element. Child elements are built by chaining `.ele()` / `.txt()` / `.up()`.
|
||||
`doc.end({ prettyPrint: false })` serialises to a string prefixed with `<?xml version="1.0"
|
||||
encoding="UTF-8"?>`. `prettyPrint: false` is chosen for minimal byte overhead (sitemap consumers
|
||||
parse XML, not read it).
|
||||
|
||||
**Sitemap namespace**: `http://www.sitemaps.org/schemas/sitemap/0.9` — required by the Sitemaps
|
||||
protocol and the XSD schema referenced in SC-004.
|
||||
|
||||
**Validation**: The serialised string must begin with `<?xml` and contain a valid `<urlset>` root.
|
||||
Unit tests will assert this.
|
||||
|
||||
**Alternatives considered**:
|
||||
- Manual string concatenation: rejected (error-prone escaping, violates FR-008 which requires
|
||||
xmlBuilder).
|
||||
- `xmlbuilder` (v1/v2): not the installed package; rejected.
|
||||
|
||||
---
|
||||
|
||||
## R-004: Axios Error Differentiation — 502 vs 504
|
||||
|
||||
**Decision**: Reuse the exact error-detection pattern already present in the script:
|
||||
|
||||
| Condition | Status | Detection |
|
||||
|---|---|---|
|
||||
| `err.response` is defined | 502 Bad Gateway | Axios sets `err.response` for non-2xx HTTP responses |
|
||||
| `err.code === 'ECONNABORTED'` | 504 Gateway Timeout | Axios timeout (pre-Node 18) |
|
||||
| `err.code === 'ERR_CANCELED'` | 504 Gateway Timeout | Axios timeout (Node 18+ / AbortSignal) |
|
||||
| Other | 502 Bad Gateway | Treated as upstream failure |
|
||||
|
||||
**Rationale**: The existing script already uses this exact pattern for token-service errors
|
||||
(`err.response`, `err.code === 'ECONNABORTED' || err.code === 'ERR_CANCELED'`). Reusing it for
|
||||
search-service errors ensures consistent error classification across all upstream calls.
|
||||
|
||||
**Timeout value**: 10 000 ms, as stated in the spec assumption ("consistent with industry-standard
|
||||
defaults for proxy-initiated upstream requests").
|
||||
|
||||
**Alternatives considered**:
|
||||
- `AbortController` + `fetch`: not available in the VM context (only `axios` is injected). Rejected.
|
||||
- Different timeout for search vs auth: spec does not require this; YAGNI.
|
||||
|
||||
---
|
||||
|
||||
## R-005: Settings Validation — New Fields
|
||||
|
||||
**Decision**: At the entry point of the sitemap flow, perform an explicit guard before any async
|
||||
operation:
|
||||
|
||||
```javascript
|
||||
const requiredSitemapFields = ['searchApiBaseUrl', 'tenant', 'proxyBaseUrl'];
|
||||
for (const field of requiredSitemapFields) {
|
||||
if (!kme_CSA_settings[field]) {
|
||||
res.writeHead(500, { 'Content-Type': 'text/plain' });
|
||||
res.end('Configuration error: missing required field: ' + field);
|
||||
return;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Rationale**: FR-011 requires HTTP 500 with a descriptive message for missing settings. Checking
|
||||
before any async work means no I/O is attempted against an unconfigured upstream, and the error
|
||||
message identifies exactly which field is absent.
|
||||
|
||||
**The three new fields to add to `kme_CSA_settings.json`**:
|
||||
|
||||
| Field | Type | Description |
|
||||
|---|---|---|
|
||||
| `searchApiBaseUrl` | string | Base URL of the KME Knowledge Search Service |
|
||||
| `tenant` | string | Tenant identifier appended to search base URL |
|
||||
| `proxyBaseUrl` | string | Externally accessible HTTPS URL of this adapter instance |
|
||||
|
||||
---
|
||||
|
||||
## R-006: `loc` URL Construction and `vkm:url` Encoding
|
||||
|
||||
**Decision**: Construct each `<loc>` as:
|
||||
|
||||
```javascript
|
||||
`${proxyBaseUrl}?kmeURL=${encodeURIComponent(item['vkm:url'])}`
|
||||
```
|
||||
|
||||
**Rationale**: FR-005 specifies exactly this pattern. `encodeURIComponent` is a built-in available
|
||||
inside the VM context without injection (it is a standard JavaScript global). Using it percent-encodes
|
||||
the `vkm:url` value, producing a safe query-string parameter even if the value contains `://`, `?`,
|
||||
`#`, or other URL-special characters.
|
||||
|
||||
**Empty/missing guard** (FR-006):
|
||||
```javascript
|
||||
const vkmUrl = item['vkm:url'];
|
||||
if (!vkmUrl) continue; // omit silently
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Summary of All Decisions
|
||||
|
||||
| ID | Topic | Decision |
|
||||
|---|---|---|
|
||||
| R-001 | Token reuse | Inline shared token-fetch logic; branch on URL first |
|
||||
| R-002 | Search API response shape | Assume `{ items: [...] }`; verify against live API |
|
||||
| R-003 | xmlbuilder2 API | `xmlBuilder({...}).ele('urlset', {...})…doc.end({})` |
|
||||
| R-004 | Error mapping | Reuse existing `err.response` / `err.code` pattern |
|
||||
| R-005 | Settings validation | Explicit `requiredSitemapFields` guard → HTTP 500 |
|
||||
| R-006 | `loc` construction | `proxyBaseUrl?kmeURL=encodeURIComponent(vkm:url)` |
|
||||
Reference in New Issue
Block a user