Files
kme_content_adapter/specs/002-sitemap-generation/research.md
Peter.Morton 50b87297d2 feat(002): add sitemap generation feature
- Refactor kmeContentSourceAdapter.js into getValidToken(), oidcAuthFlow(),
  and sitemapFlow(); add sitemap generation using hydra:member response structure
- Add searchApiBaseUrl, tenant, proxyBaseUrl fields to kme_CSA_settings.json
  and kme_CSA_settings.json.example
- Add 17 unit tests for sitemap flow and non-sitemap routing regression
- Add 5 contract tests for sitemap endpoint (proxy-http.test.js)
- Add [Unreleased] sitemap entry to CHANGELOG.md
- Add full specs/002-sitemap-generation/ artifact directory
  (spec, plan, tasks, data-model, contracts, research, quickstart, checklist)
- Update constitution.md: add redis as permitted global, refresh
  kme_CSA_settings references
- Update copilot-instructions.md SPECKIT marker to sitemap plan
2026-04-22 22:08:08 -05:00

7.8 KiB
Raw Blame History

Research: Sitemap XML Generation

Feature: 002-sitemap-generation Branch: 002-sitemap-generation Date: 2025-07-14


R-001: Token Reuse — OIDC Cache Pattern

Decision: Reuse redis.hGet('authorization', 'token') / redis.hGet('authorization', 'expiry') and the existing stampede-guard / token-refresh flow verbatim.

Rationale: The existing kmeContentSourceAdapter.js already implements a correct, battle-tested pattern for obtaining a valid OIDC id_token from Redis and refreshing it when expired. Duplicating only the cache-read portion (steps 13 of the existing flow) would create divergence. Calling the full existing logic first and then branching to the sitemap flow avoids that risk while reusing the security invariants already proven in production.

Approach in code: Refactor the top-level IIFE so that:

  1. URL routing check happens first (before any async work).
  2. For sitemap requests, a shared getValidToken() helper (inlined in the script, no imports) performs the identical cache-hit → stampede-guard → refresh → cache-write sequence.
  3. For all other requests, the existing flow runs unchanged.

Alternatives considered:

  • Call the existing OIDC logic unconditionally, then branch: rejected because it adds unnecessary latency to non-sitemap requests (token check not needed for sitemap but would execute anyway).
  • Separate helper file: rejected by the monolithic architecture constraint (Section I, constitution).

R-002: KME Knowledge Search Service API — Response Envelope

Decision: Assume the response body is a JSON object with a top-level items array. Each element of items is an object whose vkm:url property holds the canonical document URL.

Rationale: The feature spec states:

"The vkm:url field is present at the top level of each item object in the search results array; the exact response envelope shape will be confirmed against the live API during implementation."

The most common shape for knowledge/search services is { items: [ { "vkm:url": "...", ... } ] }. This assumption allows the code to be written and fully unit-tested before live-API access is available. A single items extraction line (response.data.items ?? response.data) means the adaption to the real shape is a one-line change.

Concrete assumption:

{
  "items": [
    { "vkm:url": "https://kme.example.com/knowledge/doc-1", "title": "…" },
    { "vkm:url": "https://kme.example.com/knowledge/doc-2", "title": "…" }
  ]
}

Verification required: During implementation, run the live API call against <searchApiBaseUrl>/<tenant> and confirm:

  1. The top-level key that holds the array (likely items, results, or the root is directly an array).
  2. That vkm:url is a string property, not nested deeper.

Fallback: If the root is a bare array, response.data itself is used as the items array.

Alternatives considered:

  • results key: equally plausible; the code will use response.data.items ?? response.data as a defensive pattern until confirmed.
  • Deeply nested: no evidence for this; rejected pending confirmation.

R-003: xmlbuilder2 create() API for Sitemap XML

Decision: Use the xmlBuilder context variable (which is xmlbuilder2's create function) with the following call chain:

const doc = xmlBuilder({ version: '1.0', encoding: 'UTF-8' });
const urlset = doc.ele('urlset', { xmlns: 'http://www.sitemaps.org/schemas/sitemap/0.9' });
for (const item of items) {
  urlset.ele('url').ele('loc').txt(locValue).up().up();
}
const xml = doc.end({ prettyPrint: false });

Rationale: xmlbuilder2 v4.x create() returns a XMLBuilder document node. Calling .ele() on it creates the root element. Child elements are built by chaining .ele() / .txt() / .up(). doc.end({ prettyPrint: false }) serialises to a string prefixed with <?xml version="1.0" encoding="UTF-8"?>. prettyPrint: false is chosen for minimal byte overhead (sitemap consumers parse XML, not read it).

Sitemap namespace: http://www.sitemaps.org/schemas/sitemap/0.9 — required by the Sitemaps protocol and the XSD schema referenced in SC-004.

Validation: The serialised string must begin with <?xml and contain a valid <urlset> root. Unit tests will assert this.

Alternatives considered:

  • Manual string concatenation: rejected (error-prone escaping, violates FR-008 which requires xmlBuilder).
  • xmlbuilder (v1/v2): not the installed package; rejected.

R-004: Axios Error Differentiation — 502 vs 504

Decision: Reuse the exact error-detection pattern already present in the script:

Condition Status Detection
err.response is defined 502 Bad Gateway Axios sets err.response for non-2xx HTTP responses
err.code === 'ECONNABORTED' 504 Gateway Timeout Axios timeout (pre-Node 18)
err.code === 'ERR_CANCELED' 504 Gateway Timeout Axios timeout (Node 18+ / AbortSignal)
Other 502 Bad Gateway Treated as upstream failure

Rationale: The existing script already uses this exact pattern for token-service errors (err.response, err.code === 'ECONNABORTED' || err.code === 'ERR_CANCELED'). Reusing it for search-service errors ensures consistent error classification across all upstream calls.

Timeout value: 10 000 ms, as stated in the spec assumption ("consistent with industry-standard defaults for proxy-initiated upstream requests").

Alternatives considered:

  • AbortController + fetch: not available in the VM context (only axios is injected). Rejected.
  • Different timeout for search vs auth: spec does not require this; YAGNI.

R-005: Settings Validation — New Fields

Decision: At the entry point of the sitemap flow, perform an explicit guard before any async operation:

const requiredSitemapFields = ['searchApiBaseUrl', 'tenant', 'proxyBaseUrl'];
for (const field of requiredSitemapFields) {
  if (!kme_CSA_settings[field]) {
    res.writeHead(500, { 'Content-Type': 'text/plain' });
    res.end('Configuration error: missing required field: ' + field);
    return;
  }
}

Rationale: FR-011 requires HTTP 500 with a descriptive message for missing settings. Checking before any async work means no I/O is attempted against an unconfigured upstream, and the error message identifies exactly which field is absent.

The three new fields to add to kme_CSA_settings.json:

Field Type Description
searchApiBaseUrl string Base URL of the KME Knowledge Search Service
tenant string Tenant identifier appended to search base URL
proxyBaseUrl string Externally accessible HTTPS URL of this adapter instance

R-006: loc URL Construction and vkm:url Encoding

Decision: Construct each <loc> as:

`${proxyBaseUrl}?kmeURL=${encodeURIComponent(item['vkm:url'])}`

Rationale: FR-005 specifies exactly this pattern. encodeURIComponent is a built-in available inside the VM context without injection (it is a standard JavaScript global). Using it percent-encodes the vkm:url value, producing a safe query-string parameter even if the value contains ://, ?, #, or other URL-special characters.

Empty/missing guard (FR-006):

const vkmUrl = item['vkm:url'];
if (!vkmUrl) continue; // omit silently

Summary of All Decisions

ID Topic Decision
R-001 Token reuse Inline shared token-fetch logic; branch on URL first
R-002 Search API response shape Assume { items: [...] }; verify against live API
R-003 xmlbuilder2 API xmlBuilder({...}).ele('urlset', {...})…doc.end({})
R-004 Error mapping Reuse existing err.response / err.code pattern
R-005 Settings validation Explicit requiredSitemapFields guard → HTTP 500
R-006 loc construction proxyBaseUrl?kmeURL=encodeURIComponent(vkm:url)