- Add contentFetchFlow() to proxy (FR-001 through FR-012) - Add extractArticleBody() helper with vkm:articleBody / articleBody fallback - Dynamic proxyBaseUrl derivation from x-forwarded-proto/host headers - Forward query/size/category params on /sitemap.xml requests - Add Accept: application/ld+json header to content API calls - Remove oidcAuthFlow() - unmatched requests now return 404 Not Found - Fix xmlbuilder2 import: default import, call as xmlbuilder2.create(...) - Version bump 0.2.0 → 0.3.0 - 45/45 tests passing Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
191 lines
7.8 KiB
Markdown
191 lines
7.8 KiB
Markdown
# Research: Sitemap XML Generation
|
||
|
||
**Feature**: `002-sitemap-generation`
|
||
**Branch**: `002-sitemap-generation`
|
||
**Date**: 2025-07-14
|
||
|
||
---
|
||
|
||
## R-001: Token Reuse — OIDC Cache Pattern
|
||
|
||
**Decision**: Reuse `redis.hGet('authorization', 'token')` / `redis.hGet('authorization', 'expiry')`
|
||
and the existing stampede-guard / token-refresh flow verbatim.
|
||
|
||
**Rationale**: The existing `kmeContentSourceAdapter.js` already implements a correct, battle-tested
|
||
pattern for obtaining a valid OIDC `id_token` from Redis and refreshing it when expired. Duplicating
|
||
only the cache-read portion (steps 1–3 of the existing flow) would create divergence. Calling the
|
||
full existing logic first and then branching to the sitemap flow avoids that risk while reusing the
|
||
security invariants already proven in production.
|
||
|
||
**Approach in code**: Refactor the top-level IIFE so that:
|
||
1. URL routing check happens **first** (before any async work).
|
||
2. For sitemap requests, a shared `getValidToken()` helper (inlined in the script, no imports)
|
||
performs the identical cache-hit → stampede-guard → refresh → cache-write sequence.
|
||
3. For all other requests, the existing flow runs unchanged.
|
||
|
||
**Alternatives considered**:
|
||
- Call the existing OIDC logic unconditionally, then branch: rejected because it adds unnecessary
|
||
latency to non-sitemap requests (token check not needed for sitemap but would execute anyway).
|
||
- Separate helper file: rejected by the monolithic architecture constraint (Section I, constitution).
|
||
|
||
---
|
||
|
||
## R-002: KME Knowledge Search Service API — Response Envelope
|
||
|
||
**Decision**: Assume the response body is a JSON object with a top-level `items` array. Each element
|
||
of `items` is an object whose `vkm:url` property holds the canonical document URL.
|
||
|
||
**Rationale**: The feature spec states:
|
||
> "The `vkm:url` field is present at the top level of each item object in the search results
|
||
> array; the exact response envelope shape will be confirmed against the live API during
|
||
> implementation."
|
||
|
||
The most common shape for knowledge/search services is `{ items: [ { "vkm:url": "...", ... } ] }`.
|
||
This assumption allows the code to be written and fully unit-tested before live-API access is
|
||
available. A single `items` extraction line (`response.data.items ?? response.data`) means the
|
||
adaption to the real shape is a one-line change.
|
||
|
||
**Concrete assumption**:
|
||
```json
|
||
{
|
||
"items": [
|
||
{ "vkm:url": "https://kme.example.com/knowledge/doc-1", "title": "…" },
|
||
{ "vkm:url": "https://kme.example.com/knowledge/doc-2", "title": "…" }
|
||
]
|
||
}
|
||
```
|
||
|
||
**Verification required**: During implementation, run the live API call against
|
||
`<searchApiBaseUrl>/<tenant>` and confirm:
|
||
1. The top-level key that holds the array (likely `items`, `results`, or the root is directly an
|
||
array).
|
||
2. That `vkm:url` is a string property, not nested deeper.
|
||
|
||
**Fallback**: If the root is a bare array, `response.data` itself is used as the items array.
|
||
|
||
**Alternatives considered**:
|
||
- `results` key: equally plausible; the code will use `response.data.items ?? response.data` as a
|
||
defensive pattern until confirmed.
|
||
- Deeply nested: no evidence for this; rejected pending confirmation.
|
||
|
||
---
|
||
|
||
## R-003: xmlbuilder2 `create()` API for Sitemap XML
|
||
|
||
**Decision**: Use the `xmlbuilder2` context variable (which is `xmlbuilder2`'s `create` function)
|
||
with the following call chain:
|
||
|
||
```javascript
|
||
const doc = xmlbuilder2({ version: '1.0', encoding: 'UTF-8' });
|
||
const urlset = doc.ele('urlset', { xmlns: 'http://www.sitemaps.org/schemas/sitemap/0.9' });
|
||
for (const item of items) {
|
||
urlset.ele('url').ele('loc').txt(locValue).up().up();
|
||
}
|
||
const xml = doc.end({ prettyPrint: false });
|
||
```
|
||
|
||
**Rationale**: `xmlbuilder2` v4.x `create()` returns a `XMLBuilder` document node. Calling `.ele()`
|
||
on it creates the root element. Child elements are built by chaining `.ele()` / `.txt()` / `.up()`.
|
||
`doc.end({ prettyPrint: false })` serialises to a string prefixed with `<?xml version="1.0"
|
||
encoding="UTF-8"?>`. `prettyPrint: false` is chosen for minimal byte overhead (sitemap consumers
|
||
parse XML, not read it).
|
||
|
||
**Sitemap namespace**: `http://www.sitemaps.org/schemas/sitemap/0.9` — required by the Sitemaps
|
||
protocol and the XSD schema referenced in SC-004.
|
||
|
||
**Validation**: The serialised string must begin with `<?xml` and contain a valid `<urlset>` root.
|
||
Unit tests will assert this.
|
||
|
||
**Alternatives considered**:
|
||
- Manual string concatenation: rejected (error-prone escaping, violates FR-008 which requires
|
||
xmlbuilder2).
|
||
- `xmlbuilder` (v1/v2): not the installed package; rejected.
|
||
|
||
---
|
||
|
||
## R-004: Axios Error Differentiation — 502 vs 504
|
||
|
||
**Decision**: Reuse the exact error-detection pattern already present in the script:
|
||
|
||
| Condition | Status | Detection |
|
||
|---|---|---|
|
||
| `err.response` is defined | 502 Bad Gateway | Axios sets `err.response` for non-2xx HTTP responses |
|
||
| `err.code === 'ECONNABORTED'` | 504 Gateway Timeout | Axios timeout (pre-Node 18) |
|
||
| `err.code === 'ERR_CANCELED'` | 504 Gateway Timeout | Axios timeout (Node 18+ / AbortSignal) |
|
||
| Other | 502 Bad Gateway | Treated as upstream failure |
|
||
|
||
**Rationale**: The existing script already uses this exact pattern for token-service errors
|
||
(`err.response`, `err.code === 'ECONNABORTED' || err.code === 'ERR_CANCELED'`). Reusing it for
|
||
search-service errors ensures consistent error classification across all upstream calls.
|
||
|
||
**Timeout value**: 10 000 ms, as stated in the spec assumption ("consistent with industry-standard
|
||
defaults for proxy-initiated upstream requests").
|
||
|
||
**Alternatives considered**:
|
||
- `AbortController` + `fetch`: not available in the VM context (only `axios` is injected). Rejected.
|
||
- Different timeout for search vs auth: spec does not require this; YAGNI.
|
||
|
||
---
|
||
|
||
## R-005: Settings Validation — New Fields
|
||
|
||
**Decision**: At the entry point of the sitemap flow, perform an explicit guard before any async
|
||
operation:
|
||
|
||
```javascript
|
||
const requiredSitemapFields = ['searchApiBaseUrl', 'tenant', 'proxyBaseUrl'];
|
||
for (const field of requiredSitemapFields) {
|
||
if (!kme_CSA_settings[field]) {
|
||
res.writeHead(500, { 'Content-Type': 'text/plain' });
|
||
res.end('Configuration error: missing required field: ' + field);
|
||
return;
|
||
}
|
||
}
|
||
```
|
||
|
||
**Rationale**: FR-011 requires HTTP 500 with a descriptive message for missing settings. Checking
|
||
before any async work means no I/O is attempted against an unconfigured upstream, and the error
|
||
message identifies exactly which field is absent.
|
||
|
||
**The three new fields to add to `kme_CSA_settings.json`**:
|
||
|
||
| Field | Type | Description |
|
||
|---|---|---|
|
||
| `searchApiBaseUrl` | string | Base URL of the KME Knowledge Search Service |
|
||
| `tenant` | string | Tenant identifier appended to search base URL |
|
||
| `proxyBaseUrl` | string | Externally accessible HTTPS URL of this adapter instance |
|
||
|
||
---
|
||
|
||
## R-006: `loc` URL Construction and `vkm:url` Encoding
|
||
|
||
**Decision**: Construct each `<loc>` as:
|
||
|
||
```javascript
|
||
`${proxyBaseUrl}?kmeURL=${encodeURIComponent(item['vkm:url'])}`
|
||
```
|
||
|
||
**Rationale**: FR-005 specifies exactly this pattern. `encodeURIComponent` is a built-in available
|
||
inside the VM context without injection (it is a standard JavaScript global). Using it percent-encodes
|
||
the `vkm:url` value, producing a safe query-string parameter even if the value contains `://`, `?`,
|
||
`#`, or other URL-special characters.
|
||
|
||
**Empty/missing guard** (FR-006):
|
||
```javascript
|
||
const vkmUrl = item['vkm:url'];
|
||
if (!vkmUrl) continue; // omit silently
|
||
```
|
||
|
||
---
|
||
|
||
## Summary of All Decisions
|
||
|
||
| ID | Topic | Decision |
|
||
|---|---|---|
|
||
| R-001 | Token reuse | Inline shared token-fetch logic; branch on URL first |
|
||
| R-002 | Search API response shape | Assume `{ items: [...] }`; verify against live API |
|
||
| R-003 | xmlbuilder2 API | `xmlbuilder2({...}).ele('urlset', {...})…doc.end({})` |
|
||
| R-004 | Error mapping | Reuse existing `err.response` / `err.code` pattern |
|
||
| R-005 | Settings validation | Explicit `requiredSitemapFields` guard → HTTP 500 |
|
||
| R-006 | `loc` construction | `proxyBaseUrl?kmeURL=encodeURIComponent(vkm:url)` |
|