Files
Peter.Morton f840587e5e feat: content fetch, sitemap fixes, remove oidcAuthFlow
- Add contentFetchFlow() to proxy (FR-001 through FR-012)
- Add extractArticleBody() helper with vkm:articleBody / articleBody fallback
- Dynamic proxyBaseUrl derivation from x-forwarded-proto/host headers
- Forward query/size/category params on /sitemap.xml requests
- Add Accept: application/ld+json header to content API calls
- Remove oidcAuthFlow() - unmatched requests now return 404 Not Found
- Fix xmlbuilder2 import: default import, call as xmlbuilder2.create(...)
- Version bump 0.2.0 → 0.3.0
- 45/45 tests passing

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-23 16:40:06 -05:00

147 lines
6.8 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Research: KME Article Content Fetch (003)
**Phase 0 output for `003-kme-content-fetch`**
All questions resolved from existing codebase investigation and spec analysis. No external research
needed. No NEEDS CLARIFICATION items remain.
---
## Decision 1: URL Parameter Extraction Strategy
**Question**: How should `kmeURL` be extracted from `req.url` (a relative path string) inside the VM
sandbox where only the injected `URL` and `URLSearchParams` globals are available?
**Decision**: Use `new URL(req.url, 'http://localhost').searchParams.get('kmeURL')`.
**Rationale**:
- `req.url` in Node's `http.IncomingMessage` is always a relative path + query string (e.g.,
`/?kmeURL=https://...`), never a full absolute URL.
- `new URL(relPath, 'http://localhost')` resolves the relative path against a dummy base, giving
a full `URL` object whose `.searchParams` can be queried safely. The dummy base is never
contacted.
- `URLSearchParams.get()` decodes percent-encoding exactly once — satisfying the spec edge case
"kmeURL must be used verbatim; double-encoding must not occur" (the decoded value is then passed
directly to `axios.get()` without further manipulation, per FR-002).
- `URL` is confirmed injected: `src/server.js` line 19 (`globalVMContext = { URLSearchParams, URL, ... }`).
**Alternatives considered**:
- Manual string split on `?` and parsing manually — rejected: more error-prone, reinvents WHATWG URL parsing.
- `req.url.includes('?kmeURL=')` for routing check only — acceptable for the routing `has()` check
but `new URL().searchParams.has('kmeURL')` is used for consistency with the extraction call.
---
## Decision 2: kmeURL Validation Strategy
**Question**: How should FR-007 (absent/empty → 400) and FR-008 (malformed/non-absolute → 400) be
implemented in the VM sandbox?
**Decision**: Two-stage guard:
1. `if (!kmeURL.trim())` → 400 (handles absent, empty, whitespace-only)
2. `try { const u = new URL(kmeURL); if (u.protocol !== 'http:' && u.protocol !== 'https:') throw new Error(); } catch { → 400 }`
**Rationale**:
- `new URL(value)` (no base) throws on any malformed string, satisfying FR-008.
- Protocol check rejects `ftp:`, `file:`, `data:`, `javascript:` etc. while accepting http/https.
- `URL` is already in context; no helper needed — the two-line guard is readable inline.
- The spec assumption states kmeURL values are verbatim `vkm:url` values from KME Search (always
absolute http/https), so protocol-level validation is a lightweight safety check.
**Alternatives considered**:
- Regex validation — rejected: more complex, less accurate than the URL parser itself.
- Validating domain/path structure — rejected: spec explicitly says not to validate beyond
well-formedness (assumption bullet 1 in spec).
---
## Decision 3: Axios Timeout and Error Code Handling
**Question**: What error codes distinguish axios timeout from HTTP errors from network errors?
**Decision** (confirmed from existing code and tests):
- `err.code === 'ECONNABORTED' || err.code === 'ERR_CANCELED'` → timeout → 502
- `err.response` populated → HTTP error → inspect `err.response.status`:
- `4xx` → 404 (spec FR-009)
- `5xx` → 502 (spec FR-010)
- No `err.response`, no timeout code → network error → 502
**Rationale**: This pattern already exists verbatim in `sitemapFlow()` (lines 6173 of
`kmeContentSourceAdapter.js`) and is battle-tested in both unit and contract tests. Using the same
pattern for `contentFetchFlow()` is consistent and maintainable.
**Alternatives considered**:
- Using `axios.isAxiosError()` — not needed; the existing pattern already discriminates all cases.
- Mapping all errors to 502 — rejected: spec explicitly requires 404 for upstream 4xx (FR-009).
---
## Decision 4: JSON-LD Response Body Parsing
**Question**: When should the proxy return 502 (unparseable) vs 404 (article not found)?
**Decision**:
1. If `response.data` is already an object (axios auto-parsed JSON) → proceed to extraction.
2. If `response.data` is a string → try `JSON.parse(data)`; on failure → 502.
3. If `data` is not an object after parsing → 502.
4. Call `extractArticleBody(data)`:
- Returns `null` → 404 (absent, null, empty, whitespace)
- Returns string → 200 text/html
**Rationale**:
- Axios auto-parses JSON when `Content-Type: application/json` (confirmed from contract tests
using `startMockServer` with JSON bodies). If the upstream KME service sends non-JSON content-type,
`response.data` is a string — explicit `JSON.parse` is required as a fallback.
- Spec edge case: "If JSON-LD response body is not valid JSON → 502". The string branch handles this.
- Spec edge case: "If `vkm:articleBody` is empty string → treat as absent → 404". This is handled
by `extractArticleBody`'s `body.trim() === ''` check.
**Alternatives considered**:
- Always calling `JSON.parse(JSON.stringify(data))` — rejected: unnecessary if axios already parsed.
- A single `JSON.parse(response.data)` regardless — rejected: axios may have already parsed.
---
## Decision 5: `extractArticleBody` Placement
**Question**: Should `extractArticleBody` go in `kmeContentSourceAdapter.js` or
`kmeContentSourceAdapterHelpers.js`?
**Decision**: `kmeContentSourceAdapterHelpers.js` (helpers file).
**Rationale**:
- The function is a pure data transformation: given an object, return a string or null. No side
effects, no state, no API calls.
- Constitution Section I.I classifies "data transformation" as eligible for the helpers file when
it is a pure utility (no business decisions or authentication).
- Consistent with `extractHydraItems` and `buildSitemapXml` which are also pure transformations.
**Alternatives considered**:
- Inline in `contentFetchFlow()` — valid but reduces testability and clarity.
- Defining it as a local helper inside the adapter — rejected: less testable and inconsistent with
existing pattern where data-extraction helpers live in the helpers file.
---
## Decision 6: Error Handling Boundary for `contentFetchFlow()`
**Question**: Should `contentFetchFlow` let errors bubble to the outer catch (→ 401) or handle all
errors internally?
**Decision**: `contentFetchFlow()` is fully self-contained — handles ALL errors internally, returns
early on every error path, never throws to the outer catch.
**Rationale**:
- The outer catch produces a 401 Unauthorized response, which is correct for `oidcAuthFlow` errors
but wrong for content-fetch errors (should be 400/404/500/502 per spec).
- Self-contained error handling makes the function testable in isolation and prevents accidental
401s from unexpected throws.
- Consistent with `sitemapFlow()`, which also handles all errors internally (the outer catch was
designed for `oidcAuthFlow` only).
---
## No Remaining Unknowns
All NEEDS CLARIFICATION items resolved. Phase 1 design may proceed.