- Add contentFetchFlow() to proxy (FR-001 through FR-012) - Add extractArticleBody() helper with vkm:articleBody / articleBody fallback - Dynamic proxyBaseUrl derivation from x-forwarded-proto/host headers - Forward query/size/category params on /sitemap.xml requests - Add Accept: application/ld+json header to content API calls - Remove oidcAuthFlow() - unmatched requests now return 404 Not Found - Fix xmlbuilder2 import: default import, call as xmlbuilder2.create(...) - Version bump 0.2.0 → 0.3.0 - 45/45 tests passing Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
6.8 KiB
Research: KME Article Content Fetch (003)
Phase 0 output for 003-kme-content-fetch
All questions resolved from existing codebase investigation and spec analysis. No external research needed. No NEEDS CLARIFICATION items remain.
Decision 1: URL Parameter Extraction Strategy
Question: How should kmeURL be extracted from req.url (a relative path string) inside the VM
sandbox where only the injected URL and URLSearchParams globals are available?
Decision: Use new URL(req.url, 'http://localhost').searchParams.get('kmeURL').
Rationale:
req.urlin Node'shttp.IncomingMessageis always a relative path + query string (e.g.,/?kmeURL=https://...), never a full absolute URL.new URL(relPath, 'http://localhost')resolves the relative path against a dummy base, giving a fullURLobject whose.searchParamscan be queried safely. The dummy base is never contacted.URLSearchParams.get()decodes percent-encoding exactly once — satisfying the spec edge case "kmeURL must be used verbatim; double-encoding must not occur" (the decoded value is then passed directly toaxios.get()without further manipulation, per FR-002).URLis confirmed injected:src/server.jsline 19 (globalVMContext = { URLSearchParams, URL, ... }).
Alternatives considered:
- Manual string split on
?and parsing manually — rejected: more error-prone, reinvents WHATWG URL parsing. req.url.includes('?kmeURL=')for routing check only — acceptable for the routinghas()check butnew URL().searchParams.has('kmeURL')is used for consistency with the extraction call.
Decision 2: kmeURL Validation Strategy
Question: How should FR-007 (absent/empty → 400) and FR-008 (malformed/non-absolute → 400) be implemented in the VM sandbox?
Decision: Two-stage guard:
if (!kmeURL.trim())→ 400 (handles absent, empty, whitespace-only)try { const u = new URL(kmeURL); if (u.protocol !== 'http:' && u.protocol !== 'https:') throw new Error(); } catch { → 400 }
Rationale:
new URL(value)(no base) throws on any malformed string, satisfying FR-008.- Protocol check rejects
ftp:,file:,data:,javascript:etc. while accepting http/https. URLis already in context; no helper needed — the two-line guard is readable inline.- The spec assumption states kmeURL values are verbatim
vkm:urlvalues from KME Search (always absolute http/https), so protocol-level validation is a lightweight safety check.
Alternatives considered:
- Regex validation — rejected: more complex, less accurate than the URL parser itself.
- Validating domain/path structure — rejected: spec explicitly says not to validate beyond well-formedness (assumption bullet 1 in spec).
Decision 3: Axios Timeout and Error Code Handling
Question: What error codes distinguish axios timeout from HTTP errors from network errors?
Decision (confirmed from existing code and tests):
err.code === 'ECONNABORTED' || err.code === 'ERR_CANCELED'→ timeout → 502err.responsepopulated → HTTP error → inspecterr.response.status:4xx→ 404 (spec FR-009)5xx→ 502 (spec FR-010)
- No
err.response, no timeout code → network error → 502
Rationale: This pattern already exists verbatim in sitemapFlow() (lines 61–73 of
kmeContentSourceAdapter.js) and is battle-tested in both unit and contract tests. Using the same
pattern for contentFetchFlow() is consistent and maintainable.
Alternatives considered:
- Using
axios.isAxiosError()— not needed; the existing pattern already discriminates all cases. - Mapping all errors to 502 — rejected: spec explicitly requires 404 for upstream 4xx (FR-009).
Decision 4: JSON-LD Response Body Parsing
Question: When should the proxy return 502 (unparseable) vs 404 (article not found)?
Decision:
- If
response.datais already an object (axios auto-parsed JSON) → proceed to extraction. - If
response.datais a string → tryJSON.parse(data); on failure → 502. - If
datais not an object after parsing → 502. - Call
extractArticleBody(data):- Returns
null→ 404 (absent, null, empty, whitespace) - Returns string → 200 text/html
- Returns
Rationale:
- Axios auto-parses JSON when
Content-Type: application/json(confirmed from contract tests usingstartMockServerwith JSON bodies). If the upstream KME service sends non-JSON content-type,response.datais a string — explicitJSON.parseis required as a fallback. - Spec edge case: "If JSON-LD response body is not valid JSON → 502". The string branch handles this.
- Spec edge case: "If
vkm:articleBodyis empty string → treat as absent → 404". This is handled byextractArticleBody'sbody.trim() === ''check.
Alternatives considered:
- Always calling
JSON.parse(JSON.stringify(data))— rejected: unnecessary if axios already parsed. - A single
JSON.parse(response.data)regardless — rejected: axios may have already parsed.
Decision 5: extractArticleBody Placement
Question: Should extractArticleBody go in kmeContentSourceAdapter.js or
kmeContentSourceAdapterHelpers.js?
Decision: kmeContentSourceAdapterHelpers.js (helpers file).
Rationale:
- The function is a pure data transformation: given an object, return a string or null. No side effects, no state, no API calls.
- Constitution Section I.I classifies "data transformation" as eligible for the helpers file when it is a pure utility (no business decisions or authentication).
- Consistent with
extractHydraItemsandbuildSitemapXmlwhich are also pure transformations.
Alternatives considered:
- Inline in
contentFetchFlow()— valid but reduces testability and clarity. - Defining it as a local helper inside the adapter — rejected: less testable and inconsistent with existing pattern where data-extraction helpers live in the helpers file.
Decision 6: Error Handling Boundary for contentFetchFlow()
Question: Should contentFetchFlow let errors bubble to the outer catch (→ 401) or handle all
errors internally?
Decision: contentFetchFlow() is fully self-contained — handles ALL errors internally, returns
early on every error path, never throws to the outer catch.
Rationale:
- The outer catch produces a 401 Unauthorized response, which is correct for
oidcAuthFlowerrors but wrong for content-fetch errors (should be 400/404/500/502 per spec). - Self-contained error handling makes the function testable in isolation and prevents accidental 401s from unexpected throws.
- Consistent with
sitemapFlow(), which also handles all errors internally (the outer catch was designed foroidcAuthFlowonly).
No Remaining Unknowns
All NEEDS CLARIFICATION items resolved. Phase 1 design may proceed.