Files

Peter.Morton f840587e5e feat: content fetch, sitemap fixes, remove oidcAuthFlow

- Add contentFetchFlow() to proxy (FR-001 through FR-012)
- Add extractArticleBody() helper with vkm:articleBody / articleBody fallback
- Dynamic proxyBaseUrl derivation from x-forwarded-proto/host headers
- Forward query/size/category params on /sitemap.xml requests
- Add Accept: application/ld+json header to content API calls
- Remove oidcAuthFlow() - unmatched requests now return 404 Not Found
- Fix xmlbuilder2 import: default import, call as xmlbuilder2.create(...)
- Version bump 0.2.0 → 0.3.0
- 45/45 tests passing

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

2026-04-23 16:40:06 -05:00

6.8 KiB

Raw Permalink Blame History

Research: KME Article Content Fetch (003)

Phase 0 output for 003-kme-content-fetch

All questions resolved from existing codebase investigation and spec analysis. No external research needed. No NEEDS CLARIFICATION items remain.

Decision 1: URL Parameter Extraction Strategy

Question: How should kmeURL be extracted from req.url (a relative path string) inside the VM sandbox where only the injected URL and URLSearchParams globals are available?

Decision: Use new URL(req.url, 'http://localhost').searchParams.get('kmeURL').

Rationale:

req.url in Node's http.IncomingMessage is always a relative path + query string (e.g., /?kmeURL=https://...), never a full absolute URL.
new URL(relPath, 'http://localhost') resolves the relative path against a dummy base, giving a full URL object whose .searchParams can be queried safely. The dummy base is never contacted.
URLSearchParams.get() decodes percent-encoding exactly once — satisfying the spec edge case "kmeURL must be used verbatim; double-encoding must not occur" (the decoded value is then passed directly to axios.get() without further manipulation, per FR-002).
URL is confirmed injected: src/server.js line 19 (globalVMContext = { URLSearchParams, URL, ... }).

Alternatives considered:

Manual string split on ? and parsing manually — rejected: more error-prone, reinvents WHATWG URL parsing.
req.url.includes('?kmeURL=') for routing check only — acceptable for the routing has() check but new URL().searchParams.has('kmeURL') is used for consistency with the extraction call.

Decision 2: kmeURL Validation Strategy

Question: How should FR-007 (absent/empty → 400) and FR-008 (malformed/non-absolute → 400) be implemented in the VM sandbox?

Decision: Two-stage guard:

if (!kmeURL.trim()) → 400 (handles absent, empty, whitespace-only)
try { const u = new URL(kmeURL); if (u.protocol !== 'http:' && u.protocol !== 'https:') throw new Error(); } catch { → 400 }

Rationale:

new URL(value) (no base) throws on any malformed string, satisfying FR-008.
Protocol check rejects ftp:, file:, data:, javascript: etc. while accepting http/https.
URL is already in context; no helper needed — the two-line guard is readable inline.
The spec assumption states kmeURL values are verbatim vkm:url values from KME Search (always absolute http/https), so protocol-level validation is a lightweight safety check.

Alternatives considered:

Regex validation — rejected: more complex, less accurate than the URL parser itself.
Validating domain/path structure — rejected: spec explicitly says not to validate beyond well-formedness (assumption bullet 1 in spec).

Decision 3: Axios Timeout and Error Code Handling

Question: What error codes distinguish axios timeout from HTTP errors from network errors?

Decision (confirmed from existing code and tests):

err.code === 'ECONNABORTED' || err.code === 'ERR_CANCELED' → timeout → 502
err.response populated → HTTP error → inspect err.response.status:
- 4xx → 404 (spec FR-009)
- 5xx → 502 (spec FR-010)
No err.response, no timeout code → network error → 502

Rationale: This pattern already exists verbatim in sitemapFlow() (lines 61–73 of kmeContentSourceAdapter.js) and is battle-tested in both unit and contract tests. Using the same pattern for contentFetchFlow() is consistent and maintainable.

Alternatives considered:

Using axios.isAxiosError() — not needed; the existing pattern already discriminates all cases.
Mapping all errors to 502 — rejected: spec explicitly requires 404 for upstream 4xx (FR-009).

Decision 4: JSON-LD Response Body Parsing

Question: When should the proxy return 502 (unparseable) vs 404 (article not found)?

Decision:

If response.data is already an object (axios auto-parsed JSON) → proceed to extraction.
If response.data is a string → try JSON.parse(data); on failure → 502.
If data is not an object after parsing → 502.
Call extractArticleBody(data):
- Returns null → 404 (absent, null, empty, whitespace)
- Returns string → 200 text/html

Rationale:

Axios auto-parses JSON when Content-Type: application/json (confirmed from contract tests using startMockServer with JSON bodies). If the upstream KME service sends non-JSON content-type, response.data is a string — explicit JSON.parse is required as a fallback.
Spec edge case: "If JSON-LD response body is not valid JSON → 502". The string branch handles this.
Spec edge case: "If vkm:articleBody is empty string → treat as absent → 404". This is handled by extractArticleBody's body.trim() === '' check.

Alternatives considered:

Always calling JSON.parse(JSON.stringify(data)) — rejected: unnecessary if axios already parsed.
A single JSON.parse(response.data) regardless — rejected: axios may have already parsed.

Decision 5: `extractArticleBody` Placement

Question: Should extractArticleBody go in kmeContentSourceAdapter.js or kmeContentSourceAdapterHelpers.js?

Decision: kmeContentSourceAdapterHelpers.js (helpers file).

Rationale:

The function is a pure data transformation: given an object, return a string or null. No side effects, no state, no API calls.
Constitution Section I.I classifies "data transformation" as eligible for the helpers file when it is a pure utility (no business decisions or authentication).
Consistent with extractHydraItems and buildSitemapXml which are also pure transformations.

Alternatives considered:

Inline in contentFetchFlow() — valid but reduces testability and clarity.
Defining it as a local helper inside the adapter — rejected: less testable and inconsistent with existing pattern where data-extraction helpers live in the helpers file.

Decision 6: Error Handling Boundary for `contentFetchFlow()`

Question: Should contentFetchFlow let errors bubble to the outer catch (→ 401) or handle all errors internally?

Decision: contentFetchFlow() is fully self-contained — handles ALL errors internally, returns early on every error path, never throws to the outer catch.

Rationale:

The outer catch produces a 401 Unauthorized response, which is correct for oidcAuthFlow errors but wrong for content-fetch errors (should be 400/404/500/502 per spec).
Self-contained error handling makes the function testable in isolation and prevents accidental 401s from unexpected throws.
Consistent with sitemapFlow(), which also handles all errors internally (the outer catch was designed for oidcAuthFlow only).

No Remaining Unknowns

All NEEDS CLARIFICATION items resolved. Phase 1 design may proceed.

6.8 KiB Raw Permalink Blame History Unescape Escape

Research: KME Article Content Fetch (003)

Decision 1: URL Parameter Extraction Strategy

Decision 2: kmeURL Validation Strategy

Decision 3: Axios Timeout and Error Code Handling

Decision 4: JSON-LD Response Body Parsing

Decision 5: extractArticleBody Placement

Decision 6: Error Handling Boundary for contentFetchFlow()

No Remaining Unknowns

6.8 KiB

Raw Permalink Blame History

Decision 5: `extractArticleBody` Placement

Decision 6: Error Handling Boundary for `contentFetchFlow()`