Files
Peter.Morton f840587e5e feat: content fetch, sitemap fixes, remove oidcAuthFlow
- Add contentFetchFlow() to proxy (FR-001 through FR-012)
- Add extractArticleBody() helper with vkm:articleBody / articleBody fallback
- Dynamic proxyBaseUrl derivation from x-forwarded-proto/host headers
- Forward query/size/category params on /sitemap.xml requests
- Add Accept: application/ld+json header to content API calls
- Remove oidcAuthFlow() - unmatched requests now return 404 Not Found
- Fix xmlbuilder2 import: default import, call as xmlbuilder2.create(...)
- Version bump 0.2.0 → 0.3.0
- 45/45 tests passing

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-23 16:40:06 -05:00

6.8 KiB
Raw Permalink Blame History

Research: KME Article Content Fetch (003)

Phase 0 output for 003-kme-content-fetch

All questions resolved from existing codebase investigation and spec analysis. No external research needed. No NEEDS CLARIFICATION items remain.


Decision 1: URL Parameter Extraction Strategy

Question: How should kmeURL be extracted from req.url (a relative path string) inside the VM sandbox where only the injected URL and URLSearchParams globals are available?

Decision: Use new URL(req.url, 'http://localhost').searchParams.get('kmeURL').

Rationale:

  • req.url in Node's http.IncomingMessage is always a relative path + query string (e.g., /?kmeURL=https://...), never a full absolute URL.
  • new URL(relPath, 'http://localhost') resolves the relative path against a dummy base, giving a full URL object whose .searchParams can be queried safely. The dummy base is never contacted.
  • URLSearchParams.get() decodes percent-encoding exactly once — satisfying the spec edge case "kmeURL must be used verbatim; double-encoding must not occur" (the decoded value is then passed directly to axios.get() without further manipulation, per FR-002).
  • URL is confirmed injected: src/server.js line 19 (globalVMContext = { URLSearchParams, URL, ... }).

Alternatives considered:

  • Manual string split on ? and parsing manually — rejected: more error-prone, reinvents WHATWG URL parsing.
  • req.url.includes('?kmeURL=') for routing check only — acceptable for the routing has() check but new URL().searchParams.has('kmeURL') is used for consistency with the extraction call.

Decision 2: kmeURL Validation Strategy

Question: How should FR-007 (absent/empty → 400) and FR-008 (malformed/non-absolute → 400) be implemented in the VM sandbox?

Decision: Two-stage guard:

  1. if (!kmeURL.trim()) → 400 (handles absent, empty, whitespace-only)
  2. try { const u = new URL(kmeURL); if (u.protocol !== 'http:' && u.protocol !== 'https:') throw new Error(); } catch { → 400 }

Rationale:

  • new URL(value) (no base) throws on any malformed string, satisfying FR-008.
  • Protocol check rejects ftp:, file:, data:, javascript: etc. while accepting http/https.
  • URL is already in context; no helper needed — the two-line guard is readable inline.
  • The spec assumption states kmeURL values are verbatim vkm:url values from KME Search (always absolute http/https), so protocol-level validation is a lightweight safety check.

Alternatives considered:

  • Regex validation — rejected: more complex, less accurate than the URL parser itself.
  • Validating domain/path structure — rejected: spec explicitly says not to validate beyond well-formedness (assumption bullet 1 in spec).

Decision 3: Axios Timeout and Error Code Handling

Question: What error codes distinguish axios timeout from HTTP errors from network errors?

Decision (confirmed from existing code and tests):

  • err.code === 'ECONNABORTED' || err.code === 'ERR_CANCELED' → timeout → 502
  • err.response populated → HTTP error → inspect err.response.status:
    • 4xx → 404 (spec FR-009)
    • 5xx → 502 (spec FR-010)
  • No err.response, no timeout code → network error → 502

Rationale: This pattern already exists verbatim in sitemapFlow() (lines 6173 of kmeContentSourceAdapter.js) and is battle-tested in both unit and contract tests. Using the same pattern for contentFetchFlow() is consistent and maintainable.

Alternatives considered:

  • Using axios.isAxiosError() — not needed; the existing pattern already discriminates all cases.
  • Mapping all errors to 502 — rejected: spec explicitly requires 404 for upstream 4xx (FR-009).

Decision 4: JSON-LD Response Body Parsing

Question: When should the proxy return 502 (unparseable) vs 404 (article not found)?

Decision:

  1. If response.data is already an object (axios auto-parsed JSON) → proceed to extraction.
  2. If response.data is a string → try JSON.parse(data); on failure → 502.
  3. If data is not an object after parsing → 502.
  4. Call extractArticleBody(data):
    • Returns null → 404 (absent, null, empty, whitespace)
    • Returns string → 200 text/html

Rationale:

  • Axios auto-parses JSON when Content-Type: application/json (confirmed from contract tests using startMockServer with JSON bodies). If the upstream KME service sends non-JSON content-type, response.data is a string — explicit JSON.parse is required as a fallback.
  • Spec edge case: "If JSON-LD response body is not valid JSON → 502". The string branch handles this.
  • Spec edge case: "If vkm:articleBody is empty string → treat as absent → 404". This is handled by extractArticleBody's body.trim() === '' check.

Alternatives considered:

  • Always calling JSON.parse(JSON.stringify(data)) — rejected: unnecessary if axios already parsed.
  • A single JSON.parse(response.data) regardless — rejected: axios may have already parsed.

Decision 5: extractArticleBody Placement

Question: Should extractArticleBody go in kmeContentSourceAdapter.js or kmeContentSourceAdapterHelpers.js?

Decision: kmeContentSourceAdapterHelpers.js (helpers file).

Rationale:

  • The function is a pure data transformation: given an object, return a string or null. No side effects, no state, no API calls.
  • Constitution Section I.I classifies "data transformation" as eligible for the helpers file when it is a pure utility (no business decisions or authentication).
  • Consistent with extractHydraItems and buildSitemapXml which are also pure transformations.

Alternatives considered:

  • Inline in contentFetchFlow() — valid but reduces testability and clarity.
  • Defining it as a local helper inside the adapter — rejected: less testable and inconsistent with existing pattern where data-extraction helpers live in the helpers file.

Decision 6: Error Handling Boundary for contentFetchFlow()

Question: Should contentFetchFlow let errors bubble to the outer catch (→ 401) or handle all errors internally?

Decision: contentFetchFlow() is fully self-contained — handles ALL errors internally, returns early on every error path, never throws to the outer catch.

Rationale:

  • The outer catch produces a 401 Unauthorized response, which is correct for oidcAuthFlow errors but wrong for content-fetch errors (should be 400/404/500/502 per spec).
  • Self-contained error handling makes the function testable in isolation and prevents accidental 401s from unexpected throws.
  • Consistent with sitemapFlow(), which also handles all errors internally (the outer catch was designed for oidcAuthFlow only).

No Remaining Unknowns

All NEEDS CLARIFICATION items resolved. Phase 1 design may proceed.