- Add contentFetchFlow() to proxy (FR-001 through FR-012) - Add extractArticleBody() helper with vkm:articleBody / articleBody fallback - Dynamic proxyBaseUrl derivation from x-forwarded-proto/host headers - Forward query/size/category params on /sitemap.xml requests - Add Accept: application/ld+json header to content API calls - Remove oidcAuthFlow() - unmatched requests now return 404 Not Found - Fix xmlbuilder2 import: default import, call as xmlbuilder2.create(...) - Version bump 0.2.0 → 0.3.0 - 45/45 tests passing Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
147 lines
6.8 KiB
Markdown
147 lines
6.8 KiB
Markdown
# Research: KME Article Content Fetch (003)
|
||
|
||
**Phase 0 output for `003-kme-content-fetch`**
|
||
|
||
All questions resolved from existing codebase investigation and spec analysis. No external research
|
||
needed. No NEEDS CLARIFICATION items remain.
|
||
|
||
---
|
||
|
||
## Decision 1: URL Parameter Extraction Strategy
|
||
|
||
**Question**: How should `kmeURL` be extracted from `req.url` (a relative path string) inside the VM
|
||
sandbox where only the injected `URL` and `URLSearchParams` globals are available?
|
||
|
||
**Decision**: Use `new URL(req.url, 'http://localhost').searchParams.get('kmeURL')`.
|
||
|
||
**Rationale**:
|
||
- `req.url` in Node's `http.IncomingMessage` is always a relative path + query string (e.g.,
|
||
`/?kmeURL=https://...`), never a full absolute URL.
|
||
- `new URL(relPath, 'http://localhost')` resolves the relative path against a dummy base, giving
|
||
a full `URL` object whose `.searchParams` can be queried safely. The dummy base is never
|
||
contacted.
|
||
- `URLSearchParams.get()` decodes percent-encoding exactly once — satisfying the spec edge case
|
||
"kmeURL must be used verbatim; double-encoding must not occur" (the decoded value is then passed
|
||
directly to `axios.get()` without further manipulation, per FR-002).
|
||
- `URL` is confirmed injected: `src/server.js` line 19 (`globalVMContext = { URLSearchParams, URL, ... }`).
|
||
|
||
**Alternatives considered**:
|
||
- Manual string split on `?` and parsing manually — rejected: more error-prone, reinvents WHATWG URL parsing.
|
||
- `req.url.includes('?kmeURL=')` for routing check only — acceptable for the routing `has()` check
|
||
but `new URL().searchParams.has('kmeURL')` is used for consistency with the extraction call.
|
||
|
||
---
|
||
|
||
## Decision 2: kmeURL Validation Strategy
|
||
|
||
**Question**: How should FR-007 (absent/empty → 400) and FR-008 (malformed/non-absolute → 400) be
|
||
implemented in the VM sandbox?
|
||
|
||
**Decision**: Two-stage guard:
|
||
1. `if (!kmeURL.trim())` → 400 (handles absent, empty, whitespace-only)
|
||
2. `try { const u = new URL(kmeURL); if (u.protocol !== 'http:' && u.protocol !== 'https:') throw new Error(); } catch { → 400 }`
|
||
|
||
**Rationale**:
|
||
- `new URL(value)` (no base) throws on any malformed string, satisfying FR-008.
|
||
- Protocol check rejects `ftp:`, `file:`, `data:`, `javascript:` etc. while accepting http/https.
|
||
- `URL` is already in context; no helper needed — the two-line guard is readable inline.
|
||
- The spec assumption states kmeURL values are verbatim `vkm:url` values from KME Search (always
|
||
absolute http/https), so protocol-level validation is a lightweight safety check.
|
||
|
||
**Alternatives considered**:
|
||
- Regex validation — rejected: more complex, less accurate than the URL parser itself.
|
||
- Validating domain/path structure — rejected: spec explicitly says not to validate beyond
|
||
well-formedness (assumption bullet 1 in spec).
|
||
|
||
---
|
||
|
||
## Decision 3: Axios Timeout and Error Code Handling
|
||
|
||
**Question**: What error codes distinguish axios timeout from HTTP errors from network errors?
|
||
|
||
**Decision** (confirmed from existing code and tests):
|
||
- `err.code === 'ECONNABORTED' || err.code === 'ERR_CANCELED'` → timeout → 502
|
||
- `err.response` populated → HTTP error → inspect `err.response.status`:
|
||
- `4xx` → 404 (spec FR-009)
|
||
- `5xx` → 502 (spec FR-010)
|
||
- No `err.response`, no timeout code → network error → 502
|
||
|
||
**Rationale**: This pattern already exists verbatim in `sitemapFlow()` (lines 61–73 of
|
||
`kmeContentSourceAdapter.js`) and is battle-tested in both unit and contract tests. Using the same
|
||
pattern for `contentFetchFlow()` is consistent and maintainable.
|
||
|
||
**Alternatives considered**:
|
||
- Using `axios.isAxiosError()` — not needed; the existing pattern already discriminates all cases.
|
||
- Mapping all errors to 502 — rejected: spec explicitly requires 404 for upstream 4xx (FR-009).
|
||
|
||
---
|
||
|
||
## Decision 4: JSON-LD Response Body Parsing
|
||
|
||
**Question**: When should the proxy return 502 (unparseable) vs 404 (article not found)?
|
||
|
||
**Decision**:
|
||
1. If `response.data` is already an object (axios auto-parsed JSON) → proceed to extraction.
|
||
2. If `response.data` is a string → try `JSON.parse(data)`; on failure → 502.
|
||
3. If `data` is not an object after parsing → 502.
|
||
4. Call `extractArticleBody(data)`:
|
||
- Returns `null` → 404 (absent, null, empty, whitespace)
|
||
- Returns string → 200 text/html
|
||
|
||
**Rationale**:
|
||
- Axios auto-parses JSON when `Content-Type: application/json` (confirmed from contract tests
|
||
using `startMockServer` with JSON bodies). If the upstream KME service sends non-JSON content-type,
|
||
`response.data` is a string — explicit `JSON.parse` is required as a fallback.
|
||
- Spec edge case: "If JSON-LD response body is not valid JSON → 502". The string branch handles this.
|
||
- Spec edge case: "If `vkm:articleBody` is empty string → treat as absent → 404". This is handled
|
||
by `extractArticleBody`'s `body.trim() === ''` check.
|
||
|
||
**Alternatives considered**:
|
||
- Always calling `JSON.parse(JSON.stringify(data))` — rejected: unnecessary if axios already parsed.
|
||
- A single `JSON.parse(response.data)` regardless — rejected: axios may have already parsed.
|
||
|
||
---
|
||
|
||
## Decision 5: `extractArticleBody` Placement
|
||
|
||
**Question**: Should `extractArticleBody` go in `kmeContentSourceAdapter.js` or
|
||
`kmeContentSourceAdapterHelpers.js`?
|
||
|
||
**Decision**: `kmeContentSourceAdapterHelpers.js` (helpers file).
|
||
|
||
**Rationale**:
|
||
- The function is a pure data transformation: given an object, return a string or null. No side
|
||
effects, no state, no API calls.
|
||
- Constitution Section I.I classifies "data transformation" as eligible for the helpers file when
|
||
it is a pure utility (no business decisions or authentication).
|
||
- Consistent with `extractHydraItems` and `buildSitemapXml` which are also pure transformations.
|
||
|
||
**Alternatives considered**:
|
||
- Inline in `contentFetchFlow()` — valid but reduces testability and clarity.
|
||
- Defining it as a local helper inside the adapter — rejected: less testable and inconsistent with
|
||
existing pattern where data-extraction helpers live in the helpers file.
|
||
|
||
---
|
||
|
||
## Decision 6: Error Handling Boundary for `contentFetchFlow()`
|
||
|
||
**Question**: Should `contentFetchFlow` let errors bubble to the outer catch (→ 401) or handle all
|
||
errors internally?
|
||
|
||
**Decision**: `contentFetchFlow()` is fully self-contained — handles ALL errors internally, returns
|
||
early on every error path, never throws to the outer catch.
|
||
|
||
**Rationale**:
|
||
- The outer catch produces a 401 Unauthorized response, which is correct for `oidcAuthFlow` errors
|
||
but wrong for content-fetch errors (should be 400/404/500/502 per spec).
|
||
- Self-contained error handling makes the function testable in isolation and prevents accidental
|
||
401s from unexpected throws.
|
||
- Consistent with `sitemapFlow()`, which also handles all errors internally (the outer catch was
|
||
designed for `oidcAuthFlow` only).
|
||
|
||
---
|
||
|
||
## No Remaining Unknowns
|
||
|
||
All NEEDS CLARIFICATION items resolved. Phase 1 design may proceed.
|