- Add contentFetchFlow() to proxy (FR-001 through FR-012) - Add extractArticleBody() helper with vkm:articleBody / articleBody fallback - Dynamic proxyBaseUrl derivation from x-forwarded-proto/host headers - Forward query/size/category params on /sitemap.xml requests - Add Accept: application/ld+json header to content API calls - Remove oidcAuthFlow() - unmatched requests now return 404 Not Found - Fix xmlbuilder2 import: default import, call as xmlbuilder2.create(...) - Version bump 0.2.0 → 0.3.0 - 45/45 tests passing Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
331 lines
15 KiB
Markdown
331 lines
15 KiB
Markdown
# Implementation Plan: KME Article Content Fetch
|
||
|
||
**Branch**: `003-kme-content-fetch` | **Date**: 2025-07-15 | **Spec**: [spec.md](spec.md)
|
||
**Input**: Feature specification from `specs/003-kme-content-fetch/spec.md`
|
||
|
||
## Summary
|
||
|
||
Add a new `contentFetchFlow()` to `src/proxyScripts/kmeContentSourceAdapter.js` that handles
|
||
requests carrying a `?kmeURL=` query parameter. The flow validates the parameter, obtains an OIDC
|
||
token via the existing `getValidToken()`, performs a GET request to the `kmeURL` with
|
||
`Authorization: OIDC_id_token {token}`, extracts `vkm:articleBody` from the JSON-LD response, and
|
||
returns it as `text/html`. A new pure helper `extractArticleBody(data)` is added to
|
||
`src/globalVariables/kmeContentSourceAdapterHelpers.js`. No new files, no new npm dependencies.
|
||
|
||
## Technical Context
|
||
|
||
**Language/Version**: Node.js ≥18, ESM (`"type": "module"`)
|
||
**Primary Dependencies**: `axios ^1.13` (HTTP client, already in context), `redis ^5` (token cache, injected), `xmlbuilder2 ^4` (sitemap, unrelated to this feature)
|
||
**Storage**: Redis (token cache only — managed by `getValidToken()`, not modified by this feature)
|
||
**Testing**: Node.js built-in test runner (`node:test`) — `npm run test:unit`, `npm run test:contract`
|
||
**Target Platform**: Node.js server (Linux/macOS); proxy script executed inside `vm.createContext` per request
|
||
**Project Type**: HTTP proxy adapter — monolithic VM-sandbox architecture
|
||
**Performance Goals**: End-to-end response ≤11 s (10 s upstream timeout + 1 s proxy overhead) per SC-001
|
||
**Constraints**: Zero new imports/exports in VM sandbox files; no new npm dependencies; no new `src/` files
|
||
**Scale/Scope**: Two files modified (`kmeContentSourceAdapter.js`, `kmeContentSourceAdapterHelpers.js`); new unit tests in `tests/unit/proxy.test.js`; new contract tests in `tests/contract/proxy-http.test.js`
|
||
|
||
## Constitution Check
|
||
|
||
*GATE: Must pass before Phase 0 research. Re-check after Phase 1 design.*
|
||
|
||
| Gate | Status | Notes |
|
||
|------|--------|-------|
|
||
| All business logic stays in `src/proxyScripts/kmeContentSourceAdapter.js` | ✅ PASS | `contentFetchFlow()` lives entirely in the adapter file |
|
||
| Zero `import`/`export` in VM sandbox file | ✅ PASS | No imports added; all dependencies via injected context |
|
||
| `extractArticleBody` is a pure utility → helper file | ✅ PASS | No state, no API calls — qualifies for `kmeContentSourceAdapterHelpers.js` |
|
||
| No new files in `src/` | ✅ PASS | Only two existing files are modified |
|
||
| No new npm dependencies | ✅ PASS | `axios` and `URL`/`URLSearchParams` already available in context |
|
||
| Helpers file uses literal function body pattern | ✅ PASS | New helper added before existing `return { ... }` block |
|
||
| Authentication (`getValidToken`) stays in proxy script (called from adapter, not moved) | ✅ PASS | `getValidToken()` is invoked from `contentFetchFlow()` in the adapter |
|
||
|
||
**Post-design re-check**: All gates pass. No constitutional violations. No complexity tracking required.
|
||
|
||
## Project Structure
|
||
|
||
### Documentation (this feature)
|
||
|
||
```text
|
||
specs/003-kme-content-fetch/
|
||
├── plan.md # This file
|
||
├── research.md # Phase 0 output
|
||
├── data-model.md # Phase 1 output
|
||
├── quickstart.md # Phase 1 output
|
||
├── contracts/
|
||
│ └── http-content-fetch.md # HTTP request/response contract
|
||
└── tasks.md # Phase 2 output (/speckit.tasks — NOT created here)
|
||
```
|
||
|
||
### Source Code (repository root)
|
||
|
||
```text
|
||
src/
|
||
├── proxyScripts/
|
||
│ └── kmeContentSourceAdapter.js # MODIFIED: add contentFetchFlow(), update routing
|
||
├── globalVariables/
|
||
│ └── kmeContentSourceAdapterHelpers.js # MODIFIED: add extractArticleBody()
|
||
├── logger.js # unchanged
|
||
└── server.js # unchanged
|
||
|
||
tests/
|
||
├── unit/
|
||
│ └── proxy.test.js # MODIFIED: add content-fetch describe blocks
|
||
└── contract/
|
||
└── proxy-http.test.js # MODIFIED: add content-fetch contract tests
|
||
```
|
||
|
||
**Structure Decision**: Single-project monolith. Two existing source files modified; two existing test
|
||
files extended. No new files in `src/`. The VM sandbox pattern and helper file pattern are preserved
|
||
exactly as documented in the constitution.
|
||
|
||
---
|
||
|
||
## Phase 0: Research Findings → `research.md`
|
||
|
||
See [research.md](research.md) for full decision log. Summary:
|
||
|
||
- **URL parameter extraction**: `new URL(req.url, 'http://localhost').searchParams.get('kmeURL')` — confirmed safe from VM context research; `URL` is injected at server.js line 19.
|
||
- **URL validation**: `new URL(kmeURL)` + protocol check in try/catch — cleanly handles FR-007/FR-008 in one guard.
|
||
- **Axios error handling**: Confirmed `ECONNABORTED`/`ERR_CANCELED` for timeout; `err.response.status` available for all HTTP errors; JSON auto-parsed when `Content-Type: application/json`.
|
||
- **JSON-LD parsing**: `response.data` is an object when axios auto-parses; fallback `JSON.parse()` needed for non-JSON content-types; non-object → 502.
|
||
- **No unknowns remaining**: All NEEDS CLARIFICATION resolved. Research complete.
|
||
|
||
---
|
||
|
||
## Phase 1: Design
|
||
|
||
### Routing Change (`kmeContentSourceAdapter.js`)
|
||
|
||
Add a new `else if` branch between the existing sitemap check and the `oidcAuthFlow` fallback:
|
||
|
||
```javascript
|
||
// Entry point — URL routing
|
||
try {
|
||
if (req.url.endsWith('/sitemap.xml')) {
|
||
await sitemapFlow();
|
||
} else if (new URL(req.url, 'http://localhost').searchParams.has('kmeURL')) {
|
||
await contentFetchFlow(); // ← NEW
|
||
} else {
|
||
await oidcAuthFlow();
|
||
}
|
||
} catch (err) { /* existing outer catch → 401 */ }
|
||
```
|
||
|
||
`contentFetchFlow()` is fully self-contained — all errors are caught internally and never propagate to the outer catch.
|
||
|
||
### `contentFetchFlow()` — Complete Logic
|
||
|
||
```javascript
|
||
async function contentFetchFlow() {
|
||
// Step 1: Extract kmeURL (FR-001, FR-002)
|
||
const kmeURL = new URL(req.url, 'http://localhost').searchParams.get('kmeURL') ?? '';
|
||
|
||
// Step 2: Validate — absent / empty (FR-007)
|
||
if (!kmeURL.trim()) {
|
||
res.writeHead(400, { 'Content-Type': 'text/plain' });
|
||
res.end('Bad Request: kmeURL parameter is required');
|
||
return;
|
||
}
|
||
|
||
// Step 3: Validate — well-formed absolute http/https URL (FR-008)
|
||
try {
|
||
const u = new URL(kmeURL);
|
||
if (u.protocol !== 'http:' && u.protocol !== 'https:') throw new Error();
|
||
} catch {
|
||
res.writeHead(400, { 'Content-Type': 'text/plain' });
|
||
res.end('Bad Request: kmeURL must be a well-formed absolute http/https URL');
|
||
return;
|
||
}
|
||
|
||
// Step 4: Validate OIDC settings (config guard, returns 500 for missing config)
|
||
const missingField = kmeContentSourceAdapterHelpers.validateSettings(
|
||
kme_CSA_settings,
|
||
['tokenUrl', 'username', 'password', 'clientId', 'scope'],
|
||
);
|
||
if (missingField) {
|
||
console.error({ message: 'Content fetch: config error', missingField });
|
||
res.writeHead(500, { 'Content-Type': 'text/plain' });
|
||
res.end('Configuration error: missing required field: ' + missingField);
|
||
return;
|
||
}
|
||
|
||
// Step 5: Obtain OIDC token (FR-003, FR-011)
|
||
let token;
|
||
try {
|
||
token = await kmeContentSourceAdapterHelpers.getValidToken(req.url, req.method);
|
||
} catch (tokenErr) {
|
||
console.error({ message: 'Content fetch: token acquisition failed', error: tokenErr.message });
|
||
res.writeHead(502, { 'Content-Type': 'text/plain' });
|
||
res.end('Bad Gateway: token acquisition failed');
|
||
return;
|
||
}
|
||
|
||
// Step 6: GET kmeURL verbatim with auth header (FR-002, FR-003, FR-004)
|
||
let response;
|
||
try {
|
||
console.debug({ message: 'Content fetch: fetching article', kmeURL });
|
||
response = await axios.get(kmeURL, {
|
||
headers: { Authorization: `OIDC_id_token ${token}` },
|
||
timeout: 10000,
|
||
});
|
||
} catch (fetchErr) {
|
||
if (fetchErr.code === 'ECONNABORTED' || fetchErr.code === 'ERR_CANCELED') {
|
||
console.error({ message: 'Content fetch: upstream timeout', code: fetchErr.code });
|
||
res.writeHead(502, { 'Content-Type': 'text/plain' });
|
||
res.end('Bad Gateway: upstream request timed out');
|
||
} else if (fetchErr.response) {
|
||
const status = fetchErr.response.status;
|
||
console.error({ message: 'Content fetch: upstream HTTP error', status });
|
||
if (status >= 400 && status < 500) {
|
||
res.writeHead(404, { 'Content-Type': 'text/plain' });
|
||
res.end('Not Found: article not found at upstream');
|
||
} else {
|
||
res.writeHead(502, { 'Content-Type': 'text/plain' });
|
||
res.end('Bad Gateway: upstream error HTTP ' + status);
|
||
}
|
||
} else {
|
||
console.error({ message: 'Content fetch: network error', error: fetchErr.message });
|
||
res.writeHead(502, { 'Content-Type': 'text/plain' });
|
||
res.end('Bad Gateway: ' + fetchErr.message);
|
||
}
|
||
return;
|
||
}
|
||
|
||
// Step 7: Parse body — handle non-JSON content-type (FR-005, FR-010)
|
||
let data = response.data;
|
||
if (typeof data === 'string') {
|
||
try {
|
||
data = JSON.parse(data);
|
||
} catch {
|
||
console.error({ message: 'Content fetch: unparseable response body', kmeURL });
|
||
res.writeHead(502, { 'Content-Type': 'text/plain' });
|
||
res.end('Bad Gateway: unparseable response from upstream');
|
||
return;
|
||
}
|
||
}
|
||
if (typeof data !== 'object' || data === null) {
|
||
console.error({ message: 'Content fetch: unexpected non-object response', kmeURL });
|
||
res.writeHead(502, { 'Content-Type': 'text/plain' });
|
||
res.end('Bad Gateway: unexpected response from upstream');
|
||
return;
|
||
}
|
||
|
||
// Step 8: Extract vkm:articleBody (FR-005, FR-009)
|
||
const articleBody = kmeContentSourceAdapterHelpers.extractArticleBody(data);
|
||
if (!articleBody) {
|
||
console.error({ message: 'Content fetch: vkm:articleBody absent or empty', kmeURL });
|
||
res.writeHead(404, { 'Content-Type': 'text/plain' });
|
||
res.end('Not Found: article body not present in upstream response');
|
||
return;
|
||
}
|
||
|
||
// Step 9: Return article HTML (FR-006)
|
||
console.info({ message: 'Content fetch: article fetched successfully', kmeURL });
|
||
res.writeHead(200, { 'Content-Type': 'text/html' });
|
||
res.end(articleBody);
|
||
}
|
||
```
|
||
|
||
### `extractArticleBody(data)` — New Helper
|
||
|
||
Add to `kmeContentSourceAdapterHelpers.js` before the existing `return { ... }` block:
|
||
|
||
```javascript
|
||
/**
|
||
* Extracts the vkm:articleBody string from a KME Content Service JSON-LD response.
|
||
* Returns null if the field is absent, null, not a string, or an empty/whitespace string.
|
||
* @param {object} data – parsed JSON-LD response from the KME Content Service
|
||
* @returns {string|null}
|
||
*/
|
||
function extractArticleBody(data) {
|
||
if (!data || typeof data !== 'object') return null;
|
||
const body = data['vkm:articleBody'];
|
||
if (body == null || typeof body !== 'string' || body.trim() === '') return null;
|
||
return body;
|
||
}
|
||
```
|
||
|
||
Update the `return { ... }` at the bottom of the helpers file to export the new function:
|
||
|
||
```javascript
|
||
return {
|
||
validateSettings,
|
||
extractHydraItems,
|
||
buildSitemapXml,
|
||
getValidToken,
|
||
extractArticleBody, // ← NEW
|
||
};
|
||
```
|
||
|
||
### Error Response Matrix
|
||
|
||
| Condition | HTTP Status | Response Body |
|
||
|-----------|-------------|---------------|
|
||
| `kmeURL` absent or empty | 400 | `Bad Request: kmeURL parameter is required` |
|
||
| `kmeURL` not a well-formed absolute http/https URL | 400 | `Bad Request: kmeURL must be a well-formed absolute http/https URL` |
|
||
| Missing OIDC config field | 500 | `Configuration error: missing required field: {field}` |
|
||
| Token acquisition failure | 502 | `Bad Gateway: token acquisition failed` |
|
||
| Upstream 4xx response | 404 | `Not Found: article not found at upstream` |
|
||
| Upstream 5xx response | 502 | `Bad Gateway: upstream error HTTP {status}` |
|
||
| Upstream timeout (`ECONNABORTED`/`ERR_CANCELED`) | 502 | `Bad Gateway: upstream request timed out` |
|
||
| Network error (no `err.response`) | 502 | `Bad Gateway: {err.message}` |
|
||
| Response body unparseable as JSON | 502 | `Bad Gateway: unparseable response from upstream` |
|
||
| Non-object response body | 502 | `Bad Gateway: unexpected response from upstream` |
|
||
| `vkm:articleBody` absent, null, or empty | 404 | `Not Found: article body not present in upstream response` |
|
||
| Success | 200 `text/html` | article body HTML |
|
||
|
||
### Test Coverage Plan
|
||
|
||
**Unit tests** (add to `tests/unit/proxy.test.js`):
|
||
|
||
| Describe block | Test case | Verifies |
|
||
|---------------|-----------|---------|
|
||
| `US-content-fetch: happy path` | cached token → 200 HTML body | FR-001, FR-005, FR-006 |
|
||
| `US-content-fetch: happy path` | cache miss → token fetch → 200 HTML body | FR-003 |
|
||
| `US-content-fetch: happy path` | expired token → refresh → 200 HTML body | FR-003 |
|
||
| `US-content-fetch: input validation` | no `kmeURL` → oidcAuthFlow (unchanged 200) | FR-012 |
|
||
| `US-content-fetch: input validation` | `kmeURL` empty string → 400 | FR-007 |
|
||
| `US-content-fetch: input validation` | `kmeURL` whitespace → 400 | FR-007 |
|
||
| `US-content-fetch: input validation` | `kmeURL` relative URL → 400 | FR-008 |
|
||
| `US-content-fetch: input validation` | `kmeURL` non-http protocol (`ftp:`) → 400 | FR-008 |
|
||
| `US-content-fetch: input validation` | `kmeURL` malformed string → 400 | FR-008 |
|
||
| `US-content-fetch: token failure` | `getValidToken` throws → 502 | FR-011 |
|
||
| `US-content-fetch: upstream errors` | upstream 404 → 404 | FR-009 |
|
||
| `US-content-fetch: upstream errors` | upstream 410 → 404 | FR-009 |
|
||
| `US-content-fetch: upstream errors` | upstream 503 → 502 | FR-010 |
|
||
| `US-content-fetch: upstream errors` | timeout `ECONNABORTED` → 502 | FR-010 |
|
||
| `US-content-fetch: upstream errors` | timeout `ERR_CANCELED` → 502 | FR-010 |
|
||
| `US-content-fetch: upstream errors` | network error (no `err.response`) → 502 | FR-010 |
|
||
| `US-content-fetch: body parsing` | unparseable string body → 502 | FR-010 |
|
||
| `US-content-fetch: body parsing` | `vkm:articleBody` absent → 404 | FR-009 |
|
||
| `US-content-fetch: body parsing` | `vkm:articleBody` null → 404 | FR-009 |
|
||
| `US-content-fetch: body parsing` | `vkm:articleBody` empty string → 404 | FR-009 |
|
||
| `US-content-fetch: body parsing` | `vkm:articleBody` whitespace → 404 | FR-009 |
|
||
| `US-content-fetch: passthrough preserved` | no `kmeURL`, not sitemap → 200 'Authorized' | FR-012 |
|
||
| `extractArticleBody helper` | returns body string | FR-005 |
|
||
| `extractArticleBody helper` | null data → null | FR-005 |
|
||
| `extractArticleBody helper` | no `vkm:articleBody` field → null | FR-009 |
|
||
| `extractArticleBody helper` | empty string → null | FR-009 |
|
||
| `extractArticleBody helper` | whitespace string → null | FR-009 |
|
||
|
||
**Contract tests** (add to `tests/contract/proxy-http.test.js`):
|
||
|
||
| Test case | Setup | Verifies |
|
||
|-----------|-------|---------|
|
||
| valid `kmeURL` → real mock HTTP server returning JSON-LD with `vkm:articleBody` → 200 HTML | real HTTP server, real token server | SC-001, FR-006 |
|
||
| real mock server returning 404 → proxy returns 404 | real 404 HTTP server | FR-009 |
|
||
| real mock server returning 503 → proxy returns 502 | real 503 HTTP server | FR-010 |
|
||
| non-responding server → proxy returns 502 within 12 s | real server that never responds | FR-010 |
|
||
|
||
### `extractArticleBody` — Edge Case Coverage
|
||
|
||
| Input | Expected output |
|
||
|-------|----------------|
|
||
| `{ 'vkm:articleBody': '<p>Hello</p>' }` | `'<p>Hello</p>'` |
|
||
| `{ 'vkm:articleBody': '' }` | `null` |
|
||
| `{ 'vkm:articleBody': ' ' }` | `null` |
|
||
| `{ 'vkm:articleBody': null }` | `null` |
|
||
| `{ 'vkm:articleBody': undefined }` (field absent) | `null` |
|
||
| `{}` (field absent) | `null` |
|
||
| `null` | `null` |
|
||
| `'string'` (non-object) | `null` |
|