Files
Peter.Morton f840587e5e feat: content fetch, sitemap fixes, remove oidcAuthFlow
- Add contentFetchFlow() to proxy (FR-001 through FR-012)
- Add extractArticleBody() helper with vkm:articleBody / articleBody fallback
- Dynamic proxyBaseUrl derivation from x-forwarded-proto/host headers
- Forward query/size/category params on /sitemap.xml requests
- Add Accept: application/ld+json header to content API calls
- Remove oidcAuthFlow() - unmatched requests now return 404 Not Found
- Fix xmlbuilder2 import: default import, call as xmlbuilder2.create(...)
- Version bump 0.2.0 → 0.3.0
- 45/45 tests passing

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-23 16:40:06 -05:00

331 lines
15 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Implementation Plan: KME Article Content Fetch
**Branch**: `003-kme-content-fetch` | **Date**: 2025-07-15 | **Spec**: [spec.md](spec.md)
**Input**: Feature specification from `specs/003-kme-content-fetch/spec.md`
## Summary
Add a new `contentFetchFlow()` to `src/proxyScripts/kmeContentSourceAdapter.js` that handles
requests carrying a `?kmeURL=` query parameter. The flow validates the parameter, obtains an OIDC
token via the existing `getValidToken()`, performs a GET request to the `kmeURL` with
`Authorization: OIDC_id_token {token}`, extracts `vkm:articleBody` from the JSON-LD response, and
returns it as `text/html`. A new pure helper `extractArticleBody(data)` is added to
`src/globalVariables/kmeContentSourceAdapterHelpers.js`. No new files, no new npm dependencies.
## Technical Context
**Language/Version**: Node.js ≥18, ESM (`"type": "module"`)
**Primary Dependencies**: `axios ^1.13` (HTTP client, already in context), `redis ^5` (token cache, injected), `xmlbuilder2 ^4` (sitemap, unrelated to this feature)
**Storage**: Redis (token cache only — managed by `getValidToken()`, not modified by this feature)
**Testing**: Node.js built-in test runner (`node:test`) — `npm run test:unit`, `npm run test:contract`
**Target Platform**: Node.js server (Linux/macOS); proxy script executed inside `vm.createContext` per request
**Project Type**: HTTP proxy adapter — monolithic VM-sandbox architecture
**Performance Goals**: End-to-end response ≤11 s (10 s upstream timeout + 1 s proxy overhead) per SC-001
**Constraints**: Zero new imports/exports in VM sandbox files; no new npm dependencies; no new `src/` files
**Scale/Scope**: Two files modified (`kmeContentSourceAdapter.js`, `kmeContentSourceAdapterHelpers.js`); new unit tests in `tests/unit/proxy.test.js`; new contract tests in `tests/contract/proxy-http.test.js`
## Constitution Check
*GATE: Must pass before Phase 0 research. Re-check after Phase 1 design.*
| Gate | Status | Notes |
|------|--------|-------|
| All business logic stays in `src/proxyScripts/kmeContentSourceAdapter.js` | ✅ PASS | `contentFetchFlow()` lives entirely in the adapter file |
| Zero `import`/`export` in VM sandbox file | ✅ PASS | No imports added; all dependencies via injected context |
| `extractArticleBody` is a pure utility → helper file | ✅ PASS | No state, no API calls — qualifies for `kmeContentSourceAdapterHelpers.js` |
| No new files in `src/` | ✅ PASS | Only two existing files are modified |
| No new npm dependencies | ✅ PASS | `axios` and `URL`/`URLSearchParams` already available in context |
| Helpers file uses literal function body pattern | ✅ PASS | New helper added before existing `return { ... }` block |
| Authentication (`getValidToken`) stays in proxy script (called from adapter, not moved) | ✅ PASS | `getValidToken()` is invoked from `contentFetchFlow()` in the adapter |
**Post-design re-check**: All gates pass. No constitutional violations. No complexity tracking required.
## Project Structure
### Documentation (this feature)
```text
specs/003-kme-content-fetch/
├── plan.md # This file
├── research.md # Phase 0 output
├── data-model.md # Phase 1 output
├── quickstart.md # Phase 1 output
├── contracts/
│ └── http-content-fetch.md # HTTP request/response contract
└── tasks.md # Phase 2 output (/speckit.tasks — NOT created here)
```
### Source Code (repository root)
```text
src/
├── proxyScripts/
│ └── kmeContentSourceAdapter.js # MODIFIED: add contentFetchFlow(), update routing
├── globalVariables/
│ └── kmeContentSourceAdapterHelpers.js # MODIFIED: add extractArticleBody()
├── logger.js # unchanged
└── server.js # unchanged
tests/
├── unit/
│ └── proxy.test.js # MODIFIED: add content-fetch describe blocks
└── contract/
└── proxy-http.test.js # MODIFIED: add content-fetch contract tests
```
**Structure Decision**: Single-project monolith. Two existing source files modified; two existing test
files extended. No new files in `src/`. The VM sandbox pattern and helper file pattern are preserved
exactly as documented in the constitution.
---
## Phase 0: Research Findings → `research.md`
See [research.md](research.md) for full decision log. Summary:
- **URL parameter extraction**: `new URL(req.url, 'http://localhost').searchParams.get('kmeURL')` — confirmed safe from VM context research; `URL` is injected at server.js line 19.
- **URL validation**: `new URL(kmeURL)` + protocol check in try/catch — cleanly handles FR-007/FR-008 in one guard.
- **Axios error handling**: Confirmed `ECONNABORTED`/`ERR_CANCELED` for timeout; `err.response.status` available for all HTTP errors; JSON auto-parsed when `Content-Type: application/json`.
- **JSON-LD parsing**: `response.data` is an object when axios auto-parses; fallback `JSON.parse()` needed for non-JSON content-types; non-object → 502.
- **No unknowns remaining**: All NEEDS CLARIFICATION resolved. Research complete.
---
## Phase 1: Design
### Routing Change (`kmeContentSourceAdapter.js`)
Add a new `else if` branch between the existing sitemap check and the `oidcAuthFlow` fallback:
```javascript
// Entry point — URL routing
try {
if (req.url.endsWith('/sitemap.xml')) {
await sitemapFlow();
} else if (new URL(req.url, 'http://localhost').searchParams.has('kmeURL')) {
await contentFetchFlow(); // ← NEW
} else {
await oidcAuthFlow();
}
} catch (err) { /* existing outer catch → 401 */ }
```
`contentFetchFlow()` is fully self-contained — all errors are caught internally and never propagate to the outer catch.
### `contentFetchFlow()` — Complete Logic
```javascript
async function contentFetchFlow() {
// Step 1: Extract kmeURL (FR-001, FR-002)
const kmeURL = new URL(req.url, 'http://localhost').searchParams.get('kmeURL') ?? '';
// Step 2: Validate — absent / empty (FR-007)
if (!kmeURL.trim()) {
res.writeHead(400, { 'Content-Type': 'text/plain' });
res.end('Bad Request: kmeURL parameter is required');
return;
}
// Step 3: Validate — well-formed absolute http/https URL (FR-008)
try {
const u = new URL(kmeURL);
if (u.protocol !== 'http:' && u.protocol !== 'https:') throw new Error();
} catch {
res.writeHead(400, { 'Content-Type': 'text/plain' });
res.end('Bad Request: kmeURL must be a well-formed absolute http/https URL');
return;
}
// Step 4: Validate OIDC settings (config guard, returns 500 for missing config)
const missingField = kmeContentSourceAdapterHelpers.validateSettings(
kme_CSA_settings,
['tokenUrl', 'username', 'password', 'clientId', 'scope'],
);
if (missingField) {
console.error({ message: 'Content fetch: config error', missingField });
res.writeHead(500, { 'Content-Type': 'text/plain' });
res.end('Configuration error: missing required field: ' + missingField);
return;
}
// Step 5: Obtain OIDC token (FR-003, FR-011)
let token;
try {
token = await kmeContentSourceAdapterHelpers.getValidToken(req.url, req.method);
} catch (tokenErr) {
console.error({ message: 'Content fetch: token acquisition failed', error: tokenErr.message });
res.writeHead(502, { 'Content-Type': 'text/plain' });
res.end('Bad Gateway: token acquisition failed');
return;
}
// Step 6: GET kmeURL verbatim with auth header (FR-002, FR-003, FR-004)
let response;
try {
console.debug({ message: 'Content fetch: fetching article', kmeURL });
response = await axios.get(kmeURL, {
headers: { Authorization: `OIDC_id_token ${token}` },
timeout: 10000,
});
} catch (fetchErr) {
if (fetchErr.code === 'ECONNABORTED' || fetchErr.code === 'ERR_CANCELED') {
console.error({ message: 'Content fetch: upstream timeout', code: fetchErr.code });
res.writeHead(502, { 'Content-Type': 'text/plain' });
res.end('Bad Gateway: upstream request timed out');
} else if (fetchErr.response) {
const status = fetchErr.response.status;
console.error({ message: 'Content fetch: upstream HTTP error', status });
if (status >= 400 && status < 500) {
res.writeHead(404, { 'Content-Type': 'text/plain' });
res.end('Not Found: article not found at upstream');
} else {
res.writeHead(502, { 'Content-Type': 'text/plain' });
res.end('Bad Gateway: upstream error HTTP ' + status);
}
} else {
console.error({ message: 'Content fetch: network error', error: fetchErr.message });
res.writeHead(502, { 'Content-Type': 'text/plain' });
res.end('Bad Gateway: ' + fetchErr.message);
}
return;
}
// Step 7: Parse body — handle non-JSON content-type (FR-005, FR-010)
let data = response.data;
if (typeof data === 'string') {
try {
data = JSON.parse(data);
} catch {
console.error({ message: 'Content fetch: unparseable response body', kmeURL });
res.writeHead(502, { 'Content-Type': 'text/plain' });
res.end('Bad Gateway: unparseable response from upstream');
return;
}
}
if (typeof data !== 'object' || data === null) {
console.error({ message: 'Content fetch: unexpected non-object response', kmeURL });
res.writeHead(502, { 'Content-Type': 'text/plain' });
res.end('Bad Gateway: unexpected response from upstream');
return;
}
// Step 8: Extract vkm:articleBody (FR-005, FR-009)
const articleBody = kmeContentSourceAdapterHelpers.extractArticleBody(data);
if (!articleBody) {
console.error({ message: 'Content fetch: vkm:articleBody absent or empty', kmeURL });
res.writeHead(404, { 'Content-Type': 'text/plain' });
res.end('Not Found: article body not present in upstream response');
return;
}
// Step 9: Return article HTML (FR-006)
console.info({ message: 'Content fetch: article fetched successfully', kmeURL });
res.writeHead(200, { 'Content-Type': 'text/html' });
res.end(articleBody);
}
```
### `extractArticleBody(data)` — New Helper
Add to `kmeContentSourceAdapterHelpers.js` before the existing `return { ... }` block:
```javascript
/**
* Extracts the vkm:articleBody string from a KME Content Service JSON-LD response.
* Returns null if the field is absent, null, not a string, or an empty/whitespace string.
* @param {object} data parsed JSON-LD response from the KME Content Service
* @returns {string|null}
*/
function extractArticleBody(data) {
if (!data || typeof data !== 'object') return null;
const body = data['vkm:articleBody'];
if (body == null || typeof body !== 'string' || body.trim() === '') return null;
return body;
}
```
Update the `return { ... }` at the bottom of the helpers file to export the new function:
```javascript
return {
validateSettings,
extractHydraItems,
buildSitemapXml,
getValidToken,
extractArticleBody, // ← NEW
};
```
### Error Response Matrix
| Condition | HTTP Status | Response Body |
|-----------|-------------|---------------|
| `kmeURL` absent or empty | 400 | `Bad Request: kmeURL parameter is required` |
| `kmeURL` not a well-formed absolute http/https URL | 400 | `Bad Request: kmeURL must be a well-formed absolute http/https URL` |
| Missing OIDC config field | 500 | `Configuration error: missing required field: {field}` |
| Token acquisition failure | 502 | `Bad Gateway: token acquisition failed` |
| Upstream 4xx response | 404 | `Not Found: article not found at upstream` |
| Upstream 5xx response | 502 | `Bad Gateway: upstream error HTTP {status}` |
| Upstream timeout (`ECONNABORTED`/`ERR_CANCELED`) | 502 | `Bad Gateway: upstream request timed out` |
| Network error (no `err.response`) | 502 | `Bad Gateway: {err.message}` |
| Response body unparseable as JSON | 502 | `Bad Gateway: unparseable response from upstream` |
| Non-object response body | 502 | `Bad Gateway: unexpected response from upstream` |
| `vkm:articleBody` absent, null, or empty | 404 | `Not Found: article body not present in upstream response` |
| Success | 200 `text/html` | article body HTML |
### Test Coverage Plan
**Unit tests** (add to `tests/unit/proxy.test.js`):
| Describe block | Test case | Verifies |
|---------------|-----------|---------|
| `US-content-fetch: happy path` | cached token → 200 HTML body | FR-001, FR-005, FR-006 |
| `US-content-fetch: happy path` | cache miss → token fetch → 200 HTML body | FR-003 |
| `US-content-fetch: happy path` | expired token → refresh → 200 HTML body | FR-003 |
| `US-content-fetch: input validation` | no `kmeURL` → oidcAuthFlow (unchanged 200) | FR-012 |
| `US-content-fetch: input validation` | `kmeURL` empty string → 400 | FR-007 |
| `US-content-fetch: input validation` | `kmeURL` whitespace → 400 | FR-007 |
| `US-content-fetch: input validation` | `kmeURL` relative URL → 400 | FR-008 |
| `US-content-fetch: input validation` | `kmeURL` non-http protocol (`ftp:`) → 400 | FR-008 |
| `US-content-fetch: input validation` | `kmeURL` malformed string → 400 | FR-008 |
| `US-content-fetch: token failure` | `getValidToken` throws → 502 | FR-011 |
| `US-content-fetch: upstream errors` | upstream 404 → 404 | FR-009 |
| `US-content-fetch: upstream errors` | upstream 410 → 404 | FR-009 |
| `US-content-fetch: upstream errors` | upstream 503 → 502 | FR-010 |
| `US-content-fetch: upstream errors` | timeout `ECONNABORTED` → 502 | FR-010 |
| `US-content-fetch: upstream errors` | timeout `ERR_CANCELED` → 502 | FR-010 |
| `US-content-fetch: upstream errors` | network error (no `err.response`) → 502 | FR-010 |
| `US-content-fetch: body parsing` | unparseable string body → 502 | FR-010 |
| `US-content-fetch: body parsing` | `vkm:articleBody` absent → 404 | FR-009 |
| `US-content-fetch: body parsing` | `vkm:articleBody` null → 404 | FR-009 |
| `US-content-fetch: body parsing` | `vkm:articleBody` empty string → 404 | FR-009 |
| `US-content-fetch: body parsing` | `vkm:articleBody` whitespace → 404 | FR-009 |
| `US-content-fetch: passthrough preserved` | no `kmeURL`, not sitemap → 200 'Authorized' | FR-012 |
| `extractArticleBody helper` | returns body string | FR-005 |
| `extractArticleBody helper` | null data → null | FR-005 |
| `extractArticleBody helper` | no `vkm:articleBody` field → null | FR-009 |
| `extractArticleBody helper` | empty string → null | FR-009 |
| `extractArticleBody helper` | whitespace string → null | FR-009 |
**Contract tests** (add to `tests/contract/proxy-http.test.js`):
| Test case | Setup | Verifies |
|-----------|-------|---------|
| valid `kmeURL` → real mock HTTP server returning JSON-LD with `vkm:articleBody` → 200 HTML | real HTTP server, real token server | SC-001, FR-006 |
| real mock server returning 404 → proxy returns 404 | real 404 HTTP server | FR-009 |
| real mock server returning 503 → proxy returns 502 | real 503 HTTP server | FR-010 |
| non-responding server → proxy returns 502 within 12 s | real server that never responds | FR-010 |
### `extractArticleBody` — Edge Case Coverage
| Input | Expected output |
|-------|----------------|
| `{ 'vkm:articleBody': '<p>Hello</p>' }` | `'<p>Hello</p>'` |
| `{ 'vkm:articleBody': '' }` | `null` |
| `{ 'vkm:articleBody': ' ' }` | `null` |
| `{ 'vkm:articleBody': null }` | `null` |
| `{ 'vkm:articleBody': undefined }` (field absent) | `null` |
| `{}` (field absent) | `null` |
| `null` | `null` |
| `'string'` (non-object) | `null` |