feat: content fetch, sitemap fixes, remove oidcAuthFlow

- Add contentFetchFlow() to proxy (FR-001 through FR-012)
- Add extractArticleBody() helper with vkm:articleBody / articleBody fallback
- Dynamic proxyBaseUrl derivation from x-forwarded-proto/host headers
- Forward query/size/category params on /sitemap.xml requests
- Add Accept: application/ld+json header to content API calls
- Remove oidcAuthFlow() - unmatched requests now return 404 Not Found
- Fix xmlbuilder2 import: default import, call as xmlbuilder2.create(...)
- Version bump 0.2.0 → 0.3.0
- 45/45 tests passing

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
2026-04-23 16:40:06 -05:00
parent d50f041488
commit f840587e5e
29 changed files with 1998 additions and 352 deletions

View File

@@ -0,0 +1,330 @@
# Implementation Plan: KME Article Content Fetch
**Branch**: `003-kme-content-fetch` | **Date**: 2025-07-15 | **Spec**: [spec.md](spec.md)
**Input**: Feature specification from `specs/003-kme-content-fetch/spec.md`
## Summary
Add a new `contentFetchFlow()` to `src/proxyScripts/kmeContentSourceAdapter.js` that handles
requests carrying a `?kmeURL=` query parameter. The flow validates the parameter, obtains an OIDC
token via the existing `getValidToken()`, performs a GET request to the `kmeURL` with
`Authorization: OIDC_id_token {token}`, extracts `vkm:articleBody` from the JSON-LD response, and
returns it as `text/html`. A new pure helper `extractArticleBody(data)` is added to
`src/globalVariables/kmeContentSourceAdapterHelpers.js`. No new files, no new npm dependencies.
## Technical Context
**Language/Version**: Node.js ≥18, ESM (`"type": "module"`)
**Primary Dependencies**: `axios ^1.13` (HTTP client, already in context), `redis ^5` (token cache, injected), `xmlbuilder2 ^4` (sitemap, unrelated to this feature)
**Storage**: Redis (token cache only — managed by `getValidToken()`, not modified by this feature)
**Testing**: Node.js built-in test runner (`node:test`) — `npm run test:unit`, `npm run test:contract`
**Target Platform**: Node.js server (Linux/macOS); proxy script executed inside `vm.createContext` per request
**Project Type**: HTTP proxy adapter — monolithic VM-sandbox architecture
**Performance Goals**: End-to-end response ≤11 s (10 s upstream timeout + 1 s proxy overhead) per SC-001
**Constraints**: Zero new imports/exports in VM sandbox files; no new npm dependencies; no new `src/` files
**Scale/Scope**: Two files modified (`kmeContentSourceAdapter.js`, `kmeContentSourceAdapterHelpers.js`); new unit tests in `tests/unit/proxy.test.js`; new contract tests in `tests/contract/proxy-http.test.js`
## Constitution Check
*GATE: Must pass before Phase 0 research. Re-check after Phase 1 design.*
| Gate | Status | Notes |
|------|--------|-------|
| All business logic stays in `src/proxyScripts/kmeContentSourceAdapter.js` | ✅ PASS | `contentFetchFlow()` lives entirely in the adapter file |
| Zero `import`/`export` in VM sandbox file | ✅ PASS | No imports added; all dependencies via injected context |
| `extractArticleBody` is a pure utility → helper file | ✅ PASS | No state, no API calls — qualifies for `kmeContentSourceAdapterHelpers.js` |
| No new files in `src/` | ✅ PASS | Only two existing files are modified |
| No new npm dependencies | ✅ PASS | `axios` and `URL`/`URLSearchParams` already available in context |
| Helpers file uses literal function body pattern | ✅ PASS | New helper added before existing `return { ... }` block |
| Authentication (`getValidToken`) stays in proxy script (called from adapter, not moved) | ✅ PASS | `getValidToken()` is invoked from `contentFetchFlow()` in the adapter |
**Post-design re-check**: All gates pass. No constitutional violations. No complexity tracking required.
## Project Structure
### Documentation (this feature)
```text
specs/003-kme-content-fetch/
├── plan.md # This file
├── research.md # Phase 0 output
├── data-model.md # Phase 1 output
├── quickstart.md # Phase 1 output
├── contracts/
│ └── http-content-fetch.md # HTTP request/response contract
└── tasks.md # Phase 2 output (/speckit.tasks — NOT created here)
```
### Source Code (repository root)
```text
src/
├── proxyScripts/
│ └── kmeContentSourceAdapter.js # MODIFIED: add contentFetchFlow(), update routing
├── globalVariables/
│ └── kmeContentSourceAdapterHelpers.js # MODIFIED: add extractArticleBody()
├── logger.js # unchanged
└── server.js # unchanged
tests/
├── unit/
│ └── proxy.test.js # MODIFIED: add content-fetch describe blocks
└── contract/
└── proxy-http.test.js # MODIFIED: add content-fetch contract tests
```
**Structure Decision**: Single-project monolith. Two existing source files modified; two existing test
files extended. No new files in `src/`. The VM sandbox pattern and helper file pattern are preserved
exactly as documented in the constitution.
---
## Phase 0: Research Findings → `research.md`
See [research.md](research.md) for full decision log. Summary:
- **URL parameter extraction**: `new URL(req.url, 'http://localhost').searchParams.get('kmeURL')` — confirmed safe from VM context research; `URL` is injected at server.js line 19.
- **URL validation**: `new URL(kmeURL)` + protocol check in try/catch — cleanly handles FR-007/FR-008 in one guard.
- **Axios error handling**: Confirmed `ECONNABORTED`/`ERR_CANCELED` for timeout; `err.response.status` available for all HTTP errors; JSON auto-parsed when `Content-Type: application/json`.
- **JSON-LD parsing**: `response.data` is an object when axios auto-parses; fallback `JSON.parse()` needed for non-JSON content-types; non-object → 502.
- **No unknowns remaining**: All NEEDS CLARIFICATION resolved. Research complete.
---
## Phase 1: Design
### Routing Change (`kmeContentSourceAdapter.js`)
Add a new `else if` branch between the existing sitemap check and the `oidcAuthFlow` fallback:
```javascript
// Entry point — URL routing
try {
if (req.url.endsWith('/sitemap.xml')) {
await sitemapFlow();
} else if (new URL(req.url, 'http://localhost').searchParams.has('kmeURL')) {
await contentFetchFlow(); // ← NEW
} else {
await oidcAuthFlow();
}
} catch (err) { /* existing outer catch → 401 */ }
```
`contentFetchFlow()` is fully self-contained — all errors are caught internally and never propagate to the outer catch.
### `contentFetchFlow()` — Complete Logic
```javascript
async function contentFetchFlow() {
// Step 1: Extract kmeURL (FR-001, FR-002)
const kmeURL = new URL(req.url, 'http://localhost').searchParams.get('kmeURL') ?? '';
// Step 2: Validate — absent / empty (FR-007)
if (!kmeURL.trim()) {
res.writeHead(400, { 'Content-Type': 'text/plain' });
res.end('Bad Request: kmeURL parameter is required');
return;
}
// Step 3: Validate — well-formed absolute http/https URL (FR-008)
try {
const u = new URL(kmeURL);
if (u.protocol !== 'http:' && u.protocol !== 'https:') throw new Error();
} catch {
res.writeHead(400, { 'Content-Type': 'text/plain' });
res.end('Bad Request: kmeURL must be a well-formed absolute http/https URL');
return;
}
// Step 4: Validate OIDC settings (config guard, returns 500 for missing config)
const missingField = kmeContentSourceAdapterHelpers.validateSettings(
kme_CSA_settings,
['tokenUrl', 'username', 'password', 'clientId', 'scope'],
);
if (missingField) {
console.error({ message: 'Content fetch: config error', missingField });
res.writeHead(500, { 'Content-Type': 'text/plain' });
res.end('Configuration error: missing required field: ' + missingField);
return;
}
// Step 5: Obtain OIDC token (FR-003, FR-011)
let token;
try {
token = await kmeContentSourceAdapterHelpers.getValidToken(req.url, req.method);
} catch (tokenErr) {
console.error({ message: 'Content fetch: token acquisition failed', error: tokenErr.message });
res.writeHead(502, { 'Content-Type': 'text/plain' });
res.end('Bad Gateway: token acquisition failed');
return;
}
// Step 6: GET kmeURL verbatim with auth header (FR-002, FR-003, FR-004)
let response;
try {
console.debug({ message: 'Content fetch: fetching article', kmeURL });
response = await axios.get(kmeURL, {
headers: { Authorization: `OIDC_id_token ${token}` },
timeout: 10000,
});
} catch (fetchErr) {
if (fetchErr.code === 'ECONNABORTED' || fetchErr.code === 'ERR_CANCELED') {
console.error({ message: 'Content fetch: upstream timeout', code: fetchErr.code });
res.writeHead(502, { 'Content-Type': 'text/plain' });
res.end('Bad Gateway: upstream request timed out');
} else if (fetchErr.response) {
const status = fetchErr.response.status;
console.error({ message: 'Content fetch: upstream HTTP error', status });
if (status >= 400 && status < 500) {
res.writeHead(404, { 'Content-Type': 'text/plain' });
res.end('Not Found: article not found at upstream');
} else {
res.writeHead(502, { 'Content-Type': 'text/plain' });
res.end('Bad Gateway: upstream error HTTP ' + status);
}
} else {
console.error({ message: 'Content fetch: network error', error: fetchErr.message });
res.writeHead(502, { 'Content-Type': 'text/plain' });
res.end('Bad Gateway: ' + fetchErr.message);
}
return;
}
// Step 7: Parse body — handle non-JSON content-type (FR-005, FR-010)
let data = response.data;
if (typeof data === 'string') {
try {
data = JSON.parse(data);
} catch {
console.error({ message: 'Content fetch: unparseable response body', kmeURL });
res.writeHead(502, { 'Content-Type': 'text/plain' });
res.end('Bad Gateway: unparseable response from upstream');
return;
}
}
if (typeof data !== 'object' || data === null) {
console.error({ message: 'Content fetch: unexpected non-object response', kmeURL });
res.writeHead(502, { 'Content-Type': 'text/plain' });
res.end('Bad Gateway: unexpected response from upstream');
return;
}
// Step 8: Extract vkm:articleBody (FR-005, FR-009)
const articleBody = kmeContentSourceAdapterHelpers.extractArticleBody(data);
if (!articleBody) {
console.error({ message: 'Content fetch: vkm:articleBody absent or empty', kmeURL });
res.writeHead(404, { 'Content-Type': 'text/plain' });
res.end('Not Found: article body not present in upstream response');
return;
}
// Step 9: Return article HTML (FR-006)
console.info({ message: 'Content fetch: article fetched successfully', kmeURL });
res.writeHead(200, { 'Content-Type': 'text/html' });
res.end(articleBody);
}
```
### `extractArticleBody(data)` — New Helper
Add to `kmeContentSourceAdapterHelpers.js` before the existing `return { ... }` block:
```javascript
/**
* Extracts the vkm:articleBody string from a KME Content Service JSON-LD response.
* Returns null if the field is absent, null, not a string, or an empty/whitespace string.
* @param {object} data parsed JSON-LD response from the KME Content Service
* @returns {string|null}
*/
function extractArticleBody(data) {
if (!data || typeof data !== 'object') return null;
const body = data['vkm:articleBody'];
if (body == null || typeof body !== 'string' || body.trim() === '') return null;
return body;
}
```
Update the `return { ... }` at the bottom of the helpers file to export the new function:
```javascript
return {
validateSettings,
extractHydraItems,
buildSitemapXml,
getValidToken,
extractArticleBody, // ← NEW
};
```
### Error Response Matrix
| Condition | HTTP Status | Response Body |
|-----------|-------------|---------------|
| `kmeURL` absent or empty | 400 | `Bad Request: kmeURL parameter is required` |
| `kmeURL` not a well-formed absolute http/https URL | 400 | `Bad Request: kmeURL must be a well-formed absolute http/https URL` |
| Missing OIDC config field | 500 | `Configuration error: missing required field: {field}` |
| Token acquisition failure | 502 | `Bad Gateway: token acquisition failed` |
| Upstream 4xx response | 404 | `Not Found: article not found at upstream` |
| Upstream 5xx response | 502 | `Bad Gateway: upstream error HTTP {status}` |
| Upstream timeout (`ECONNABORTED`/`ERR_CANCELED`) | 502 | `Bad Gateway: upstream request timed out` |
| Network error (no `err.response`) | 502 | `Bad Gateway: {err.message}` |
| Response body unparseable as JSON | 502 | `Bad Gateway: unparseable response from upstream` |
| Non-object response body | 502 | `Bad Gateway: unexpected response from upstream` |
| `vkm:articleBody` absent, null, or empty | 404 | `Not Found: article body not present in upstream response` |
| Success | 200 `text/html` | article body HTML |
### Test Coverage Plan
**Unit tests** (add to `tests/unit/proxy.test.js`):
| Describe block | Test case | Verifies |
|---------------|-----------|---------|
| `US-content-fetch: happy path` | cached token → 200 HTML body | FR-001, FR-005, FR-006 |
| `US-content-fetch: happy path` | cache miss → token fetch → 200 HTML body | FR-003 |
| `US-content-fetch: happy path` | expired token → refresh → 200 HTML body | FR-003 |
| `US-content-fetch: input validation` | no `kmeURL` → oidcAuthFlow (unchanged 200) | FR-012 |
| `US-content-fetch: input validation` | `kmeURL` empty string → 400 | FR-007 |
| `US-content-fetch: input validation` | `kmeURL` whitespace → 400 | FR-007 |
| `US-content-fetch: input validation` | `kmeURL` relative URL → 400 | FR-008 |
| `US-content-fetch: input validation` | `kmeURL` non-http protocol (`ftp:`) → 400 | FR-008 |
| `US-content-fetch: input validation` | `kmeURL` malformed string → 400 | FR-008 |
| `US-content-fetch: token failure` | `getValidToken` throws → 502 | FR-011 |
| `US-content-fetch: upstream errors` | upstream 404 → 404 | FR-009 |
| `US-content-fetch: upstream errors` | upstream 410 → 404 | FR-009 |
| `US-content-fetch: upstream errors` | upstream 503 → 502 | FR-010 |
| `US-content-fetch: upstream errors` | timeout `ECONNABORTED` → 502 | FR-010 |
| `US-content-fetch: upstream errors` | timeout `ERR_CANCELED` → 502 | FR-010 |
| `US-content-fetch: upstream errors` | network error (no `err.response`) → 502 | FR-010 |
| `US-content-fetch: body parsing` | unparseable string body → 502 | FR-010 |
| `US-content-fetch: body parsing` | `vkm:articleBody` absent → 404 | FR-009 |
| `US-content-fetch: body parsing` | `vkm:articleBody` null → 404 | FR-009 |
| `US-content-fetch: body parsing` | `vkm:articleBody` empty string → 404 | FR-009 |
| `US-content-fetch: body parsing` | `vkm:articleBody` whitespace → 404 | FR-009 |
| `US-content-fetch: passthrough preserved` | no `kmeURL`, not sitemap → 200 'Authorized' | FR-012 |
| `extractArticleBody helper` | returns body string | FR-005 |
| `extractArticleBody helper` | null data → null | FR-005 |
| `extractArticleBody helper` | no `vkm:articleBody` field → null | FR-009 |
| `extractArticleBody helper` | empty string → null | FR-009 |
| `extractArticleBody helper` | whitespace string → null | FR-009 |
**Contract tests** (add to `tests/contract/proxy-http.test.js`):
| Test case | Setup | Verifies |
|-----------|-------|---------|
| valid `kmeURL` → real mock HTTP server returning JSON-LD with `vkm:articleBody` → 200 HTML | real HTTP server, real token server | SC-001, FR-006 |
| real mock server returning 404 → proxy returns 404 | real 404 HTTP server | FR-009 |
| real mock server returning 503 → proxy returns 502 | real 503 HTTP server | FR-010 |
| non-responding server → proxy returns 502 within 12 s | real server that never responds | FR-010 |
### `extractArticleBody` — Edge Case Coverage
| Input | Expected output |
|-------|----------------|
| `{ 'vkm:articleBody': '<p>Hello</p>' }` | `'<p>Hello</p>'` |
| `{ 'vkm:articleBody': '' }` | `null` |
| `{ 'vkm:articleBody': ' ' }` | `null` |
| `{ 'vkm:articleBody': null }` | `null` |
| `{ 'vkm:articleBody': undefined }` (field absent) | `null` |
| `{}` (field absent) | `null` |
| `null` | `null` |
| `'string'` (non-object) | `null` |