Files
Peter.Morton f840587e5e feat: content fetch, sitemap fixes, remove oidcAuthFlow
- Add contentFetchFlow() to proxy (FR-001 through FR-012)
- Add extractArticleBody() helper with vkm:articleBody / articleBody fallback
- Dynamic proxyBaseUrl derivation from x-forwarded-proto/host headers
- Forward query/size/category params on /sitemap.xml requests
- Add Accept: application/ld+json header to content API calls
- Remove oidcAuthFlow() - unmatched requests now return 404 Not Found
- Fix xmlbuilder2 import: default import, call as xmlbuilder2.create(...)
- Version bump 0.2.0 → 0.3.0
- 45/45 tests passing

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-23 16:40:06 -05:00

123 lines
10 KiB
Markdown

# Feature Specification: KME Article Content Fetch
**Feature Branch**: `003-kme-content-fetch`
**Created**: 2025-07-15
**Status**: Draft
## User Scenarios & Testing *(mandatory)*
### User Story 1 — Happy Path Article Fetch (Priority: P1)
A downstream consumer (e.g. a CMS or search front-end) sends a request to the proxy with a `kmeURL` query parameter containing the verbatim `vkm:url` value it received from the KME Search API. The proxy authenticates the request to the KME Content Service, fetches the article, and streams back the HTML body of that article so the consumer can render it.
**Why this priority**: This is the core business value of the feature. Without a working happy path there is nothing to build on.
**Independent Test**: Issue a GET request to the proxy with a valid, reachable `kmeURL`. Verify the response body is HTML matching the `vkm:articleBody` field in the KME Content Service response, status 200 and `Content-Type: text/html`.
**Acceptance Scenarios**:
1. **Given** the proxy receives a GET request whose URL does **not** end in `/sitemap.xml`, **When** the request contains `?kmeURL=https://content.kme.example/articles/123`, **Then** the proxy fetches that URL from the KME Content Service with `Authorization: OIDC_id_token {token}`, extracts `vkm:articleBody` from the JSON-LD response, and returns it as the HTTP response body with status 200 and `Content-Type: text/html`.
2. **Given** the token cache holds a valid OIDC token, **When** the proxy makes the upstream request, **Then** it uses the cached token without a new token acquisition round-trip.
3. **Given** the token cache has expired, **When** the proxy makes the upstream request, **Then** `getValidToken()` refreshes the token transparently before the upstream call is made.
---
### User Story 2 — Missing or Empty kmeURL Parameter (Priority: P2)
A consumer sends a request that matches the content-fetch route (not a sitemap URL) but omits the `kmeURL` parameter or provides it as an empty string. The proxy must reject the request immediately with a clear 400 response rather than making a malformed upstream call.
**Why this priority**: Bad-input rejection prevents meaningless upstream calls and gives consumers a clear, actionable error signal.
**Independent Test**: Send a GET request to the proxy without `kmeURL`, or with `kmeURL=`. Verify a 400 Bad Request response is returned.
**Acceptance Scenarios**:
1. **Given** the proxy receives a request with no `kmeURL` query parameter, **When** the request is processed, **Then** the proxy returns HTTP 400 without making any upstream request.
2. **Given** the proxy receives a request with `?kmeURL=` (empty value), **When** the request is processed, **Then** the proxy returns HTTP 400 without making any upstream request.
---
### User Story 3 — Upstream Content Fetch Failure or Missing Article Body (Priority: P3)
The KME Content Service is unreachable, returns an HTTP error status, times out, or returns a valid JSON-LD document that does not contain `vkm:articleBody`. The proxy must surface an appropriate error to the consumer.
**Why this priority**: Robust error handling avoids silent failures and lets consumers distinguish between "article not found" and "upstream service error".
**Independent Test**: Simulate or stub each failure mode and verify the correct HTTP error code is returned by the proxy.
**Acceptance Scenarios**:
1. **Given** the KME Content Service returns a 4xx response for the requested URL, **When** the proxy processes the response, **Then** the proxy returns HTTP 404 to the caller.
2. **Given** the KME Content Service returns a 5xx response or the request times out (exceeding 10 seconds), **When** the proxy processes the response, **Then** the proxy returns HTTP 502 to the caller.
3. **Given** the KME Content Service returns a 200 JSON-LD response but the `vkm:articleBody` field is absent or null, **When** the proxy processes the response, **Then** the proxy returns HTTP 404 to the caller.
4. **Given** a network-level error prevents the upstream request from completing, **When** the proxy processes the error, **Then** the proxy returns HTTP 502 to the caller.
---
### User Story 4 — Existing Passthrough Behaviour Preserved (Priority: P4)
Requests that do not match the sitemap route and do not carry a `kmeURL` parameter must continue to receive the existing 200 OK response (auth-check passthrough) without any change in behaviour.
**Why this priority**: Non-regression of existing behaviour is required to avoid breaking active consumers that rely on the passthrough route.
**Independent Test**: Send a GET request to the proxy with neither a `/sitemap.xml` suffix nor a `kmeURL` parameter. Verify a 200 OK response is returned, identical to current behaviour.
**Acceptance Scenarios**:
1. **Given** the proxy receives a request with no `kmeURL` parameter and a URL not ending in `/sitemap.xml`, **When** the request is processed, **Then** the proxy returns HTTP 200 (the existing auth-check passthrough).
---
### Edge Cases
- What happens when `kmeURL` contains an already-encoded URL (percent-encoded characters)? The value must be used verbatim; double-encoding must not occur.
- What happens if the JSON-LD response body from the KME Content Service is not valid JSON? The proxy should treat this as a 502 upstream error.
- What happens if the upstream response contains `vkm:articleBody` but its value is an empty string? Treat as absent → return 404.
- What happens if the OIDC token cannot be acquired (e.g. auth service down)? Surface this as a 502 upstream error.
- What happens if `kmeURL` is present but the URL is not a well-formed absolute URL? Return 400 Bad Request (same as missing/empty).
## Requirements *(mandatory)*
### Functional Requirements
- **FR-001**: The proxy MUST detect when an incoming request URL does NOT end in `/sitemap.xml` AND contains a non-empty `kmeURL` query parameter, and route such requests through the content-fetch flow.
- **FR-002**: The proxy MUST use the `kmeURL` parameter value exactly as provided — without any manipulation, re-encoding, or URL construction — as the target URL for the upstream GET request.
- **FR-003**: The proxy MUST attach an `Authorization: OIDC_id_token {token}` header to the upstream GET request, obtaining the token via `getValidToken()` from `kmeContentSourceAdapterHelpers`.
- **FR-004**: The upstream GET request MUST have a timeout of 10 seconds.
- **FR-005**: On a successful upstream response, the proxy MUST extract the `vkm:articleBody` field from the JSON-LD response body.
- **FR-006**: The proxy MUST return the `vkm:articleBody` value as the HTTP response body with status 200 and `Content-Type: text/html`.
- **FR-007**: If `kmeURL` is absent, empty, or blank, the proxy MUST return HTTP 400 without making an upstream request.
- **FR-008**: If `kmeURL` is present but not a well-formed absolute URL, the proxy MUST return HTTP 400.
- **FR-009**: If the upstream request results in a 4xx response, or the `vkm:articleBody` field is absent, null, or empty in an otherwise successful response, the proxy MUST return HTTP 404 to the caller.
- **FR-010**: If the upstream request results in a 5xx response, a timeout, a network error, or an unparseable response body, the proxy MUST return HTTP 502 to the caller.
- **FR-011**: If the OIDC token cannot be acquired, the proxy MUST return HTTP 502 to the caller.
- **FR-012**: Requests that neither end in `/sitemap.xml` nor carry a `kmeURL` parameter MUST continue to receive the existing 200 OK passthrough response, unchanged.
- **FR-013**: The content-fetch flow MUST be implemented entirely within the VM sandbox file (`src/proxyScripts/kmeContentSourceAdapter.js`) using only the injected context variables (`axios`, `kmeContentSourceAdapterHelpers`, `kme_CSA_settings`, `console`, `URLSearchParams`, `URL`, `req`, `res`) — no new imports or module-level exports are permitted.
### Key Entities
- **KME Article Content**: Represents a single article fetched from the KME Content Service. Identified by its `vkm:url`. Key field: `vkm:articleBody` (HTML string).
- **OIDC Token**: A short-lived bearer credential used to authenticate requests to the KME Content Service. Managed by `getValidToken()`, which handles caching (Redis) and refresh transparently.
- **Proxy Request**: An incoming HTTP request received by the proxy script, carrying routing signals in the URL path (sitemap detection) and query string (`kmeURL`).
## Success Criteria *(mandatory)*
### Measurable Outcomes
- **SC-001**: A consumer submitting a valid `kmeURL` receives the corresponding article HTML body in under 11 seconds end-to-end (10 s upstream timeout + 1 s proxy overhead) under normal network conditions.
- **SC-002**: 100% of requests with a missing, empty, or malformed `kmeURL` parameter receive a 400 response without triggering any upstream call.
- **SC-003**: 100% of upstream 4xx responses and missing/empty `vkm:articleBody` scenarios result in a 404 response to the caller.
- **SC-004**: 100% of upstream 5xx, timeout, and network-error scenarios result in a 502 response to the caller.
- **SC-005**: All existing proxy routes (sitemap flow and passthrough) continue to behave identically to their pre-feature behaviour — zero regression.
- **SC-006**: The unit test suite for the proxy script achieves ≥90% branch coverage across the content-fetch flow, including all four error paths.
## Assumptions
- The `kmeURL` value provided by callers is the verbatim `vkm:url` value from the KME Search API response — the spec does not need to validate its domain or path structure beyond confirming it is a well-formed absolute URL.
- `getValidToken()` is already implemented, tested, and handles all OIDC token edge cases (expiry, Redis connectivity, refresh). This feature does not modify it.
- The `axios` instance injected into the VM context supports a `timeout` configuration option and throws a recognisable error on timeout (following standard axios behaviour).
- The KME Content Service always returns `Content-Type: application/ld+json` (or similar JSON) for valid article requests; no binary or streaming responses are expected.
- HTTP method is always GET for the content-fetch flow; no authentication or session concept exists on the proxy's inbound side.
- The existing sitemap route detection (URL ends in `/sitemap.xml`) takes priority over the `kmeURL` check — a URL ending in `/sitemap.xml?kmeURL=...` would route to the sitemap flow, not the content-fetch flow.
- Error response bodies are plain text or minimal JSON — no prescribed format is required beyond the correct HTTP status code.