feat: content fetch, sitemap fixes, remove oidcAuthFlow

- Add contentFetchFlow() to proxy (FR-001 through FR-012)
- Add extractArticleBody() helper with vkm:articleBody / articleBody fallback
- Dynamic proxyBaseUrl derivation from x-forwarded-proto/host headers
- Forward query/size/category params on /sitemap.xml requests
- Add Accept: application/ld+json header to content API calls
- Remove oidcAuthFlow() - unmatched requests now return 404 Not Found
- Fix xmlbuilder2 import: default import, call as xmlbuilder2.create(...)
- Version bump 0.2.0 → 0.3.0
- 45/45 tests passing

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
2026-04-23 16:40:06 -05:00
parent d50f041488
commit f840587e5e
29 changed files with 1998 additions and 352 deletions

View File

@@ -0,0 +1,201 @@
# HTTP Contract: Content Fetch Route
**Feature**: `003-kme-content-fetch`
**File**: `specs/003-kme-content-fetch/contracts/http-content-fetch.md`
This document defines the HTTP request/response contract for the content-fetch route exposed by the
KME Content Adapter proxy.
---
## Route
```
GET {proxy-base-url}?kmeURL={encoded-article-url}
```
The proxy detects the content-fetch route when:
- The incoming URL does **not** end in `/sitemap.xml`, AND
- The query string contains a `kmeURL` parameter (present, regardless of value)
Requests without `kmeURL` (and not a sitemap request) are routed to the existing auth-check
passthrough (returns 200 "Authorized").
---
## Request
### Method
`GET`
### Query Parameters
| Parameter | Required | Description |
|-----------|----------|-------------|
| `kmeURL` | Yes | The verbatim `vkm:url` value from the KME Search API response. Must be a well-formed absolute `http` or `https` URL. Percent-encoded characters are decoded once (standard URL decoding) — double-encoding must not occur. |
### Headers
None required on the inbound request. The proxy adds its own `Authorization` header on the upstream
request.
### Example Request
```
GET /?kmeURL=https%3A%2F%2Fcontent.kme.example%2Farticles%2F123 HTTP/1.1
Host: proxy.example.com
```
---
## Responses
### 200 OK — Article HTML Body
The article was successfully fetched and `vkm:articleBody` was extracted.
```
HTTP/1.1 200 OK
Content-Type: text/html
<p>Article content here...</p>
```
| Field | Value |
|-------|-------|
| Status | `200` |
| `Content-Type` | `text/html` |
| Body | Raw HTML string from `vkm:articleBody` field of the KME Content Service JSON-LD response. Not sanitised or transformed. |
---
### 400 Bad Request — Invalid `kmeURL`
Returned when `kmeURL` is absent, empty, whitespace-only, or not a well-formed absolute http/https URL.
No upstream request is made.
```
HTTP/1.1 400 Bad Request
Content-Type: text/plain
Bad Request: kmeURL parameter is required
```
```
HTTP/1.1 400 Bad Request
Content-Type: text/plain
Bad Request: kmeURL must be a well-formed absolute http/https URL
```
| Trigger | Response body |
|---------|---------------|
| `kmeURL` absent, empty, or whitespace | `Bad Request: kmeURL parameter is required` |
| `kmeURL` present but malformed or non-http/https | `Bad Request: kmeURL must be a well-formed absolute http/https URL` |
---
### 404 Not Found — Article Not Found
Returned when the upstream KME Content Service returns a 4xx response for the article URL, or when
the upstream response does not contain a non-empty `vkm:articleBody`.
```
HTTP/1.1 404 Not Found
Content-Type: text/plain
Not Found: article not found at upstream
```
```
HTTP/1.1 404 Not Found
Content-Type: text/plain
Not Found: article body not present in upstream response
```
| Trigger | Response body |
|---------|---------------|
| Upstream 4xx HTTP response | `Not Found: article not found at upstream` |
| `vkm:articleBody` absent, null, or empty string | `Not Found: article body not present in upstream response` |
---
### 500 Internal Server Error — Proxy Configuration Error
Returned when a required OIDC setting is missing from `kme_CSA_settings`. Indicates a proxy
deployment/configuration issue.
```
HTTP/1.1 500 Internal Server Error
Content-Type: text/plain
Configuration error: missing required field: tokenUrl
```
---
### 502 Bad Gateway — Upstream or Token Failure
Returned for any upstream connectivity, protocol, or data error, and for token acquisition failure.
```
HTTP/1.1 502 Bad Gateway
Content-Type: text/plain
Bad Gateway: token acquisition failed
```
| Trigger | Response body |
|---------|---------------|
| OIDC token acquisition failure | `Bad Gateway: token acquisition failed` |
| Upstream request timeout (`ECONNABORTED`/`ERR_CANCELED`) | `Bad Gateway: upstream request timed out` |
| Upstream 5xx HTTP response | `Bad Gateway: upstream error HTTP {status}` |
| Network-level error (no HTTP response) | `Bad Gateway: {error message}` |
| Upstream response body is not valid JSON | `Bad Gateway: unparseable response from upstream` |
| Upstream response body is not an object | `Bad Gateway: unexpected response from upstream` |
---
## Upstream Request (Proxy → KME Content Service)
The proxy makes a single GET request to the verbatim `kmeURL` value.
```
GET {kmeURL} HTTP/1.1
Authorization: OIDC_id_token {id_token}
```
| Field | Value |
|-------|-------|
| Method | `GET` |
| URL | Verbatim value of `kmeURL` query parameter — no manipulation, no re-encoding |
| `Authorization` | `OIDC_id_token {id_token}` where `id_token` is from `getValidToken()` |
| Timeout | 10 000 ms (10 seconds) |
---
## Error Mapping Summary
```
kmeURL absent/empty → 400
kmeURL malformed / non-http(s) → 400
Missing OIDC config → 500
Token acquisition failure → 502
Upstream 4xx → 404
Upstream 5xx → 502
Upstream timeout → 502
Network error → 502
Unparseable response body → 502
vkm:articleBody absent/null/empty → 404
Success → 200 text/html
```
---
## Non-regression: Existing Routes
This feature does not change the behaviour of existing routes:
| Route | Behaviour |
|-------|-----------|
| URL ends in `/sitemap.xml` | Sitemap flow (unchanged) |
| No `kmeURL`, not sitemap | Auth-check passthrough → 200 "Authorized" (unchanged) |