Files
Peter.Morton f840587e5e feat: content fetch, sitemap fixes, remove oidcAuthFlow
- Add contentFetchFlow() to proxy (FR-001 through FR-012)
- Add extractArticleBody() helper with vkm:articleBody / articleBody fallback
- Dynamic proxyBaseUrl derivation from x-forwarded-proto/host headers
- Forward query/size/category params on /sitemap.xml requests
- Add Accept: application/ld+json header to content API calls
- Remove oidcAuthFlow() - unmatched requests now return 404 Not Found
- Fix xmlbuilder2 import: default import, call as xmlbuilder2.create(...)
- Version bump 0.2.0 → 0.3.0
- 45/45 tests passing

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-23 16:40:06 -05:00

109 lines
9.6 KiB
Markdown

# Feature Specification: Sitemap XML Generation
**Feature Branch**: `002-sitemap-generation`
**Created**: 2025-07-14
**Status**: Draft
## User Scenarios & Testing *(mandatory)*
### User Story 1 — Search Crawler Discovers KME Content (Priority: P1)
A search engine crawler or sitemap consumer sends a `GET` request to the proxy adapter's sitemap endpoint. The adapter fetches all available knowledge items from the KME Knowledge Search Service and returns a standards-compliant `sitemap.xml` document that the crawler can index.
**Why this priority**: This is the core deliverable. Without a valid `sitemap.xml` response, no downstream indexing or content discovery is possible.
**Independent Test**: Can be fully tested by sending `GET /sitemap.xml` to a running adapter instance and verifying the returned XML body and `Content-Type` header, independent of all other routing behaviour.
**Acceptance Scenarios**:
1. **Given** the adapter is running and the KME Knowledge Search Service is available, **When** a consumer sends `GET <proxy-base-url>/sitemap.xml`, **Then** the adapter responds with HTTP 200, `Content-Type: application/xml`, and a body that is a well-formed XML sitemap containing one `<url>/<loc>` entry per knowledge item returned by the search service.
2. **Given** each search result contains a `vkm:url` field, **When** the sitemap is generated, **Then** every `<loc>` value follows the pattern `<proxyBaseUrl>?kmeURL=<vkm:url value>`.
3. **Given** the KME search service returns zero results, **When** the sitemap is generated, **Then** the adapter returns a valid, empty `<urlset>` document (no `<url>` elements) with HTTP 200.
---
### User Story 2 — Non-Sitemap Requests Continue to Use Existing Auth Flow (Priority: P2)
A client sends a request whose URL does *not* end in `/sitemap.xml`. The adapter executes the existing OIDC token-check flow (cache hit/miss, Redis, stampede guard) and responds `200 Authorized` or `401 Unauthorized` exactly as before.
**Why this priority**: Backwards compatibility with the existing OIDC proxy behaviour must be preserved; a regression here would break all current integrations.
**Independent Test**: Can be fully tested by sending any non-sitemap request and confirming the existing `200 Authorized` / `401 Unauthorized` response behaviour is unchanged.
**Acceptance Scenarios**:
1. **Given** a request URL that does not end in `/sitemap.xml`, **When** a valid cached OIDC token exists, **Then** the adapter responds `200 Authorized` with `Content-Type: text/plain`.
2. **Given** a request URL that does not end in `/sitemap.xml`, **When** no cached token exists, **Then** the adapter fetches a fresh OIDC token, caches it, and responds `200 Authorized`.
3. **Given** a request URL that does not end in `/sitemap.xml`, **When** the token service is unreachable, **Then** the adapter responds `401 Unauthorized` as it does today.
---
### User Story 3 — Sitemap Request Fails Gracefully When Search API Is Unavailable (Priority: P3)
When the KME Knowledge Search Service is unreachable or returns an error, the adapter returns a meaningful error response rather than hanging or crashing.
**Why this priority**: Graceful degradation protects the wider proxy from silent failures and aids operator debugging.
**Independent Test**: Can be fully tested by mocking the search API to return an error and confirming the adapter returns a 5xx response with a descriptive message.
**Acceptance Scenarios**:
1. **Given** the Knowledge Search Service returns a non-2xx HTTP status, **When** the sitemap is requested, **Then** the adapter responds with HTTP 502 and a plain-text error message describing the upstream failure.
2. **Given** the Knowledge Search Service connection times out, **When** the sitemap is requested, **Then** the adapter responds with HTTP 504 and a plain-text message indicating a gateway timeout.
---
### Edge Cases
- What happens when the OIDC token is expired at the moment the sitemap request arrives? The same token-refresh logic used by the existing auth flow must be invoked before calling the search API.
- What happens when a knowledge item has a missing or empty `vkm:url` field? That item must be omitted from the sitemap rather than producing a malformed `<loc>` entry.
- What happens when the search API returns a very large number of results? The sitemap should include all returned results; pagination handling is out of scope for v1 (assumption documented below).
- What happens when `searchApiBaseUrl`, `tenant`, or `proxyBaseUrl` are missing from the settings file? The adapter must respond with a `500` error and a descriptive message.
- What happens when `xmlbuilder2` is not available in the VM context? The adapter must respond with a `500` error.
## Requirements *(mandatory)*
### Functional Requirements
- **FR-001**: The adapter MUST detect whether the incoming request URL ends with `/sitemap.xml` and route accordingly — to the sitemap generation flow or the existing OIDC auth flow.
- **FR-002**: When generating a sitemap, the adapter MUST retrieve knowledge items by calling the KME Knowledge Search Service at `<searchApiBaseUrl>/<tenant>` using a `GET` request.
- **FR-003**: Every Knowledge Search Service request MUST include an `Authorization` header with the value `OIDC_id_token <token>`, where `<token>` is the cached OIDC `id_token` obtained from Redis or refreshed using the existing stampede-guarded fetch logic.
- **FR-004**: The sitemap response MUST be a valid XML Sitemap conforming to the [Sitemaps protocol](https://www.sitemaps.org/protocol.html), with a `<urlset>` root element and one `<url>/<loc>` element per knowledge item.
- **FR-005**: Each `<loc>` value MUST be constructed as `<proxyBaseUrl>?kmeURL=<vkm:url value>`, where `proxyBaseUrl` is taken from `kme_CSA_settings.proxyBaseUrl`.
- **FR-006**: Knowledge items with a missing or empty `vkm:url` field MUST be silently omitted from the sitemap.
- **FR-007**: The sitemap response MUST be returned with the HTTP header `Content-Type: application/xml`.
- **FR-008**: The XML MUST be built using the `xmlbuilder2` utility already available in the VM context — no additional XML libraries may be imported.
- **FR-009**: The proxy script MUST contain zero `import` or `export` statements and MUST NOT reference `config`, `global.config`, or `process.env`.
- **FR-010**: `kme_CSA_settings.json` MUST be extended with three new fields: `searchApiBaseUrl`, `tenant`, and `proxyBaseUrl`.
- **FR-011**: If any required settings field (`searchApiBaseUrl`, `tenant`, `proxyBaseUrl`) is absent at runtime, the adapter MUST respond with HTTP 500 and a descriptive error message.
- **FR-012**: If the Knowledge Search Service responds with a non-2xx status, the adapter MUST respond with HTTP 502 and a plain-text description of the upstream error.
- **FR-013**: If the Knowledge Search Service connection times out, the adapter MUST respond with HTTP 504.
### Key Entities
- **Knowledge Item**: A document stored in KME, identified by a `vkm:url` field in the search result payload. The sitemap `<loc>` is derived from this URL.
- **Sitemap Entry**: A single `<url>/<loc>` element in the generated `sitemap.xml`, representing one indexable knowledge document URL accessible through the proxy adapter.
- **OIDC Token**: The cached `id_token` stored in Redis at `authorization.token`, used to authenticate calls to the Knowledge Search Service.
- **Settings**: Runtime configuration loaded from `kme_CSA_settings.json` and made available to the VM context as the `kme_CSA_settings` variable.
## Success Criteria *(mandatory)*
### Measurable Outcomes
- **SC-001**: A consumer requesting `/sitemap.xml` receives a well-formed, valid XML Sitemap document in under 5 seconds under normal network conditions.
- **SC-002**: All knowledge items returned by the search service are represented in the sitemap; zero items are silently dropped unless their `vkm:url` is empty or missing.
- **SC-003**: All existing non-sitemap requests continue to receive the same response behaviour (`200 Authorized` / `401 Unauthorized`) with no change in response time or correctness — zero regressions.
- **SC-004**: The returned `sitemap.xml` passes validation against the [Sitemaps XSD schema](https://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd).
- **SC-005**: Error scenarios (upstream timeout, missing settings, unavailable search service) produce an appropriate HTTP error status code and a human-readable message within 10 seconds.
## Assumptions
- The KME Knowledge Search Service returns all relevant knowledge items in a single response for v1; pagination of search results is out of scope.
- The `vkm:url` field is present at the top level of each item object in the search results array; the exact response envelope shape will be confirmed against the live API during implementation.
- The `xmlbuilder2` injected into the VM context exposes a builder API compatible with the existing usage in the project (e.g., `fast-xml-parser` `XMLBuilder` or equivalent).
- No additional `<lastmod>`, `<changefreq>`, or `<priority>` elements are required in sitemap entries for v1; only `<loc>` is mandatory.
- The proxy adapter is deployed behind a reverse proxy or load balancer that handles TLS termination; the `proxyBaseUrl` in settings reflects the externally accessible HTTPS URL.
- A single tenant is configured per adapter deployment; multi-tenant sitemap generation is out of scope.
- Search result items without a `vkm:url` field are considered malformed and are omitted without raising an error — this matches common defensive data-handling practice.
- The request timeout for the Knowledge Search Service call is 10 seconds, consistent with industry-standard defaults for proxy-initiated upstream requests.