# Feature Specification: Sitemap XML Generation **Feature Branch**: `002-sitemap-generation` **Created**: 2025-07-14 **Status**: Draft ## User Scenarios & Testing *(mandatory)* ### User Story 1 — Search Crawler Discovers KME Content (Priority: P1) A search engine crawler or sitemap consumer sends a `GET` request to the proxy adapter's sitemap endpoint. The adapter fetches all available knowledge items from the KME Knowledge Search Service and returns a standards-compliant `sitemap.xml` document that the crawler can index. **Why this priority**: This is the core deliverable. Without a valid `sitemap.xml` response, no downstream indexing or content discovery is possible. **Independent Test**: Can be fully tested by sending `GET /sitemap.xml` to a running adapter instance and verifying the returned XML body and `Content-Type` header, independent of all other routing behaviour. **Acceptance Scenarios**: 1. **Given** the adapter is running and the KME Knowledge Search Service is available, **When** a consumer sends `GET /sitemap.xml`, **Then** the adapter responds with HTTP 200, `Content-Type: application/xml`, and a body that is a well-formed XML sitemap containing one `/` entry per knowledge item returned by the search service. 2. **Given** each search result contains a `vkm:url` field, **When** the sitemap is generated, **Then** every `` value follows the pattern `?kmeURL=`. 3. **Given** the KME search service returns zero results, **When** the sitemap is generated, **Then** the adapter returns a valid, empty `` document (no `` elements) with HTTP 200. --- ### User Story 2 — Non-Sitemap Requests Continue to Use Existing Auth Flow (Priority: P2) A client sends a request whose URL does *not* end in `/sitemap.xml`. The adapter executes the existing OIDC token-check flow (cache hit/miss, Redis, stampede guard) and responds `200 Authorized` or `401 Unauthorized` exactly as before. **Why this priority**: Backwards compatibility with the existing OIDC proxy behaviour must be preserved; a regression here would break all current integrations. **Independent Test**: Can be fully tested by sending any non-sitemap request and confirming the existing `200 Authorized` / `401 Unauthorized` response behaviour is unchanged. **Acceptance Scenarios**: 1. **Given** a request URL that does not end in `/sitemap.xml`, **When** a valid cached OIDC token exists, **Then** the adapter responds `200 Authorized` with `Content-Type: text/plain`. 2. **Given** a request URL that does not end in `/sitemap.xml`, **When** no cached token exists, **Then** the adapter fetches a fresh OIDC token, caches it, and responds `200 Authorized`. 3. **Given** a request URL that does not end in `/sitemap.xml`, **When** the token service is unreachable, **Then** the adapter responds `401 Unauthorized` as it does today. --- ### User Story 3 — Sitemap Request Fails Gracefully When Search API Is Unavailable (Priority: P3) When the KME Knowledge Search Service is unreachable or returns an error, the adapter returns a meaningful error response rather than hanging or crashing. **Why this priority**: Graceful degradation protects the wider proxy from silent failures and aids operator debugging. **Independent Test**: Can be fully tested by mocking the search API to return an error and confirming the adapter returns a 5xx response with a descriptive message. **Acceptance Scenarios**: 1. **Given** the Knowledge Search Service returns a non-2xx HTTP status, **When** the sitemap is requested, **Then** the adapter responds with HTTP 502 and a plain-text error message describing the upstream failure. 2. **Given** the Knowledge Search Service connection times out, **When** the sitemap is requested, **Then** the adapter responds with HTTP 504 and a plain-text message indicating a gateway timeout. --- ### Edge Cases - What happens when the OIDC token is expired at the moment the sitemap request arrives? The same token-refresh logic used by the existing auth flow must be invoked before calling the search API. - What happens when a knowledge item has a missing or empty `vkm:url` field? That item must be omitted from the sitemap rather than producing a malformed `` entry. - What happens when the search API returns a very large number of results? The sitemap should include all returned results; pagination handling is out of scope for v1 (assumption documented below). - What happens when `searchApiBaseUrl`, `tenant`, or `proxyBaseUrl` are missing from the settings file? The adapter must respond with a `500` error and a descriptive message. - What happens when `xmlBuilder` is not available in the VM context? The adapter must respond with a `500` error. ## Requirements *(mandatory)* ### Functional Requirements - **FR-001**: The adapter MUST detect whether the incoming request URL ends with `/sitemap.xml` and route accordingly — to the sitemap generation flow or the existing OIDC auth flow. - **FR-002**: When generating a sitemap, the adapter MUST retrieve knowledge items by calling the KME Knowledge Search Service at `/` using a `GET` request. - **FR-003**: Every Knowledge Search Service request MUST include an `Authorization` header with the value `OIDC_id_token `, where `` is the cached OIDC `id_token` obtained from Redis or refreshed using the existing stampede-guarded fetch logic. - **FR-004**: The sitemap response MUST be a valid XML Sitemap conforming to the [Sitemaps protocol](https://www.sitemaps.org/protocol.html), with a `` root element and one `/` element per knowledge item. - **FR-005**: Each `` value MUST be constructed as `?kmeURL=`, where `proxyBaseUrl` is taken from `kme_CSA_settings.proxyBaseUrl`. - **FR-006**: Knowledge items with a missing or empty `vkm:url` field MUST be silently omitted from the sitemap. - **FR-007**: The sitemap response MUST be returned with the HTTP header `Content-Type: application/xml`. - **FR-008**: The XML MUST be built using the `xmlBuilder` utility already available in the VM context — no additional XML libraries may be imported. - **FR-009**: The proxy script MUST contain zero `import` or `export` statements and MUST NOT reference `config`, `global.config`, or `process.env`. - **FR-010**: `kme_CSA_settings.json` MUST be extended with three new fields: `searchApiBaseUrl`, `tenant`, and `proxyBaseUrl`. - **FR-011**: If any required settings field (`searchApiBaseUrl`, `tenant`, `proxyBaseUrl`) is absent at runtime, the adapter MUST respond with HTTP 500 and a descriptive error message. - **FR-012**: If the Knowledge Search Service responds with a non-2xx status, the adapter MUST respond with HTTP 502 and a plain-text description of the upstream error. - **FR-013**: If the Knowledge Search Service connection times out, the adapter MUST respond with HTTP 504. ### Key Entities - **Knowledge Item**: A document stored in KME, identified by a `vkm:url` field in the search result payload. The sitemap `` is derived from this URL. - **Sitemap Entry**: A single `/` element in the generated `sitemap.xml`, representing one indexable knowledge document URL accessible through the proxy adapter. - **OIDC Token**: The cached `id_token` stored in Redis at `authorization.token`, used to authenticate calls to the Knowledge Search Service. - **Settings**: Runtime configuration loaded from `kme_CSA_settings.json` and made available to the VM context as the `kme_CSA_settings` variable. ## Success Criteria *(mandatory)* ### Measurable Outcomes - **SC-001**: A consumer requesting `/sitemap.xml` receives a well-formed, valid XML Sitemap document in under 5 seconds under normal network conditions. - **SC-002**: All knowledge items returned by the search service are represented in the sitemap; zero items are silently dropped unless their `vkm:url` is empty or missing. - **SC-003**: All existing non-sitemap requests continue to receive the same response behaviour (`200 Authorized` / `401 Unauthorized`) with no change in response time or correctness — zero regressions. - **SC-004**: The returned `sitemap.xml` passes validation against the [Sitemaps XSD schema](https://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd). - **SC-005**: Error scenarios (upstream timeout, missing settings, unavailable search service) produce an appropriate HTTP error status code and a human-readable message within 10 seconds. ## Assumptions - The KME Knowledge Search Service returns all relevant knowledge items in a single response for v1; pagination of search results is out of scope. - The `vkm:url` field is present at the top level of each item object in the search results array; the exact response envelope shape will be confirmed against the live API during implementation. - The `xmlBuilder` injected into the VM context exposes a builder API compatible with the existing usage in the project (e.g., `fast-xml-parser` `XMLBuilder` or equivalent). - No additional ``, ``, or `` elements are required in sitemap entries for v1; only `` is mandatory. - The proxy adapter is deployed behind a reverse proxy or load balancer that handles TLS termination; the `proxyBaseUrl` in settings reflects the externally accessible HTTPS URL. - A single tenant is configured per adapter deployment; multi-tenant sitemap generation is out of scope. - Search result items without a `vkm:url` field are considered malformed and are omitted without raising an error — this matches common defensive data-handling practice. - The request timeout for the Knowledge Search Service call is 10 seconds, consistent with industry-standard defaults for proxy-initiated upstream requests.