Files
kme_content_adapter/specs/002-sitemap-generation/spec.md
Peter.Morton 50b87297d2 feat(002): add sitemap generation feature
- Refactor kmeContentSourceAdapter.js into getValidToken(), oidcAuthFlow(),
  and sitemapFlow(); add sitemap generation using hydra:member response structure
- Add searchApiBaseUrl, tenant, proxyBaseUrl fields to kme_CSA_settings.json
  and kme_CSA_settings.json.example
- Add 17 unit tests for sitemap flow and non-sitemap routing regression
- Add 5 contract tests for sitemap endpoint (proxy-http.test.js)
- Add [Unreleased] sitemap entry to CHANGELOG.md
- Add full specs/002-sitemap-generation/ artifact directory
  (spec, plan, tasks, data-model, contracts, research, quickstart, checklist)
- Update constitution.md: add redis as permitted global, refresh
  kme_CSA_settings references
- Update copilot-instructions.md SPECKIT marker to sitemap plan
2026-04-22 22:08:08 -05:00

9.6 KiB

Feature Specification: Sitemap XML Generation

Feature Branch: 002-sitemap-generation Created: 2025-07-14 Status: Draft

User Scenarios & Testing (mandatory)

User Story 1 — Search Crawler Discovers KME Content (Priority: P1)

A search engine crawler or sitemap consumer sends a GET request to the proxy adapter's sitemap endpoint. The adapter fetches all available knowledge items from the KME Knowledge Search Service and returns a standards-compliant sitemap.xml document that the crawler can index.

Why this priority: This is the core deliverable. Without a valid sitemap.xml response, no downstream indexing or content discovery is possible.

Independent Test: Can be fully tested by sending GET /sitemap.xml to a running adapter instance and verifying the returned XML body and Content-Type header, independent of all other routing behaviour.

Acceptance Scenarios:

  1. Given the adapter is running and the KME Knowledge Search Service is available, When a consumer sends GET <proxy-base-url>/sitemap.xml, Then the adapter responds with HTTP 200, Content-Type: application/xml, and a body that is a well-formed XML sitemap containing one <url>/<loc> entry per knowledge item returned by the search service.
  2. Given each search result contains a vkm:url field, When the sitemap is generated, Then every <loc> value follows the pattern <proxyBaseUrl>?kmeURL=<vkm:url value>.
  3. Given the KME search service returns zero results, When the sitemap is generated, Then the adapter returns a valid, empty <urlset> document (no <url> elements) with HTTP 200.

User Story 2 — Non-Sitemap Requests Continue to Use Existing Auth Flow (Priority: P2)

A client sends a request whose URL does not end in /sitemap.xml. The adapter executes the existing OIDC token-check flow (cache hit/miss, Redis, stampede guard) and responds 200 Authorized or 401 Unauthorized exactly as before.

Why this priority: Backwards compatibility with the existing OIDC proxy behaviour must be preserved; a regression here would break all current integrations.

Independent Test: Can be fully tested by sending any non-sitemap request and confirming the existing 200 Authorized / 401 Unauthorized response behaviour is unchanged.

Acceptance Scenarios:

  1. Given a request URL that does not end in /sitemap.xml, When a valid cached OIDC token exists, Then the adapter responds 200 Authorized with Content-Type: text/plain.
  2. Given a request URL that does not end in /sitemap.xml, When no cached token exists, Then the adapter fetches a fresh OIDC token, caches it, and responds 200 Authorized.
  3. Given a request URL that does not end in /sitemap.xml, When the token service is unreachable, Then the adapter responds 401 Unauthorized as it does today.

User Story 3 — Sitemap Request Fails Gracefully When Search API Is Unavailable (Priority: P3)

When the KME Knowledge Search Service is unreachable or returns an error, the adapter returns a meaningful error response rather than hanging or crashing.

Why this priority: Graceful degradation protects the wider proxy from silent failures and aids operator debugging.

Independent Test: Can be fully tested by mocking the search API to return an error and confirming the adapter returns a 5xx response with a descriptive message.

Acceptance Scenarios:

  1. Given the Knowledge Search Service returns a non-2xx HTTP status, When the sitemap is requested, Then the adapter responds with HTTP 502 and a plain-text error message describing the upstream failure.
  2. Given the Knowledge Search Service connection times out, When the sitemap is requested, Then the adapter responds with HTTP 504 and a plain-text message indicating a gateway timeout.

Edge Cases

  • What happens when the OIDC token is expired at the moment the sitemap request arrives? The same token-refresh logic used by the existing auth flow must be invoked before calling the search API.
  • What happens when a knowledge item has a missing or empty vkm:url field? That item must be omitted from the sitemap rather than producing a malformed <loc> entry.
  • What happens when the search API returns a very large number of results? The sitemap should include all returned results; pagination handling is out of scope for v1 (assumption documented below).
  • What happens when searchApiBaseUrl, tenant, or proxyBaseUrl are missing from the settings file? The adapter must respond with a 500 error and a descriptive message.
  • What happens when xmlBuilder is not available in the VM context? The adapter must respond with a 500 error.

Requirements (mandatory)

Functional Requirements

  • FR-001: The adapter MUST detect whether the incoming request URL ends with /sitemap.xml and route accordingly — to the sitemap generation flow or the existing OIDC auth flow.
  • FR-002: When generating a sitemap, the adapter MUST retrieve knowledge items by calling the KME Knowledge Search Service at <searchApiBaseUrl>/<tenant> using a GET request.
  • FR-003: Every Knowledge Search Service request MUST include an Authorization header with the value OIDC_id_token <token>, where <token> is the cached OIDC id_token obtained from Redis or refreshed using the existing stampede-guarded fetch logic.
  • FR-004: The sitemap response MUST be a valid XML Sitemap conforming to the Sitemaps protocol, with a <urlset> root element and one <url>/<loc> element per knowledge item.
  • FR-005: Each <loc> value MUST be constructed as <proxyBaseUrl>?kmeURL=<vkm:url value>, where proxyBaseUrl is taken from kme_CSA_settings.proxyBaseUrl.
  • FR-006: Knowledge items with a missing or empty vkm:url field MUST be silently omitted from the sitemap.
  • FR-007: The sitemap response MUST be returned with the HTTP header Content-Type: application/xml.
  • FR-008: The XML MUST be built using the xmlBuilder utility already available in the VM context — no additional XML libraries may be imported.
  • FR-009: The proxy script MUST contain zero import or export statements and MUST NOT reference config, global.config, or process.env.
  • FR-010: kme_CSA_settings.json MUST be extended with three new fields: searchApiBaseUrl, tenant, and proxyBaseUrl.
  • FR-011: If any required settings field (searchApiBaseUrl, tenant, proxyBaseUrl) is absent at runtime, the adapter MUST respond with HTTP 500 and a descriptive error message.
  • FR-012: If the Knowledge Search Service responds with a non-2xx status, the adapter MUST respond with HTTP 502 and a plain-text description of the upstream error.
  • FR-013: If the Knowledge Search Service connection times out, the adapter MUST respond with HTTP 504.

Key Entities

  • Knowledge Item: A document stored in KME, identified by a vkm:url field in the search result payload. The sitemap <loc> is derived from this URL.
  • Sitemap Entry: A single <url>/<loc> element in the generated sitemap.xml, representing one indexable knowledge document URL accessible through the proxy adapter.
  • OIDC Token: The cached id_token stored in Redis at authorization.token, used to authenticate calls to the Knowledge Search Service.
  • Settings: Runtime configuration loaded from kme_CSA_settings.json and made available to the VM context as the kme_CSA_settings variable.

Success Criteria (mandatory)

Measurable Outcomes

  • SC-001: A consumer requesting /sitemap.xml receives a well-formed, valid XML Sitemap document in under 5 seconds under normal network conditions.
  • SC-002: All knowledge items returned by the search service are represented in the sitemap; zero items are silently dropped unless their vkm:url is empty or missing.
  • SC-003: All existing non-sitemap requests continue to receive the same response behaviour (200 Authorized / 401 Unauthorized) with no change in response time or correctness — zero regressions.
  • SC-004: The returned sitemap.xml passes validation against the Sitemaps XSD schema.
  • SC-005: Error scenarios (upstream timeout, missing settings, unavailable search service) produce an appropriate HTTP error status code and a human-readable message within 10 seconds.

Assumptions

  • The KME Knowledge Search Service returns all relevant knowledge items in a single response for v1; pagination of search results is out of scope.
  • The vkm:url field is present at the top level of each item object in the search results array; the exact response envelope shape will be confirmed against the live API during implementation.
  • The xmlBuilder injected into the VM context exposes a builder API compatible with the existing usage in the project (e.g., fast-xml-parser XMLBuilder or equivalent).
  • No additional <lastmod>, <changefreq>, or <priority> elements are required in sitemap entries for v1; only <loc> is mandatory.
  • The proxy adapter is deployed behind a reverse proxy or load balancer that handles TLS termination; the proxyBaseUrl in settings reflects the externally accessible HTTPS URL.
  • A single tenant is configured per adapter deployment; multi-tenant sitemap generation is out of scope.
  • Search result items without a vkm:url field are considered malformed and are omitted without raising an error — this matches common defensive data-handling practice.
  • The request timeout for the Knowledge Search Service call is 10 seconds, consistent with industry-standard defaults for proxy-initiated upstream requests.