- Refactor kmeContentSourceAdapter.js into getValidToken(), oidcAuthFlow(), and sitemapFlow(); add sitemap generation using hydra:member response structure - Add searchApiBaseUrl, tenant, proxyBaseUrl fields to kme_CSA_settings.json and kme_CSA_settings.json.example - Add 17 unit tests for sitemap flow and non-sitemap routing regression - Add 5 contract tests for sitemap endpoint (proxy-http.test.js) - Add [Unreleased] sitemap entry to CHANGELOG.md - Add full specs/002-sitemap-generation/ artifact directory (spec, plan, tasks, data-model, contracts, research, quickstart, checklist) - Update constitution.md: add redis as permitted global, refresh kme_CSA_settings references - Update copilot-instructions.md SPECKIT marker to sitemap plan
9.6 KiB
Feature Specification: Sitemap XML Generation
Feature Branch: 002-sitemap-generation
Created: 2025-07-14
Status: Draft
User Scenarios & Testing (mandatory)
User Story 1 — Search Crawler Discovers KME Content (Priority: P1)
A search engine crawler or sitemap consumer sends a GET request to the proxy adapter's sitemap endpoint. The adapter fetches all available knowledge items from the KME Knowledge Search Service and returns a standards-compliant sitemap.xml document that the crawler can index.
Why this priority: This is the core deliverable. Without a valid sitemap.xml response, no downstream indexing or content discovery is possible.
Independent Test: Can be fully tested by sending GET /sitemap.xml to a running adapter instance and verifying the returned XML body and Content-Type header, independent of all other routing behaviour.
Acceptance Scenarios:
- Given the adapter is running and the KME Knowledge Search Service is available, When a consumer sends
GET <proxy-base-url>/sitemap.xml, Then the adapter responds with HTTP 200,Content-Type: application/xml, and a body that is a well-formed XML sitemap containing one<url>/<loc>entry per knowledge item returned by the search service. - Given each search result contains a
vkm:urlfield, When the sitemap is generated, Then every<loc>value follows the pattern<proxyBaseUrl>?kmeURL=<vkm:url value>. - Given the KME search service returns zero results, When the sitemap is generated, Then the adapter returns a valid, empty
<urlset>document (no<url>elements) with HTTP 200.
User Story 2 — Non-Sitemap Requests Continue to Use Existing Auth Flow (Priority: P2)
A client sends a request whose URL does not end in /sitemap.xml. The adapter executes the existing OIDC token-check flow (cache hit/miss, Redis, stampede guard) and responds 200 Authorized or 401 Unauthorized exactly as before.
Why this priority: Backwards compatibility with the existing OIDC proxy behaviour must be preserved; a regression here would break all current integrations.
Independent Test: Can be fully tested by sending any non-sitemap request and confirming the existing 200 Authorized / 401 Unauthorized response behaviour is unchanged.
Acceptance Scenarios:
- Given a request URL that does not end in
/sitemap.xml, When a valid cached OIDC token exists, Then the adapter responds200 AuthorizedwithContent-Type: text/plain. - Given a request URL that does not end in
/sitemap.xml, When no cached token exists, Then the adapter fetches a fresh OIDC token, caches it, and responds200 Authorized. - Given a request URL that does not end in
/sitemap.xml, When the token service is unreachable, Then the adapter responds401 Unauthorizedas it does today.
User Story 3 — Sitemap Request Fails Gracefully When Search API Is Unavailable (Priority: P3)
When the KME Knowledge Search Service is unreachable or returns an error, the adapter returns a meaningful error response rather than hanging or crashing.
Why this priority: Graceful degradation protects the wider proxy from silent failures and aids operator debugging.
Independent Test: Can be fully tested by mocking the search API to return an error and confirming the adapter returns a 5xx response with a descriptive message.
Acceptance Scenarios:
- Given the Knowledge Search Service returns a non-2xx HTTP status, When the sitemap is requested, Then the adapter responds with HTTP 502 and a plain-text error message describing the upstream failure.
- Given the Knowledge Search Service connection times out, When the sitemap is requested, Then the adapter responds with HTTP 504 and a plain-text message indicating a gateway timeout.
Edge Cases
- What happens when the OIDC token is expired at the moment the sitemap request arrives? The same token-refresh logic used by the existing auth flow must be invoked before calling the search API.
- What happens when a knowledge item has a missing or empty
vkm:urlfield? That item must be omitted from the sitemap rather than producing a malformed<loc>entry. - What happens when the search API returns a very large number of results? The sitemap should include all returned results; pagination handling is out of scope for v1 (assumption documented below).
- What happens when
searchApiBaseUrl,tenant, orproxyBaseUrlare missing from the settings file? The adapter must respond with a500error and a descriptive message. - What happens when
xmlBuilderis not available in the VM context? The adapter must respond with a500error.
Requirements (mandatory)
Functional Requirements
- FR-001: The adapter MUST detect whether the incoming request URL ends with
/sitemap.xmland route accordingly — to the sitemap generation flow or the existing OIDC auth flow. - FR-002: When generating a sitemap, the adapter MUST retrieve knowledge items by calling the KME Knowledge Search Service at
<searchApiBaseUrl>/<tenant>using aGETrequest. - FR-003: Every Knowledge Search Service request MUST include an
Authorizationheader with the valueOIDC_id_token <token>, where<token>is the cached OIDCid_tokenobtained from Redis or refreshed using the existing stampede-guarded fetch logic. - FR-004: The sitemap response MUST be a valid XML Sitemap conforming to the Sitemaps protocol, with a
<urlset>root element and one<url>/<loc>element per knowledge item. - FR-005: Each
<loc>value MUST be constructed as<proxyBaseUrl>?kmeURL=<vkm:url value>, whereproxyBaseUrlis taken fromkme_CSA_settings.proxyBaseUrl. - FR-006: Knowledge items with a missing or empty
vkm:urlfield MUST be silently omitted from the sitemap. - FR-007: The sitemap response MUST be returned with the HTTP header
Content-Type: application/xml. - FR-008: The XML MUST be built using the
xmlBuilderutility already available in the VM context — no additional XML libraries may be imported. - FR-009: The proxy script MUST contain zero
importorexportstatements and MUST NOT referenceconfig,global.config, orprocess.env. - FR-010:
kme_CSA_settings.jsonMUST be extended with three new fields:searchApiBaseUrl,tenant, andproxyBaseUrl. - FR-011: If any required settings field (
searchApiBaseUrl,tenant,proxyBaseUrl) is absent at runtime, the adapter MUST respond with HTTP 500 and a descriptive error message. - FR-012: If the Knowledge Search Service responds with a non-2xx status, the adapter MUST respond with HTTP 502 and a plain-text description of the upstream error.
- FR-013: If the Knowledge Search Service connection times out, the adapter MUST respond with HTTP 504.
Key Entities
- Knowledge Item: A document stored in KME, identified by a
vkm:urlfield in the search result payload. The sitemap<loc>is derived from this URL. - Sitemap Entry: A single
<url>/<loc>element in the generatedsitemap.xml, representing one indexable knowledge document URL accessible through the proxy adapter. - OIDC Token: The cached
id_tokenstored in Redis atauthorization.token, used to authenticate calls to the Knowledge Search Service. - Settings: Runtime configuration loaded from
kme_CSA_settings.jsonand made available to the VM context as thekme_CSA_settingsvariable.
Success Criteria (mandatory)
Measurable Outcomes
- SC-001: A consumer requesting
/sitemap.xmlreceives a well-formed, valid XML Sitemap document in under 5 seconds under normal network conditions. - SC-002: All knowledge items returned by the search service are represented in the sitemap; zero items are silently dropped unless their
vkm:urlis empty or missing. - SC-003: All existing non-sitemap requests continue to receive the same response behaviour (
200 Authorized/401 Unauthorized) with no change in response time or correctness — zero regressions. - SC-004: The returned
sitemap.xmlpasses validation against the Sitemaps XSD schema. - SC-005: Error scenarios (upstream timeout, missing settings, unavailable search service) produce an appropriate HTTP error status code and a human-readable message within 10 seconds.
Assumptions
- The KME Knowledge Search Service returns all relevant knowledge items in a single response for v1; pagination of search results is out of scope.
- The
vkm:urlfield is present at the top level of each item object in the search results array; the exact response envelope shape will be confirmed against the live API during implementation. - The
xmlBuilderinjected into the VM context exposes a builder API compatible with the existing usage in the project (e.g.,fast-xml-parserXMLBuilderor equivalent). - No additional
<lastmod>,<changefreq>, or<priority>elements are required in sitemap entries for v1; only<loc>is mandatory. - The proxy adapter is deployed behind a reverse proxy or load balancer that handles TLS termination; the
proxyBaseUrlin settings reflects the externally accessible HTTPS URL. - A single tenant is configured per adapter deployment; multi-tenant sitemap generation is out of scope.
- Search result items without a
vkm:urlfield are considered malformed and are omitted without raising an error — this matches common defensive data-handling practice. - The request timeout for the Knowledge Search Service call is 10 seconds, consistent with industry-standard defaults for proxy-initiated upstream requests.