feat(002): add sitemap generation feature
- Refactor kmeContentSourceAdapter.js into getValidToken(), oidcAuthFlow(), and sitemapFlow(); add sitemap generation using hydra:member response structure - Add searchApiBaseUrl, tenant, proxyBaseUrl fields to kme_CSA_settings.json and kme_CSA_settings.json.example - Add 17 unit tests for sitemap flow and non-sitemap routing regression - Add 5 contract tests for sitemap endpoint (proxy-http.test.js) - Add [Unreleased] sitemap entry to CHANGELOG.md - Add full specs/002-sitemap-generation/ artifact directory (spec, plan, tasks, data-model, contracts, research, quickstart, checklist) - Update constitution.md: add redis as permitted global, refresh kme_CSA_settings references - Update copilot-instructions.md SPECKIT marker to sitemap plan
This commit is contained in:
108
specs/002-sitemap-generation/spec.md
Normal file
108
specs/002-sitemap-generation/spec.md
Normal file
@@ -0,0 +1,108 @@
|
||||
# Feature Specification: Sitemap XML Generation
|
||||
|
||||
**Feature Branch**: `002-sitemap-generation`
|
||||
**Created**: 2025-07-14
|
||||
**Status**: Draft
|
||||
|
||||
## User Scenarios & Testing *(mandatory)*
|
||||
|
||||
### User Story 1 — Search Crawler Discovers KME Content (Priority: P1)
|
||||
|
||||
A search engine crawler or sitemap consumer sends a `GET` request to the proxy adapter's sitemap endpoint. The adapter fetches all available knowledge items from the KME Knowledge Search Service and returns a standards-compliant `sitemap.xml` document that the crawler can index.
|
||||
|
||||
**Why this priority**: This is the core deliverable. Without a valid `sitemap.xml` response, no downstream indexing or content discovery is possible.
|
||||
|
||||
**Independent Test**: Can be fully tested by sending `GET /sitemap.xml` to a running adapter instance and verifying the returned XML body and `Content-Type` header, independent of all other routing behaviour.
|
||||
|
||||
**Acceptance Scenarios**:
|
||||
|
||||
1. **Given** the adapter is running and the KME Knowledge Search Service is available, **When** a consumer sends `GET <proxy-base-url>/sitemap.xml`, **Then** the adapter responds with HTTP 200, `Content-Type: application/xml`, and a body that is a well-formed XML sitemap containing one `<url>/<loc>` entry per knowledge item returned by the search service.
|
||||
2. **Given** each search result contains a `vkm:url` field, **When** the sitemap is generated, **Then** every `<loc>` value follows the pattern `<proxyBaseUrl>?kmeURL=<vkm:url value>`.
|
||||
3. **Given** the KME search service returns zero results, **When** the sitemap is generated, **Then** the adapter returns a valid, empty `<urlset>` document (no `<url>` elements) with HTTP 200.
|
||||
|
||||
---
|
||||
|
||||
### User Story 2 — Non-Sitemap Requests Continue to Use Existing Auth Flow (Priority: P2)
|
||||
|
||||
A client sends a request whose URL does *not* end in `/sitemap.xml`. The adapter executes the existing OIDC token-check flow (cache hit/miss, Redis, stampede guard) and responds `200 Authorized` or `401 Unauthorized` exactly as before.
|
||||
|
||||
**Why this priority**: Backwards compatibility with the existing OIDC proxy behaviour must be preserved; a regression here would break all current integrations.
|
||||
|
||||
**Independent Test**: Can be fully tested by sending any non-sitemap request and confirming the existing `200 Authorized` / `401 Unauthorized` response behaviour is unchanged.
|
||||
|
||||
**Acceptance Scenarios**:
|
||||
|
||||
1. **Given** a request URL that does not end in `/sitemap.xml`, **When** a valid cached OIDC token exists, **Then** the adapter responds `200 Authorized` with `Content-Type: text/plain`.
|
||||
2. **Given** a request URL that does not end in `/sitemap.xml`, **When** no cached token exists, **Then** the adapter fetches a fresh OIDC token, caches it, and responds `200 Authorized`.
|
||||
3. **Given** a request URL that does not end in `/sitemap.xml`, **When** the token service is unreachable, **Then** the adapter responds `401 Unauthorized` as it does today.
|
||||
|
||||
---
|
||||
|
||||
### User Story 3 — Sitemap Request Fails Gracefully When Search API Is Unavailable (Priority: P3)
|
||||
|
||||
When the KME Knowledge Search Service is unreachable or returns an error, the adapter returns a meaningful error response rather than hanging or crashing.
|
||||
|
||||
**Why this priority**: Graceful degradation protects the wider proxy from silent failures and aids operator debugging.
|
||||
|
||||
**Independent Test**: Can be fully tested by mocking the search API to return an error and confirming the adapter returns a 5xx response with a descriptive message.
|
||||
|
||||
**Acceptance Scenarios**:
|
||||
|
||||
1. **Given** the Knowledge Search Service returns a non-2xx HTTP status, **When** the sitemap is requested, **Then** the adapter responds with HTTP 502 and a plain-text error message describing the upstream failure.
|
||||
2. **Given** the Knowledge Search Service connection times out, **When** the sitemap is requested, **Then** the adapter responds with HTTP 504 and a plain-text message indicating a gateway timeout.
|
||||
|
||||
---
|
||||
|
||||
### Edge Cases
|
||||
|
||||
- What happens when the OIDC token is expired at the moment the sitemap request arrives? The same token-refresh logic used by the existing auth flow must be invoked before calling the search API.
|
||||
- What happens when a knowledge item has a missing or empty `vkm:url` field? That item must be omitted from the sitemap rather than producing a malformed `<loc>` entry.
|
||||
- What happens when the search API returns a very large number of results? The sitemap should include all returned results; pagination handling is out of scope for v1 (assumption documented below).
|
||||
- What happens when `searchApiBaseUrl`, `tenant`, or `proxyBaseUrl` are missing from the settings file? The adapter must respond with a `500` error and a descriptive message.
|
||||
- What happens when `xmlBuilder` is not available in the VM context? The adapter must respond with a `500` error.
|
||||
|
||||
## Requirements *(mandatory)*
|
||||
|
||||
### Functional Requirements
|
||||
|
||||
- **FR-001**: The adapter MUST detect whether the incoming request URL ends with `/sitemap.xml` and route accordingly — to the sitemap generation flow or the existing OIDC auth flow.
|
||||
- **FR-002**: When generating a sitemap, the adapter MUST retrieve knowledge items by calling the KME Knowledge Search Service at `<searchApiBaseUrl>/<tenant>` using a `GET` request.
|
||||
- **FR-003**: Every Knowledge Search Service request MUST include an `Authorization` header with the value `OIDC_id_token <token>`, where `<token>` is the cached OIDC `id_token` obtained from Redis or refreshed using the existing stampede-guarded fetch logic.
|
||||
- **FR-004**: The sitemap response MUST be a valid XML Sitemap conforming to the [Sitemaps protocol](https://www.sitemaps.org/protocol.html), with a `<urlset>` root element and one `<url>/<loc>` element per knowledge item.
|
||||
- **FR-005**: Each `<loc>` value MUST be constructed as `<proxyBaseUrl>?kmeURL=<vkm:url value>`, where `proxyBaseUrl` is taken from `kme_CSA_settings.proxyBaseUrl`.
|
||||
- **FR-006**: Knowledge items with a missing or empty `vkm:url` field MUST be silently omitted from the sitemap.
|
||||
- **FR-007**: The sitemap response MUST be returned with the HTTP header `Content-Type: application/xml`.
|
||||
- **FR-008**: The XML MUST be built using the `xmlBuilder` utility already available in the VM context — no additional XML libraries may be imported.
|
||||
- **FR-009**: The proxy script MUST contain zero `import` or `export` statements and MUST NOT reference `config`, `global.config`, or `process.env`.
|
||||
- **FR-010**: `kme_CSA_settings.json` MUST be extended with three new fields: `searchApiBaseUrl`, `tenant`, and `proxyBaseUrl`.
|
||||
- **FR-011**: If any required settings field (`searchApiBaseUrl`, `tenant`, `proxyBaseUrl`) is absent at runtime, the adapter MUST respond with HTTP 500 and a descriptive error message.
|
||||
- **FR-012**: If the Knowledge Search Service responds with a non-2xx status, the adapter MUST respond with HTTP 502 and a plain-text description of the upstream error.
|
||||
- **FR-013**: If the Knowledge Search Service connection times out, the adapter MUST respond with HTTP 504.
|
||||
|
||||
### Key Entities
|
||||
|
||||
- **Knowledge Item**: A document stored in KME, identified by a `vkm:url` field in the search result payload. The sitemap `<loc>` is derived from this URL.
|
||||
- **Sitemap Entry**: A single `<url>/<loc>` element in the generated `sitemap.xml`, representing one indexable knowledge document URL accessible through the proxy adapter.
|
||||
- **OIDC Token**: The cached `id_token` stored in Redis at `authorization.token`, used to authenticate calls to the Knowledge Search Service.
|
||||
- **Settings**: Runtime configuration loaded from `kme_CSA_settings.json` and made available to the VM context as the `kme_CSA_settings` variable.
|
||||
|
||||
## Success Criteria *(mandatory)*
|
||||
|
||||
### Measurable Outcomes
|
||||
|
||||
- **SC-001**: A consumer requesting `/sitemap.xml` receives a well-formed, valid XML Sitemap document in under 5 seconds under normal network conditions.
|
||||
- **SC-002**: All knowledge items returned by the search service are represented in the sitemap; zero items are silently dropped unless their `vkm:url` is empty or missing.
|
||||
- **SC-003**: All existing non-sitemap requests continue to receive the same response behaviour (`200 Authorized` / `401 Unauthorized`) with no change in response time or correctness — zero regressions.
|
||||
- **SC-004**: The returned `sitemap.xml` passes validation against the [Sitemaps XSD schema](https://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd).
|
||||
- **SC-005**: Error scenarios (upstream timeout, missing settings, unavailable search service) produce an appropriate HTTP error status code and a human-readable message within 10 seconds.
|
||||
|
||||
## Assumptions
|
||||
|
||||
- The KME Knowledge Search Service returns all relevant knowledge items in a single response for v1; pagination of search results is out of scope.
|
||||
- The `vkm:url` field is present at the top level of each item object in the search results array; the exact response envelope shape will be confirmed against the live API during implementation.
|
||||
- The `xmlBuilder` injected into the VM context exposes a builder API compatible with the existing usage in the project (e.g., `fast-xml-parser` `XMLBuilder` or equivalent).
|
||||
- No additional `<lastmod>`, `<changefreq>`, or `<priority>` elements are required in sitemap entries for v1; only `<loc>` is mandatory.
|
||||
- The proxy adapter is deployed behind a reverse proxy or load balancer that handles TLS termination; the `proxyBaseUrl` in settings reflects the externally accessible HTTPS URL.
|
||||
- A single tenant is configured per adapter deployment; multi-tenant sitemap generation is out of scope.
|
||||
- Search result items without a `vkm:url` field are considered malformed and are omitted without raising an error — this matches common defensive data-handling practice.
|
||||
- The request timeout for the Knowledge Search Service call is 10 seconds, consistent with industry-standard defaults for proxy-initiated upstream requests.
|
||||
Reference in New Issue
Block a user