feat(002): add sitemap generation feature

- Refactor kmeContentSourceAdapter.js into getValidToken(), oidcAuthFlow(), and sitemapFlow(); add sitemap generation using hydra:member response structure - Add searchApiBaseUrl, tenant, proxyBaseUrl fields to kme_CSA_settings.json and kme_CSA_settings.json.example - Add 17 unit tests for sitemap flow and non-sitemap routing regression - Add 5 contract tests for sitemap endpoint (proxy-http.test.js) - Add [Unreleased] sitemap entry to CHANGELOG.md - Add full specs/002-sitemap-generation/ artifact directory (spec, plan, tasks, data-model, contracts, research, quickstart, checklist) - Update constitution.md: add redis as permitted global, refresh kme_CSA_settings references - Update copilot-instructions.md SPECKIT marker to sitemap plan
2026-04-22 22:08:08 -05:00
parent 49a6b2e4e7
commit 50b87297d2
17 changed files with 1879 additions and 40 deletions
--- a/specs/002-sitemap-generation/tasks.md
+++ b/specs/002-sitemap-generation/tasks.md
@@ -0,0 +1,241 @@
+# Tasks: Sitemap XML Generation
+
+**Feature**: `002-sitemap-generation`
+**Input**: Design documents from `/specs/002-sitemap-generation/`
+**Prerequisites**: plan.md ✅ spec.md ✅ research.md ✅ data-model.md ✅ contracts/sitemap-endpoint.md ✅ quickstart.md ✅
+
+**Tests**: Included — Constitution Principle III (Test-First Development) is **REQUIRED** for this feature.
+
+**Organization**: Tasks grouped by user story to enable independent implementation and testing.
+
+## Format: `[ID] [P?] [Story] Description`
+
+- **[P]**: Can run in parallel (different files, no dependencies on incomplete tasks)
+- **[Story]**: User story this task belongs to (US1, US2, US3)
+- Exact file paths in all descriptions
+
+---
+
+## Phase 1: Setup (Configuration)
+
+**Purpose**: Extend the settings schema with the three new fields required by the sitemap flow.
+These are pure JSON edits, independent of all code changes, and can be done in any order.
+
+- [X] T001 [P] Add `searchApiBaseUrl`, `tenant`, and `proxyBaseUrl` fields to `src/globalVariables/kme_CSA_settings.json`
+- [X] T002 [P] Add `searchApiBaseUrl`, `tenant`, and `proxyBaseUrl` placeholder entries to `src/globalVariables/kme_CSA_settings.json.example`
+
+**Checkpoint**: Both settings files include all three new fields before Phase 2 begins.
+
+---
+
+## Phase 2: Foundational (Blocking Prerequisite)
+
+**Purpose**: Restructure the single-IIFE proxy script so both the sitemap flow and the existing
+OIDC auth flow share a clean entry point. **No user-story work can begin until this is done.**
+
+- [X] T003 Restructure `src/proxyScripts/kmeContentSourceAdapter.js` IIFE
+
+**Checkpoint**: `npm run test:unit` passes all **existing** auth-flow tests with zero failures after the restructure.
+
+---
+
+## Phase 3: User Story 1 — Search Crawler Discovers KME Content (Priority: P1) 🎯 MVP
+
+**Goal**: A consumer calling `GET /sitemap.xml` receives a well-formed XML Sitemap containing
+one `<url>/<loc>` per knowledge item, built via `xmlBuilder`, with `Content-Type: application/xml`.
+
+**Independent Test**: `curl http://localhost:3000/sitemap.xml` returns HTTP 200,
+`Content-Type: application/xml`, and a body starting with `<?xml` containing `<urlset>`.
+
+### Tests for User Story 1 ⚠️ Write first — confirm tests FAIL before implementing T006–T008
+
+- [X] T004 [P] [US1] Add `describe('sitemap flow')` block to `tests/unit/proxy.test.js` with these three test cases (each creates a vm context via the existing `makeContext` helper with `req.url` set to `'/sitemap.xml'`):
+  - **Happy path — items present**: mock `axios.get` resolving `{ data: { items: [{ 'vkm:url': 'https://kme.example.com/doc-1' }, { 'vkm:url': 'https://kme.example.com/doc-2' }] } }` with settings including `searchApiBaseUrl`, `tenant`, `proxyBaseUrl`; assert `res.statusCode === 200`, `res.headers['Content-Type'] === 'application/xml'`, body contains `<?xml`, `<urlset`, and `<loc>https://proxy.example.com?kmeURL=https%3A%2F%2Fkme.example.com%2Fdoc-1</loc>`
+  - **Happy path — zero items**: mock `axios.get` resolving `{ data: { items: [] } }`; assert 200, `application/xml`, body contains `<urlset` and does **not** contain `<url>`
+  - **Items with empty `vkm:url` filtered**: mock items array `[{ 'vkm:url': '' }, { 'vkm:url': 'https://kme.example.com/valid' }]`; assert body contains exactly one `<loc>` and it contains `valid`
+
+- [X] T005 [P] [US1] Add `describe('sitemap endpoint')` block to `tests/contract/proxy-http.test.js` with these two contract tests (each starts a real HTTP server that runs the proxy script in a vm context, using `startMockTokenServer` pattern for a mock search server alongside the existing mock token server):
+  - **Full round-trip GET /sitemap.xml**: mock search server returns `{ items: [{ 'vkm:url': 'https://kme.example.com/doc-1' }] }`; send real `axios.get('http://localhost:<port>/sitemap.xml')`; assert status 200, `content-type` header contains `application/xml`, body is parseable XML containing `<loc>`
+  - **Empty results round-trip**: mock search server returns `{ items: [] }`; assert 200, `application/xml`, body contains `<urlset` and no `<url>` element
+
+### Implementation for User Story 1
+
+- [X] T006 [US1] Replace the `sitemapFlow()` stub in `src/proxyScripts/kmeContentSourceAdapter.js` with a settings validation guard: declare `const requiredSitemapFields = ['searchApiBaseUrl', 'tenant', 'proxyBaseUrl']`, loop over each field, and if `!kme_CSA_settings[field]` respond `res.writeHead(500, { 'Content-Type': 'text/plain' })` + `res.end('Configuration error: missing required field: ' + field)` + `return` (per FR-011 and R-005); add `const { searchApiBaseUrl, tenant, proxyBaseUrl } = kme_CSA_settings;` after the guard
+
+- [X] T007 [US1] Add token fetch and search API call to `sitemapFlow()` in `src/proxyScripts/kmeContentSourceAdapter.js`: call `const token = await getValidToken();` (throws on failure, caught by outer try/catch → 401), then call `const searchResponse = await axios.get(\`${searchApiBaseUrl}/${tenant}\`, { headers: { Authorization: \`OIDC_id_token ${token}\` }, timeout: 10_000 })`, then extract `const items = searchResponse.data.items ?? searchResponse.data ?? [];` (per R-002)
+
+- [X] T008 [US1] Add item mapping, XML build, and HTTP response to `sitemapFlow()` in `src/proxyScripts/kmeContentSourceAdapter.js`: iterate `items`, skip entries where `!item['vkm:url']` (FR-006), for each valid item compute `const loc = \`${proxyBaseUrl}?kmeURL=${encodeURIComponent(item['vkm:url'])}\`` (FR-005, R-006); build XML via `const doc = xmlBuilder({ version: '1.0', encoding: 'UTF-8' }); const urlset = doc.ele('urlset', { xmlns: 'http://www.sitemaps.org/schemas/sitemap/0.9' }); urlset.ele('url').ele('loc').txt(loc).up().up();` (FR-008, R-003); serialise with `const xml = doc.end({ prettyPrint: false })`; respond `res.writeHead(200, { 'Content-Type': 'application/xml' }); res.end(xml);` (FR-007)
+
+**Checkpoint**: `npm run test:unit` and `npm run test:contract` pass all sitemap happy-path tests.
+At this point `GET /sitemap.xml` is fully functional; MVP is deliverable.
+
+---
+
+## Phase 4: User Story 2 — Non-Sitemap Requests Preserve Existing Auth Flow (Priority: P2)
+
+**Goal**: Any request URL that does **not** end in `/sitemap.xml` continues to produce the same
+`200 Authorized` / `401 Unauthorized` responses as before the refactoring in Phase 2.
+
+**Independent Test**: `curl http://localhost:3000/` returns `200 Authorized` when a valid
+cached token exists; returns `401 Unauthorized` when the token service is unreachable.
+
+### Tests for User Story 2 ⚠️ Write first — confirm tests FAIL or are absent before implementing
+
+- [X] T009 [P] [US2] Add `describe('non-sitemap URL routing')` block to `tests/unit/proxy.test.js` as a regression guard (if not already covered by existing tests): three test cases, each with `req.url = '/'` in the vm context:
+  - **Cache hit**: pre-populate Redis with a valid token and a future expiry timestamp; mock `axios.post` to fail (should never be called); assert `res.statusCode === 200`, body `=== 'Authorized'`, and `axios.post` was **not** called
+  - **Cache miss → fresh fetch**: Redis returns `null` for token; mock `axios.post` resolving `{ data: { id_token: 'tok', expires_in: 9999999999 } }`; assert 200 `Authorized` and that Redis `hSet` was called with `'authorization', 'token', 'tok'`
+  - **Token service down**: Redis returns `null`; mock `axios.post` rejecting with `{ code: 'ECONNABORTED' }`; assert `res.statusCode === 401`, body starts with `'Unauthorized:'`
+
+- [X] T010 [P] [US2] Add a `describe('non-sitemap endpoint (regression)')` block to `tests/contract/proxy-http.test.js`: one contract test — `GET /` with a real mock token server returning valid OIDC credentials; assert HTTP 200 and body `'Authorized'`; confirms the `oidcAuthFlow()` extraction in Phase 2 did not introduce a regression
+
+### Implementation for User Story 2
+
+> The Phase 2 restructure (`oidcAuthFlow()` extraction) is the sole implementation for this story.
+> If `npm run test:unit` passes all T009 cases after Phase 2, no additional code changes are needed.
+
+- [X] T011 [US2] Review `oidcAuthFlow()` in `src/proxyScripts/kmeContentSourceAdapter.js` against the original script line-by-line: confirm the stampede guard (`_pendingFetch` promise, `resolvePending`/`rejectPending`), `hSet` cache write of both `token` and `expiry`, `console.debug`/`console.info`/`console.error` calls, and all error-path `res.writeHead(401)` / `res.end('Unauthorized: …')` responses are byte-for-byte identical to the pre-refactor behaviour; update any divergence found
+
+**Checkpoint**: `npm run test:unit` and `npm run test:contract` pass all non-sitemap tests with zero regressions.
+
+---
+
+## Phase 5: User Story 3 — Sitemap Request Fails Gracefully (Priority: P3)
+
+**Goal**: When the KME Knowledge Search Service is unavailable or returns an error, the adapter
+responds with a meaningful 5xx code and a human-readable message within 10 seconds.
+
+**Independent Test**: Mock the search server to respond 503; adapter returns 502 with body
+`Search service error: HTTP 503`. Mock the search server to time out; adapter returns 504.
+
+### Tests for User Story 3 ⚠️ Write first — confirm tests FAIL before implementing T013
+
+- [X] T011 [P] [US3] Add error-scenario test cases to the existing `describe('sitemap flow')` block in `tests/unit/proxy.test.js` (append after T004 cases):
+  - **Upstream 503**: mock `axios.get` rejecting with `{ response: { status: 503 } }`; assert `res.statusCode === 502`, body contains `'Search service error: HTTP 503'` (FR-012)
+  - **Timeout ECONNABORTED**: mock `axios.get` rejecting with `{ code: 'ECONNABORTED' }`; assert `res.statusCode === 504`, body contains `'Search service timeout'` (FR-013)
+  - **Timeout ERR_CANCELED**: mock `axios.get` rejecting with `{ code: 'ERR_CANCELED' }`; assert `res.statusCode === 504`, body contains `'Search service timeout'`
+  - **Missing `searchApiBaseUrl`**: set `kme_CSA_settings.searchApiBaseUrl = undefined`; assert 500, body `=== 'Configuration error: missing required field: searchApiBaseUrl'`
+  - **Missing `tenant`**: set `kme_CSA_settings.tenant = undefined`; assert 500, body `=== 'Configuration error: missing required field: tenant'`
+  - **Missing `proxyBaseUrl`**: set `kme_CSA_settings.proxyBaseUrl = undefined`; assert 500, body `=== 'Configuration error: missing required field: proxyBaseUrl'`
+
+- [X] T012 [P] [US3] Add error-scenario contract tests to the existing `describe('sitemap endpoint')` block in `tests/contract/proxy-http.test.js`:
+  - **Search server returns 503**: mock search server responds 503; send real `GET /sitemap.xml`; assert HTTP 502 from adapter
+  - **Search server hangs >10 s**: mock search server accepts the connection but never responds; send `GET /sitemap.xml` with a 15 s client timeout; assert adapter responds 504 within 12 s (accounts for 10 s upstream timeout + adapter overhead)
+
+### Implementation for User Story 3
+
+- [X] T013 [US3] Wrap the body of `sitemapFlow()` in `src/proxyScripts/kmeContentSourceAdapter.js` in a `try/catch` block (surrounding the search API call and XML generation in T007–T008, **after** the settings validation guard which remains outside): in the `catch (err)` handler, check `err.code === 'ECONNABORTED' || err.code === 'ERR_CANCELED'` → `res.writeHead(504, { 'Content-Type': 'text/plain' }); res.end('Search service timeout');`; else if `err.response` → `res.writeHead(502, { 'Content-Type': 'text/plain' }); res.end('Search service error: HTTP ' + err.response.status);`; else → `res.writeHead(502, { 'Content-Type': 'text/plain' }); res.end('Search service error: ' + err.message);` (per R-004 and contracts/sitemap-endpoint.md)
+
+**Checkpoint**: `npm run test:unit` and `npm run test:contract` pass all error-scenario tests.
+
+---
+
+## Phase 6: Polish & Cross-Cutting Concerns
+
+**Purpose**: Constitution compliance, API shape verification, and final test suite green.
+
+- [X] T014 [P] Verify `src/proxyScripts/kmeContentSourceAdapter.js` constitution compliance: run `grep -n 'import\|export\|process\.env\|global\.config\b\|config\.' src/proxyScripts/kmeContentSourceAdapter.js` and confirm zero matches (FR-009, Constitution §I); confirm `xmlBuilder` is the sole XML-building mechanism (FR-008); confirm no new files were created in `src/`
+
+- [X] T015 [P] Verify live search API response shape against R-002 assumption: using a test token, call `GET ${searchApiBaseUrl}/${tenant}` manually with `curl -H "Authorization: OIDC_id_token <token>" <searchApiBaseUrl>/<tenant>` and confirm (a) the top-level key holding the items array (`items` vs `results` vs bare array) and (b) that `vkm:url` is a direct string property of each item; update the extraction line `response.data.items ?? response.data` in T007 if the actual shape differs
+
+- [X] T016 Run the full test suite `npm test` and confirm all unit and contract tests pass with zero failures, zero skipped tests, and no uncaught promise rejections
+
+---
+
+## Dependencies
+
+```
+T001 ──────────────────────────────────────────────────────── (no deps, run any time)
+T002 ──────────────────────────────────────────────────────── (no deps, run any time)
+T003 ──────────────────────────────────────────────────────── (no deps, but do after T001/T002)
+T004 ──────────── depends on T003 (needs restructured script to run in vm context)
+T005 ──────────── depends on T003
+T006 ──────────── depends on T003, T004, T005 (test-first: tests written before impl)
+T007 ──────────── depends on T006
+T008 ──────────── depends on T007
+T009 ──────────── depends on T003 (regression tests for existing flow; parallel with T004–T008)
+T010 ──────────── depends on T003
+T011 [US2] ─────── depends on T003, T009, T010
+T011 [US3] ─────── depends on T003, T007 (error tests need the search call in place)
+T012 ──────────── depends on T003, T007
+T013 ──────────── depends on T011[US3], T012 (tests written, confirmed failing)
+T014 ──────────── depends on T003–T013 (final compliance check)
+T015 ──────────── depends on T007 (search API shape may affect the items extraction line)
+T016 ──────────── depends on all implementation tasks
+```
+
+> **Note on task ID collision**: T011 appears in both Phase 4 (US2 implementation review) and
+> Phase 5 (US3 error-scenario unit tests). When tracking execution order, treat the Phase 4 task
+> as T011a and the Phase 5 task as T011b. Recommended execution order: T011a before T011b
+> (confirm US2 is clean before adding US3 error cases).
+
+---
+
+## Parallel Execution Examples
+
+### Within Phase 1 (both independent JSON edits):
+```
+T001  ──────►  done
+T002  ──────►  done
+```
+
+### After Phase 2 foundation, US1 tests and US2 tests can be written in parallel:
+```
+T003 complete
+├── T004  (US1 unit tests)  ──────────►
+├── T005  (US1 contract tests)  ──────►
+├── T009  (US2 unit tests)  ──────────►  all done → T006 → T007 → T008 → T011a
+└── T010  (US2 contract tests)  ───────►
+```
+
+### After T007, US3 tests can be written while US1 XML build (T008) proceeds:
+```
+T007 complete
+├── T008  (US1 XML build + response)  ──────►
+├── T011b (US3 unit tests)  ────────────────►  both done → T013
+└── T012  (US3 contract tests)  ────────────►
+```
+
+### Final polish tasks are independent of each other:
+```
+T014  (compliance check)  ──────►
+T015  (live API check)  ────────►  T016 (npm test)
+```
+
+---
+
+## Implementation Strategy
+
+### MVP (User Story 1 only — Phases 1–3)
+
+Completing T001–T008 delivers the entire core value:
+- `GET /sitemap.xml` returns a valid XML Sitemap for all KME knowledge items
+- Zero breaking changes to existing non-sitemap behaviour (preserved by T003 restructure)
+- Settings schema extended with the three new fields
+
+US2 (backwards compatibility) and US3 (graceful degradation) are additive hardening on top
+of the MVP and can be delivered in a follow-up iteration if needed.
+
+### Incremental delivery order
+
+1. **Iteration 1** (MVP): T001 → T002 → T003 → T004 + T005 → T006 → T007 → T008
+2. **Iteration 2** (Hardening): T009 + T010 → T011a → T011b + T012 → T013
+3. **Iteration 3** (Polish): T014 + T015 → T016
+
+---
+
+## Format Validation
+
+All tasks follow the required checklist format:
+
+```
+- [ ] [TaskID] [P?] [Story?] Description with file path
+```
+
+| Check | Result |
+|---|---|
+| All tasks start with `- [ ]` checkbox | ✅ |
+| All tasks have a sequential ID (T001–T016) | ✅ |
+| `[P]` only on tasks modifying different files with no unmet dependencies | ✅ |
+| `[US1]`/`[US2]`/`[US3]` labels only on user-story phase tasks | ✅ |
+| Setup/Foundational/Polish tasks have no story label | ✅ |
+| All tasks name at least one explicit file path | ✅ |