Files
kme_content_adapter/specs/002-sitemap-generation/tasks.md
Peter.Morton f840587e5e feat: content fetch, sitemap fixes, remove oidcAuthFlow
- Add contentFetchFlow() to proxy (FR-001 through FR-012)
- Add extractArticleBody() helper with vkm:articleBody / articleBody fallback
- Dynamic proxyBaseUrl derivation from x-forwarded-proto/host headers
- Forward query/size/category params on /sitemap.xml requests
- Add Accept: application/ld+json header to content API calls
- Remove oidcAuthFlow() - unmatched requests now return 404 Not Found
- Fix xmlbuilder2 import: default import, call as xmlbuilder2.create(...)
- Version bump 0.2.0 → 0.3.0
- 45/45 tests passing

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-23 16:40:06 -05:00

242 lines
17 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Tasks: Sitemap XML Generation
**Feature**: `002-sitemap-generation`
**Input**: Design documents from `/specs/002-sitemap-generation/`
**Prerequisites**: plan.md ✅ spec.md ✅ research.md ✅ data-model.md ✅ contracts/sitemap-endpoint.md ✅ quickstart.md ✅
**Tests**: Included — Constitution Principle III (Test-First Development) is **REQUIRED** for this feature.
**Organization**: Tasks grouped by user story to enable independent implementation and testing.
## Format: `[ID] [P?] [Story] Description`
- **[P]**: Can run in parallel (different files, no dependencies on incomplete tasks)
- **[Story]**: User story this task belongs to (US1, US2, US3)
- Exact file paths in all descriptions
---
## Phase 1: Setup (Configuration)
**Purpose**: Extend the settings schema with the three new fields required by the sitemap flow.
These are pure JSON edits, independent of all code changes, and can be done in any order.
- [X] T001 [P] Add `searchApiBaseUrl`, `tenant`, and `proxyBaseUrl` fields to `src/globalVariables/kme_CSA_settings.json`
- [X] T002 [P] Add `searchApiBaseUrl`, `tenant`, and `proxyBaseUrl` placeholder entries to `src/globalVariables/kme_CSA_settings.json.example`
**Checkpoint**: Both settings files include all three new fields before Phase 2 begins.
---
## Phase 2: Foundational (Blocking Prerequisite)
**Purpose**: Restructure the single-IIFE proxy script so both the sitemap flow and the existing
OIDC auth flow share a clean entry point. **No user-story work can begin until this is done.**
- [X] T003 Restructure `src/proxyScripts/kmeContentSourceAdapter.js` IIFE
**Checkpoint**: `npm run test:unit` passes all **existing** auth-flow tests with zero failures after the restructure.
---
## Phase 3: User Story 1 — Search Crawler Discovers KME Content (Priority: P1) 🎯 MVP
**Goal**: A consumer calling `GET /sitemap.xml` receives a well-formed XML Sitemap containing
one `<url>/<loc>` per knowledge item, built via `xmlbuilder2`, with `Content-Type: application/xml`.
**Independent Test**: `curl http://localhost:3000/sitemap.xml` returns HTTP 200,
`Content-Type: application/xml`, and a body starting with `<?xml` containing `<urlset>`.
### Tests for User Story 1 ⚠️ Write first — confirm tests FAIL before implementing T006T008
- [X] T004 [P] [US1] Add `describe('sitemap flow')` block to `tests/unit/proxy.test.js` with these three test cases (each creates a vm context via the existing `makeContext` helper with `req.url` set to `'/sitemap.xml'`):
- **Happy path — items present**: mock `axios.get` resolving `{ data: { items: [{ 'vkm:url': 'https://kme.example.com/doc-1' }, { 'vkm:url': 'https://kme.example.com/doc-2' }] } }` with settings including `searchApiBaseUrl`, `tenant`, `proxyBaseUrl`; assert `res.statusCode === 200`, `res.headers['Content-Type'] === 'application/xml'`, body contains `<?xml`, `<urlset`, and `<loc>https://proxy.example.com?kmeURL=https%3A%2F%2Fkme.example.com%2Fdoc-1</loc>`
- **Happy path — zero items**: mock `axios.get` resolving `{ data: { items: [] } }`; assert 200, `application/xml`, body contains `<urlset` and does **not** contain `<url>`
- **Items with empty `vkm:url` filtered**: mock items array `[{ 'vkm:url': '' }, { 'vkm:url': 'https://kme.example.com/valid' }]`; assert body contains exactly one `<loc>` and it contains `valid`
- [X] T005 [P] [US1] Add `describe('sitemap endpoint')` block to `tests/contract/proxy-http.test.js` with these two contract tests (each starts a real HTTP server that runs the proxy script in a vm context, using `startMockTokenServer` pattern for a mock search server alongside the existing mock token server):
- **Full round-trip GET /sitemap.xml**: mock search server returns `{ items: [{ 'vkm:url': 'https://kme.example.com/doc-1' }] }`; send real `axios.get('http://localhost:<port>/sitemap.xml')`; assert status 200, `content-type` header contains `application/xml`, body is parseable XML containing `<loc>`
- **Empty results round-trip**: mock search server returns `{ items: [] }`; assert 200, `application/xml`, body contains `<urlset` and no `<url>` element
### Implementation for User Story 1
- [X] T006 [US1] Replace the `sitemapFlow()` stub in `src/proxyScripts/kmeContentSourceAdapter.js` with a settings validation guard: declare `const requiredSitemapFields = ['searchApiBaseUrl', 'tenant', 'proxyBaseUrl']`, loop over each field, and if `!kme_CSA_settings[field]` respond `res.writeHead(500, { 'Content-Type': 'text/plain' })` + `res.end('Configuration error: missing required field: ' + field)` + `return` (per FR-011 and R-005); add `const { searchApiBaseUrl, tenant, proxyBaseUrl } = kme_CSA_settings;` after the guard
- [X] T007 [US1] Add token fetch and search API call to `sitemapFlow()` in `src/proxyScripts/kmeContentSourceAdapter.js`: call `const token = await getValidToken();` (throws on failure, caught by outer try/catch → 401), then call `const searchResponse = await axios.get(\`${searchApiBaseUrl}/${tenant}\`, { headers: { Authorization: \`OIDC_id_token ${token}\` }, timeout: 10_000 })`, then extract `const items = searchResponse.data.items ?? searchResponse.data ?? [];` (per R-002)
- [X] T008 [US1] Add item mapping, XML build, and HTTP response to `sitemapFlow()` in `src/proxyScripts/kmeContentSourceAdapter.js`: iterate `items`, skip entries where `!item['vkm:url']` (FR-006), for each valid item compute `const loc = \`${proxyBaseUrl}?kmeURL=${encodeURIComponent(item['vkm:url'])}\`` (FR-005, R-006); build XML via `const doc = xmlbuilder2({ version: '1.0', encoding: 'UTF-8' }); const urlset = doc.ele('urlset', { xmlns: 'http://www.sitemaps.org/schemas/sitemap/0.9' }); urlset.ele('url').ele('loc').txt(loc).up().up();` (FR-008, R-003); serialise with `const xml = doc.end({ prettyPrint: false })`; respond `res.writeHead(200, { 'Content-Type': 'application/xml' }); res.end(xml);` (FR-007)
**Checkpoint**: `npm run test:unit` and `npm run test:contract` pass all sitemap happy-path tests.
At this point `GET /sitemap.xml` is fully functional; MVP is deliverable.
---
## Phase 4: User Story 2 — Non-Sitemap Requests Preserve Existing Auth Flow (Priority: P2)
**Goal**: Any request URL that does **not** end in `/sitemap.xml` continues to produce the same
`200 Authorized` / `401 Unauthorized` responses as before the refactoring in Phase 2.
**Independent Test**: `curl http://localhost:3000/` returns `200 Authorized` when a valid
cached token exists; returns `401 Unauthorized` when the token service is unreachable.
### Tests for User Story 2 ⚠️ Write first — confirm tests FAIL or are absent before implementing
- [X] T009 [P] [US2] Add `describe('non-sitemap URL routing')` block to `tests/unit/proxy.test.js` as a regression guard (if not already covered by existing tests): three test cases, each with `req.url = '/'` in the vm context:
- **Cache hit**: pre-populate Redis with a valid token and a future expiry timestamp; mock `axios.post` to fail (should never be called); assert `res.statusCode === 200`, body `=== 'Authorized'`, and `axios.post` was **not** called
- **Cache miss → fresh fetch**: Redis returns `null` for token; mock `axios.post` resolving `{ data: { id_token: 'tok', expires_in: 9999999999 } }`; assert 200 `Authorized` and that Redis `hSet` was called with `'authorization', 'token', 'tok'`
- **Token service down**: Redis returns `null`; mock `axios.post` rejecting with `{ code: 'ECONNABORTED' }`; assert `res.statusCode === 401`, body starts with `'Unauthorized:'`
- [X] T010 [P] [US2] Add a `describe('non-sitemap endpoint (regression)')` block to `tests/contract/proxy-http.test.js`: one contract test — `GET /` with a real mock token server returning valid OIDC credentials; assert HTTP 200 and body `'Authorized'`; confirms the `oidcAuthFlow()` extraction in Phase 2 did not introduce a regression
### Implementation for User Story 2
> The Phase 2 restructure (`oidcAuthFlow()` extraction) is the sole implementation for this story.
> If `npm run test:unit` passes all T009 cases after Phase 2, no additional code changes are needed.
- [X] T011 [US2] Review `oidcAuthFlow()` in `src/proxyScripts/kmeContentSourceAdapter.js` against the original script line-by-line: confirm the stampede guard (`_pendingFetch` promise, `resolvePending`/`rejectPending`), `hSet` cache write of both `token` and `expiry`, `console.debug`/`console.info`/`console.error` calls, and all error-path `res.writeHead(401)` / `res.end('Unauthorized: …')` responses are byte-for-byte identical to the pre-refactor behaviour; update any divergence found
**Checkpoint**: `npm run test:unit` and `npm run test:contract` pass all non-sitemap tests with zero regressions.
---
## Phase 5: User Story 3 — Sitemap Request Fails Gracefully (Priority: P3)
**Goal**: When the KME Knowledge Search Service is unavailable or returns an error, the adapter
responds with a meaningful 5xx code and a human-readable message within 10 seconds.
**Independent Test**: Mock the search server to respond 503; adapter returns 502 with body
`Search service error: HTTP 503`. Mock the search server to time out; adapter returns 504.
### Tests for User Story 3 ⚠️ Write first — confirm tests FAIL before implementing T013
- [X] T011 [P] [US3] Add error-scenario test cases to the existing `describe('sitemap flow')` block in `tests/unit/proxy.test.js` (append after T004 cases):
- **Upstream 503**: mock `axios.get` rejecting with `{ response: { status: 503 } }`; assert `res.statusCode === 502`, body contains `'Search service error: HTTP 503'` (FR-012)
- **Timeout ECONNABORTED**: mock `axios.get` rejecting with `{ code: 'ECONNABORTED' }`; assert `res.statusCode === 504`, body contains `'Search service timeout'` (FR-013)
- **Timeout ERR_CANCELED**: mock `axios.get` rejecting with `{ code: 'ERR_CANCELED' }`; assert `res.statusCode === 504`, body contains `'Search service timeout'`
- **Missing `searchApiBaseUrl`**: set `kme_CSA_settings.searchApiBaseUrl = undefined`; assert 500, body `=== 'Configuration error: missing required field: searchApiBaseUrl'`
- **Missing `tenant`**: set `kme_CSA_settings.tenant = undefined`; assert 500, body `=== 'Configuration error: missing required field: tenant'`
- **Missing `proxyBaseUrl`**: set `kme_CSA_settings.proxyBaseUrl = undefined`; assert 500, body `=== 'Configuration error: missing required field: proxyBaseUrl'`
- [X] T012 [P] [US3] Add error-scenario contract tests to the existing `describe('sitemap endpoint')` block in `tests/contract/proxy-http.test.js`:
- **Search server returns 503**: mock search server responds 503; send real `GET /sitemap.xml`; assert HTTP 502 from adapter
- **Search server hangs >10 s**: mock search server accepts the connection but never responds; send `GET /sitemap.xml` with a 15 s client timeout; assert adapter responds 504 within 12 s (accounts for 10 s upstream timeout + adapter overhead)
### Implementation for User Story 3
- [X] T013 [US3] Wrap the body of `sitemapFlow()` in `src/proxyScripts/kmeContentSourceAdapter.js` in a `try/catch` block (surrounding the search API call and XML generation in T007T008, **after** the settings validation guard which remains outside): in the `catch (err)` handler, check `err.code === 'ECONNABORTED' || err.code === 'ERR_CANCELED'``res.writeHead(504, { 'Content-Type': 'text/plain' }); res.end('Search service timeout');`; else if `err.response``res.writeHead(502, { 'Content-Type': 'text/plain' }); res.end('Search service error: HTTP ' + err.response.status);`; else → `res.writeHead(502, { 'Content-Type': 'text/plain' }); res.end('Search service error: ' + err.message);` (per R-004 and contracts/sitemap-endpoint.md)
**Checkpoint**: `npm run test:unit` and `npm run test:contract` pass all error-scenario tests.
---
## Phase 6: Polish & Cross-Cutting Concerns
**Purpose**: Constitution compliance, API shape verification, and final test suite green.
- [X] T014 [P] Verify `src/proxyScripts/kmeContentSourceAdapter.js` constitution compliance: run `grep -n 'import\|export\|process\.env\|global\.config\b\|config\.' src/proxyScripts/kmeContentSourceAdapter.js` and confirm zero matches (FR-009, Constitution §I); confirm `xmlbuilder2` is the sole XML-building mechanism (FR-008); confirm no new files were created in `src/`
- [X] T015 [P] Verify live search API response shape against R-002 assumption: using a test token, call `GET ${searchApiBaseUrl}/${tenant}` manually with `curl -H "Authorization: OIDC_id_token <token>" <searchApiBaseUrl>/<tenant>` and confirm (a) the top-level key holding the items array (`items` vs `results` vs bare array) and (b) that `vkm:url` is a direct string property of each item; update the extraction line `response.data.items ?? response.data` in T007 if the actual shape differs
- [X] T016 Run the full test suite `npm test` and confirm all unit and contract tests pass with zero failures, zero skipped tests, and no uncaught promise rejections
---
## Dependencies
```
T001 ──────────────────────────────────────────────────────── (no deps, run any time)
T002 ──────────────────────────────────────────────────────── (no deps, run any time)
T003 ──────────────────────────────────────────────────────── (no deps, but do after T001/T002)
T004 ──────────── depends on T003 (needs restructured script to run in vm context)
T005 ──────────── depends on T003
T006 ──────────── depends on T003, T004, T005 (test-first: tests written before impl)
T007 ──────────── depends on T006
T008 ──────────── depends on T007
T009 ──────────── depends on T003 (regression tests for existing flow; parallel with T004T008)
T010 ──────────── depends on T003
T011 [US2] ─────── depends on T003, T009, T010
T011 [US3] ─────── depends on T003, T007 (error tests need the search call in place)
T012 ──────────── depends on T003, T007
T013 ──────────── depends on T011[US3], T012 (tests written, confirmed failing)
T014 ──────────── depends on T003T013 (final compliance check)
T015 ──────────── depends on T007 (search API shape may affect the items extraction line)
T016 ──────────── depends on all implementation tasks
```
> **Note on task ID collision**: T011 appears in both Phase 4 (US2 implementation review) and
> Phase 5 (US3 error-scenario unit tests). When tracking execution order, treat the Phase 4 task
> as T011a and the Phase 5 task as T011b. Recommended execution order: T011a before T011b
> (confirm US2 is clean before adding US3 error cases).
---
## Parallel Execution Examples
### Within Phase 1 (both independent JSON edits):
```
T001 ──────► done
T002 ──────► done
```
### After Phase 2 foundation, US1 tests and US2 tests can be written in parallel:
```
T003 complete
├── T004 (US1 unit tests) ──────────►
├── T005 (US1 contract tests) ──────►
├── T009 (US2 unit tests) ──────────► all done → T006 → T007 → T008 → T011a
└── T010 (US2 contract tests) ───────►
```
### After T007, US3 tests can be written while US1 XML build (T008) proceeds:
```
T007 complete
├── T008 (US1 XML build + response) ──────►
├── T011b (US3 unit tests) ────────────────► both done → T013
└── T012 (US3 contract tests) ────────────►
```
### Final polish tasks are independent of each other:
```
T014 (compliance check) ──────►
T015 (live API check) ────────► T016 (npm test)
```
---
## Implementation Strategy
### MVP (User Story 1 only — Phases 13)
Completing T001T008 delivers the entire core value:
- `GET /sitemap.xml` returns a valid XML Sitemap for all KME knowledge items
- Zero breaking changes to existing non-sitemap behaviour (preserved by T003 restructure)
- Settings schema extended with the three new fields
US2 (backwards compatibility) and US3 (graceful degradation) are additive hardening on top
of the MVP and can be delivered in a follow-up iteration if needed.
### Incremental delivery order
1. **Iteration 1** (MVP): T001 → T002 → T003 → T004 + T005 → T006 → T007 → T008
2. **Iteration 2** (Hardening): T009 + T010 → T011a → T011b + T012 → T013
3. **Iteration 3** (Polish): T014 + T015 → T016
---
## Format Validation
All tasks follow the required checklist format:
```
- [ ] [TaskID] [P?] [Story?] Description with file path
```
| Check | Result |
|---|---|
| All tasks start with `- [ ]` checkbox | ✅ |
| All tasks have a sequential ID (T001T016) | ✅ |
| `[P]` only on tasks modifying different files with no unmet dependencies | ✅ |
| `[US1]`/`[US2]`/`[US3]` labels only on user-story phase tasks | ✅ |
| Setup/Foundational/Polish tasks have no story label | ✅ |
| All tasks name at least one explicit file path | ✅ |