Files
Peter.Morton f840587e5e feat: content fetch, sitemap fixes, remove oidcAuthFlow
- Add contentFetchFlow() to proxy (FR-001 through FR-012)
- Add extractArticleBody() helper with vkm:articleBody / articleBody fallback
- Dynamic proxyBaseUrl derivation from x-forwarded-proto/host headers
- Forward query/size/category params on /sitemap.xml requests
- Add Accept: application/ld+json header to content API calls
- Remove oidcAuthFlow() - unmatched requests now return 404 Not Found
- Fix xmlbuilder2 import: default import, call as xmlbuilder2.create(...)
- Version bump 0.2.0 → 0.3.0
- 45/45 tests passing

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-23 16:40:06 -05:00

3.9 KiB

Quickstart: Sitemap XML Generation

Feature: 002-sitemap-generation Branch: 002-sitemap-generation


What This Feature Does

Adds a GET /sitemap.xml endpoint to the kme-content-adapter proxy. When a crawler or sitemap consumer requests this URL, the adapter:

  1. Obtains a valid OIDC id_token from the Redis cache (refreshing if expired).
  2. Calls the KME Knowledge Search Service to retrieve all knowledge items.
  3. Builds a standards-compliant XML Sitemap (urlset) with one <loc> per item.
  4. Returns the sitemap as application/xml with HTTP 200.

All other requests continue to use the existing OIDC auth flow without modification.


Setup

1. Add the new settings fields

Open src/globalVariables/kme_CSA_settings.json and add the three new fields:

{
  "tokenUrl": "https://<your-oidc-host>/token",
  "username": "apiclient",
  "password": "<your-password>",
  "clientId": "<your-client-id>",
  "scope": "openid ...",
  "searchApiBaseUrl": "https://<kme-search-host>/api/search",
  "tenant": "<your-tenant-id>",
  "proxyBaseUrl": "https://<your-adapter-external-url>"
}
Field Description Example
searchApiBaseUrl Base URL of the KME Knowledge Search Service https://kme-qa.example.com/search
tenant Tenant identifier appended to the search URL path my-org
proxyBaseUrl Externally accessible HTTPS URL of this adapter https://proxy.example.com

The adapter will call GET {searchApiBaseUrl}/{tenant} to retrieve knowledge items.

2. Start the adapter

npm run dev    # development (auto-restart on changes)
npm start      # production

Redis must be running and accessible (default: redis://localhost:6379).


Usage

Request the sitemap

curl -v http://localhost:3000/sitemap.xml

Expected response:

HTTP/1.1 200 OK
Content-Type: application/xml

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://proxy.example.com?kmeURL=https%3A%2F%2Fkme.example.com%2Fdoc-1</loc>
  </url>
  ...
</urlset>

Validate the sitemap against the Sitemaps XSD

# Using xmllint (libxml2)
curl -s http://localhost:3000/sitemap.xml | \
  xmllint --schema https://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd --noout -

Running the Tests

npm run test:unit      # unit tests (VM context mocking, no network)
npm run test:contract  # contract tests (real HTTP, mock token/search servers)
npm test               # all tests

Unit tests live in tests/unit/proxy.test.js. Contract tests live in tests/contract/proxy-http.test.js.


Error Scenarios

Scenario How to reproduce Expected response
Missing searchApiBaseUrl Remove field from kme_CSA_settings.json, restart 500 Configuration error: missing required field: searchApiBaseUrl
Search service down Point searchApiBaseUrl to an unreachable host 502 Search service error: HTTP <status> or 504 Search service timeout
Zero results Search service returns empty items array 200 OK with empty <urlset/>
Items with empty vkm:url (covered by unit tests) Items silently omitted from sitemap

Architecture Notes

  • No new files: All new logic is added directly to src/proxyScripts/kmeContentSourceAdapter.js (monolithic architecture constraint).
  • No new dependencies: xmlbuilder2 is already in package.json and injected into the VM context as xmlbuilder2.
  • Token reuse: The sitemap flow reuses the existing Redis hGet/token-refresh pattern — no separate auth logic.
  • VM isolation: The proxy script runs in a vm.createContext sandbox. It has access only to the injected globals listed in src/server.js (axios, redis, xmlbuilder2, kme_CSA_settings, req, res, console, URLSearchParams, URL, crypto).