Files
kme_content_adapter/README.md
Peter.Morton 7d6637effa docs: update README for v0.4.0
- Add Endpoints section documenting /sitemap.xml, /?kmeURL=, and 404 fallback
- Expand settings table with searchApiBaseUrl and tenant fields
- Update file tree to reflect kmeContentSourceAdapterHelpers.js
- Add Helpers section documenting each exported function
- Expand VM context globals table with helpers and correct xmlbuilder2 usage
- Note dynamic proxyBaseUrl derivation from request headers
- Add stampede guard detail to Token Caching section

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-23 19:11:36 -05:00

196 lines
6.7 KiB
Markdown

# kme-content-adapter
An HTTP proxy adapter that searches and fetches content from KME (Knowledge Management Engine) and exposes it as a Sitemaps-compliant XML feed and individual HTML article pages. Business logic runs in an isolated Node.js VM sandbox, mirroring the IVA Studio proxy script execution environment.
## Requirements
- Node.js ≥ 18
- Redis (used for token caching)
- `jq` (optional — used by `npm start` for log pretty-printing)
## Setup
```bash
npm install
cp src/globalVariables/kme_CSA_settings.json.example src/globalVariables/kme_CSA_settings.json
# Edit kme_CSA_settings.json with real credentials
```
## Configuration
### `src/globalVariables/kme_CSA_settings.json`
Credentials and API settings — **never commit this file**.
```json
{
"tokenUrl": "https://<host>/oidc-token-service/<env>/token",
"username": "<username>",
"password": "<password>",
"clientId": "default",
"scope": "openid tags content_entitlements",
"searchApiBaseUrl": "https://<host>/km-search-service",
"tenant": "<env>"
}
```
| Field | Description |
|---|---|
| `tokenUrl` | OIDC token endpoint |
| `username` / `password` | KME credentials |
| `clientId` | OAuth client ID (usually `default`) |
| `scope` | OAuth scopes |
| `searchApiBaseUrl` | KME Knowledge Search Service base URL |
| `tenant` | KME tenant/environment path segment (e.g. `qa`) |
### `config/default.json`
Infrastructure settings (port, host, log level). Override with environment variables:
| Variable | Default | Description |
|---|---|---|
| `PORT` | `3000` | HTTP server port |
| `HOST` | `0.0.0.0` | Bind address |
| `LOG_LEVEL` | `debug` | Log level: `DEBUG`, `INFO`, `WARN`, `ERROR` |
## Endpoints
### `GET /sitemap.xml`
Returns a [Sitemaps protocol 0.9](https://www.sitemaps.org/protocol.html) XML document. Each `<loc>` points back to this adapter's content fetch endpoint so crawlers can retrieve individual articles.
**Query parameters** (all optional):
| Parameter | Default | Description |
|---|---|---|
| `query` | `*` | KME search query string |
| `size` | `100` | Max results per search page |
| `category` | `vkm:ArticleCategory` | KME category filter |
Results are paginated automatically using `hydra:view['hydra:last']`. The response is capped at **50,000 URLs** per the Sitemaps protocol.
```
GET /sitemap.xml?query=temple&size=50&category=vkm:ArticleCategory
```
### `GET /?kmeURL=<upstream-article-url>`
Fetches a single KME article by its upstream URL and returns it as a full HTML document.
```
GET /?kmeURL=https%3A%2F%2F<kme-host>%2Fkm-content-service%2F...
```
**Response:** `200 text/html; charset=utf-8` — a complete HTML document:
```html
<!DOCTYPE html>
<html>
<head><title>Article Title from vkm:name</title></head>
<body>
<!-- vkm:articleBody content verbatim -->
</body>
</html>
```
**Error responses:**
| Status | Cause |
|---|---|
| `400` | `kmeURL` missing, blank, malformed, or non-http/https |
| `404` | Upstream returned 4xx, or article body absent in response |
| `502` | Token acquisition failed, upstream 5xx, network error, or timeout |
### `GET /*` (anything else)
Returns `404 Not Found`.
---
## Running
```bash
npm run dev # Development — auto-restart on file changes
npm start # Production — logs piped through jq
```
## Testing
```bash
npm test # All tests
npm run test:unit # Unit tests only
npm run test:integration # Integration tests only
npm run test:contract # Contract tests only
# Single test file
node --test tests/unit/proxy.test.js
```
Tests use the Node.js built-in `node:test` runner. No external test framework.
## Architecture
The server loads `src/proxyScripts/kmeContentSourceAdapter.js` once at startup via `vm.Script`, then executes it in a **fresh isolated VM context per request** via `vm.createContext`.
```
src/
├── proxyScripts/
│ └── kmeContentSourceAdapter.js # All business logic (zero imports/exports)
├── globalVariables/
│ ├── kme_CSA_settings.json # Credentials & API config (gitignored)
│ ├── kme_CSA_settings.json.example # Template for version control
│ └── kmeContentSourceAdapterHelpers.js # Pure utilities (literal function body)
├── logger.js # Structured JSON logger
└── server.js # HTTP server bootstrap only
config/
└── default.json # Infrastructure settings
```
### VM Context Globals
All dependencies are injected into each request's sandbox:
| Variable | Source |
|---|---|
| `console` | Structured logger |
| `crypto` | Node.js Web Crypto API |
| `axios` | HTTP client |
| `jwt` | `jsonwebtoken` |
| `uuidv4` | UUID v4 generator |
| `xmlbuilder2` | `xmlbuilder2` default export (call as `xmlbuilder2.create(...)`) |
| `redis` | Connected Redis client |
| `URLSearchParams`, `URL` | Node.js globals |
| `kme_CSA_settings` | Loaded from `src/globalVariables/kme_CSA_settings.json` |
| `kmeContentSourceAdapterHelpers` | Loaded from `src/globalVariables/kmeContentSourceAdapterHelpers.js` |
| `req`, `res` | Node.js HTTP request/response |
### Key Constraints for `kmeContentSourceAdapter.js`
- **Zero `import`/`export`** — runs in a VM with no module system
- **No `config`, `global.config`, or `process.env`** — use injected globals only
- Routing metadata is available via `req.params` (set by `server.js`)
- `proxyBaseUrl` is derived dynamically from request headers (`x-forwarded-proto`, `x-forwarded-host`, `host`) — not read from settings
## Token Caching
OIDC tokens are cached in Redis under the hash key `authorization` (fields `token` and `expiry`). The cache survives adapter restarts. Token expiry is stored as an absolute Unix epoch timestamp. A stampede guard ensures only one token fetch is in flight at a time when multiple concurrent requests encounter a cache miss.
## Helpers (`kmeContentSourceAdapterHelpers.js`)
A pure-utility module injected into the VM context. Key functions:
| Function | Description |
|---|---|
| `getValidToken(reqUrl, reqMethod)` | Returns a cached or freshly-fetched OIDC `id_token`; throws on failure |
| `extractHydraItems(data)` | Extracts one fragment per `SearchResultItem` — the one with the latest `vkm:datePublished` |
| `buildSitemapXml(items, proxyBaseUrl)` | Builds Sitemaps 0.9 XML from an array of fragments |
| `extractArticleBody(data)` | Returns `vkm:articleBody` (or `articleBody` fallback) from a content API response |
| `validateSettings(settings, fields)` | Returns the first missing required field name, or `null` |
> **Note:** This file is a literal function body — `server.js` wraps it as `(function() { <file> })()`. It must end with a bare `return { ... }` and contain zero `import`/`export` statements.
## Changelog
See [CHANGELOG.md](CHANGELOG.md).