Initial Version of sitemap.xml spec

This commit is contained in:
2026-03-06 23:34:00 -06:00
parent fec5bfa5c7
commit e9495f65b5
41 changed files with 10665 additions and 35 deletions

View File

@@ -0,0 +1,436 @@
# API Contract: Sitemap Endpoint
**Feature**: 001-drive-proxy-adapter
**Date**: 2026-03-07
**Phase**: 1 - Design & Contracts
**Endpoint**: `GET /sitemap.xml`
## Overview
The `/sitemap.xml` endpoint returns an XML sitemap listing all Google Drive documents accessible to the Service Account. This is the only endpoint exposed by the adapter.
---
## Endpoint Definition
### URL
```
GET /sitemap.xml
```
### Authentication
- **Method**: None (endpoint is public)
- **Backend Authentication**: Service Account JWT to Google Drive API (transparent to client)
- **Credentials**: Loaded from `GOOGLE_SERVICE_ACCOUNT_KEY` environment variable
### Request
**Method**: `GET`
**Headers**:
- None required
**Query Parameters**:
- None supported
**Request Body**:
- None (GET request)
**Example Request**:
```http
GET /sitemap.xml HTTP/1.1
Host: adapter.example.com
User-Agent: Mozilla/5.0
```
---
## Response Specifications
### Success Response (200 OK)
**Status Code**: `200 OK`
**Headers**:
- `Content-Type: application/xml`
- `Content-Length: {size_in_bytes}`
**Body**: Valid XML sitemap conforming to sitemap protocol
**XML Schema**:
```xml
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://adapter.example.com/documents/{documentId}</loc>
<lastmod>2026-03-06T10:30:00.000Z</lastmod>
</url>
<!-- Additional <url> entries (up to 50,000) -->
</urlset>
```
**Field Descriptions**:
- `<urlset>`: Root element with sitemap namespace
- `<url>`: Individual URL entry (0 to 50,000 entries)
- `<loc>`: Absolute URL to document using RESTful format `/documents/{documentId}`
- `<lastmod>`: ISO 8601 timestamp of last document modification
**Constraints**:
- Maximum 50,000 `<url>` entries (sitemap protocol limit per spec.md FR-015)
- Maximum 50MB uncompressed (protocol limit, not enforced)
- All `<loc>` URLs use same base URL (configured via `BASE_URL` env var)
- All `<loc>` URLs use RESTful path format: `/documents/{documentId}`
**Example Response**:
```http
HTTP/1.1 200 OK
Content-Type: application/xml
Content-Length: 4582
```
**Performance Targets** (from spec.md success criteria):
- Response time: < 5 seconds for up to 10,000 documents
- Memory usage: < 256MB under normal load
- Concurrent requests: Support 10 concurrent requests without degradation
---
### Not Found Response (404)
**Status Code**: `404 Not Found`
**Headers**: None
**Body**: Empty (per spec.md clarification: "HTTP status code only, no error response body")
**When Returned**:
- Any path other than `/sitemap.xml` (per spec.md FR-007)
**Example Response**:
```http
HTTP/1.1 404 Not Found
```
---
### Unauthorized Response (401)
**Status Code**: `401 Unauthorized`
**Headers**: None
**Body**: Empty (per spec.md clarification: "HTTP status code only, no error response body")
**When Returned**:
- Service Account JWT authentication failed (per spec.md FR-010)
- OAuth token refresh failed
- Invalid Service Account credentials
**Example Response**:
```http
HTTP/1.1 401 Unauthorized
```
**Client Action**: Check Service Account credentials in `GOOGLE_SERVICE_ACCOUNT_KEY` environment variable
---
### Rate Limited Response (429)
**Status Code**: `429 Too Many Requests`
**Headers**:
- `Retry-After: {seconds}` (integer, seconds until retry allowed)
**Body**: Empty (per spec.md clarification: "HTTP status code only, no error response body")
**When Returned**:
- Google Drive API rate limit exceeded (per spec.md FR-013)
- Quota exhausted for Service Account
**Example Response**:
```http
HTTP/1.1 429 Too Many Requests
Retry-After: 60
```
**Client Action**: Wait `Retry-After` seconds before retrying request
**Retry-After Values**:
- Derived from Google Drive API `Retry-After` header if available
- Default: 60 seconds if not specified by Drive API
---
### Internal Server Error (500)
**Status Code**: `500 Internal Server Error`
**Headers**: None
**Body**: Empty (per spec.md clarification: "HTTP status code only, no error response body")
**When Returned**:
- Unexpected server error (per spec.md FR-008)
- Configuration error (missing environment variables)
- XML generation failure
**Example Response**:
```http
HTTP/1.1 500 Internal Server Error
```
**Client Action**: Report error to adapter administrator
**Server Logging**: All 500 errors logged with stack trace to stderr (per spec.md FR-012)
---
### Service Unavailable Response (503)
**Status Code**: `503 Service Unavailable`
**Headers**: None
**Body**: Empty (per spec.md clarification: "HTTP status code only, no error response body")
**When Returned**:
- Google Drive API unavailable (per spec.md FR-017)
- Drive API returns 503 status (no retries per spec clarification)
**Example Response**:
```http
HTTP/1.1 503 Service Unavailable
```
**Client Action**: Retry request later (Drive API temporarily unavailable)
**Retry Behavior**: Adapter does NOT retry Drive API 503 errors; immediately returns 503 to client (per spec.md FR-017 clarification)
---
## Error Handling Specification
### Error Response Format
**All error responses follow same pattern**:
- Status code indicates error type
- No response body (per spec.md clarification)
- Minimal headers (only `Retry-After` for 429)
**Rationale**: Simplicity, consistency, fail-fast approach
### Error Status Code Matrix
| Error Condition | Status Code | Headers | Body | Retry? |
|----------------|-------------|---------|------|--------|
| Authentication failed | 401 | None | Empty | No (fix credentials) |
| Rate limit exceeded | 429 | `Retry-After` | Empty | Yes (after delay) |
| Drive API unavailable | 503 | None | Empty | Yes (later) |
| Internal error | 500 | None | Empty | No (report to admin) |
| Path not found | 404 | None | Empty | No |
---
## Logging Specification
### Request Logging (stdout)
**All requests logged with**:
- Timestamp (ISO 8601)
- HTTP method and path
- Response status code
- Response time (milliseconds)
**Example**:
```
[2026-03-07T14:30:15.456Z] GET /sitemap.xml -> 200 (1234ms)
[2026-03-07T14:30:20.789Z] GET /sitemap.xml -> 429 (234ms)
[2026-03-07T14:30:25.012Z] GET /invalid.xml -> 404 (1ms)
```
### Error Logging (stderr)
**All errors logged with**:
- Timestamp (ISO 8601)
- Request ID (for correlation)
- Error message
- Stack trace (for 500 errors)
**Example**:
```
[2026-03-07T14:30:20.789Z] [ERROR] Rate limit exceeded: Drive API quota exhausted
[2026-03-07T14:30:25.012Z] [ERROR] Authentication failed: Invalid Service Account key
[2026-03-07T14:30:30.345Z] [ERROR] Drive API unavailable: Connection timeout
```
---
## Contract Tests
### Test Scenarios
1. **Successful sitemap generation**
- Request: `GET /sitemap.xml`
- Expected: 200 status, valid XML, `Content-Type: application/xml`
2. **Not found for other paths**
- Request: `GET /invalid.xml`
- Expected: 404 status, empty body
3. **Rate limiting**
- Simulate Drive API 429 response
- Expected: 429 status, `Retry-After` header, empty body
4. **Authentication failure**
- Simulate invalid credentials
- Expected: 401 status, empty body
5. **Service unavailable**
- Simulate Drive API 503 response
- Expected: 503 status, empty body (no retries)
6. **XML schema validation**
- Request: `GET /sitemap.xml`
- Validate XML against sitemap protocol schema
7. **URL format validation**
- Request: `GET /sitemap.xml`
- Verify all `<loc>` URLs use `/documents/{documentId}` format
### Test Assertions
**XML Schema Validation**:
- Root element: `<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">`
- Each `<url>` has required `<loc>` child
- Each `<lastmod>` is valid ISO 8601 timestamp
- Maximum 50,000 `<url>` entries
**URL Format Validation**:
- All `<loc>` URLs are absolute (start with http:// or https://)
- All `<loc>` URLs use RESTful format: `{baseUrl}/documents/{documentId}`
- Document IDs match regex: `^[a-zA-Z0-9_-]+$`
**Header Validation**:
- 200 responses include `Content-Type: application/xml`
- 429 responses include `Retry-After` header with integer value
- All error responses have empty body
---
## Configuration
### Environment Variables
| Variable | Required | Default | Description |
|----------|----------|---------|-------------|
| `GOOGLE_SERVICE_ACCOUNT_KEY` | Yes | None | Inline JSON of Service Account key file |
| `BASE_URL` | Yes | None | Base URL for sitemap links (e.g., `https://adapter.example.com`) |
| `PORT` | No | 3000 | HTTP server port |
**Example .env**:
```bash
GOOGLE_SERVICE_ACCOUNT_KEY='{"type":"service_account","project_id":"...","private_key":"-----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----\n","client_email":"...@developer.gserviceaccount.com",...}'
BASE_URL=https://adapter.example.com
PORT=3000
```
---
## Compatibility
### Sitemap Protocol Compliance
**Protocol**: https://www.sitemaps.org/protocol.html
**Compliance**:
- ✅ Valid XML with namespace
-`<loc>` with absolute URLs
-`<lastmod>` with W3C Datetime format (ISO 8601)
- ✅ Maximum 50,000 URLs
- ✅ Maximum 50MB uncompressed size
**Optional Elements Not Used**:
- `<changefreq>`: Not applicable (no historical change data)
- `<priority>`: Not applicable (all documents equal priority)
### HTTP Compliance
**HTTP Version**: HTTP/1.1
**Methods Supported**: `GET` only
**Status Codes Used**: 200, 401, 404, 429, 500, 503
**Headers Used**:
- Response: `Content-Type`, `Content-Length`, `Retry-After`
- Request: Standard HTTP headers accepted, none required
---
## Security Considerations
### Authentication
- Service Account credentials secured in environment variable (not in code or config files)
- Credentials never logged or exposed in error messages
- Read-only Drive scope (`drive.readonly`) - no write permissions
### Rate Limiting
- Transparent propagation of Drive API rate limits to client
- No internal rate limiting (rely on Drive API limits)
### Input Validation
- Path validation: Only `/sitemap.xml` accepted
- Method validation: Only `GET` accepted
- No query parameters processed (rejection not required, just ignored)
### Output Sanitization
- All URLs XML-escaped to prevent injection
- All timestamps XML-escaped (though ISO 8601 format doesn't contain XML special chars)
---
## Versioning
**Current Version**: 1.0.0 (initial implementation)
**Future Changes**:
- Breaking changes (new required parameters): Major version bump (2.0.0)
- Backward-compatible additions (query parameters): Minor version bump (1.1.0)
- Bug fixes: Patch version bump (1.0.1)
**Deprecation Policy**:
- Breaking changes include migration guide
- Deprecated features supported for at least one minor version
---
## References
- Feature Specification: `/specs/001-drive-proxy-adapter/spec.md`
- Data Model: `/specs/001-drive-proxy-adapter/data-model.md`
- Research Document: `/specs/001-drive-proxy-adapter/research.md`
- Sitemap Protocol: https://www.sitemaps.org/protocol.html
- Google Drive API v3: https://developers.google.com/drive/api/v3/reference
**Deprecation Policy**:
- Breaking changes include migration guide
- Deprecated features supported for at least one minor version
---
## References
- Feature Specification: `/specs/001-drive-proxy-adapter/spec.md`
- Data Model: `/specs/001-drive-proxy-adapter/data-model.md`
- Research Document: `/specs/001-drive-proxy-adapter/research.md`
- Sitemap Protocol: https://www.sitemaps.org/protocol.html
- Google Drive API v3: https://developers.google.com/drive/api/v3/reference