google-drive-content-adapter/specs/001-sitemap/spec.md

# Feature Specification: Google Drive HTTP Proxy Adapter

**Feature Branch**: `001-drive-proxy-adapter`
**Created**: 2026-03-06
**Updated**: 2026-03-07
**Status**: Draft
**Input**: User description: "I want to build a node.js application that provides an http proxy adapter to search and export documents from Google Drive. HTTP requests to 'sitemap.xml' should use a query to list documents in Google Drive. The links returned in the 'sitemap.xml' should link back to this adapter with a document id."

**Scope Change (2026-03-07)**: Simplified to only handle sitemap.xml generation. Document export functionality removed from scope.

## Clarifications

### Session 2026-03-06

- Q: Architecture approach - format conversion vs metadata-only vs hybrid? → A: Use metadata exportLinks to fetch and stream files through adapter (hybrid: metadata discovery + content streaming)
- Q: How to handle Markdown format (not in Drive API exportLinks)? → A: Check exportLinks for text/x-markdown; if unavailable, convert from HTML export
- Q: What error response format (JSON/text/status-only)? → A: HTTP status code only, no error response body
- Q: Rate limiting behavior when Drive API limits hit? → A: Return 429 with Retry-After header indicating seconds until retry
- Q: Maximum document size limit for streaming? → A: Stream up to 20MB maximum; return 413 Payload Too Large for larger documents

### Session 2026-03-07

- **SCOPE CHANGE**: Removed all document export functionality. System now only generates sitemap.xml with document IDs. The links in the sitemap point back to the adapter with document IDs, but the adapter does not implement the document retrieval endpoints.
- Q: Authentication method for Google Drive API? → A: Service Account with JSON key file (JWT-based, server-to-server authentication)
- Q: Sitemap URL format for document links? → A: /documents/{documentId} (RESTful, clear resource path)
- Q: Retry behavior when Drive API returns 503? → A: No retries, immediately return 503 to client
- Q: Service account credentials storage method? → A: Inline JSON in env var (GOOGLE_SERVICE_ACCOUNT_KEY)
- Q: Logging output destination? → A: stdout/stderr only (console logging, no files)

### Session 3 (2026-03-07)

- Q: How should the system handle cases where >50,000 documents exist in Google Drive (exceeding sitemap protocol limit)? → A: Return 413 error if >50k documents exist
- Q: How should the system handle fatal errors (e.g., invalid service account credentials, unable to bind to port)? → A: Log critical error + crash with exit code 1
- Q: How should the system handle concurrent requests to /sitemap.xml? → A: Queue requests, process one at a time (FIFO)
- Q: What format should be used for log messages? → A: Plain text logging format [timestamp] [level] message
- Q: Should the Drive API query filter be hardcoded or configurable? → A: Drive API filter should be configurable in config/settings.js file (not hardcoded)

## User Scenarios & Testing _(mandatory)_

### User Story 1 - Generate Sitemap of Available Documents (Priority: P1)

A user makes an HTTP GET request to `/sitemap.xml` and receives a valid XML sitemap listing all accessible Google Drive documents with links back to the adapter (document IDs only, no export functionality).

**Why this priority**: This is the core and only functionality. Enables document discovery and generates a sitemap with links containing document IDs. This makes the adapter useful for indexing scenarios (e.g., search engines, content aggregators).

**Independent Test**: Can be tested by making GET request to `/sitemap.xml` and verifying: (1) valid XML sitemap format, (2) contains URLs pointing to adapter endpoints with document IDs, (3) reflects documents accessible in user's Google Drive.

**Acceptance Scenarios**:

1. **Given** user has access to Google Drive documents, **When** user requests `/sitemap.xml`, **Then** system returns 200 status with valid XML sitemap
2. **Given** sitemap is generated, **When** examining the XML, **Then** each `<url>` entry contains a `<loc>` pointing to the adapter using RESTful format (e.g., `http://adapter-host/documents/{documentId}`)
3. **Given** multiple documents in Google Drive, **When** sitemap is generated, **Then** all accessible documents are included in the sitemap
4. **Given** user lacks permission to certain documents, **When** sitemap is generated, **Then** those documents are excluded from the sitemap
5. **Given** the adapter receives a sitemap request at any path, **When** sitemap is generated, **Then** all URLs use the base URL derived from the incoming request (protocol, host, and path up to but not including sitemap.xml)

---

### Edge Cases

- What happens when Google Drive API is unavailable or rate-limited? → Return 503 Service Unavailable immediately without retries if API returns 503; return 429 Too Many Requests with Retry-After header if rate limited
- What happens when OAuth token expires during request? → Attempt token refresh; if failed, return 401 Unauthorized
- How are shared drive documents handled? → Treat same as My Drive documents if user has access
- What happens with password-protected or restricted documents? → Exclude from sitemap (filter out documents without read access)
- How are document updates reflected in sitemap? → Each sitemap request fetches current list; no caching
- What if there are more than 50,000 documents (sitemap limit)? → Return 413 Payload Too Large error (enforces sitemap protocol limit)
- How are non-document files handled (images, videos, etc.)? → Include all files in sitemap regardless of type
- What happens if no documents are accessible? → Return valid sitemap XML with no URL entries
- What happens when multiple /sitemap.xml requests arrive simultaneously? → Requests are queued and processed sequentially in FIFO order (one at a time)
- What happens when service account credentials are invalid or missing at startup? → Log critical error to stderr and crash with exit code 1
- How are Drive API query filters customized? → Configure filters in config/settings.js file (not hardcoded)
- What happens if config/settings.js is missing or malformed? → Log critical error to stderr and crash with exit code 1
- How is the base URL determined for sitemap links? → Extracted from incoming request including protocol, host, and path prefix (e.g., request to `/api/v1/sitemap.xml` generates URLs like `https://example.com/api/v1/documents/{id}`)

## Requirements _(mandatory)_

### Functional Requirements

- **FR-001**: System MUST provide an HTTP server that listens for incoming requests
- **FR-002**: System MUST authenticate with Google Drive API using Service Account with JSON key file (JWT-based, server-to-server authentication)
- **FR-003**: System MUST handle GET requests to `/sitemap.xml` endpoint
- **FR-004**: System MUST query Google Drive API to retrieve list of accessible documents for sitemap generation
- **FR-005**: System MUST generate valid XML sitemap conforming to sitemap protocol (https://www.sitemaps.org/protocol.html)
- **FR-006**: System MUST include document metadata in sitemap (URL with RESTful path format `/documents/{documentId}`, last modified date if available)
- **FR-007**: System MUST return HTTP 404 Not Found for any endpoint other than `/sitemap.xml`
- **FR-008**: System MUST return appropriate HTTP status codes (200 OK, 401 Unauthorized, 413 Payload Too Large, 429 Too Many Requests, 500 Internal Server Error, 503 Service Unavailable)
- **FR-009**: System MUST include Content-Type: application/xml header for sitemap responses
- **FR-010**: System MUST handle OAuth token refresh when tokens expire
- **FR-011**: System MUST log all incoming requests to stdout/stderr using plain text format: [timestamp] [level] message (includes endpoint and response status)
- **FR-012**: System MUST log errors to stdout/stderr using plain text format: [timestamp] [level] message (includes request ID and error message for debugging)
- **FR-013**: System MUST handle Google Drive API rate limiting gracefully by returning 429 status with Retry-After header indicating seconds until retry
- **FR-017**: System MUST NOT retry when Google Drive API returns 503; instead immediately return 503 to client
- **FR-014**: System MUST derive the base URL from the incoming HTTP request including the full path (using X-Forwarded-Proto and X-Forwarded-Host headers if present, otherwise using request protocol and host, plus the path up to but not including sitemap.xml)
- **FR-018**: System MUST load Service Account credentials from environment variable GOOGLE_SERVICE_ACCOUNT_KEY containing inline JSON key file content
- **FR-015**: System MUST return 413 Payload Too Large if Google Drive contains more than 50,000 documents (enforces sitemap protocol limit)
- **FR-016**: System MUST filter out documents user lacks read access to from sitemap
- **FR-019**: System MUST process /sitemap.xml requests sequentially using a FIFO queue (one request at a time to prevent concurrent Drive API operations)
- **FR-020**: System MUST crash with exit code 1 after logging critical errors (e.g., invalid service account credentials, unable to bind to port, missing required configuration)
- **FR-021**: System MUST load Drive API query filter configuration from config/settings.js file (not hardcoded in source)

### Key Entities

- **Document**: Represents a file in Google Drive. Key attributes include: document ID (unique identifier), title, MIME type, last modified timestamp, permissions status
- **Sitemap Entry**: Represents a document listing in the sitemap XML. Attributes include: location URL (RESTful path `/documents/{documentId}`), last modified date
- **HTTP Request Context**: Represents an incoming request. Attributes include: request ID (for tracing), Service Account JWT token, requested endpoint, client IP
- **Service Account Credentials**: Represents JWT-based authentication state. Attributes include: client email, private key (from JSON key file), access token (generated via JWT), token expiry time, scopes granted
- **Configuration**: Represents application settings. Attributes include: Drive API query filter (loaded from config/settings.js), server port, request queue (FIFO for /sitemap.xml requests)

## Success Criteria _(mandatory)_

### Measurable Outcomes

- **SC-001**: Users can request `/sitemap.xml` and receive a valid XML sitemap within 5 seconds for drives containing up to 10,000 documents
- **SC-002**: System successfully handles at least 10 concurrent sitemap requests without errors (queued and processed sequentially in FIFO order)
- **SC-003**: 95% of sitemap requests complete successfully (200 status code)
- **SC-004**: System responds to invalid endpoint requests (404) within 1 second
- **SC-005**: System gracefully handles Google Drive API rate limits without crashing, returning 429 status codes with Retry-After headers
- **SC-006**: Service Account JWT token generation succeeds automatically in >99% of expiration scenarios
- **SC-007**: System startup time from cold start to accepting requests is under 10 seconds
- **SC-008**: System memory usage remains under 256MB under normal load (< 10 concurrent requests)
- **SC-011**: All logs output to stdout/stderr only using plain text format [timestamp] [level] message; no log files created on disk
- **SC-009**: Sitemap includes all accessible documents (100% coverage for documents with read permission)
- **SC-010**: Generated sitemap XML validates against sitemap protocol schema
- **SC-012**: System returns 413 Payload Too Large when Drive contains >50,000 documents (prevents oversized sitemap generation)
- **SC-013**: System terminates with exit code 1 within 5 seconds of encountering fatal configuration or startup errors

## Assumptions _(optional)_

- Service Account has valid JSON key file credentials configured for Google Drive access
- The adapter runs as a trusted application with appropriate scopes for Google Drive access (read-only, https://www.googleapis.com/auth/drive.readonly)
- Service Account JSON key file is provided via GOOGLE_SERVICE_ACCOUNT_KEY environment variable as inline JSON string
- Network connectivity to Google Drive API (https://www.googleapis.com/drive/v3/) is available
- Document IDs in sitemap URLs are Google Drive file IDs, not custom identifiers
- Sitemap URLs use RESTful path format: `/documents/{documentId}`
- Sitemap generation queries "My Drive" and shared drives where service account has access
- Default port is 3000 unless configured otherwise
- System runs on Node.js LTS version (v18 or later)
- Environment supports async/await and ES modules
- Sitemap URLs are constructed dynamically from incoming request headers and path (X-Forwarded-Proto/Host for reverse proxy scenarios, otherwise direct request protocol/host, plus path prefix before sitemap.xml)
- Drive API query filter is configured in config/settings.js file (allows customization without code changes)
- System processes sitemap requests sequentially to avoid concurrent Drive API query conflicts
- Fatal errors (invalid credentials, port binding failure, missing configuration) cause immediate termination with exit code 1

## Out of Scope _(optional)_

- Document export functionality (Markdown, HTML, PDF) - removed from original scope
- Document editing or creation capabilities
- Document content retrieval or streaming
- User authentication/authorization beyond Google Service Account (JWT-based)
- Document caching or local storage (always fetch fresh list from Google Drive)
- Automatic retry logic for Drive API 503 errors (fail immediately instead)
- File-based logging (logs output to console only)
- Custom domain mapping or URL shortening
- Analytics or usage tracking
- Document versioning or revision history access
- Folder hierarchy preservation in sitemap (flat list of documents)
- Batch operations
- WebSocket or Server-Sent Events for real-time updates
- Admin interface or dashboard
- Health check endpoint (only /sitemap.xml is supported)