# Research: Google Drive HTTP Proxy Adapter **Feature**: 001-drive-proxy-adapter **Phase**: 0 - Outline & Research **Date**: 2026-03-07 ## Overview This research document consolidates findings from all clarification sessions (10 Q&A pairs across 3 sessions) and investigates technical decisions for building a Node.js HTTP proxy adapter that generates XML sitemaps from Google Drive documents using Service Account authentication. ## Research Areas ### 1. Google Drive API Service Account Authentication **Decision**: Use Service Account with JWT-based authentication (server-to-server, no user interaction) **Rationale**: - Service Account provides server-to-server authentication without user login flow - JWT tokens generated programmatically from JSON key file (client_email + private_key) - Ideal for proxy/adapter scenarios where application acts on behalf of domain users - Tokens auto-refresh via googleapis SDK (handles expiry transparently) **Implementation Approach**: - Load JSON key file from environment variable `GOOGLE_SERVICE_ACCOUNT_KEY` (inline JSON string) - Use `googleapis` npm package `google.auth.GoogleAuth` class with JWT configuration - Set scope to `https://www.googleapis.com/auth/drive.readonly` (read-only access) - SDK automatically manages token lifecycle (generation, refresh, caching) **Alternatives Considered**: - ❌ OAuth 2.0 user flow - Requires interactive browser login, unsuitable for proxy adapter - ❌ API key authentication - Not supported for Drive API (OAuth required) - ❌ Manual JWT implementation - Complex signing/token exchange, googleapis SDK already provides this **References**: - [Google Service Account Documentation](https://cloud.google.com/iam/docs/service-accounts) - [googleapis Node.js Client](https://github.com/googleapis/google-api-nodejs-client) --- ### 2. XML Sitemap Generation (Sitemap Protocol) **Decision**: Generate XML sitemap conforming to sitemaps.org protocol, enforce 50,000 URL limit **Rationale**: - Sitemap protocol specifies max 50,000 URLs per sitemap file - Each URL entry requires `` (required), optional `` (from Drive modifiedTime) - Must use proper XML namespace: `xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"` - URLs must be absolute (include base URL prefix) **Implementation Approach**: - Query Drive API: `drive.files.list()` with fields `files(id, name, mimeType, modifiedTime)` - Count results - if >50,000, return HTTP 413 Payload Too Large immediately - Build XML using template literals (Node.js native approach) or minimal XML library - Format URLs as RESTful paths: `{baseUrl}/documents/{documentId}` - Include `` using ISO 8601 format from Drive API `modifiedTime` field **Alternatives Considered**: - ❌ Sitemap index with multiple sitemaps - Over-engineering for initial requirement (YAGNI) - ❌ Paginated sitemaps - Not requested in spec, adds complexity - ✅ Node.js built-in XML generation (template literals) - Simple for flat structure - ⚠️ `xmlbuilder2` npm package - Consider if XML escaping becomes complex (acceptable dependency per constitution if justified) **References**: - [Sitemaps.org Protocol](https://www.sitemaps.org/protocol.html) - [Google Sitemap Guidelines](https://developers.google.com/search/docs/crawling-indexing/sitemaps/build-sitemap) --- ### 3. Concurrency Control - FIFO Request Queue **Decision**: Implement FIFO queue for `/sitemap.xml` requests, process one at a time **Rationale** (from Session 3 clarification): - Prevents concurrent Drive API queries that could cause rate limiting issues - Ensures predictable resource usage (single Drive API operation at a time) - Simple queue semantics: first request in, first request served - If request fails, continue to next in queue (no retry per spec) **Implementation Approach**: - Use Node.js EventEmitter pattern for queue implementation (built-in module) - Maintain array of pending request handlers (FIFO array: push to end, shift from start) - Check queue state before processing: - If queue empty: start processing immediately - If queue busy: add request to pending array - Emit 'complete' event to trigger next request processing **Code Pattern**: ```javascript import { EventEmitter } from 'events'; class SitemapQueue extends EventEmitter { constructor() { super(); this.processing = false; this.queue = []; } async process(handler) { return new Promise((resolve, reject) => { this.queue.push({ handler, resolve, reject }); if (!this.processing) this.processNext(); }); } async processNext() { if (this.queue.length === 0) { this.processing = false; return; } this.processing = true; const { handler, resolve, reject } = this.queue.shift(); try { const result = await handler(); resolve(result); } catch (error) { reject(error); } finally { this.processNext(); // Process next in queue } } } ``` **Alternatives Considered**: - ❌ Concurrent processing with rate limiting - More complex, not required per clarification - ❌ External queue (Redis, RabbitMQ) - Over-engineering for single-server deployment - ❌ Worker pool - Unnecessary complexity for sequential processing requirement --- ### 4. Error Handling Strategy **Decision**: Status-code-only errors (no response body), crash on fatal errors, immediate 503 passthrough **Rationale** (consolidated from all 3 sessions): - **Clarification**: HTTP status code only, no error response body (Session 1) - **Clarification**: Return 429 with `Retry-After` header for rate limiting (Session 1) - **Clarification**: No retries on Drive API 503, immediately return 503 to client (Session 2) - **Clarification**: Crash with exit code 1 on fatal errors (invalid credentials, port binding failure) (Session 3) - **Clarification**: Return 413 for >50k documents (Session 3) **Error Scenarios**: | Scenario | HTTP Status | Response Body | Retry-After Header | Action | |----------|-------------|---------------|-------------------|--------| | Successful sitemap | 200 OK | XML sitemap | N/A | Return sitemap | | Invalid endpoint | 404 Not Found | Empty | N/A | Status only | | >50k documents | 413 Payload Too Large | Empty | N/A | Status only | | Drive API rate limit | 429 Too Many Requests | Empty | Seconds until retry | Status + header | | OAuth token expired | 401 Unauthorized | Empty | N/A | Token refresh failed | | Drive API unavailable (503) | 503 Service Unavailable | Empty | N/A | No retry, immediate passthrough | | Internal error | 500 Internal Server Error | Empty | N/A | Log error, return status | | Fatal startup error | N/A | N/A | N/A | Log to stderr, exit(1) | **Implementation Approach**: - Use try-catch blocks in request handler - Map googleapis SDK errors to HTTP status codes - Set `Retry-After` header by extracting from Drive API error response - Detect fatal errors during startup (invalid credentials, port EADDRINUSE) - Use `logger.error()` for stderr logging before `process.exit(1)` --- ### 5. Logging Format and Destination **Decision**: Plain text logging to stdout/stderr with format `[timestamp] [level] message` **Rationale** (from Session 3 clarification): - Simple, human-readable format for container/cloud environments - stdout for informational logs (info, debug) - stderr for errors (error level) - No file-based logging (per constitution: "stdout/stderr only") - Timestamp helps with debugging time-sequence issues **Implementation Approach** (already exists in codebase): ```javascript // src/logger.js (aliased as console.js per constitution) const formatMessage = (level, message) => { const timestamp = new Date().toISOString(); return `[${timestamp}] [${level.toUpperCase()}] ${message}`; }; export const logger = { log: (msg) => console.log(formatMessage('info', msg)), info: (msg) => console.log(formatMessage('info', msg)), debug: (msg) => console.log(formatMessage('debug', msg)), error: (msg) => console.error(formatMessage('error', msg)) }; ``` **Log Events to Capture**: - Server startup: port, base URL configuration - Incoming request: method, endpoint, client IP - Request completion: status code, response time - Drive API interaction: query start, document count, completion time - Errors: error type, message, stack trace (if available) - Fatal errors: critical error message before crash **Alternatives Considered**: - ❌ JSON structured logging - Over-engineering for initial requirement, plain text is simpler - ❌ File-based logging - Explicitly rejected in constitution and clarifications - ❌ External logging service (Sentry, LogDNA) - Not required, adds dependency --- ### 6. Configuration Management **Decision**: Split configuration between server settings (config/config.js) and Drive API filter (config/settings.js), load credentials from environment variable **Rationale** (from Sessions 2 & 3 clarifications): - **Clarification**: Service Account credentials in env var `GOOGLE_SERVICE_ACCOUNT_KEY` (Session 2) - **Clarification**: Drive API filter configurable in `config/settings.js` (Session 3) - Server configuration (port, base URL) in `config/config.js` (per constitution) - settings.js loaded into global `settings` variable (per constitution) **Configuration Schema**: `config/config.js`: ```javascript export default { server: { port: process.env.PORT || 3000, baseUrl: process.env.BASE_URL || 'http://localhost:3000' } }; ``` `config/settings.js`: ```javascript export default { drive: { // Drive API query filter (q parameter) // Default: all files excluding trashed query: process.env.DRIVE_QUERY || "trashed = false", // Fields to retrieve fields: 'files(id, name, mimeType, modifiedTime)', // Maximum results per page pageSize: 1000 } }; ``` **Environment Variables**: - `GOOGLE_SERVICE_ACCOUNT_KEY` (required): JSON key file content (inline string) - `PORT` (optional): Server port (default: 3000) - `BASE_URL` (optional): Base URL for sitemap URLs (default: http://localhost:3000) - `DRIVE_QUERY` (optional): Drive API query filter (default: "trashed = false") **Startup Validation**: - Check `GOOGLE_SERVICE_ACCOUNT_KEY` is present and valid JSON - Validate JSON contains required fields: `client_email`, `private_key` - If validation fails: log critical error to stderr, exit(1) - Check port is available (catch EADDRINUSE error), exit(1) if unavailable **Alternatives Considered**: - ❌ Credentials file on disk - Environment variable approach is more secure and container-friendly - ❌ Hardcoded Drive query - Explicitly rejected in Session 3 clarification - ❌ Database configuration storage - Over-engineering for simple key-value config --- ## Technology Stack Validation ### Core Dependencies | Package | Version | Justification | Constitution Compliance | |---------|---------|---------------|------------------------| | `googleapis` | ^140.0.0 | Official Google SDK, handles OAuth2/JWT complexity, implements Drive API v3 protocol. Alternative (manual implementation) would take >2 days and risk protocol errors. | ✅ APPROVED (documented in plan.md) | ### Node.js Built-ins Used - `http` - HTTP server - `fs` - Configuration file loading - `path` - File path utilities - `events` - FIFO queue implementation (EventEmitter) - `url` - URL parsing for request routing **No additional external dependencies required** - All other functionality (XML generation, logging, queue) implemented using Node.js built-ins. --- ## Best Practices Research ### 1. Service Account Security - **Never log credentials**: Filter private_key from logs - **Validate JSON structure**: Check required fields before use - **Scope restriction**: Use minimal scope (readonly) - **Token lifecycle**: Let googleapis SDK manage refresh automatically ### 2. HTTP Server Best Practices - **Graceful shutdown**: Handle SIGTERM/SIGINT for cleanup - **Request timeout**: Set reasonable timeout (30-60 seconds for Drive API calls) - **Error boundaries**: Catch all errors to prevent crashes (except fatal startup errors) - **Content-Type headers**: Always set appropriate headers (application/xml for sitemap) ### 3. Google Drive API Best Practices - **Pagination**: Use pageToken for >1000 results (Drive API default page size) - **Field filtering**: Request only needed fields to reduce payload size - **Rate limiting**: Handle 429 errors gracefully (already in spec) - **Exponential backoff**: NOT required per spec (no retries on 503) ### 4. Sitemap Generation Best Practices - **XML escaping**: Escape special characters in URLs (&, <, >, ", ') - **Absolute URLs**: Always use full URLs with protocol and domain - **Date format**: Use ISO 8601 format for lastmod (YYYY-MM-DD or YYYY-MM-DDTHH:MM:SS+00:00) - **URL encoding**: Encode document IDs if they contain special characters --- ## Integration Patterns ### Request Flow ``` Client Request → HTTP Server → FIFO Queue → Drive API Query → XML Generation → Response ↓ (Sequential Processing) ``` ### Authentication Flow ``` Startup → Load GOOGLE_SERVICE_ACCOUNT_KEY → Parse JSON → Create GoogleAuth Client ↓ Request → Check Token Expiry → Auto-Refresh (if needed) → Use Token for Drive API ``` ### Error Flow ``` Error Occurs → Map to HTTP Status → Set Headers (Retry-After if 429) → Return Status Code (no body) ↓ Log Error (stderr) → Include context (request ID, error message) ``` --- ## Open Questions & Assumptions ### Resolved via Clarifications (All 3 Sessions) ✅ Authentication method → Service Account with JWT ✅ URL format → `/documents/{documentId}` (RESTful) ✅ Error response format → Status code only, no body ✅ Rate limiting behavior → 429 with Retry-After header ✅ Drive API 503 handling → No retries, immediate passthrough ✅ Credentials storage → Inline JSON in env var ✅ Logging destination → stdout/stderr only ✅ >50k documents handling → 413 error ✅ Fatal error handling → Crash with exit code 1 ✅ Concurrent requests → FIFO queue, sequential processing ✅ Log format → Plain text `[timestamp] [level] message` ✅ Drive query filter → Configurable in config/settings.js ### Assumptions (from spec.md) - Service Account has domain-wide delegation if accessing user drives - Base URL configured correctly for production environment - Node.js v18+ LTS available on deployment platform - Network connectivity to googleapis.com available --- ## Summary All technical unknowns from the specification have been resolved through 3 clarification sessions (10 Q&A pairs total). Key research findings: 1. **Authentication**: googleapis SDK with Service Account JWT (load from env var) 2. **Sitemap Protocol**: Enforce 50k limit, use standard XML namespace, include lastmod 3. **Concurrency**: FIFO queue using Node.js EventEmitter (sequential processing) 4. **Error Handling**: Status-only responses, crash on fatal errors, no retries on 503 5. **Logging**: Plain text format to stdout/stderr (no files) 6. **Configuration**: Split between config.js (server) and settings.js (Drive query filter) **No remaining NEEDS CLARIFICATION items** - Ready to proceed to Phase 1 design.