15 KiB
Research: Google Drive HTTP Proxy Adapter
Feature: 001-drive-proxy-adapter
Phase: 0 - Outline & Research
Date: 2026-03-07
Overview
This research document consolidates findings from all clarification sessions (10 Q&A pairs across 3 sessions) and investigates technical decisions for building a Node.js HTTP proxy adapter that generates XML sitemaps from Google Drive documents using Service Account authentication.
Research Areas
1. Google Drive API Service Account Authentication
Decision: Use Service Account with JWT-based authentication (server-to-server, no user interaction)
Rationale:
- Service Account provides server-to-server authentication without user login flow
- JWT tokens generated programmatically from JSON key file (client_email + private_key)
- Ideal for proxy/adapter scenarios where application acts on behalf of domain users
- Tokens auto-refresh via googleapis SDK (handles expiry transparently)
Implementation Approach:
- Load JSON key file from environment variable
GOOGLE_SERVICE_ACCOUNT_KEY(inline JSON string) - Use
googleapisnpm packagegoogle.auth.GoogleAuthclass with JWT configuration - Set scope to
https://www.googleapis.com/auth/drive.readonly(read-only access) - SDK automatically manages token lifecycle (generation, refresh, caching)
Alternatives Considered:
- ❌ OAuth 2.0 user flow - Requires interactive browser login, unsuitable for proxy adapter
- ❌ API key authentication - Not supported for Drive API (OAuth required)
- ❌ Manual JWT implementation - Complex signing/token exchange, googleapis SDK already provides this
References:
2. XML Sitemap Generation (Sitemap Protocol)
Decision: Generate XML sitemap conforming to sitemaps.org protocol, enforce 50,000 URL limit
Rationale:
- Sitemap protocol specifies max 50,000 URLs per sitemap file
- Each URL entry requires
<loc>(required), optional<lastmod>(from Drive modifiedTime) - Must use proper XML namespace:
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" - URLs must be absolute (include base URL prefix)
Implementation Approach:
- Query Drive API:
drive.files.list()with fieldsfiles(id, name, mimeType, modifiedTime) - Count results - if >50,000, return HTTP 413 Payload Too Large immediately
- Build XML using template literals (Node.js native approach) or minimal XML library
- Format URLs as RESTful paths:
{baseUrl}/documents/{documentId} - Include
<lastmod>using ISO 8601 format from Drive APImodifiedTimefield
Alternatives Considered:
- ❌ Sitemap index with multiple sitemaps - Over-engineering for initial requirement (YAGNI)
- ❌ Paginated sitemaps - Not requested in spec, adds complexity
- ✅ Node.js built-in XML generation (template literals) - Simple for flat structure
- ⚠️
xmlbuilder2npm package - Consider if XML escaping becomes complex (acceptable dependency per constitution if justified)
References:
3. Concurrency Control - FIFO Request Queue
Decision: Implement FIFO queue for /sitemap.xml requests, process one at a time
Rationale (from Session 3 clarification):
- Prevents concurrent Drive API queries that could cause rate limiting issues
- Ensures predictable resource usage (single Drive API operation at a time)
- Simple queue semantics: first request in, first request served
- If request fails, continue to next in queue (no retry per spec)
Implementation Approach:
- Use Node.js EventEmitter pattern for queue implementation (built-in module)
- Maintain array of pending request handlers (FIFO array: push to end, shift from start)
- Check queue state before processing:
- If queue empty: start processing immediately
- If queue busy: add request to pending array
- Emit 'complete' event to trigger next request processing
Code Pattern:
import { EventEmitter } from 'events';
class SitemapQueue extends EventEmitter {
constructor() {
super();
this.processing = false;
this.queue = [];
}
async process(handler) {
return new Promise((resolve, reject) => {
this.queue.push({ handler, resolve, reject });
if (!this.processing) this.processNext();
});
}
async processNext() {
if (this.queue.length === 0) {
this.processing = false;
return;
}
this.processing = true;
const { handler, resolve, reject } = this.queue.shift();
try {
const result = await handler();
resolve(result);
} catch (error) {
reject(error);
} finally {
this.processNext(); // Process next in queue
}
}
}
Alternatives Considered:
- ❌ Concurrent processing with rate limiting - More complex, not required per clarification
- ❌ External queue (Redis, RabbitMQ) - Over-engineering for single-server deployment
- ❌ Worker pool - Unnecessary complexity for sequential processing requirement
4. Error Handling Strategy
Decision: Status-code-only errors (no response body), crash on fatal errors, immediate 503 passthrough
Rationale (consolidated from all 3 sessions):
- Clarification: HTTP status code only, no error response body (Session 1)
- Clarification: Return 429 with
Retry-Afterheader for rate limiting (Session 1) - Clarification: No retries on Drive API 503, immediately return 503 to client (Session 2)
- Clarification: Crash with exit code 1 on fatal errors (invalid credentials, port binding failure) (Session 3)
- Clarification: Return 413 for >50k documents (Session 3)
Error Scenarios:
| Scenario | HTTP Status | Response Body | Retry-After Header | Action |
|---|---|---|---|---|
| Successful sitemap | 200 OK | XML sitemap | N/A | Return sitemap |
| Invalid endpoint | 404 Not Found | Empty | N/A | Status only |
| >50k documents | 413 Payload Too Large | Empty | N/A | Status only |
| Drive API rate limit | 429 Too Many Requests | Empty | Seconds until retry | Status + header |
| OAuth token expired | 401 Unauthorized | Empty | N/A | Token refresh failed |
| Drive API unavailable (503) | 503 Service Unavailable | Empty | N/A | No retry, immediate passthrough |
| Internal error | 500 Internal Server Error | Empty | N/A | Log error, return status |
| Fatal startup error | N/A | N/A | N/A | Log to stderr, exit(1) |
Implementation Approach:
- Use try-catch blocks in request handler
- Map googleapis SDK errors to HTTP status codes
- Set
Retry-Afterheader by extracting from Drive API error response - Detect fatal errors during startup (invalid credentials, port EADDRINUSE)
- Use
logger.error()for stderr logging beforeprocess.exit(1)
5. Logging Format and Destination
Decision: Plain text logging to stdout/stderr with format [timestamp] [level] message
Rationale (from Session 3 clarification):
- Simple, human-readable format for container/cloud environments
- stdout for informational logs (info, debug)
- stderr for errors (error level)
- No file-based logging (per constitution: "stdout/stderr only")
- Timestamp helps with debugging time-sequence issues
Implementation Approach (already exists in codebase):
// src/logger.js (aliased as console.js per constitution)
const formatMessage = (level, message) => {
const timestamp = new Date().toISOString();
return `[${timestamp}] [${level.toUpperCase()}] ${message}`;
};
export const logger = {
log: (msg) => console.log(formatMessage('info', msg)),
info: (msg) => console.log(formatMessage('info', msg)),
debug: (msg) => console.log(formatMessage('debug', msg)),
error: (msg) => console.error(formatMessage('error', msg))
};
Log Events to Capture:
- Server startup: port, base URL configuration
- Incoming request: method, endpoint, client IP
- Request completion: status code, response time
- Drive API interaction: query start, document count, completion time
- Errors: error type, message, stack trace (if available)
- Fatal errors: critical error message before crash
Alternatives Considered:
- ❌ JSON structured logging - Over-engineering for initial requirement, plain text is simpler
- ❌ File-based logging - Explicitly rejected in constitution and clarifications
- ❌ External logging service (Sentry, LogDNA) - Not required, adds dependency
6. Configuration Management
Decision: Split configuration between server settings (config/config.js) and Drive API filter (config/settings.js), load credentials from environment variable
Rationale (from Sessions 2 & 3 clarifications):
- Clarification: Service Account credentials in env var
GOOGLE_SERVICE_ACCOUNT_KEY(Session 2) - Clarification: Drive API filter configurable in
config/settings.js(Session 3) - Server configuration (port, base URL) in
config/config.js(per constitution) - settings.js loaded into global
settingsvariable (per constitution)
Configuration Schema:
config/config.js:
export default {
server: {
port: process.env.PORT || 3000,
baseUrl: process.env.BASE_URL || 'http://localhost:3000'
}
};
config/settings.js:
export default {
drive: {
// Drive API query filter (q parameter)
// Default: all files excluding trashed
query: process.env.DRIVE_QUERY || "trashed = false",
// Fields to retrieve
fields: 'files(id, name, mimeType, modifiedTime)',
// Maximum results per page
pageSize: 1000
}
};
Environment Variables:
GOOGLE_SERVICE_ACCOUNT_KEY(required): JSON key file content (inline string)PORT(optional): Server port (default: 3000)BASE_URL(optional): Base URL for sitemap URLs (default: http://localhost:3000)DRIVE_QUERY(optional): Drive API query filter (default: "trashed = false")
Startup Validation:
- Check
GOOGLE_SERVICE_ACCOUNT_KEYis present and valid JSON - Validate JSON contains required fields:
client_email,private_key - If validation fails: log critical error to stderr, exit(1)
- Check port is available (catch EADDRINUSE error), exit(1) if unavailable
Alternatives Considered:
- ❌ Credentials file on disk - Environment variable approach is more secure and container-friendly
- ❌ Hardcoded Drive query - Explicitly rejected in Session 3 clarification
- ❌ Database configuration storage - Over-engineering for simple key-value config
Technology Stack Validation
Core Dependencies
| Package | Version | Justification | Constitution Compliance |
|---|---|---|---|
googleapis |
^140.0.0 | Official Google SDK, handles OAuth2/JWT complexity, implements Drive API v3 protocol. Alternative (manual implementation) would take >2 days and risk protocol errors. | ✅ APPROVED (documented in plan.md) |
Node.js Built-ins Used
http- HTTP serverfs- Configuration file loadingpath- File path utilitiesevents- FIFO queue implementation (EventEmitter)url- URL parsing for request routing
No additional external dependencies required - All other functionality (XML generation, logging, queue) implemented using Node.js built-ins.
Best Practices Research
1. Service Account Security
- Never log credentials: Filter private_key from logs
- Validate JSON structure: Check required fields before use
- Scope restriction: Use minimal scope (readonly)
- Token lifecycle: Let googleapis SDK manage refresh automatically
2. HTTP Server Best Practices
- Graceful shutdown: Handle SIGTERM/SIGINT for cleanup
- Request timeout: Set reasonable timeout (30-60 seconds for Drive API calls)
- Error boundaries: Catch all errors to prevent crashes (except fatal startup errors)
- Content-Type headers: Always set appropriate headers (application/xml for sitemap)
3. Google Drive API Best Practices
- Pagination: Use pageToken for >1000 results (Drive API default page size)
- Field filtering: Request only needed fields to reduce payload size
- Rate limiting: Handle 429 errors gracefully (already in spec)
- Exponential backoff: NOT required per spec (no retries on 503)
4. Sitemap Generation Best Practices
- XML escaping: Escape special characters in URLs (&, <, >, ", ')
- Absolute URLs: Always use full URLs with protocol and domain
- Date format: Use ISO 8601 format for lastmod (YYYY-MM-DD or YYYY-MM-DDTHH:MM:SS+00:00)
- URL encoding: Encode document IDs if they contain special characters
Integration Patterns
Request Flow
Client Request → HTTP Server → FIFO Queue → Drive API Query → XML Generation → Response
↓
(Sequential Processing)
Authentication Flow
Startup → Load GOOGLE_SERVICE_ACCOUNT_KEY → Parse JSON → Create GoogleAuth Client
↓
Request → Check Token Expiry → Auto-Refresh (if needed) → Use Token for Drive API
Error Flow
Error Occurs → Map to HTTP Status → Set Headers (Retry-After if 429) → Return Status Code (no body)
↓
Log Error (stderr) → Include context (request ID, error message)
Open Questions & Assumptions
Resolved via Clarifications (All 3 Sessions)
✅ Authentication method → Service Account with JWT
✅ URL format → /documents/{documentId} (RESTful)
✅ Error response format → Status code only, no body
✅ Rate limiting behavior → 429 with Retry-After header
✅ Drive API 503 handling → No retries, immediate passthrough
✅ Credentials storage → Inline JSON in env var
✅ Logging destination → stdout/stderr only
✅ >50k documents handling → 413 error
✅ Fatal error handling → Crash with exit code 1
✅ Concurrent requests → FIFO queue, sequential processing
✅ Log format → Plain text [timestamp] [level] message
✅ Drive query filter → Configurable in config/settings.js
Assumptions (from spec.md)
- Service Account has domain-wide delegation if accessing user drives
- Base URL configured correctly for production environment
- Node.js v18+ LTS available on deployment platform
- Network connectivity to googleapis.com available
Summary
All technical unknowns from the specification have been resolved through 3 clarification sessions (10 Q&A pairs total). Key research findings:
- Authentication: googleapis SDK with Service Account JWT (load from env var)
- Sitemap Protocol: Enforce 50k limit, use standard XML namespace, include lastmod
- Concurrency: FIFO queue using Node.js EventEmitter (sequential processing)
- Error Handling: Status-only responses, crash on fatal errors, no retries on 503
- Logging: Plain text format to stdout/stderr (no files)
- Configuration: Split between config.js (server) and settings.js (Drive query filter)
No remaining NEEDS CLARIFICATION items - Ready to proceed to Phase 1 design.