Files

15 KiB

Research: Google Drive HTTP Proxy Adapter

Feature: 001-drive-proxy-adapter
Phase: 0 - Outline & Research
Date: 2026-03-07

Overview

This research document consolidates findings from all clarification sessions (10 Q&A pairs across 3 sessions) and investigates technical decisions for building a Node.js HTTP proxy adapter that generates XML sitemaps from Google Drive documents using Service Account authentication.

Research Areas

1. Google Drive API Service Account Authentication

Decision: Use Service Account with JWT-based authentication (server-to-server, no user interaction)

Rationale:

  • Service Account provides server-to-server authentication without user login flow
  • JWT tokens generated programmatically from JSON key file (client_email + private_key)
  • Ideal for proxy/adapter scenarios where application acts on behalf of domain users
  • Tokens auto-refresh via googleapis SDK (handles expiry transparently)

Implementation Approach:

  • Load JSON key file from environment variable GOOGLE_SERVICE_ACCOUNT_KEY (inline JSON string)
  • Use googleapis npm package google.auth.GoogleAuth class with JWT configuration
  • Set scope to https://www.googleapis.com/auth/drive.readonly (read-only access)
  • SDK automatically manages token lifecycle (generation, refresh, caching)

Alternatives Considered:

  • OAuth 2.0 user flow - Requires interactive browser login, unsuitable for proxy adapter
  • API key authentication - Not supported for Drive API (OAuth required)
  • Manual JWT implementation - Complex signing/token exchange, googleapis SDK already provides this

References:


2. XML Sitemap Generation (Sitemap Protocol)

Decision: Generate XML sitemap conforming to sitemaps.org protocol, enforce 50,000 URL limit

Rationale:

  • Sitemap protocol specifies max 50,000 URLs per sitemap file
  • Each URL entry requires <loc> (required), optional <lastmod> (from Drive modifiedTime)
  • Must use proper XML namespace: xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
  • URLs must be absolute (include base URL prefix)

Implementation Approach:

  • Query Drive API: drive.files.list() with fields files(id, name, mimeType, modifiedTime)
  • Count results - if >50,000, return HTTP 413 Payload Too Large immediately
  • Build XML using template literals (Node.js native approach) or minimal XML library
  • Format URLs as RESTful paths: {baseUrl}/documents/{documentId}
  • Include <lastmod> using ISO 8601 format from Drive API modifiedTime field

Alternatives Considered:

  • Sitemap index with multiple sitemaps - Over-engineering for initial requirement (YAGNI)
  • Paginated sitemaps - Not requested in spec, adds complexity
  • Node.js built-in XML generation (template literals) - Simple for flat structure
  • ⚠️ xmlbuilder2 npm package - Consider if XML escaping becomes complex (acceptable dependency per constitution if justified)

References:


3. Concurrency Control - FIFO Request Queue

Decision: Implement FIFO queue for /sitemap.xml requests, process one at a time

Rationale (from Session 3 clarification):

  • Prevents concurrent Drive API queries that could cause rate limiting issues
  • Ensures predictable resource usage (single Drive API operation at a time)
  • Simple queue semantics: first request in, first request served
  • If request fails, continue to next in queue (no retry per spec)

Implementation Approach:

  • Use Node.js EventEmitter pattern for queue implementation (built-in module)
  • Maintain array of pending request handlers (FIFO array: push to end, shift from start)
  • Check queue state before processing:
    • If queue empty: start processing immediately
    • If queue busy: add request to pending array
  • Emit 'complete' event to trigger next request processing

Code Pattern:

import { EventEmitter } from 'events';

class SitemapQueue extends EventEmitter {
  constructor() {
    super();
    this.processing = false;
    this.queue = [];
  }

  async process(handler) {
    return new Promise((resolve, reject) => {
      this.queue.push({ handler, resolve, reject });
      if (!this.processing) this.processNext();
    });
  }

  async processNext() {
    if (this.queue.length === 0) {
      this.processing = false;
      return;
    }
    this.processing = true;
    const { handler, resolve, reject } = this.queue.shift();
    try {
      const result = await handler();
      resolve(result);
    } catch (error) {
      reject(error);
    } finally {
      this.processNext(); // Process next in queue
    }
  }
}

Alternatives Considered:

  • Concurrent processing with rate limiting - More complex, not required per clarification
  • External queue (Redis, RabbitMQ) - Over-engineering for single-server deployment
  • Worker pool - Unnecessary complexity for sequential processing requirement

4. Error Handling Strategy

Decision: Status-code-only errors (no response body), crash on fatal errors, immediate 503 passthrough

Rationale (consolidated from all 3 sessions):

  • Clarification: HTTP status code only, no error response body (Session 1)
  • Clarification: Return 429 with Retry-After header for rate limiting (Session 1)
  • Clarification: No retries on Drive API 503, immediately return 503 to client (Session 2)
  • Clarification: Crash with exit code 1 on fatal errors (invalid credentials, port binding failure) (Session 3)
  • Clarification: Return 413 for >50k documents (Session 3)

Error Scenarios:

Scenario HTTP Status Response Body Retry-After Header Action
Successful sitemap 200 OK XML sitemap N/A Return sitemap
Invalid endpoint 404 Not Found Empty N/A Status only
>50k documents 413 Payload Too Large Empty N/A Status only
Drive API rate limit 429 Too Many Requests Empty Seconds until retry Status + header
OAuth token expired 401 Unauthorized Empty N/A Token refresh failed
Drive API unavailable (503) 503 Service Unavailable Empty N/A No retry, immediate passthrough
Internal error 500 Internal Server Error Empty N/A Log error, return status
Fatal startup error N/A N/A N/A Log to stderr, exit(1)

Implementation Approach:

  • Use try-catch blocks in request handler
  • Map googleapis SDK errors to HTTP status codes
  • Set Retry-After header by extracting from Drive API error response
  • Detect fatal errors during startup (invalid credentials, port EADDRINUSE)
  • Use logger.error() for stderr logging before process.exit(1)

5. Logging Format and Destination

Decision: Plain text logging to stdout/stderr with format [timestamp] [level] message

Rationale (from Session 3 clarification):

  • Simple, human-readable format for container/cloud environments
  • stdout for informational logs (info, debug)
  • stderr for errors (error level)
  • No file-based logging (per constitution: "stdout/stderr only")
  • Timestamp helps with debugging time-sequence issues

Implementation Approach (already exists in codebase):

// src/logger.js (aliased as console.js per constitution)
const formatMessage = (level, message) => {
  const timestamp = new Date().toISOString();
  return `[${timestamp}] [${level.toUpperCase()}] ${message}`;
};

export const logger = {
  log: (msg) => console.log(formatMessage('info', msg)),
  info: (msg) => console.log(formatMessage('info', msg)),
  debug: (msg) => console.log(formatMessage('debug', msg)),
  error: (msg) => console.error(formatMessage('error', msg))
};

Log Events to Capture:

  • Server startup: port, base URL configuration
  • Incoming request: method, endpoint, client IP
  • Request completion: status code, response time
  • Drive API interaction: query start, document count, completion time
  • Errors: error type, message, stack trace (if available)
  • Fatal errors: critical error message before crash

Alternatives Considered:

  • JSON structured logging - Over-engineering for initial requirement, plain text is simpler
  • File-based logging - Explicitly rejected in constitution and clarifications
  • External logging service (Sentry, LogDNA) - Not required, adds dependency

6. Configuration Management

Decision: Split configuration between server settings (config/config.js) and Drive API filter (config/settings.js), load credentials from environment variable

Rationale (from Sessions 2 & 3 clarifications):

  • Clarification: Service Account credentials in env var GOOGLE_SERVICE_ACCOUNT_KEY (Session 2)
  • Clarification: Drive API filter configurable in config/settings.js (Session 3)
  • Server configuration (port, base URL) in config/config.js (per constitution)
  • settings.js loaded into global settings variable (per constitution)

Configuration Schema:

config/config.js:

export default {
  server: {
    port: process.env.PORT || 3000,
    baseUrl: process.env.BASE_URL || 'http://localhost:3000'
  }
};

config/settings.js:

export default {
  drive: {
    // Drive API query filter (q parameter)
    // Default: all files excluding trashed
    query: process.env.DRIVE_QUERY || "trashed = false",
    // Fields to retrieve
    fields: 'files(id, name, mimeType, modifiedTime)',
    // Maximum results per page
    pageSize: 1000
  }
};

Environment Variables:

  • GOOGLE_SERVICE_ACCOUNT_KEY (required): JSON key file content (inline string)
  • PORT (optional): Server port (default: 3000)
  • BASE_URL (optional): Base URL for sitemap URLs (default: http://localhost:3000)
  • DRIVE_QUERY (optional): Drive API query filter (default: "trashed = false")

Startup Validation:

  • Check GOOGLE_SERVICE_ACCOUNT_KEY is present and valid JSON
  • Validate JSON contains required fields: client_email, private_key
  • If validation fails: log critical error to stderr, exit(1)
  • Check port is available (catch EADDRINUSE error), exit(1) if unavailable

Alternatives Considered:

  • Credentials file on disk - Environment variable approach is more secure and container-friendly
  • Hardcoded Drive query - Explicitly rejected in Session 3 clarification
  • Database configuration storage - Over-engineering for simple key-value config

Technology Stack Validation

Core Dependencies

Package Version Justification Constitution Compliance
googleapis ^140.0.0 Official Google SDK, handles OAuth2/JWT complexity, implements Drive API v3 protocol. Alternative (manual implementation) would take >2 days and risk protocol errors. APPROVED (documented in plan.md)

Node.js Built-ins Used

  • http - HTTP server
  • fs - Configuration file loading
  • path - File path utilities
  • events - FIFO queue implementation (EventEmitter)
  • url - URL parsing for request routing

No additional external dependencies required - All other functionality (XML generation, logging, queue) implemented using Node.js built-ins.


Best Practices Research

1. Service Account Security

  • Never log credentials: Filter private_key from logs
  • Validate JSON structure: Check required fields before use
  • Scope restriction: Use minimal scope (readonly)
  • Token lifecycle: Let googleapis SDK manage refresh automatically

2. HTTP Server Best Practices

  • Graceful shutdown: Handle SIGTERM/SIGINT for cleanup
  • Request timeout: Set reasonable timeout (30-60 seconds for Drive API calls)
  • Error boundaries: Catch all errors to prevent crashes (except fatal startup errors)
  • Content-Type headers: Always set appropriate headers (application/xml for sitemap)

3. Google Drive API Best Practices

  • Pagination: Use pageToken for >1000 results (Drive API default page size)
  • Field filtering: Request only needed fields to reduce payload size
  • Rate limiting: Handle 429 errors gracefully (already in spec)
  • Exponential backoff: NOT required per spec (no retries on 503)

4. Sitemap Generation Best Practices

  • XML escaping: Escape special characters in URLs (&, <, >, ", ')
  • Absolute URLs: Always use full URLs with protocol and domain
  • Date format: Use ISO 8601 format for lastmod (YYYY-MM-DD or YYYY-MM-DDTHH:MM:SS+00:00)
  • URL encoding: Encode document IDs if they contain special characters

Integration Patterns

Request Flow

Client Request → HTTP Server → FIFO Queue → Drive API Query → XML Generation → Response
                                    ↓
                              (Sequential Processing)

Authentication Flow

Startup → Load GOOGLE_SERVICE_ACCOUNT_KEY → Parse JSON → Create GoogleAuth Client
            ↓
Request → Check Token Expiry → Auto-Refresh (if needed) → Use Token for Drive API

Error Flow

Error Occurs → Map to HTTP Status → Set Headers (Retry-After if 429) → Return Status Code (no body)
     ↓
  Log Error (stderr) → Include context (request ID, error message)

Open Questions & Assumptions

Resolved via Clarifications (All 3 Sessions)

Authentication method → Service Account with JWT
URL format → /documents/{documentId} (RESTful)
Error response format → Status code only, no body
Rate limiting behavior → 429 with Retry-After header
Drive API 503 handling → No retries, immediate passthrough
Credentials storage → Inline JSON in env var
Logging destination → stdout/stderr only
>50k documents handling → 413 error
Fatal error handling → Crash with exit code 1
Concurrent requests → FIFO queue, sequential processing
Log format → Plain text [timestamp] [level] message
Drive query filter → Configurable in config/settings.js

Assumptions (from spec.md)

  • Service Account has domain-wide delegation if accessing user drives
  • Base URL configured correctly for production environment
  • Node.js v18+ LTS available on deployment platform
  • Network connectivity to googleapis.com available

Summary

All technical unknowns from the specification have been resolved through 3 clarification sessions (10 Q&A pairs total). Key research findings:

  1. Authentication: googleapis SDK with Service Account JWT (load from env var)
  2. Sitemap Protocol: Enforce 50k limit, use standard XML namespace, include lastmod
  3. Concurrency: FIFO queue using Node.js EventEmitter (sequential processing)
  4. Error Handling: Status-only responses, crash on fatal errors, no retries on 503
  5. Logging: Plain text format to stdout/stderr (no files)
  6. Configuration: Split between config.js (server) and settings.js (Drive query filter)

No remaining NEEDS CLARIFICATION items - Ready to proceed to Phase 1 design.