verint.com/google-drive-content-adapter

Fork 0

Files

Peter.Morton 2acb04ad76 Added new feature for document export

2026-03-10 16:25:05 -05:00

15 KiB

Raw Blame History

Research: Google Drive HTTP Proxy Adapter

Feature: 001-drive-proxy-adapter
Phase: 0 - Outline & Research
Date: 2026-03-07

Overview

This research document consolidates findings from all clarification sessions (10 Q&A pairs across 3 sessions) and investigates technical decisions for building a Node.js HTTP proxy adapter that generates XML sitemaps from Google Drive documents using Service Account authentication.

Research Areas

1. Google Drive API Service Account Authentication

Decision: Use Service Account with JWT-based authentication (server-to-server, no user interaction)

Rationale:

Service Account provides server-to-server authentication without user login flow
JWT tokens generated programmatically from JSON key file (client_email + private_key)
Ideal for proxy/adapter scenarios where application acts on behalf of domain users
Tokens auto-refresh via googleapis SDK (handles expiry transparently)

Implementation Approach:

Load JSON key file from environment variable GOOGLE_SERVICE_ACCOUNT_KEY (inline JSON string)
Use googleapis npm package google.auth.GoogleAuth class with JWT configuration
Set scope to https://www.googleapis.com/auth/drive.readonly (read-only access)
SDK automatically manages token lifecycle (generation, refresh, caching)

Alternatives Considered:

❌ OAuth 2.0 user flow - Requires interactive browser login, unsuitable for proxy adapter
❌ API key authentication - Not supported for Drive API (OAuth required)
❌ Manual JWT implementation - Complex signing/token exchange, googleapis SDK already provides this

References:

2. XML Sitemap Generation (Sitemap Protocol)

Decision: Generate XML sitemap conforming to sitemaps.org protocol, enforce 50,000 URL limit

Rationale:

Sitemap protocol specifies max 50,000 URLs per sitemap file
Each URL entry requires <loc> (required), optional <lastmod> (from Drive modifiedTime)
Must use proper XML namespace: xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
URLs must be absolute (include base URL prefix)

Implementation Approach:

Query Drive API: drive.files.list() with fields files(id, name, mimeType, modifiedTime)
Count results - if >50,000, return HTTP 413 Payload Too Large immediately
Build XML using template literals (Node.js native approach) or minimal XML library
Format URLs as RESTful paths: {baseUrl}/documents/{documentId}
Include <lastmod> using ISO 8601 format from Drive API modifiedTime field

Alternatives Considered:

❌ Sitemap index with multiple sitemaps - Over-engineering for initial requirement (YAGNI)
❌ Paginated sitemaps - Not requested in spec, adds complexity
✅ Node.js built-in XML generation (template literals) - Simple for flat structure
⚠️ xmlbuilder2 npm package - Consider if XML escaping becomes complex (acceptable dependency per constitution if justified)

References:

3. Concurrency Control - FIFO Request Queue

Decision: Implement FIFO queue for /sitemap.xml requests, process one at a time

Rationale (from Session 3 clarification):

Prevents concurrent Drive API queries that could cause rate limiting issues
Ensures predictable resource usage (single Drive API operation at a time)
Simple queue semantics: first request in, first request served
If request fails, continue to next in queue (no retry per spec)

Implementation Approach:

Use Node.js EventEmitter pattern for queue implementation (built-in module)
Maintain array of pending request handlers (FIFO array: push to end, shift from start)
Check queue state before processing:
- If queue empty: start processing immediately
- If queue busy: add request to pending array
Emit 'complete' event to trigger next request processing

Code Pattern:

import { EventEmitter } from 'events';

class SitemapQueue extends EventEmitter {
  constructor() {
    super();
    this.processing = false;
    this.queue = [];
  }

  async process(handler) {
    return new Promise((resolve, reject) => {
      this.queue.push({ handler, resolve, reject });
      if (!this.processing) this.processNext();
    });
  }

  async processNext() {
    if (this.queue.length === 0) {
      this.processing = false;
      return;
    }
    this.processing = true;
    const { handler, resolve, reject } = this.queue.shift();
    try {
      const result = await handler();
      resolve(result);
    } catch (error) {
      reject(error);
    } finally {
      this.processNext(); // Process next in queue
    }
  }
}

Alternatives Considered:

❌ Concurrent processing with rate limiting - More complex, not required per clarification
❌ External queue (Redis, RabbitMQ) - Over-engineering for single-server deployment
❌ Worker pool - Unnecessary complexity for sequential processing requirement

4. Error Handling Strategy

Decision: Status-code-only errors (no response body), crash on fatal errors, immediate 503 passthrough

Rationale (consolidated from all 3 sessions):

Clarification: HTTP status code only, no error response body (Session 1)
Clarification: Return 429 with Retry-After header for rate limiting (Session 1)
Clarification: No retries on Drive API 503, immediately return 503 to client (Session 2)
Clarification: Crash with exit code 1 on fatal errors (invalid credentials, port binding failure) (Session 3)
Clarification: Return 413 for >50k documents (Session 3)

Error Scenarios:

Scenario	HTTP Status	Response Body	Retry-After Header	Action
Successful sitemap	200 OK	XML sitemap	N/A	Return sitemap
Invalid endpoint	404 Not Found	Empty	N/A	Status only
>50k documents	413 Payload Too Large	Empty	N/A	Status only
Drive API rate limit	429 Too Many Requests	Empty	Seconds until retry	Status + header
OAuth token expired	401 Unauthorized	Empty	N/A	Token refresh failed
Drive API unavailable (503)	503 Service Unavailable	Empty	N/A	No retry, immediate passthrough
Internal error	500 Internal Server Error	Empty	N/A	Log error, return status
Fatal startup error	N/A	N/A	N/A	Log to stderr, exit(1)

Implementation Approach:

Use try-catch blocks in request handler
Map googleapis SDK errors to HTTP status codes
Set Retry-After header by extracting from Drive API error response
Detect fatal errors during startup (invalid credentials, port EADDRINUSE)
Use logger.error() for stderr logging before process.exit(1)

5. Logging Format and Destination

Decision: Plain text logging to stdout/stderr with format [timestamp] [level] message

Rationale (from Session 3 clarification):

Simple, human-readable format for container/cloud environments
stdout for informational logs (info, debug)
stderr for errors (error level)
No file-based logging (per constitution: "stdout/stderr only")
Timestamp helps with debugging time-sequence issues

Implementation Approach (already exists in codebase):

// src/logger.js (aliased as console.js per constitution)
const formatMessage = (level, message) => {
  const timestamp = new Date().toISOString();
  return `[${timestamp}] [${level.toUpperCase()}] ${message}`;
};

export const logger = {
  log: (msg) => console.log(formatMessage('info', msg)),
  info: (msg) => console.log(formatMessage('info', msg)),
  debug: (msg) => console.log(formatMessage('debug', msg)),
  error: (msg) => console.error(formatMessage('error', msg))
};

Log Events to Capture:

Server startup: port, base URL configuration
Incoming request: method, endpoint, client IP
Request completion: status code, response time
Drive API interaction: query start, document count, completion time
Errors: error type, message, stack trace (if available)
Fatal errors: critical error message before crash

Alternatives Considered:

❌ JSON structured logging - Over-engineering for initial requirement, plain text is simpler
❌ File-based logging - Explicitly rejected in constitution and clarifications
❌ External logging service (Sentry, LogDNA) - Not required, adds dependency

6. Configuration Management

Decision: Split configuration between server settings (config/config.js) and Drive API filter (config/settings.js), load credentials from environment variable

Rationale (from Sessions 2 & 3 clarifications):

Clarification: Service Account credentials in env var GOOGLE_SERVICE_ACCOUNT_KEY (Session 2)
Clarification: Drive API filter configurable in config/settings.js (Session 3)
Server configuration (port, base URL) in config/config.js (per constitution)
settings.js loaded into global settings variable (per constitution)

Configuration Schema:

config/config.js:

export default {
  server: {
    port: process.env.PORT || 3000,
    baseUrl: process.env.BASE_URL || 'http://localhost:3000'
  }
};

config/settings.js:

export default {
  drive: {
    // Drive API query filter (q parameter)
    // Default: all files excluding trashed
    query: process.env.DRIVE_QUERY || "trashed = false",
    // Fields to retrieve
    fields: 'files(id, name, mimeType, modifiedTime)',
    // Maximum results per page
    pageSize: 1000
  }
};

Environment Variables:

GOOGLE_SERVICE_ACCOUNT_KEY (required): JSON key file content (inline string)
PORT (optional): Server port (default: 3000)
BASE_URL (optional): Base URL for sitemap URLs (default: http://localhost:3000)
DRIVE_QUERY (optional): Drive API query filter (default: "trashed = false")

Startup Validation:

Check GOOGLE_SERVICE_ACCOUNT_KEY is present and valid JSON
Validate JSON contains required fields: client_email, private_key
If validation fails: log critical error to stderr, exit(1)
Check port is available (catch EADDRINUSE error), exit(1) if unavailable

Alternatives Considered:

❌ Credentials file on disk - Environment variable approach is more secure and container-friendly
❌ Hardcoded Drive query - Explicitly rejected in Session 3 clarification
❌ Database configuration storage - Over-engineering for simple key-value config

Technology Stack Validation

Core Dependencies

Package	Version	Justification	Constitution Compliance
`googleapis`	^140.0.0	Official Google SDK, handles OAuth2/JWT complexity, implements Drive API v3 protocol. Alternative (manual implementation) would take >2 days and risk protocol errors.	✅ APPROVED (documented in plan.md)

Node.js Built-ins Used

http - HTTP server
fs - Configuration file loading
path - File path utilities
events - FIFO queue implementation (EventEmitter)
url - URL parsing for request routing

No additional external dependencies required - All other functionality (XML generation, logging, queue) implemented using Node.js built-ins.

Best Practices Research

1. Service Account Security

Never log credentials: Filter private_key from logs
Validate JSON structure: Check required fields before use
Scope restriction: Use minimal scope (readonly)
Token lifecycle: Let googleapis SDK manage refresh automatically

2. HTTP Server Best Practices

Graceful shutdown: Handle SIGTERM/SIGINT for cleanup
Request timeout: Set reasonable timeout (30-60 seconds for Drive API calls)
Error boundaries: Catch all errors to prevent crashes (except fatal startup errors)
Content-Type headers: Always set appropriate headers (application/xml for sitemap)

3. Google Drive API Best Practices

Pagination: Use pageToken for >1000 results (Drive API default page size)
Field filtering: Request only needed fields to reduce payload size
Rate limiting: Handle 429 errors gracefully (already in spec)
Exponential backoff: NOT required per spec (no retries on 503)

4. Sitemap Generation Best Practices

XML escaping: Escape special characters in URLs (&, <, >, ", ')
Absolute URLs: Always use full URLs with protocol and domain
Date format: Use ISO 8601 format for lastmod (YYYY-MM-DD or YYYY-MM-DDTHH:MM:SS+00:00)
URL encoding: Encode document IDs if they contain special characters

Integration Patterns

Request Flow

Client Request → HTTP Server → FIFO Queue → Drive API Query → XML Generation → Response
                                    ↓
                              (Sequential Processing)

Authentication Flow

Startup → Load GOOGLE_SERVICE_ACCOUNT_KEY → Parse JSON → Create GoogleAuth Client
            ↓
Request → Check Token Expiry → Auto-Refresh (if needed) → Use Token for Drive API

Error Flow

Error Occurs → Map to HTTP Status → Set Headers (Retry-After if 429) → Return Status Code (no body)
     ↓
  Log Error (stderr) → Include context (request ID, error message)

Open Questions & Assumptions

Resolved via Clarifications (All 3 Sessions)

✅ Authentication method → Service Account with JWT
✅ URL format → /documents/{documentId} (RESTful)
✅ Error response format → Status code only, no body
✅ Rate limiting behavior → 429 with Retry-After header
✅ Drive API 503 handling → No retries, immediate passthrough
✅ Credentials storage → Inline JSON in env var
✅ Logging destination → stdout/stderr only
✅ >50k documents handling → 413 error
✅ Fatal error handling → Crash with exit code 1
✅ Concurrent requests → FIFO queue, sequential processing
✅ Log format → Plain text [timestamp] [level] message
✅ Drive query filter → Configurable in config/settings.js

Assumptions (from spec.md)

Service Account has domain-wide delegation if accessing user drives
Base URL configured correctly for production environment
Node.js v18+ LTS available on deployment platform
Network connectivity to googleapis.com available

Summary

All technical unknowns from the specification have been resolved through 3 clarification sessions (10 Q&A pairs total). Key research findings:

Authentication: googleapis SDK with Service Account JWT (load from env var)
Sitemap Protocol: Enforce 50k limit, use standard XML namespace, include lastmod
Concurrency: FIFO queue using Node.js EventEmitter (sequential processing)
Error Handling: Status-only responses, crash on fatal errors, no retries on 503
Logging: Plain text format to stdout/stderr (no files)
Configuration: Split between config.js (server) and settings.js (Drive query filter)

No remaining NEEDS CLARIFICATION items - Ready to proceed to Phase 1 design.

15 KiB Raw Blame History

Research: Google Drive HTTP Proxy Adapter

Overview

Research Areas

1. Google Drive API Service Account Authentication

2. XML Sitemap Generation (Sitemap Protocol)

3. Concurrency Control - FIFO Request Queue

4. Error Handling Strategy

5. Logging Format and Destination

6. Configuration Management

Technology Stack Validation

Core Dependencies

Node.js Built-ins Used

Best Practices Research

1. Service Account Security

2. HTTP Server Best Practices

3. Google Drive API Best Practices

4. Sitemap Generation Best Practices

Integration Patterns

Request Flow

Authentication Flow

Error Flow

Open Questions & Assumptions

Resolved via Clarifications (All 3 Sessions)

Assumptions (from spec.md)

Summary

15 KiB

Raw Blame History