Initial Version of sitemap.xml spec
This commit is contained in:
368
specs/001-drive-proxy-adapter/research.md
Normal file
368
specs/001-drive-proxy-adapter/research.md
Normal file
@@ -0,0 +1,368 @@
|
||||
# Research: Google Drive HTTP Proxy Adapter
|
||||
|
||||
**Feature**: 001-drive-proxy-adapter
|
||||
**Phase**: 0 - Outline & Research
|
||||
**Date**: 2026-03-07
|
||||
|
||||
## Overview
|
||||
|
||||
This research document consolidates findings from all clarification sessions (10 Q&A pairs across 3 sessions) and investigates technical decisions for building a Node.js HTTP proxy adapter that generates XML sitemaps from Google Drive documents using Service Account authentication.
|
||||
|
||||
## Research Areas
|
||||
|
||||
### 1. Google Drive API Service Account Authentication
|
||||
|
||||
**Decision**: Use Service Account with JWT-based authentication (server-to-server, no user interaction)
|
||||
|
||||
**Rationale**:
|
||||
- Service Account provides server-to-server authentication without user login flow
|
||||
- JWT tokens generated programmatically from JSON key file (client_email + private_key)
|
||||
- Ideal for proxy/adapter scenarios where application acts on behalf of domain users
|
||||
- Tokens auto-refresh via googleapis SDK (handles expiry transparently)
|
||||
|
||||
**Implementation Approach**:
|
||||
- Load JSON key file from environment variable `GOOGLE_SERVICE_ACCOUNT_KEY` (inline JSON string)
|
||||
- Use `googleapis` npm package `google.auth.GoogleAuth` class with JWT configuration
|
||||
- Set scope to `https://www.googleapis.com/auth/drive.readonly` (read-only access)
|
||||
- SDK automatically manages token lifecycle (generation, refresh, caching)
|
||||
|
||||
**Alternatives Considered**:
|
||||
- ❌ OAuth 2.0 user flow - Requires interactive browser login, unsuitable for proxy adapter
|
||||
- ❌ API key authentication - Not supported for Drive API (OAuth required)
|
||||
- ❌ Manual JWT implementation - Complex signing/token exchange, googleapis SDK already provides this
|
||||
|
||||
**References**:
|
||||
- [Google Service Account Documentation](https://cloud.google.com/iam/docs/service-accounts)
|
||||
- [googleapis Node.js Client](https://github.com/googleapis/google-api-nodejs-client)
|
||||
|
||||
---
|
||||
|
||||
### 2. XML Sitemap Generation (Sitemap Protocol)
|
||||
|
||||
**Decision**: Generate XML sitemap conforming to sitemaps.org protocol, enforce 50,000 URL limit
|
||||
|
||||
**Rationale**:
|
||||
- Sitemap protocol specifies max 50,000 URLs per sitemap file
|
||||
- Each URL entry requires `<loc>` (required), optional `<lastmod>` (from Drive modifiedTime)
|
||||
- Must use proper XML namespace: `xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"`
|
||||
- URLs must be absolute (include base URL prefix)
|
||||
|
||||
**Implementation Approach**:
|
||||
- Query Drive API: `drive.files.list()` with fields `files(id, name, mimeType, modifiedTime)`
|
||||
- Count results - if >50,000, return HTTP 413 Payload Too Large immediately
|
||||
- Build XML using template literals (Node.js native approach) or minimal XML library
|
||||
- Format URLs as RESTful paths: `{baseUrl}/documents/{documentId}`
|
||||
- Include `<lastmod>` using ISO 8601 format from Drive API `modifiedTime` field
|
||||
|
||||
**Alternatives Considered**:
|
||||
- ❌ Sitemap index with multiple sitemaps - Over-engineering for initial requirement (YAGNI)
|
||||
- ❌ Paginated sitemaps - Not requested in spec, adds complexity
|
||||
- ✅ Node.js built-in XML generation (template literals) - Simple for flat structure
|
||||
- ⚠️ `xmlbuilder2` npm package - Consider if XML escaping becomes complex (acceptable dependency per constitution if justified)
|
||||
|
||||
**References**:
|
||||
- [Sitemaps.org Protocol](https://www.sitemaps.org/protocol.html)
|
||||
- [Google Sitemap Guidelines](https://developers.google.com/search/docs/crawling-indexing/sitemaps/build-sitemap)
|
||||
|
||||
---
|
||||
|
||||
### 3. Concurrency Control - FIFO Request Queue
|
||||
|
||||
**Decision**: Implement FIFO queue for `/sitemap.xml` requests, process one at a time
|
||||
|
||||
**Rationale** (from Session 3 clarification):
|
||||
- Prevents concurrent Drive API queries that could cause rate limiting issues
|
||||
- Ensures predictable resource usage (single Drive API operation at a time)
|
||||
- Simple queue semantics: first request in, first request served
|
||||
- If request fails, continue to next in queue (no retry per spec)
|
||||
|
||||
**Implementation Approach**:
|
||||
- Use Node.js EventEmitter pattern for queue implementation (built-in module)
|
||||
- Maintain array of pending request handlers (FIFO array: push to end, shift from start)
|
||||
- Check queue state before processing:
|
||||
- If queue empty: start processing immediately
|
||||
- If queue busy: add request to pending array
|
||||
- Emit 'complete' event to trigger next request processing
|
||||
|
||||
**Code Pattern**:
|
||||
```javascript
|
||||
import { EventEmitter } from 'events';
|
||||
|
||||
class SitemapQueue extends EventEmitter {
|
||||
constructor() {
|
||||
super();
|
||||
this.processing = false;
|
||||
this.queue = [];
|
||||
}
|
||||
|
||||
async process(handler) {
|
||||
return new Promise((resolve, reject) => {
|
||||
this.queue.push({ handler, resolve, reject });
|
||||
if (!this.processing) this.processNext();
|
||||
});
|
||||
}
|
||||
|
||||
async processNext() {
|
||||
if (this.queue.length === 0) {
|
||||
this.processing = false;
|
||||
return;
|
||||
}
|
||||
this.processing = true;
|
||||
const { handler, resolve, reject } = this.queue.shift();
|
||||
try {
|
||||
const result = await handler();
|
||||
resolve(result);
|
||||
} catch (error) {
|
||||
reject(error);
|
||||
} finally {
|
||||
this.processNext(); // Process next in queue
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Alternatives Considered**:
|
||||
- ❌ Concurrent processing with rate limiting - More complex, not required per clarification
|
||||
- ❌ External queue (Redis, RabbitMQ) - Over-engineering for single-server deployment
|
||||
- ❌ Worker pool - Unnecessary complexity for sequential processing requirement
|
||||
|
||||
---
|
||||
|
||||
### 4. Error Handling Strategy
|
||||
|
||||
**Decision**: Status-code-only errors (no response body), crash on fatal errors, immediate 503 passthrough
|
||||
|
||||
**Rationale** (consolidated from all 3 sessions):
|
||||
- **Clarification**: HTTP status code only, no error response body (Session 1)
|
||||
- **Clarification**: Return 429 with `Retry-After` header for rate limiting (Session 1)
|
||||
- **Clarification**: No retries on Drive API 503, immediately return 503 to client (Session 2)
|
||||
- **Clarification**: Crash with exit code 1 on fatal errors (invalid credentials, port binding failure) (Session 3)
|
||||
- **Clarification**: Return 413 for >50k documents (Session 3)
|
||||
|
||||
**Error Scenarios**:
|
||||
| Scenario | HTTP Status | Response Body | Retry-After Header | Action |
|
||||
|----------|-------------|---------------|-------------------|--------|
|
||||
| Successful sitemap | 200 OK | XML sitemap | N/A | Return sitemap |
|
||||
| Invalid endpoint | 404 Not Found | Empty | N/A | Status only |
|
||||
| >50k documents | 413 Payload Too Large | Empty | N/A | Status only |
|
||||
| Drive API rate limit | 429 Too Many Requests | Empty | Seconds until retry | Status + header |
|
||||
| OAuth token expired | 401 Unauthorized | Empty | N/A | Token refresh failed |
|
||||
| Drive API unavailable (503) | 503 Service Unavailable | Empty | N/A | No retry, immediate passthrough |
|
||||
| Internal error | 500 Internal Server Error | Empty | N/A | Log error, return status |
|
||||
| Fatal startup error | N/A | N/A | N/A | Log to stderr, exit(1) |
|
||||
|
||||
**Implementation Approach**:
|
||||
- Use try-catch blocks in request handler
|
||||
- Map googleapis SDK errors to HTTP status codes
|
||||
- Set `Retry-After` header by extracting from Drive API error response
|
||||
- Detect fatal errors during startup (invalid credentials, port EADDRINUSE)
|
||||
- Use `logger.error()` for stderr logging before `process.exit(1)`
|
||||
|
||||
---
|
||||
|
||||
### 5. Logging Format and Destination
|
||||
|
||||
**Decision**: Plain text logging to stdout/stderr with format `[timestamp] [level] message`
|
||||
|
||||
**Rationale** (from Session 3 clarification):
|
||||
- Simple, human-readable format for container/cloud environments
|
||||
- stdout for informational logs (info, debug)
|
||||
- stderr for errors (error level)
|
||||
- No file-based logging (per constitution: "stdout/stderr only")
|
||||
- Timestamp helps with debugging time-sequence issues
|
||||
|
||||
**Implementation Approach** (already exists in codebase):
|
||||
```javascript
|
||||
// src/logger.js (aliased as console.js per constitution)
|
||||
const formatMessage = (level, message) => {
|
||||
const timestamp = new Date().toISOString();
|
||||
return `[${timestamp}] [${level.toUpperCase()}] ${message}`;
|
||||
};
|
||||
|
||||
export const logger = {
|
||||
log: (msg) => console.log(formatMessage('info', msg)),
|
||||
info: (msg) => console.log(formatMessage('info', msg)),
|
||||
debug: (msg) => console.log(formatMessage('debug', msg)),
|
||||
error: (msg) => console.error(formatMessage('error', msg))
|
||||
};
|
||||
```
|
||||
|
||||
**Log Events to Capture**:
|
||||
- Server startup: port, base URL configuration
|
||||
- Incoming request: method, endpoint, client IP
|
||||
- Request completion: status code, response time
|
||||
- Drive API interaction: query start, document count, completion time
|
||||
- Errors: error type, message, stack trace (if available)
|
||||
- Fatal errors: critical error message before crash
|
||||
|
||||
**Alternatives Considered**:
|
||||
- ❌ JSON structured logging - Over-engineering for initial requirement, plain text is simpler
|
||||
- ❌ File-based logging - Explicitly rejected in constitution and clarifications
|
||||
- ❌ External logging service (Sentry, LogDNA) - Not required, adds dependency
|
||||
|
||||
---
|
||||
|
||||
### 6. Configuration Management
|
||||
|
||||
**Decision**: Split configuration between server settings (config/config.js) and Drive API filter (config/settings.js), load credentials from environment variable
|
||||
|
||||
**Rationale** (from Sessions 2 & 3 clarifications):
|
||||
- **Clarification**: Service Account credentials in env var `GOOGLE_SERVICE_ACCOUNT_KEY` (Session 2)
|
||||
- **Clarification**: Drive API filter configurable in `config/settings.js` (Session 3)
|
||||
- Server configuration (port, base URL) in `config/config.js` (per constitution)
|
||||
- settings.js loaded into global `settings` variable (per constitution)
|
||||
|
||||
**Configuration Schema**:
|
||||
|
||||
`config/config.js`:
|
||||
```javascript
|
||||
export default {
|
||||
server: {
|
||||
port: process.env.PORT || 3000,
|
||||
baseUrl: process.env.BASE_URL || 'http://localhost:3000'
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
`config/settings.js`:
|
||||
```javascript
|
||||
export default {
|
||||
drive: {
|
||||
// Drive API query filter (q parameter)
|
||||
// Default: all files excluding trashed
|
||||
query: process.env.DRIVE_QUERY || "trashed = false",
|
||||
// Fields to retrieve
|
||||
fields: 'files(id, name, mimeType, modifiedTime)',
|
||||
// Maximum results per page
|
||||
pageSize: 1000
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
**Environment Variables**:
|
||||
- `GOOGLE_SERVICE_ACCOUNT_KEY` (required): JSON key file content (inline string)
|
||||
- `PORT` (optional): Server port (default: 3000)
|
||||
- `BASE_URL` (optional): Base URL for sitemap URLs (default: http://localhost:3000)
|
||||
- `DRIVE_QUERY` (optional): Drive API query filter (default: "trashed = false")
|
||||
|
||||
**Startup Validation**:
|
||||
- Check `GOOGLE_SERVICE_ACCOUNT_KEY` is present and valid JSON
|
||||
- Validate JSON contains required fields: `client_email`, `private_key`
|
||||
- If validation fails: log critical error to stderr, exit(1)
|
||||
- Check port is available (catch EADDRINUSE error), exit(1) if unavailable
|
||||
|
||||
**Alternatives Considered**:
|
||||
- ❌ Credentials file on disk - Environment variable approach is more secure and container-friendly
|
||||
- ❌ Hardcoded Drive query - Explicitly rejected in Session 3 clarification
|
||||
- ❌ Database configuration storage - Over-engineering for simple key-value config
|
||||
|
||||
---
|
||||
|
||||
## Technology Stack Validation
|
||||
|
||||
### Core Dependencies
|
||||
|
||||
| Package | Version | Justification | Constitution Compliance |
|
||||
|---------|---------|---------------|------------------------|
|
||||
| `googleapis` | ^140.0.0 | Official Google SDK, handles OAuth2/JWT complexity, implements Drive API v3 protocol. Alternative (manual implementation) would take >2 days and risk protocol errors. | ✅ APPROVED (documented in plan.md) |
|
||||
|
||||
### Node.js Built-ins Used
|
||||
- `http` - HTTP server
|
||||
- `fs` - Configuration file loading
|
||||
- `path` - File path utilities
|
||||
- `events` - FIFO queue implementation (EventEmitter)
|
||||
- `url` - URL parsing for request routing
|
||||
|
||||
**No additional external dependencies required** - All other functionality (XML generation, logging, queue) implemented using Node.js built-ins.
|
||||
|
||||
---
|
||||
|
||||
## Best Practices Research
|
||||
|
||||
### 1. Service Account Security
|
||||
- **Never log credentials**: Filter private_key from logs
|
||||
- **Validate JSON structure**: Check required fields before use
|
||||
- **Scope restriction**: Use minimal scope (readonly)
|
||||
- **Token lifecycle**: Let googleapis SDK manage refresh automatically
|
||||
|
||||
### 2. HTTP Server Best Practices
|
||||
- **Graceful shutdown**: Handle SIGTERM/SIGINT for cleanup
|
||||
- **Request timeout**: Set reasonable timeout (30-60 seconds for Drive API calls)
|
||||
- **Error boundaries**: Catch all errors to prevent crashes (except fatal startup errors)
|
||||
- **Content-Type headers**: Always set appropriate headers (application/xml for sitemap)
|
||||
|
||||
### 3. Google Drive API Best Practices
|
||||
- **Pagination**: Use pageToken for >1000 results (Drive API default page size)
|
||||
- **Field filtering**: Request only needed fields to reduce payload size
|
||||
- **Rate limiting**: Handle 429 errors gracefully (already in spec)
|
||||
- **Exponential backoff**: NOT required per spec (no retries on 503)
|
||||
|
||||
### 4. Sitemap Generation Best Practices
|
||||
- **XML escaping**: Escape special characters in URLs (&, <, >, ", ')
|
||||
- **Absolute URLs**: Always use full URLs with protocol and domain
|
||||
- **Date format**: Use ISO 8601 format for lastmod (YYYY-MM-DD or YYYY-MM-DDTHH:MM:SS+00:00)
|
||||
- **URL encoding**: Encode document IDs if they contain special characters
|
||||
|
||||
---
|
||||
|
||||
## Integration Patterns
|
||||
|
||||
### Request Flow
|
||||
```
|
||||
Client Request → HTTP Server → FIFO Queue → Drive API Query → XML Generation → Response
|
||||
↓
|
||||
(Sequential Processing)
|
||||
```
|
||||
|
||||
### Authentication Flow
|
||||
```
|
||||
Startup → Load GOOGLE_SERVICE_ACCOUNT_KEY → Parse JSON → Create GoogleAuth Client
|
||||
↓
|
||||
Request → Check Token Expiry → Auto-Refresh (if needed) → Use Token for Drive API
|
||||
```
|
||||
|
||||
### Error Flow
|
||||
```
|
||||
Error Occurs → Map to HTTP Status → Set Headers (Retry-After if 429) → Return Status Code (no body)
|
||||
↓
|
||||
Log Error (stderr) → Include context (request ID, error message)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Open Questions & Assumptions
|
||||
|
||||
### Resolved via Clarifications (All 3 Sessions)
|
||||
✅ Authentication method → Service Account with JWT
|
||||
✅ URL format → `/documents/{documentId}` (RESTful)
|
||||
✅ Error response format → Status code only, no body
|
||||
✅ Rate limiting behavior → 429 with Retry-After header
|
||||
✅ Drive API 503 handling → No retries, immediate passthrough
|
||||
✅ Credentials storage → Inline JSON in env var
|
||||
✅ Logging destination → stdout/stderr only
|
||||
✅ >50k documents handling → 413 error
|
||||
✅ Fatal error handling → Crash with exit code 1
|
||||
✅ Concurrent requests → FIFO queue, sequential processing
|
||||
✅ Log format → Plain text `[timestamp] [level] message`
|
||||
✅ Drive query filter → Configurable in config/settings.js
|
||||
|
||||
### Assumptions (from spec.md)
|
||||
- Service Account has domain-wide delegation if accessing user drives
|
||||
- Base URL configured correctly for production environment
|
||||
- Node.js v18+ LTS available on deployment platform
|
||||
- Network connectivity to googleapis.com available
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
All technical unknowns from the specification have been resolved through 3 clarification sessions (10 Q&A pairs total). Key research findings:
|
||||
|
||||
1. **Authentication**: googleapis SDK with Service Account JWT (load from env var)
|
||||
2. **Sitemap Protocol**: Enforce 50k limit, use standard XML namespace, include lastmod
|
||||
3. **Concurrency**: FIFO queue using Node.js EventEmitter (sequential processing)
|
||||
4. **Error Handling**: Status-only responses, crash on fatal errors, no retries on 503
|
||||
5. **Logging**: Plain text format to stdout/stderr (no files)
|
||||
6. **Configuration**: Split between config.js (server) and settings.js (Drive query filter)
|
||||
|
||||
**No remaining NEEDS CLARIFICATION items** - Ready to proceed to Phase 1 design.
|
||||
Reference in New Issue
Block a user