Added new feature for document export
This commit is contained in:
493
specs/001-sitemap/data-model.md
Normal file
493
specs/001-sitemap/data-model.md
Normal file
@@ -0,0 +1,493 @@
|
||||
# Data Model: Google Drive HTTP Proxy Adapter
|
||||
|
||||
**Feature**: 001-drive-proxy-adapter
|
||||
**Phase**: 1 - Design & Contracts
|
||||
**Date**: 2026-03-07
|
||||
|
||||
## Overview
|
||||
|
||||
This document defines the data structures, entities, and their relationships for the Google Drive HTTP Proxy Adapter. The system is stateless (no persistence layer) with all entities representing runtime state or API payloads.
|
||||
|
||||
---
|
||||
|
||||
## Core Entities
|
||||
|
||||
### 1. Document
|
||||
|
||||
Represents a file in Google Drive. Extracted from Drive API response.
|
||||
|
||||
**JSDoc Type Definition**:
|
||||
```javascript
|
||||
/**
|
||||
* @typedef {Object} Document
|
||||
* @property {string} id - Google Drive file ID (unique identifier)
|
||||
* @property {string} name - Document title/filename
|
||||
* @property {string} mimeType - MIME type (e.g., 'application/pdf', 'text/plain')
|
||||
* @property {string} [modifiedTime] - ISO 8601 timestamp of last modification (optional)
|
||||
*/
|
||||
```
|
||||
|
||||
**Validation Rules**:
|
||||
- `id`: REQUIRED, non-empty string
|
||||
- `name`: REQUIRED, non-empty string
|
||||
- `mimeType`: REQUIRED, non-empty string
|
||||
- `modifiedTime`: OPTIONAL, must be valid ISO 8601 format if present
|
||||
|
||||
**Source**: Drive API `files.list()` response with fields: `files(id, name, mimeType, modifiedTime)`
|
||||
|
||||
**Usage**:
|
||||
- Retrieved during sitemap generation
|
||||
- Transformed into SitemapEntry for XML output
|
||||
- No filtering by mimeType (all file types included per spec)
|
||||
|
||||
---
|
||||
|
||||
### 2. SitemapEntry
|
||||
|
||||
Represents a single URL entry in the XML sitemap.
|
||||
|
||||
**JSDoc Type Definition**:
|
||||
```javascript
|
||||
/**
|
||||
* @typedef {Object} SitemapEntry
|
||||
* @property {string} loc - Absolute URL to document (RESTful format: /documents/{id})
|
||||
* @property {string} [lastmod] - ISO 8601 date of last modification (YYYY-MM-DD format)
|
||||
*/
|
||||
```
|
||||
|
||||
**Validation Rules**:
|
||||
- `loc`: REQUIRED, must be absolute URL (http:// or https://), properly escaped XML special chars
|
||||
- `lastmod`: OPTIONAL, must be ISO 8601 date format (YYYY-MM-DD or YYYY-MM-DDTHH:MM:SS+00:00)
|
||||
|
||||
**Transformation from Document**:
|
||||
```javascript
|
||||
/**
|
||||
* Transform Document to SitemapEntry
|
||||
* @param {Document} doc - Source document from Drive API
|
||||
* @param {string} baseUrl - Base URL for sitemap (from config)
|
||||
* @returns {SitemapEntry}
|
||||
*/
|
||||
function toSitemapEntry(doc, baseUrl) {
|
||||
return {
|
||||
loc: `${baseUrl}/documents/${encodeURIComponent(doc.id)}`,
|
||||
lastmod: doc.modifiedTime ? new Date(doc.modifiedTime).toISOString().split('T')[0] : undefined
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
**Usage**:
|
||||
- Generated during XML sitemap construction
|
||||
- Each entry becomes `<url><loc>...</loc><lastmod>...</lastmod></url>` in XML
|
||||
|
||||
---
|
||||
|
||||
### 3. HTTPRequestContext
|
||||
|
||||
Represents the context for an incoming HTTP request.
|
||||
|
||||
**JSDoc Type Definition**:
|
||||
```javascript
|
||||
/**
|
||||
* @typedef {Object} HTTPRequestContext
|
||||
* @property {string} requestId - Unique identifier for request tracing (UUID)
|
||||
* @property {string} method - HTTP method (e.g., 'GET')
|
||||
* @property {string} path - Request path (e.g., '/sitemap.xml')
|
||||
* @property {string} clientIp - Client IP address
|
||||
* @property {number} timestamp - Request start time (Unix timestamp in ms)
|
||||
*/
|
||||
```
|
||||
|
||||
**Validation Rules**:
|
||||
- `requestId`: REQUIRED, unique per request (generated via crypto.randomUUID())
|
||||
- `method`: REQUIRED, HTTP method string
|
||||
- `path`: REQUIRED, URL path string
|
||||
- `clientIp`: REQUIRED, IP address string
|
||||
- `timestamp`: REQUIRED, positive integer
|
||||
|
||||
**Generation**:
|
||||
```javascript
|
||||
import { randomUUID } from 'crypto';
|
||||
|
||||
/**
|
||||
* Create request context from incoming HTTP request
|
||||
* @param {http.IncomingMessage} req - Node.js HTTP request object
|
||||
* @returns {HTTPRequestContext}
|
||||
*/
|
||||
function createRequestContext(req) {
|
||||
return {
|
||||
requestId: randomUUID(),
|
||||
method: req.method,
|
||||
path: req.url,
|
||||
clientIp: req.socket.remoteAddress,
|
||||
timestamp: Date.now()
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
**Usage**:
|
||||
- Created at request entry point
|
||||
- Used for logging (trace requests through logs)
|
||||
- Passed to queue for processing
|
||||
|
||||
---
|
||||
|
||||
### 4. ServiceAccountCredentials
|
||||
|
||||
Represents Google Service Account JWT authentication credentials.
|
||||
|
||||
**JSDoc Type Definition**:
|
||||
```javascript
|
||||
/**
|
||||
* @typedef {Object} ServiceAccountCredentials
|
||||
* @property {string} client_email - Service Account email address
|
||||
* @property {string} private_key - RSA private key (PEM format)
|
||||
* @property {string} project_id - Google Cloud project ID
|
||||
* @property {string} [token_uri] - OAuth token endpoint (default: googleapis.com)
|
||||
*/
|
||||
```
|
||||
|
||||
**Validation Rules**:
|
||||
- `client_email`: REQUIRED, valid email format ending with `.gserviceaccount.com`
|
||||
- `private_key`: REQUIRED, must start with `-----BEGIN PRIVATE KEY-----`
|
||||
- `project_id`: REQUIRED, non-empty string
|
||||
- `token_uri`: OPTIONAL, defaults to Google's OAuth endpoint
|
||||
|
||||
**Source**: Loaded from `GOOGLE_SERVICE_ACCOUNT_KEY` environment variable (inline JSON)
|
||||
|
||||
**Validation Function**:
|
||||
```javascript
|
||||
/**
|
||||
* Validate Service Account credentials structure
|
||||
* @param {Object} creds - Parsed JSON credentials
|
||||
* @throws {Error} If validation fails
|
||||
*/
|
||||
function validateCredentials(creds) {
|
||||
if (!creds.client_email || !creds.client_email.endsWith('.gserviceaccount.com')) {
|
||||
throw new Error('Invalid client_email in Service Account credentials');
|
||||
}
|
||||
if (!creds.private_key || !creds.private_key.startsWith('-----BEGIN PRIVATE KEY-----')) {
|
||||
throw new Error('Invalid private_key in Service Account credentials');
|
||||
}
|
||||
if (!creds.project_id) {
|
||||
throw new Error('Missing project_id in Service Account credentials');
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Security**:
|
||||
- NEVER log `private_key` field
|
||||
- Mask in logs: `client_email: xxx***@project.iam.gserviceaccount.com`
|
||||
|
||||
---
|
||||
|
||||
### 5. Configuration
|
||||
|
||||
Represents application runtime configuration.
|
||||
|
||||
**JSDoc Type Definition**:
|
||||
```javascript
|
||||
/**
|
||||
* @typedef {Object} ServerConfig
|
||||
* @property {number} port - HTTP server port
|
||||
* @property {string} baseUrl - Base URL for sitemap links (absolute URL)
|
||||
*/
|
||||
|
||||
/**
|
||||
* @typedef {Object} DriveConfig
|
||||
* @property {string} query - Drive API query filter (q parameter)
|
||||
* @property {string} fields - Fields to retrieve from Drive API
|
||||
* @property {number} pageSize - Maximum results per page (Drive API pagination)
|
||||
* @property {string} scope - OAuth scope for Drive access
|
||||
*/
|
||||
|
||||
/**
|
||||
* @typedef {Object} Configuration
|
||||
* @property {ServerConfig} server - HTTP server configuration
|
||||
* @property {DriveConfig} drive - Google Drive API configuration
|
||||
*/
|
||||
```
|
||||
|
||||
**Default Values**:
|
||||
```javascript
|
||||
const DEFAULT_CONFIG = {
|
||||
server: {
|
||||
port: 3000,
|
||||
baseUrl: 'http://localhost:3000'
|
||||
},
|
||||
drive: {
|
||||
query: 'trashed = false',
|
||||
fields: 'files(id, name, mimeType, modifiedTime)',
|
||||
pageSize: 1000,
|
||||
scope: 'https://www.googleapis.com/auth/drive.readonly'
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
**Loading**:
|
||||
- `config/config.js`: Exports server configuration (port, baseUrl from env vars)
|
||||
- `config/settings.js`: Exports Drive configuration (query from env var, loaded into global `settings`)
|
||||
|
||||
**Validation**:
|
||||
- `port`: Must be 1-65535
|
||||
- `baseUrl`: Must be valid absolute URL (http:// or https://)
|
||||
- `query`: Non-empty string (Drive API query syntax)
|
||||
- `pageSize`: 1-1000 (Drive API limit)
|
||||
|
||||
---
|
||||
|
||||
### 6. RequestQueue
|
||||
|
||||
Represents the FIFO queue for /sitemap.xml requests.
|
||||
|
||||
**JSDoc Type Definition**:
|
||||
```javascript
|
||||
/**
|
||||
* @typedef {Object} QueuedRequest
|
||||
* @property {Function} handler - Async function to execute (returns Promise)
|
||||
* @property {Function} resolve - Promise resolve callback
|
||||
* @property {Function} reject - Promise reject callback
|
||||
*/
|
||||
|
||||
/**
|
||||
* @typedef {Object} RequestQueue
|
||||
* @property {boolean} processing - Whether a request is currently being processed
|
||||
* @property {QueuedRequest[]} queue - Array of pending requests (FIFO)
|
||||
*/
|
||||
```
|
||||
|
||||
**State Transitions**:
|
||||
```
|
||||
IDLE (processing: false, queue: [])
|
||||
↓ New request arrives
|
||||
PROCESSING (processing: true, queue: [])
|
||||
↓ New request arrives while processing
|
||||
PROCESSING (processing: true, queue: [req1])
|
||||
↓ Current request completes
|
||||
PROCESSING (processing: true, queue: []) → Process req1
|
||||
↓ req1 completes, queue empty
|
||||
IDLE (processing: false, queue: [])
|
||||
```
|
||||
|
||||
**Operations**:
|
||||
- `enqueue(handler)`: Add request to queue, start processing if idle
|
||||
- `processNext()`: Process next request in FIFO order, recursively call until queue empty
|
||||
|
||||
**Implementation**: See research.md Section 3 for EventEmitter-based code pattern
|
||||
|
||||
---
|
||||
|
||||
## State Machines
|
||||
|
||||
### Authentication State
|
||||
|
||||
```
|
||||
UNINITIALIZED
|
||||
↓ Load credentials from env var
|
||||
VALIDATING
|
||||
↓ Parse JSON, validate structure
|
||||
├─ Success → AUTHENTICATED
|
||||
└─ Failure → FATAL_ERROR (exit(1))
|
||||
|
||||
AUTHENTICATED
|
||||
↓ Token expiry detected during request
|
||||
REFRESHING
|
||||
├─ Success → AUTHENTICATED
|
||||
└─ Failure → UNAUTHORIZED (return 401)
|
||||
```
|
||||
|
||||
**Note**: googleapis SDK manages token refresh automatically. Our code only handles:
|
||||
1. Initial credential loading/validation (startup)
|
||||
2. Error mapping (401 if refresh fails during request)
|
||||
|
||||
---
|
||||
|
||||
### Request Processing State
|
||||
|
||||
```
|
||||
RECEIVED
|
||||
↓ Create RequestContext, log request
|
||||
QUEUED
|
||||
↓ Wait for queue availability (FIFO)
|
||||
PROCESSING
|
||||
↓ Query Drive API
|
||||
├─ Success (≤50k docs) → GENERATING_XML
|
||||
├─ Error (>50k docs) → PAYLOAD_TOO_LARGE (413)
|
||||
├─ Error (Rate limit) → RATE_LIMITED (429 + Retry-After)
|
||||
├─ Error (503) → SERVICE_UNAVAILABLE (503, no retry)
|
||||
└─ Error (Other) → INTERNAL_ERROR (500)
|
||||
|
||||
GENERATING_XML
|
||||
↓ Build sitemap XML from documents
|
||||
├─ Success → COMPLETED (200 + XML)
|
||||
└─ Error → INTERNAL_ERROR (500)
|
||||
|
||||
COMPLETED
|
||||
↓ Log response, return to client
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Data Flow Diagrams
|
||||
|
||||
### Sitemap Generation Flow
|
||||
|
||||
```
|
||||
[Client] --GET /sitemap.xml--> [Server]
|
||||
↓
|
||||
[Create RequestContext]
|
||||
↓
|
||||
[Enqueue in RequestQueue]
|
||||
↓
|
||||
[Wait for queue slot (FIFO)]
|
||||
↓
|
||||
[Query Drive API files.list()]
|
||||
↓
|
||||
[Paginate through results]
|
||||
↓
|
||||
[Check count ≤ 50,000]
|
||||
↓
|
||||
YES ←─────┴─────→ NO
|
||||
↓ ↓
|
||||
[Transform Documents] [Return 413]
|
||||
to SitemapEntries
|
||||
↓
|
||||
[Generate XML string]
|
||||
↓
|
||||
[Return 200 + XML]
|
||||
```
|
||||
|
||||
### Error Handling Flow
|
||||
|
||||
```
|
||||
[Error Occurs]
|
||||
↓
|
||||
[Identify Error Type]
|
||||
↓
|
||||
├─ Drive API 429 → Extract rate limit info → Set Retry-After → 429
|
||||
├─ Drive API 503 → No retry → 503
|
||||
├─ Document count > 50k → 413
|
||||
├─ Token refresh failed → 401
|
||||
├─ Invalid endpoint → 404
|
||||
└─ Unknown error → Log stack → 500
|
||||
↓
|
||||
[Set status code, NO response body]
|
||||
↓
|
||||
[Log error to stderr with context]
|
||||
↓
|
||||
[Return response to client]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## API Response Formats
|
||||
|
||||
### Successful Sitemap Response (200 OK)
|
||||
|
||||
**Headers**:
|
||||
```
|
||||
Content-Type: application/xml; charset=utf-8
|
||||
Content-Length: {size}
|
||||
```
|
||||
|
||||
**Body** (XML):
|
||||
```xml
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
|
||||
<url>
|
||||
<loc>http://example.com/documents/1A2B3C4D</loc>
|
||||
<lastmod>2026-03-07</lastmod>
|
||||
</url>
|
||||
<url>
|
||||
<loc>http://example.com/documents/5E6F7G8H</loc>
|
||||
<lastmod>2026-03-06</lastmod>
|
||||
</url>
|
||||
</urlset>
|
||||
```
|
||||
|
||||
### Error Responses (4xx/5xx)
|
||||
|
||||
**All error responses**:
|
||||
- **Headers**: No Content-Type (empty body)
|
||||
- **Body**: Empty (per spec: status code only, no body)
|
||||
- **Special case**: 429 includes `Retry-After: {seconds}` header
|
||||
|
||||
**Status codes**:
|
||||
- 404 Not Found: Invalid endpoint
|
||||
- 413 Payload Too Large: >50,000 documents
|
||||
- 429 Too Many Requests: Drive API rate limit (includes Retry-After header)
|
||||
- 401 Unauthorized: Token refresh failed
|
||||
- 503 Service Unavailable: Drive API unavailable (no retry)
|
||||
- 500 Internal Server Error: Unexpected error
|
||||
|
||||
---
|
||||
|
||||
## Validation Rules Summary
|
||||
|
||||
### Input Validation
|
||||
- Environment variables:
|
||||
- `GOOGLE_SERVICE_ACCOUNT_KEY`: Required, valid JSON with client_email/private_key
|
||||
- `PORT`: Optional, 1-65535
|
||||
- `BASE_URL`: Optional, valid absolute URL
|
||||
- `DRIVE_QUERY`: Optional, non-empty string
|
||||
|
||||
### Output Validation
|
||||
- Sitemap XML:
|
||||
- Valid XML structure (well-formed)
|
||||
- Proper namespace declaration
|
||||
- All URLs properly escaped (XML entities: &, <, >, ", ')
|
||||
- All URLs absolute (include protocol + domain)
|
||||
- Document count ≤ 50,000
|
||||
|
||||
### Runtime Validation
|
||||
- HTTP requests:
|
||||
- Only GET method for /sitemap.xml (others return 404)
|
||||
- Only /sitemap.xml path supported (others return 404)
|
||||
|
||||
---
|
||||
|
||||
## Edge Cases & Error Handling
|
||||
|
||||
| Scenario | Data Impact | Response |
|
||||
|----------|-------------|----------|
|
||||
| Empty Drive (0 documents) | Empty urlset in XML | 200 OK with empty sitemap |
|
||||
| Exactly 50,000 documents | Valid sitemap | 200 OK |
|
||||
| 50,001 documents | Abort XML generation | 413 Payload Too Large |
|
||||
| Drive API pagination (>1000 docs) | Multiple API calls, single result set | 200 OK after all pages collected |
|
||||
| Document with special chars in ID | URL-encode document ID | Properly encoded loc URL |
|
||||
| Document with no modifiedTime | SitemapEntry.lastmod undefined | Omit <lastmod> tag from XML |
|
||||
| Concurrent requests | Queue up to N requests | Process sequentially (FIFO) |
|
||||
| Request while processing | Add to queue array | Wait for turn, then process |
|
||||
| Fatal error (invalid creds) | Cannot initialize auth client | Log error, exit(1) |
|
||||
| Port already in use | Cannot bind server | Log error, exit(1) |
|
||||
|
||||
---
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
### Memory Usage
|
||||
- **Document array**: ~100 bytes per document × 50k max = ~5MB peak
|
||||
- **XML string**: ~200 bytes per entry × 50k max = ~10MB peak
|
||||
- **Total estimated**: ~20MB for max load (within 256MB constraint)
|
||||
|
||||
### API Call Efficiency
|
||||
- Use `fields` parameter to request only needed data (reduces payload size)
|
||||
- Pagination: 1000 documents per page (Drive API limit)
|
||||
- For 50k documents: ~50 API calls (sequential, within same request processing)
|
||||
|
||||
### Caching Strategy
|
||||
- **NO CACHING**: Per spec requirement "each sitemap request fetches current list"
|
||||
- Fresh data on every request (trade-off: latency vs. freshness)
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
This data model provides:
|
||||
1. **Clear entity definitions** with JSDoc type annotations (per constitution: JavaScript + JSDoc)
|
||||
2. **Validation rules** for all inputs and outputs
|
||||
3. **State machines** for authentication and request processing
|
||||
4. **Data flow diagrams** showing transformation pipelines
|
||||
5. **Error handling patterns** for all edge cases
|
||||
6. **Performance constraints** aligned with success criteria (<256MB memory, <5s response time)
|
||||
|
||||
All entities are stateless runtime structures - no persistence layer required.
|
||||
Reference in New Issue
Block a user