# Google Drive Sitemap Adapter HTTP service that generates XML sitemaps listing all accessible documents in a Google Drive account. Uses Service Account authentication for secure, automated access. ## Features - **Sitemap Generation**: XML sitemap at `/sitemap.xml` listing all accessible Google Drive documents - **RESTful URLs**: Document links in format `/documents/{documentId}` per sitemap protocol - **Service Account Auth**: JWT-based authentication using Google Service Account credentials - **Pagination Support**: Handles large document sets (up to 50,000 URLs per sitemap protocol) - **50k Limit Enforcement**: Returns 413 error if document count exceeds sitemap protocol limit - **FIFO Request Queue**: Concurrent requests processed sequentially (one at a time) - **Rate Limit Handling**: Returns 429 with Retry-After header when Drive API rate limits - **No Retry on 503**: Fails immediately on Drive API unavailability (per spec) - **Minimal Dependencies**: Only `googleapis` package required ## Quick Start ### Prerequisites - Node.js v18.x or later - Google Cloud Project with Drive API enabled - Service Account credentials with Drive API access ### Setup 1. **Install dependencies**: ```bash npm install ``` 2. **Configure Service Account** (see `specs/001-drive-proxy-adapter/quickstart.md` for detailed steps): - Create Service Account in Google Cloud Console - Download service account key JSON file - Share Drive files/folders with service account email - Place key file at `config/service-account-key.json` 3. **Configure environment**: ```bash cp .env.example .env # Edit .env with your service account email ``` 4. **Start the server**: ```bash npm start # or for development with auto-reload: npm run dev ``` 5. **Generate sitemap**: ```bash curl http://localhost:3000/sitemap.xml ``` ### Usage Examples ```bash # Get sitemap of all documents curl http://localhost:3000/sitemap.xml # Verify XML format curl http://localhost:3000/sitemap.xml | xmllint --noout - # Count documents in sitemap curl http://localhost:3000/sitemap.xml | grep -c '' ``` ## Architecture ### Monolithic Design This project follows a **monolithic architecture** as specified in the project constitution: - **Single Route File**: ALL routing, business logic, and Drive API integration in `src/proxy.js` (~350 LOC) - **Utility Modules**: Separate files for auth, logging, XML utils (constitution-compliant separation of concerns) - **Configuration as Data**: JSON configuration in `config/default.json` loaded into `global.config` at startup - **Minimal Dependencies**: Only `googleapis` package for Drive API integration ### Why Monolithic? Rationale defined in constitution: 1. **Simplicity**: Easy to understand, debug, and maintain 2. **Direct Code Flow**: No dependency injection, no framework magic 3. **YAGNI Principle**: No premature abstraction for a focused service ### Structure ``` src/ ├── server.js # HTTP server, config loader, validation ├── proxy.js # Request handler with FIFO queue integration ├── drive-client.js # Drive API integration with 50k limit enforcement ├── sitemap-generator.js # Sitemap XML generation with RESTful URLs ├── queue.js # FIFO request queue (sequential processing) ├── auth.js # Service Account authentication ├── logger.js # Structured logging utility ├── utils.js # Request ID, validation └── xml-utils.js # XML escaping ``` ## Testing ### Test Structure Tests follow **TDD workflow** with real assertions: ``` tests/ ├── contract/ # API contract tests (HTTP interface) ├── integration/ # Drive API integration tests └── unit/ # Pure function unit tests ``` ### Running Tests ```bash # All tests npm test # Specific test suites npm run test:unit npm run test:integration npm run test:contract ``` ### Coverage Requirements - **Minimum**: 80% code coverage (enforced) - **Tests Written First**: TDD mandatory per constitution - **Real Assertions**: No placeholder tests ## Configuration Configuration is loaded from `config/default.json` and merged with environment variables: ```json { "server": { "port": 3000, "host": "0.0.0.0", "baseUrl": "http://localhost:3000" }, "google": { "serviceAccountEmail": "service@project.iam.gserviceaccount.com", "serviceAccountKeyPath": "./config/service-account-key.json", "scopes": ["https://www.googleapis.com/auth/drive.readonly"] }, "sitemap": { "maxUrls": 50000 }, "logging": { "level": "info" } } ``` Environment variables override JSON config (e.g., `PORT`, `GOOGLE_SERVICE_ACCOUNT_EMAIL`). ## API Documentation ### Endpoints - `GET /sitemap.xml` - XML sitemap of all accessible documents (200 OK with XML body) - `GET /*` - All other paths return 404 Not Found (empty body) ### Response Headers Successful sitemap response (200 OK): - `Content-Type: application/xml; charset=utf-8` - `X-Request-Id: req_` - Request tracing ID - `X-Document-Count: ` - Number of documents in sitemap ### Error Responses All errors return **HTTP status code only** with **no response body** (per specification): - `401 Unauthorized` - Service account authentication failed - `404 Not Found` - Path is not /sitemap.xml - `413 Payload Too Large` - Document count exceeds 50,000 (sitemap protocol limit) - `429 Too Many Requests` - Drive API rate limit exceeded (includes `Retry-After` header in seconds) - `500 Internal Server Error` - Server error - `503 Service Unavailable` - Drive API unavailable (NO RETRY per specification) ## Performance Characteristics - **Cold Start**: < 10 seconds to accepting requests - **Sitemap Generation**: < 5 seconds for 10,000 documents - **Concurrent Requests**: 10+ without degradation - **Memory Usage**: < 256MB under normal load ## Development ### Project Structure ``` google-drive-content-adapter/ ├── config/ │ └── default.json # Configuration ├── src/ │ ├── server.js # HTTP server │ ├── proxy.js # Request handler (monolithic) │ ├── auth.js # Service Account auth │ ├── logger.js # Structured logging │ ├── utils.js # Utilities │ └── xml-utils.js # XML escaping ├── tests/ │ ├── contract/ # API contract tests │ ├── integration/ # Integration tests │ └── unit/ # Unit tests ├── specs/ │ └── 001-drive-proxy-adapter/ # Feature spec, plan, tasks ├── .env.example # Environment template ├── package.json # Dependencies and scripts └── README.md # This file ``` ### Development Workflow 1. **Write Tests First** (TDD) 2. **Implement Minimum Code** 3. **Run Tests**: `npm test` 4. **Run in Development**: `npm run dev` ## Deployment ### Docker ```dockerfile FROM node:18-alpine WORKDIR /app COPY package*.json ./ RUN npm ci --production COPY src/ ./src/ COPY config/ ./config/ CMD ["node", "src/server.js"] EXPOSE 3000 ``` ```bash docker build -t drive-sitemap-adapter . docker run -p 3000:3000 -v $(pwd)/config:/app/config drive-sitemap-adapter ``` ### Direct Node.js ```bash NODE_ENV=production npm start ``` ## Troubleshooting ### Authentication Failed (401) - Verify service account key file exists at `config/service-account-key.json` - Check service account email matches configuration - Ensure Drive API is enabled in Google Cloud project ### Empty Sitemap - Service account needs access to Drive files - Share files/folders with service account email - Check service account has "Viewer" permission ### Rate Limit (429) - Wait for time specified in `Retry-After` header - Reduce frequency of sitemap requests - Check Google Cloud Console quotas ## License ISC ## Documentation For detailed setup and usage instructions, see: - [Quick Start Guide](specs/001-drive-proxy-adapter/quickstart.md) - [Feature Specification](specs/001-drive-proxy-adapter/spec.md) - [Implementation Plan](specs/001-drive-proxy-adapter/plan.md) - [Data Model](specs/001-drive-proxy-adapter/data-model.md)