Initial Version of sitemap.xml spec
This commit is contained in:
495
specs/001-drive-proxy-adapter/quickstart.md
Normal file
495
specs/001-drive-proxy-adapter/quickstart.md
Normal file
@@ -0,0 +1,495 @@
|
||||
# Quickstart Guide: Google Drive HTTP Proxy Adapter
|
||||
|
||||
**Feature**: 001-drive-proxy-adapter
|
||||
**Date**: 2026-03-07
|
||||
**Version**: 1.0.0
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
The Google Drive HTTP Proxy Adapter is a Node.js application that generates XML sitemaps of Google Drive documents. It provides a single HTTP endpoint (`/sitemap.xml`) that queries the Google Drive API and returns a sitemap listing all accessible documents with links in RESTful format.
|
||||
|
||||
**Key Features**:
|
||||
- Service Account authentication (JWT-based, no user interaction)
|
||||
- Sitemap protocol compliant (50,000 URL limit enforced)
|
||||
- FIFO request queuing (sequential processing)
|
||||
- Configurable Drive API filters
|
||||
- Plain text logging to stdout/stderr
|
||||
|
||||
---
|
||||
|
||||
## Prerequisites
|
||||
|
||||
1. **Node.js**: v18.0.0 or later (LTS version recommended)
|
||||
2. **Google Cloud Project**: With Drive API enabled
|
||||
3. **Service Account**: JSON key file with Drive API access
|
||||
4. **Network Access**: Connectivity to googleapis.com
|
||||
|
||||
---
|
||||
|
||||
## Installation
|
||||
|
||||
### 1. Clone Repository
|
||||
|
||||
```bash
|
||||
git clone <repository-url>
|
||||
cd google-drive-content-adapter
|
||||
```
|
||||
|
||||
### 2. Install Dependencies
|
||||
|
||||
```bash
|
||||
npm install
|
||||
```
|
||||
|
||||
**Dependencies**:
|
||||
- `googleapis@^140.0.0` - Official Google API client for Node.js
|
||||
|
||||
---
|
||||
|
||||
## Configuration
|
||||
|
||||
### 1. Service Account Setup
|
||||
|
||||
**Create Service Account** (Google Cloud Console):
|
||||
1. Navigate to [IAM & Admin > Service Accounts](https://console.cloud.google.com/iam-admin/serviceaccounts)
|
||||
2. Click "Create Service Account"
|
||||
3. Name: `drive-sitemap-adapter` (or your choice)
|
||||
4. Grant role: None required if accessing service account's own Drive
|
||||
5. Click "Create Key" → Choose JSON format → Download key file
|
||||
|
||||
**Enable Drive API**:
|
||||
1. Navigate to [APIs & Services > Library](https://console.cloud.google.com/apis/library)
|
||||
2. Search for "Google Drive API"
|
||||
3. Click "Enable"
|
||||
|
||||
**Grant Access** (if accessing user drives):
|
||||
- Share Drive folders/files with Service Account email (`xxx@project.iam.gserviceaccount.com`)
|
||||
- OR configure domain-wide delegation (for G Suite organizations)
|
||||
|
||||
---
|
||||
|
||||
### 2. Environment Variables
|
||||
|
||||
Create `.env` file in project root (or set environment variables):
|
||||
|
||||
```bash
|
||||
# REQUIRED: Service Account credentials (inline JSON)
|
||||
GOOGLE_SERVICE_ACCOUNT_KEY='{"type":"service_account","project_id":"your-project","private_key_id":"...","private_key":"-----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----\n","client_email":"xxx@project.iam.gserviceaccount.com","client_id":"...","auth_uri":"https://accounts.google.com/o/oauth2/auth","token_uri":"https://oauth2.googleapis.com/token","auth_provider_x509_cert_url":"https://www.googleapis.com/oauth2/v1/certs","client_x509_cert_url":"..."}'
|
||||
|
||||
# OPTIONAL: Server configuration
|
||||
PORT=3000 # Default: 3000
|
||||
BASE_URL=http://localhost:3000 # Default: http://localhost:3000
|
||||
|
||||
# OPTIONAL: Drive API query filter
|
||||
DRIVE_QUERY="trashed = false" # Default: "trashed = false"
|
||||
```
|
||||
|
||||
**Important Notes**:
|
||||
- `GOOGLE_SERVICE_ACCOUNT_KEY` must be a single-line JSON string (escape newlines in private_key)
|
||||
- `BASE_URL` should match your production domain for sitemap URLs
|
||||
- `DRIVE_QUERY` supports Drive API query syntax ([docs](https://developers.google.com/drive/api/guides/search-files))
|
||||
|
||||
---
|
||||
|
||||
### 3. Configuration Files
|
||||
|
||||
**config/config.js**: Server settings (auto-generated from env vars)
|
||||
```javascript
|
||||
export default {
|
||||
server: {
|
||||
port: process.env.PORT || 3000,
|
||||
baseUrl: process.env.BASE_URL || 'http://localhost:3000'
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
**config/settings.js**: Drive API configuration
|
||||
```javascript
|
||||
export default {
|
||||
drive: {
|
||||
query: process.env.DRIVE_QUERY || "trashed = false",
|
||||
fields: 'files(id, name, mimeType, modifiedTime)',
|
||||
pageSize: 1000,
|
||||
scope: 'https://www.googleapis.com/auth/drive.readonly'
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
**To customize Drive API filter**, edit `config/settings.js` or set `DRIVE_QUERY` env var.
|
||||
|
||||
---
|
||||
|
||||
## Usage
|
||||
|
||||
### Start Server (Development)
|
||||
|
||||
```bash
|
||||
npm run dev
|
||||
```
|
||||
|
||||
**Output**:
|
||||
```
|
||||
[2026-03-07T10:00:00.000Z] [INFO] Server configuration loaded: port=3000, baseUrl=http://localhost:3000
|
||||
[2026-03-07T10:00:00.100Z] [INFO] Service Account authenticated: xxx***@project.iam.gserviceaccount.com
|
||||
[2026-03-07T10:00:00.200Z] [INFO] HTTP server listening on port 3000
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Start Server (Production)
|
||||
|
||||
```bash
|
||||
npm start
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Request Sitemap
|
||||
|
||||
**Using curl**:
|
||||
```bash
|
||||
curl http://localhost:3000/sitemap.xml
|
||||
```
|
||||
|
||||
**Expected Response** (200 OK):
|
||||
```xml
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
|
||||
<url>
|
||||
<loc>http://localhost:3000/documents/1A2B3C4D5E6F7G8H</loc>
|
||||
<lastmod>2026-03-07</lastmod>
|
||||
</url>
|
||||
<url>
|
||||
<loc>http://localhost:3000/documents/9I0J1K2L3M4N5O6P</loc>
|
||||
<lastmod>2026-03-05</lastmod>
|
||||
</url>
|
||||
</urlset>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Testing
|
||||
|
||||
### Run All Tests
|
||||
|
||||
```bash
|
||||
npm test
|
||||
```
|
||||
|
||||
**Test Suites**:
|
||||
- `tests/unit/` - Unit tests for Drive client, auth, sitemap generator, queue
|
||||
- `tests/integration/` - End-to-end endpoint tests for /sitemap.xml
|
||||
- `tests/contract/` - XML sitemap schema validation tests
|
||||
|
||||
---
|
||||
|
||||
### Run Specific Test Suite
|
||||
|
||||
```bash
|
||||
npm run test:unit # Unit tests only
|
||||
npm run test:integration # Integration tests only
|
||||
npm run test:contract # Contract tests only
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## API Reference
|
||||
|
||||
### Endpoint: `GET /sitemap.xml`
|
||||
|
||||
**Description**: Generate XML sitemap of all accessible Google Drive documents.
|
||||
|
||||
**Request**:
|
||||
```http
|
||||
GET /sitemap.xml HTTP/1.1
|
||||
Host: example.com
|
||||
```
|
||||
|
||||
**Success Response** (200 OK):
|
||||
```http
|
||||
HTTP/1.1 200 OK
|
||||
Content-Type: application/xml; charset=utf-8
|
||||
Content-Length: {size}
|
||||
|
||||
```
|
||||
|
||||
**Error Responses**:
|
||||
- `404 Not Found` - Invalid endpoint (only /sitemap.xml supported)
|
||||
- `413 Payload Too Large` - More than 50,000 documents in Drive
|
||||
- `429 Too Many Requests` - Rate limit exceeded (includes `Retry-After` header)
|
||||
- `401 Unauthorized` - Authentication failed
|
||||
- `503 Service Unavailable` - Drive API unavailable
|
||||
- `500 Internal Server Error` - Unexpected error
|
||||
|
||||
**Note**: All error responses have **empty body** (status code only).
|
||||
|
||||
See [contracts/sitemap-xml-schema.md](./contracts/sitemap-xml-schema.md) for full API contract.
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
|
||||
### Project Structure
|
||||
|
||||
```
|
||||
google-drive-content-adapter/
|
||||
├── src/
|
||||
│ ├── server.js # HTTP server entry point
|
||||
│ ├── proxy.js # Monolithic route handler (sitemap logic)
|
||||
│ ├── logger.js # Logging module (console.js alias)
|
||||
│ ├── auth.js # Service Account JWT authentication
|
||||
│ └── xml-utils.js # XML generation utilities
|
||||
├── config/
|
||||
│ ├── config.js # Server configuration (port, baseUrl)
|
||||
│ └── settings.js # Drive API filter configuration
|
||||
├── tests/
|
||||
│ ├── unit/ # Unit tests
|
||||
│ ├── integration/ # Integration tests
|
||||
│ └── contract/ # Contract tests
|
||||
├── specs/ # Feature specifications and planning docs
|
||||
│ └── 001-drive-proxy-adapter/
|
||||
│ ├── spec.md
|
||||
│ ├── plan.md
|
||||
│ ├── research.md
|
||||
│ ├── data-model.md
|
||||
│ ├── quickstart.md (this file)
|
||||
│ └── contracts/
|
||||
│ └── sitemap-xml-schema.md
|
||||
├── package.json
|
||||
└── README.md
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Request Flow
|
||||
|
||||
```
|
||||
1. Client → GET /sitemap.xml
|
||||
2. Server → Create RequestContext (ID, timestamp)
|
||||
3. Server → Enqueue request (FIFO queue)
|
||||
4. Queue → Process request (sequential, one at a time)
|
||||
5. Proxy → Authenticate with Service Account JWT
|
||||
6. Proxy → Query Drive API files.list() (paginate if >1000 docs)
|
||||
7. Proxy → Check count ≤ 50,000
|
||||
8. Proxy → Transform Documents to SitemapEntries
|
||||
9. Proxy → Generate XML sitemap
|
||||
10. Server → Return 200 + XML (or error status)
|
||||
11. Queue → Process next request
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### 1. Fatal Error: Invalid Service Account Credentials
|
||||
|
||||
**Error**:
|
||||
```
|
||||
[2026-03-07T10:00:00.000Z] [ERROR] FATAL: Invalid client_email in Service Account credentials
|
||||
```
|
||||
|
||||
**Solution**:
|
||||
- Check `GOOGLE_SERVICE_ACCOUNT_KEY` env var is valid JSON
|
||||
- Ensure `client_email` field ends with `.gserviceaccount.com`
|
||||
- Ensure `private_key` field starts with `-----BEGIN PRIVATE KEY-----`
|
||||
- Verify no extra escaping/quotes in JSON string
|
||||
|
||||
---
|
||||
|
||||
### 2. Fatal Error: Port Already in Use
|
||||
|
||||
**Error**:
|
||||
```
|
||||
[2026-03-07T10:00:00.000Z] [ERROR] FATAL: Unable to bind to port 3000 (EADDRINUSE)
|
||||
```
|
||||
|
||||
**Solution**:
|
||||
- Change `PORT` env var to different port (e.g., 8080)
|
||||
- OR stop other process using port 3000: `lsof -ti:3000 | xargs kill`
|
||||
|
||||
---
|
||||
|
||||
### 3. 401 Unauthorized Response
|
||||
|
||||
**Cause**: Service Account token refresh failed
|
||||
|
||||
**Solution**:
|
||||
- Verify Service Account has Drive API access (share folders with service account email)
|
||||
- Check Drive API is enabled in Google Cloud Console
|
||||
- Ensure scope is correct: `https://www.googleapis.com/auth/drive.readonly`
|
||||
|
||||
---
|
||||
|
||||
### 4. 413 Payload Too Large Response
|
||||
|
||||
**Cause**: Google Drive contains more than 50,000 documents
|
||||
|
||||
**Solution**:
|
||||
- Adjust `DRIVE_QUERY` to filter documents (e.g., by folder, date, file type)
|
||||
- Example: `DRIVE_QUERY="'folder-id' in parents and trashed = false"`
|
||||
|
||||
---
|
||||
|
||||
### 5. 429 Too Many Requests Response
|
||||
|
||||
**Cause**: Drive API rate limit exceeded
|
||||
|
||||
**Solution**:
|
||||
- Wait for time specified in `Retry-After` response header (seconds)
|
||||
- Reduce request frequency
|
||||
- Consider Drive API quota limits ([docs](https://developers.google.com/drive/api/guides/limits))
|
||||
|
||||
---
|
||||
|
||||
### 6. 503 Service Unavailable Response
|
||||
|
||||
**Cause**: Google Drive API is temporarily unavailable
|
||||
|
||||
**Solution**:
|
||||
- Wait and retry manually (no automatic retries per spec)
|
||||
- Check [Google Workspace Status Dashboard](https://www.google.com/appsstatus)
|
||||
|
||||
---
|
||||
|
||||
## Performance Tips
|
||||
|
||||
### 1. Optimize Drive Query Filter
|
||||
|
||||
**Default** (all files):
|
||||
```javascript
|
||||
DRIVE_QUERY="trashed = false"
|
||||
```
|
||||
|
||||
**Filter by folder**:
|
||||
```javascript
|
||||
DRIVE_QUERY="'folder-id' in parents and trashed = false"
|
||||
```
|
||||
|
||||
**Filter by date**:
|
||||
```javascript
|
||||
DRIVE_QUERY="modifiedTime > '2026-01-01T00:00:00' and trashed = false"
|
||||
```
|
||||
|
||||
**Filter by MIME type**:
|
||||
```javascript
|
||||
DRIVE_QUERY="mimeType = 'application/pdf' and trashed = false"
|
||||
```
|
||||
|
||||
See [Drive API search query syntax](https://developers.google.com/drive/api/guides/search-files) for more options.
|
||||
|
||||
---
|
||||
|
||||
### 2. Adjust BASE_URL for Production
|
||||
|
||||
**Development**:
|
||||
```
|
||||
BASE_URL=http://localhost:3000
|
||||
```
|
||||
|
||||
**Production**:
|
||||
```
|
||||
BASE_URL=https://your-domain.com
|
||||
```
|
||||
|
||||
This ensures sitemap URLs point to the correct domain.
|
||||
|
||||
---
|
||||
|
||||
### 3. Monitor Memory Usage
|
||||
|
||||
**Check memory usage** (production):
|
||||
```bash
|
||||
node --inspect src/server.js
|
||||
# Open chrome://inspect in Chrome DevTools
|
||||
```
|
||||
|
||||
**Expected**: <256MB under normal load (<10 concurrent requests)
|
||||
|
||||
---
|
||||
|
||||
## Security Best Practices
|
||||
|
||||
1. **Never commit** Service Account JSON key file to version control
|
||||
2. **Use environment variables** for all sensitive configuration
|
||||
3. **Restrict Service Account permissions** to minimum required (readonly scope)
|
||||
4. **Monitor logs** for unauthorized access attempts
|
||||
5. **Use HTTPS** in production (configure reverse proxy like nginx)
|
||||
6. **Filter credentials from logs** (private_key field never logged)
|
||||
|
||||
---
|
||||
|
||||
## Deployment
|
||||
|
||||
### Docker (Recommended)
|
||||
|
||||
**Dockerfile**:
|
||||
```dockerfile
|
||||
FROM node:18-alpine
|
||||
WORKDIR /app
|
||||
COPY package*.json ./
|
||||
RUN npm ci --only=production
|
||||
COPY . .
|
||||
EXPOSE 3000
|
||||
CMD ["npm", "start"]
|
||||
```
|
||||
|
||||
**Build and run**:
|
||||
```bash
|
||||
docker build -t drive-sitemap-adapter .
|
||||
docker run -p 3000:3000 \
|
||||
-e GOOGLE_SERVICE_ACCOUNT_KEY='{"type":"service_account",...}' \
|
||||
-e BASE_URL=https://your-domain.com \
|
||||
drive-sitemap-adapter
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Cloud Platforms
|
||||
|
||||
**Google Cloud Run**:
|
||||
```bash
|
||||
gcloud run deploy drive-sitemap-adapter \
|
||||
--source . \
|
||||
--set-env-vars BASE_URL=https://your-domain.com \
|
||||
--set-secrets GOOGLE_SERVICE_ACCOUNT_KEY=service-account-key:latest
|
||||
```
|
||||
|
||||
**AWS ECS / Fargate**: Use environment variables in task definition
|
||||
|
||||
**Heroku**: Set environment variables via Heroku CLI or dashboard
|
||||
|
||||
---
|
||||
|
||||
## Additional Resources
|
||||
|
||||
- **Feature Specification**: [specs/001-drive-proxy-adapter/spec.md](./spec.md)
|
||||
- **Implementation Plan**: [specs/001-drive-proxy-adapter/plan.md](./plan.md)
|
||||
- **Research Document**: [specs/001-drive-proxy-adapter/research.md](./research.md)
|
||||
- **Data Model**: [specs/001-drive-proxy-adapter/data-model.md](./data-model.md)
|
||||
- **API Contract**: [specs/001-drive-proxy-adapter/contracts/sitemap-xml-schema.md](./contracts/sitemap-xml-schema.md)
|
||||
- **Google Drive API Docs**: [https://developers.google.com/drive/api/v3/reference](https://developers.google.com/drive/api/v3/reference)
|
||||
- **Sitemap Protocol**: [https://www.sitemaps.org/protocol.html](https://www.sitemaps.org/protocol.html)
|
||||
|
||||
---
|
||||
|
||||
## Support
|
||||
|
||||
For issues or questions, refer to:
|
||||
1. This quickstart guide
|
||||
2. Feature specification (spec.md) for requirements
|
||||
3. Research document (research.md) for technical decisions
|
||||
4. Contract documentation (contracts/) for API details
|
||||
|
||||
---
|
||||
|
||||
## Version History
|
||||
|
||||
| Version | Date | Changes |
|
||||
|---------|------|---------|
|
||||
| 1.0.0 | 2026-03-07 | Initial quickstart guide |
|
||||
|
||||
| Version | Date | Changes |
|
||||
|---------|------|---------|
|
||||
| 1.0.0 | 2026-03-07 | Initial quickstart guide |
|
||||
Reference in New Issue
Block a user