Added new feature for document export

This commit is contained in:
2026-03-10 16:25:05 -05:00
parent d477367256
commit 2acb04ad76
11 changed files with 0 additions and 0 deletions

View File

@@ -0,0 +1,495 @@
# Quickstart Guide: Google Drive HTTP Proxy Adapter
**Feature**: 001-drive-proxy-adapter
**Date**: 2026-03-07
**Version**: 1.0.0
---
## Overview
The Google Drive HTTP Proxy Adapter is a Node.js application that generates XML sitemaps of Google Drive documents. It provides a single HTTP endpoint (`/sitemap.xml`) that queries the Google Drive API and returns a sitemap listing all accessible documents with links in RESTful format.
**Key Features**:
- Service Account authentication (JWT-based, no user interaction)
- Sitemap protocol compliant (50,000 URL limit enforced)
- FIFO request queuing (sequential processing)
- Configurable Drive API filters
- Plain text logging to stdout/stderr
---
## Prerequisites
1. **Node.js**: v18.0.0 or later (LTS version recommended)
2. **Google Cloud Project**: With Drive API enabled
3. **Service Account**: JSON key file with Drive API access
4. **Network Access**: Connectivity to googleapis.com
---
## Installation
### 1. Clone Repository
```bash
git clone <repository-url>
cd google-drive-content-adapter
```
### 2. Install Dependencies
```bash
npm install
```
**Dependencies**:
- `googleapis@^140.0.0` - Official Google API client for Node.js
---
## Configuration
### 1. Service Account Setup
**Create Service Account** (Google Cloud Console):
1. Navigate to [IAM & Admin > Service Accounts](https://console.cloud.google.com/iam-admin/serviceaccounts)
2. Click "Create Service Account"
3. Name: `drive-sitemap-adapter` (or your choice)
4. Grant role: None required if accessing service account's own Drive
5. Click "Create Key" → Choose JSON format → Download key file
**Enable Drive API**:
1. Navigate to [APIs & Services > Library](https://console.cloud.google.com/apis/library)
2. Search for "Google Drive API"
3. Click "Enable"
**Grant Access** (if accessing user drives):
- Share Drive folders/files with Service Account email (`xxx@project.iam.gserviceaccount.com`)
- OR configure domain-wide delegation (for G Suite organizations)
---
### 2. Environment Variables
Create `.env` file in project root (or set environment variables):
```bash
# REQUIRED: Service Account credentials (inline JSON)
GOOGLE_SERVICE_ACCOUNT_KEY='{"type":"service_account","project_id":"your-project","private_key_id":"...","private_key":"-----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----\n","client_email":"xxx@project.iam.gserviceaccount.com","client_id":"...","auth_uri":"https://accounts.google.com/o/oauth2/auth","token_uri":"https://oauth2.googleapis.com/token","auth_provider_x509_cert_url":"https://www.googleapis.com/oauth2/v1/certs","client_x509_cert_url":"..."}'
# OPTIONAL: Server configuration
PORT=3000 # Default: 3000
BASE_URL=http://localhost:3000 # Default: http://localhost:3000
# OPTIONAL: Drive API query filter
DRIVE_QUERY="trashed = false" # Default: "trashed = false"
```
**Important Notes**:
- `GOOGLE_SERVICE_ACCOUNT_KEY` must be a single-line JSON string (escape newlines in private_key)
- `BASE_URL` should match your production domain for sitemap URLs
- `DRIVE_QUERY` supports Drive API query syntax ([docs](https://developers.google.com/drive/api/guides/search-files))
---
### 3. Configuration Files
**config/config.js**: Server settings (auto-generated from env vars)
```javascript
export default {
server: {
port: process.env.PORT || 3000,
baseUrl: process.env.BASE_URL || 'http://localhost:3000'
}
};
```
**config/settings.js**: Drive API configuration
```javascript
export default {
drive: {
query: process.env.DRIVE_QUERY || "trashed = false",
fields: 'files(id, name, mimeType, modifiedTime)',
pageSize: 1000,
scope: 'https://www.googleapis.com/auth/drive.readonly'
}
};
```
**To customize Drive API filter**, edit `config/settings.js` or set `DRIVE_QUERY` env var.
---
## Usage
### Start Server (Development)
```bash
npm run dev
```
**Output**:
```
[2026-03-07T10:00:00.000Z] [INFO] Server configuration loaded: port=3000, baseUrl=http://localhost:3000
[2026-03-07T10:00:00.100Z] [INFO] Service Account authenticated: xxx***@project.iam.gserviceaccount.com
[2026-03-07T10:00:00.200Z] [INFO] HTTP server listening on port 3000
```
---
### Start Server (Production)
```bash
npm start
```
---
### Request Sitemap
**Using curl**:
```bash
curl http://localhost:3000/sitemap.xml
```
**Expected Response** (200 OK):
```xml
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://localhost:3000/documents/1A2B3C4D5E6F7G8H</loc>
<lastmod>2026-03-07</lastmod>
</url>
<url>
<loc>http://localhost:3000/documents/9I0J1K2L3M4N5O6P</loc>
<lastmod>2026-03-05</lastmod>
</url>
</urlset>
```
---
## Testing
### Run All Tests
```bash
npm test
```
**Test Suites**:
- `tests/unit/` - Unit tests for Drive client, auth, sitemap generator, queue
- `tests/integration/` - End-to-end endpoint tests for /sitemap.xml
- `tests/contract/` - XML sitemap schema validation tests
---
### Run Specific Test Suite
```bash
npm run test:unit # Unit tests only
npm run test:integration # Integration tests only
npm run test:contract # Contract tests only
```
---
## API Reference
### Endpoint: `GET /sitemap.xml`
**Description**: Generate XML sitemap of all accessible Google Drive documents.
**Request**:
```http
GET /sitemap.xml HTTP/1.1
Host: example.com
```
**Success Response** (200 OK):
```http
HTTP/1.1 200 OK
Content-Type: application/xml; charset=utf-8
Content-Length: {size}
```
**Error Responses**:
- `404 Not Found` - Invalid endpoint (only /sitemap.xml supported)
- `413 Payload Too Large` - More than 50,000 documents in Drive
- `429 Too Many Requests` - Rate limit exceeded (includes `Retry-After` header)
- `401 Unauthorized` - Authentication failed
- `503 Service Unavailable` - Drive API unavailable
- `500 Internal Server Error` - Unexpected error
**Note**: All error responses have **empty body** (status code only).
See [contracts/sitemap-xml-schema.md](./contracts/sitemap-xml-schema.md) for full API contract.
---
## Architecture
### Project Structure
```
google-drive-content-adapter/
├── src/
│ ├── server.js # HTTP server entry point
│ ├── proxy.js # Monolithic route handler (sitemap logic)
│ ├── logger.js # Logging module (console.js alias)
│ ├── auth.js # Service Account JWT authentication
│ └── xml-utils.js # XML generation utilities
├── config/
│ ├── config.js # Server configuration (port, baseUrl)
│ └── settings.js # Drive API filter configuration
├── tests/
│ ├── unit/ # Unit tests
│ ├── integration/ # Integration tests
│ └── contract/ # Contract tests
├── specs/ # Feature specifications and planning docs
│ └── 001-drive-proxy-adapter/
│ ├── spec.md
│ ├── plan.md
│ ├── research.md
│ ├── data-model.md
│ ├── quickstart.md (this file)
│ └── contracts/
│ └── sitemap-xml-schema.md
├── package.json
└── README.md
```
---
### Request Flow
```
1. Client → GET /sitemap.xml
2. Server → Create RequestContext (ID, timestamp)
3. Server → Enqueue request (FIFO queue)
4. Queue → Process request (sequential, one at a time)
5. Proxy → Authenticate with Service Account JWT
6. Proxy → Query Drive API files.list() (paginate if >1000 docs)
7. Proxy → Check count ≤ 50,000
8. Proxy → Transform Documents to SitemapEntries
9. Proxy → Generate XML sitemap
10. Server → Return 200 + XML (or error status)
11. Queue → Process next request
```
---
## Troubleshooting
### 1. Fatal Error: Invalid Service Account Credentials
**Error**:
```
[2026-03-07T10:00:00.000Z] [ERROR] FATAL: Invalid client_email in Service Account credentials
```
**Solution**:
- Check `GOOGLE_SERVICE_ACCOUNT_KEY` env var is valid JSON
- Ensure `client_email` field ends with `.gserviceaccount.com`
- Ensure `private_key` field starts with `-----BEGIN PRIVATE KEY-----`
- Verify no extra escaping/quotes in JSON string
---
### 2. Fatal Error: Port Already in Use
**Error**:
```
[2026-03-07T10:00:00.000Z] [ERROR] FATAL: Unable to bind to port 3000 (EADDRINUSE)
```
**Solution**:
- Change `PORT` env var to different port (e.g., 8080)
- OR stop other process using port 3000: `lsof -ti:3000 | xargs kill`
---
### 3. 401 Unauthorized Response
**Cause**: Service Account token refresh failed
**Solution**:
- Verify Service Account has Drive API access (share folders with service account email)
- Check Drive API is enabled in Google Cloud Console
- Ensure scope is correct: `https://www.googleapis.com/auth/drive.readonly`
---
### 4. 413 Payload Too Large Response
**Cause**: Google Drive contains more than 50,000 documents
**Solution**:
- Adjust `DRIVE_QUERY` to filter documents (e.g., by folder, date, file type)
- Example: `DRIVE_QUERY="'folder-id' in parents and trashed = false"`
---
### 5. 429 Too Many Requests Response
**Cause**: Drive API rate limit exceeded
**Solution**:
- Wait for time specified in `Retry-After` response header (seconds)
- Reduce request frequency
- Consider Drive API quota limits ([docs](https://developers.google.com/drive/api/guides/limits))
---
### 6. 503 Service Unavailable Response
**Cause**: Google Drive API is temporarily unavailable
**Solution**:
- Wait and retry manually (no automatic retries per spec)
- Check [Google Workspace Status Dashboard](https://www.google.com/appsstatus)
---
## Performance Tips
### 1. Optimize Drive Query Filter
**Default** (all files):
```javascript
DRIVE_QUERY="trashed = false"
```
**Filter by folder**:
```javascript
DRIVE_QUERY="'folder-id' in parents and trashed = false"
```
**Filter by date**:
```javascript
DRIVE_QUERY="modifiedTime > '2026-01-01T00:00:00' and trashed = false"
```
**Filter by MIME type**:
```javascript
DRIVE_QUERY="mimeType = 'application/pdf' and trashed = false"
```
See [Drive API search query syntax](https://developers.google.com/drive/api/guides/search-files) for more options.
---
### 2. Adjust BASE_URL for Production
**Development**:
```
BASE_URL=http://localhost:3000
```
**Production**:
```
BASE_URL=https://your-domain.com
```
This ensures sitemap URLs point to the correct domain.
---
### 3. Monitor Memory Usage
**Check memory usage** (production):
```bash
node --inspect src/server.js
# Open chrome://inspect in Chrome DevTools
```
**Expected**: <256MB under normal load (<10 concurrent requests)
---
## Security Best Practices
1. **Never commit** Service Account JSON key file to version control
2. **Use environment variables** for all sensitive configuration
3. **Restrict Service Account permissions** to minimum required (readonly scope)
4. **Monitor logs** for unauthorized access attempts
5. **Use HTTPS** in production (configure reverse proxy like nginx)
6. **Filter credentials from logs** (private_key field never logged)
---
## Deployment
### Docker (Recommended)
**Dockerfile**:
```dockerfile
FROM node:18-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
EXPOSE 3000
CMD ["npm", "start"]
```
**Build and run**:
```bash
docker build -t drive-sitemap-adapter .
docker run -p 3000:3000 \
-e GOOGLE_SERVICE_ACCOUNT_KEY='{"type":"service_account",...}' \
-e BASE_URL=https://your-domain.com \
drive-sitemap-adapter
```
---
### Cloud Platforms
**Google Cloud Run**:
```bash
gcloud run deploy drive-sitemap-adapter \
--source . \
--set-env-vars BASE_URL=https://your-domain.com \
--set-secrets GOOGLE_SERVICE_ACCOUNT_KEY=service-account-key:latest
```
**AWS ECS / Fargate**: Use environment variables in task definition
**Heroku**: Set environment variables via Heroku CLI or dashboard
---
## Additional Resources
- **Feature Specification**: [specs/001-drive-proxy-adapter/spec.md](./spec.md)
- **Implementation Plan**: [specs/001-drive-proxy-adapter/plan.md](./plan.md)
- **Research Document**: [specs/001-drive-proxy-adapter/research.md](./research.md)
- **Data Model**: [specs/001-drive-proxy-adapter/data-model.md](./data-model.md)
- **API Contract**: [specs/001-drive-proxy-adapter/contracts/sitemap-xml-schema.md](./contracts/sitemap-xml-schema.md)
- **Google Drive API Docs**: [https://developers.google.com/drive/api/v3/reference](https://developers.google.com/drive/api/v3/reference)
- **Sitemap Protocol**: [https://www.sitemaps.org/protocol.html](https://www.sitemaps.org/protocol.html)
---
## Support
For issues or questions, refer to:
1. This quickstart guide
2. Feature specification (spec.md) for requirements
3. Research document (research.md) for technical decisions
4. Contract documentation (contracts/) for API details
---
## Version History
| Version | Date | Changes |
|---------|------|---------|
| 1.0.0 | 2026-03-07 | Initial quickstart guide |
| Version | Date | Changes |
|---------|------|---------|
| 1.0.0 | 2026-03-07 | Initial quickstart guide |