- Add Source URL Header to features list
- Document /documents/{documentId} endpoint
- Add X-Verint-KAB-Original-URL header to response headers section
- Include curl examples for document export with header inspection
- Add documentation links for new feature spec and API contract
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
301 lines
9.3 KiB
Markdown
301 lines
9.3 KiB
Markdown
# Google Drive Sitemap Adapter
|
|
|
|
HTTP service that generates XML sitemaps listing all accessible documents in a Google Drive account. Uses Service Account authentication for secure, automated access.
|
|
|
|
## Features
|
|
|
|
- **Sitemap Generation**: XML sitemap at `/sitemap.xml` listing all accessible Google Drive documents
|
|
- **Document Export**: Export Google Drive documents with original source URL tracking
|
|
- **Source URL Header**: X-Verint-KAB-Original-URL response header for content traceability
|
|
- **RESTful URLs**: Document links in format `/documents/{documentId}` per sitemap protocol
|
|
- **Service Account Auth**: JWT-based authentication using Google Service Account credentials
|
|
- **Pagination Support**: Handles large document sets (up to 50,000 URLs per sitemap protocol)
|
|
- **50k Limit Enforcement**: Returns 413 error if document count exceeds sitemap protocol limit
|
|
- **FIFO Request Queue**: Concurrent requests processed sequentially (one at a time)
|
|
- **Rate Limit Handling**: Returns 429 with Retry-After header when Drive API rate limits
|
|
- **No Retry on 503**: Fails immediately on Drive API unavailability (per spec)
|
|
- **Minimal Dependencies**: Only `googleapis` package required
|
|
|
|
## Quick Start
|
|
|
|
### Prerequisites
|
|
|
|
- Node.js v18.x or later
|
|
- Google Cloud Project with Drive API enabled
|
|
- Service Account credentials with Drive API access
|
|
|
|
### Setup
|
|
|
|
1. **Install dependencies**:
|
|
```bash
|
|
npm install
|
|
```
|
|
|
|
2. **Configure Service Account** (see `specs/001-drive-proxy-adapter/quickstart.md` for detailed steps):
|
|
- Create Service Account in Google Cloud Console
|
|
- Download service account key JSON file
|
|
- Share Drive files/folders with service account email
|
|
- Place key file at `config/service-account-key.json`
|
|
|
|
3. **Configure environment**:
|
|
```bash
|
|
cp .env.example .env
|
|
# Edit .env with your service account email
|
|
```
|
|
|
|
4. **Start the server**:
|
|
```bash
|
|
npm start
|
|
# or for development with auto-reload:
|
|
npm run dev
|
|
```
|
|
|
|
5. **Generate sitemap**:
|
|
```bash
|
|
curl http://localhost:3000/sitemap.xml
|
|
```
|
|
|
|
### Usage Examples
|
|
|
|
```bash
|
|
# Get sitemap of all documents
|
|
curl http://localhost:3000/sitemap.xml
|
|
|
|
# Verify XML format
|
|
curl http://localhost:3000/sitemap.xml | xmllint --noout -
|
|
|
|
# Count documents in sitemap
|
|
curl http://localhost:3000/sitemap.xml | grep -c '<loc>'
|
|
|
|
# Export a document and view source URL header
|
|
curl -I http://localhost:3000/documents/{documentId}
|
|
|
|
# Export document and extract original Google Drive URL
|
|
curl -D - http://localhost:3000/documents/{documentId} | grep X-Verint-KAB-Original-URL
|
|
```
|
|
|
|
## Architecture
|
|
|
|
### Monolithic Design
|
|
|
|
This project follows a **monolithic architecture** as specified in the project constitution:
|
|
|
|
- **Single Route File**: ALL routing, business logic, and Drive API integration in `src/proxy.js` (~350 LOC)
|
|
- **Utility Modules**: Separate files for auth, logging, XML utils (constitution-compliant separation of concerns)
|
|
- **Configuration as Data**: JSON configuration in `config/default.json` loaded into `global.config` at startup
|
|
- **Minimal Dependencies**: Only `googleapis` package for Drive API integration
|
|
|
|
### Why Monolithic?
|
|
|
|
Rationale defined in constitution:
|
|
1. **Simplicity**: Easy to understand, debug, and maintain
|
|
2. **Direct Code Flow**: No dependency injection, no framework magic
|
|
3. **YAGNI Principle**: No premature abstraction for a focused service
|
|
|
|
### Structure
|
|
|
|
```
|
|
src/
|
|
├── server.js # HTTP server, config loader, validation
|
|
├── proxy.js # Request handler with FIFO queue integration
|
|
├── drive-client.js # Drive API integration with 50k limit enforcement
|
|
├── sitemap-generator.js # Sitemap XML generation with RESTful URLs
|
|
├── queue.js # FIFO request queue (sequential processing)
|
|
├── auth.js # Service Account authentication
|
|
├── logger.js # Structured logging utility
|
|
├── utils.js # Request ID, validation
|
|
└── xml-utils.js # XML escaping
|
|
```
|
|
|
|
## Testing
|
|
|
|
### Test Structure
|
|
|
|
Tests follow **TDD workflow** with real assertions:
|
|
|
|
```
|
|
tests/
|
|
├── contract/ # API contract tests (HTTP interface)
|
|
├── integration/ # Drive API integration tests
|
|
└── unit/ # Pure function unit tests
|
|
```
|
|
|
|
### Running Tests
|
|
|
|
```bash
|
|
# All tests
|
|
npm test
|
|
|
|
# Specific test suites
|
|
npm run test:unit
|
|
npm run test:integration
|
|
npm run test:contract
|
|
```
|
|
|
|
### Coverage Requirements
|
|
|
|
- **Minimum**: 80% code coverage (enforced)
|
|
- **Tests Written First**: TDD mandatory per constitution
|
|
- **Real Assertions**: No placeholder tests
|
|
|
|
## Configuration
|
|
|
|
Configuration is loaded from `config/default.json` and merged with environment variables:
|
|
|
|
```json
|
|
{
|
|
"server": {
|
|
"port": 3000,
|
|
"host": "0.0.0.0",
|
|
"baseUrl": "http://localhost:3000"
|
|
},
|
|
"google": {
|
|
"serviceAccountEmail": "service@project.iam.gserviceaccount.com",
|
|
"serviceAccountKeyPath": "./config/service-account-key.json",
|
|
"scopes": ["https://www.googleapis.com/auth/drive.readonly"]
|
|
},
|
|
"sitemap": {
|
|
"maxUrls": 50000
|
|
},
|
|
"logging": {
|
|
"level": "info"
|
|
}
|
|
}
|
|
```
|
|
|
|
Environment variables override JSON config (e.g., `PORT`, `GOOGLE_SERVICE_ACCOUNT_EMAIL`).
|
|
|
|
## API Documentation
|
|
|
|
### Endpoints
|
|
|
|
- `GET /sitemap.xml` - XML sitemap of all accessible documents (200 OK with XML body)
|
|
- `GET /documents/{documentId}` - Export Google Drive document with source URL tracking
|
|
- `GET /*` - All other paths return 404 Not Found (empty body)
|
|
|
|
### Response Headers
|
|
|
|
Successful sitemap response (200 OK):
|
|
- `Content-Type: application/xml; charset=utf-8`
|
|
- `X-Request-Id: req_<uuid>` - Request tracing ID
|
|
- `X-Document-Count: <number>` - Number of documents in sitemap
|
|
|
|
Successful document export response (200 OK):
|
|
- `Content-Type: application/pdf` (or appropriate MIME type)
|
|
- `X-Request-Id: req_<uuid>` - Request tracing ID
|
|
- `X-Verint-KAB-Original-URL: https://drive.google.com/file/d/{fileId}` - Original Google Drive URL for content traceability
|
|
|
|
### Error Responses
|
|
|
|
All errors return **HTTP status code only** with **no response body** (per specification):
|
|
|
|
- `401 Unauthorized` - Service account authentication failed
|
|
- `404 Not Found` - Path is not /sitemap.xml
|
|
- `413 Payload Too Large` - Document count exceeds 50,000 (sitemap protocol limit)
|
|
- `429 Too Many Requests` - Drive API rate limit exceeded (includes `Retry-After` header in seconds)
|
|
- `500 Internal Server Error` - Server error
|
|
- `503 Service Unavailable` - Drive API unavailable (NO RETRY per specification)
|
|
|
|
## Performance Characteristics
|
|
|
|
- **Cold Start**: < 10 seconds to accepting requests
|
|
- **Sitemap Generation**: < 5 seconds for 10,000 documents
|
|
- **Concurrent Requests**: 10+ without degradation
|
|
- **Memory Usage**: < 256MB under normal load
|
|
|
|
## Development
|
|
|
|
### Project Structure
|
|
|
|
```
|
|
google-drive-content-adapter/
|
|
├── config/
|
|
│ └── default.json # Configuration
|
|
├── src/
|
|
│ ├── server.js # HTTP server
|
|
│ ├── proxy.js # Request handler (monolithic)
|
|
│ ├── auth.js # Service Account auth
|
|
│ ├── logger.js # Structured logging
|
|
│ ├── utils.js # Utilities
|
|
│ └── xml-utils.js # XML escaping
|
|
├── tests/
|
|
│ ├── contract/ # API contract tests
|
|
│ ├── integration/ # Integration tests
|
|
│ └── unit/ # Unit tests
|
|
├── specs/
|
|
│ └── 001-drive-proxy-adapter/ # Feature spec, plan, tasks
|
|
├── .env.example # Environment template
|
|
├── package.json # Dependencies and scripts
|
|
└── README.md # This file
|
|
```
|
|
|
|
### Development Workflow
|
|
|
|
1. **Write Tests First** (TDD)
|
|
2. **Implement Minimum Code**
|
|
3. **Run Tests**: `npm test`
|
|
4. **Run in Development**: `npm run dev`
|
|
|
|
## Deployment
|
|
|
|
### Docker
|
|
|
|
```dockerfile
|
|
FROM node:18-alpine
|
|
WORKDIR /app
|
|
COPY package*.json ./
|
|
RUN npm ci --production
|
|
COPY src/ ./src/
|
|
COPY config/ ./config/
|
|
CMD ["node", "src/server.js"]
|
|
EXPOSE 3000
|
|
```
|
|
|
|
```bash
|
|
docker build -t drive-sitemap-adapter .
|
|
docker run -p 3000:3000 -v $(pwd)/config:/app/config drive-sitemap-adapter
|
|
```
|
|
|
|
### Direct Node.js
|
|
|
|
```bash
|
|
NODE_ENV=production npm start
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Authentication Failed (401)
|
|
- Verify service account key file exists at `config/service-account-key.json`
|
|
- Check service account email matches configuration
|
|
- Ensure Drive API is enabled in Google Cloud project
|
|
|
|
### Empty Sitemap
|
|
- Service account needs access to Drive files
|
|
- Share files/folders with service account email
|
|
- Check service account has "Viewer" permission
|
|
|
|
### Rate Limit (429)
|
|
- Wait for time specified in `Retry-After` header
|
|
- Reduce frequency of sitemap requests
|
|
- Check Google Cloud Console quotas
|
|
|
|
## License
|
|
|
|
ISC
|
|
|
|
## Documentation
|
|
|
|
For detailed setup and usage instructions, see:
|
|
|
|
### Sitemap Feature
|
|
- [Quick Start Guide](specs/001-drive-proxy-adapter/quickstart.md)
|
|
- [Feature Specification](specs/001-drive-proxy-adapter/spec.md)
|
|
- [Implementation Plan](specs/001-drive-proxy-adapter/plan.md)
|
|
- [Data Model](specs/001-drive-proxy-adapter/data-model.md)
|
|
|
|
### Source URL Header Feature
|
|
- [Quick Start Guide](specs/001-gdrive-url-header/quickstart.md)
|
|
- [Feature Specification](specs/001-gdrive-url-header/spec.md)
|
|
- [API Contract](specs/001-gdrive-url-header/contracts/response-headers.md)
|
|
- [Implementation Plan](specs/001-gdrive-url-header/plan.md)
|