Files
Peter.Morton 2fc6eb83e4 docs: Fix repository structure in README to match actual layout
- Remove two incorrect/outdated structure diagrams
- Replace with single accurate structure matching current repository
- Include .github/, .specify/, .vscode/ directories
- Show all spec feature directories (001-gdrive-url-header, 001-sitemap, 002-document-export)
- Accurate src/ structure with globalVariables/ and proxyScripts/ subdirectories
- Include Specify workflow infrastructure documentation

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-27 16:33:59 -05:00

305 lines
9.5 KiB
Markdown

# Google Drive Sitemap Adapter
HTTP service that generates XML sitemaps listing all accessible documents in a Google Drive account. Uses Service Account authentication for secure, automated access.
## Features
- **Sitemap Generation**: XML sitemap at `/sitemap.xml` listing all accessible Google Drive documents
- **Document Export**: Export Google Drive documents with original source URL tracking
- **Source URL Header**: X-Verint-KAB-Original-URL response header for content traceability
- **RESTful URLs**: Document links in format `/documents/{documentId}` per sitemap protocol
- **Service Account Auth**: JWT-based authentication using Google Service Account credentials
- **Pagination Support**: Handles large document sets (up to 50,000 URLs per sitemap protocol)
- **50k Limit Enforcement**: Returns 413 error if document count exceeds sitemap protocol limit
- **FIFO Request Queue**: Concurrent requests processed sequentially (one at a time)
- **Rate Limit Handling**: Returns 429 with Retry-After header when Drive API rate limits
- **No Retry on 503**: Fails immediately on Drive API unavailability (per spec)
- **Minimal Dependencies**: Only `googleapis` package required
## Quick Start
### Prerequisites
- Node.js v18.x or later
- Google Cloud Project with Drive API enabled
- Service Account credentials with Drive API access
### Setup
1. **Install dependencies**:
```bash
npm install
```
2. **Configure Service Account** (see `specs/001-drive-proxy-adapter/quickstart.md` for detailed steps):
- Create Service Account in Google Cloud Console
- Download service account key JSON file
- Share Drive files/folders with service account email
- Place key file at `config/service-account-key.json`
3. **Configure environment**:
```bash
cp .env.example .env
# Edit .env with your service account email
```
4. **Start the server**:
```bash
npm start
# or for development with auto-reload:
npm run dev
```
5. **Generate sitemap**:
```bash
curl http://localhost:3000/sitemap.xml
```
### Usage Examples
```bash
# Get sitemap of all documents
curl http://localhost:3000/sitemap.xml
# Verify XML format
curl http://localhost:3000/sitemap.xml | xmllint --noout -
# Count documents in sitemap
curl http://localhost:3000/sitemap.xml | grep -c '<loc>'
# Export a document and view source URL header
curl -I http://localhost:3000/documents/{documentId}
# Export document and extract original Google Drive URL
curl -D - http://localhost:3000/documents/{documentId} | grep X-Verint-KAB-Original-URL
```
## Architecture
### Monolithic Design
This project follows a **monolithic architecture** as specified in the project constitution:
- **Single Route File**: ALL routing, business logic, and Drive API integration in `src/proxy.js` (~350 LOC)
- **Utility Modules**: Separate files for auth, logging, XML utils (constitution-compliant separation of concerns)
- **Configuration as Data**: JSON configuration in `config/default.json` loaded into `global.config` at startup
- **Minimal Dependencies**: Only `googleapis` package for Drive API integration
### Why Monolithic?
Rationale defined in constitution:
1. **Simplicity**: Easy to understand, debug, and maintain
2. **Direct Code Flow**: No dependency injection, no framework magic
3. **YAGNI Principle**: No premature abstraction for a focused service
## Testing
### Test Structure
Tests follow **TDD workflow** with real assertions:
```
tests/
├── contract/ # API contract tests (HTTP interface)
├── integration/ # Drive API integration tests
└── unit/ # Pure function unit tests
```
### Running Tests
```bash
# All tests
npm test
# Specific test suites
npm run test:unit
npm run test:integration
npm run test:contract
```
### Coverage Requirements
- **Minimum**: 80% code coverage (enforced)
- **Tests Written First**: TDD mandatory per constitution
- **Real Assertions**: No placeholder tests
## Configuration
Configuration is loaded from `config/default.json` and merged with environment variables:
```json
{
"server": {
"port": 3000,
"host": "0.0.0.0",
"baseUrl": "http://localhost:3000"
},
"google": {
"serviceAccountEmail": "service@project.iam.gserviceaccount.com",
"serviceAccountKeyPath": "./config/service-account-key.json",
"scopes": ["https://www.googleapis.com/auth/drive.readonly"]
},
"sitemap": {
"maxUrls": 50000
},
"logging": {
"level": "info"
}
}
```
Environment variables override JSON config (e.g., `PORT`, `GOOGLE_SERVICE_ACCOUNT_EMAIL`).
## API Documentation
### Endpoints
- `GET /sitemap.xml` - XML sitemap of all accessible documents (200 OK with XML body)
- `GET /documents/{documentId}` - Export Google Drive document with source URL tracking
- `GET /*` - All other paths return 404 Not Found (empty body)
### Response Headers
Successful sitemap response (200 OK):
- `Content-Type: application/xml; charset=utf-8`
- `X-Request-Id: req_<uuid>` - Request tracing ID
- `X-Document-Count: <number>` - Number of documents in sitemap
Successful document export response (200 OK):
- `Content-Type: application/pdf` (or appropriate MIME type)
- `X-Request-Id: req_<uuid>` - Request tracing ID
- `X-Verint-KAB-Original-URL: https://drive.google.com/file/d/{fileId}` - Original Google Drive URL for content traceability
### Error Responses
All errors return **HTTP status code only** with **no response body** (per specification):
- `401 Unauthorized` - Service account authentication failed
- `404 Not Found` - Path is not /sitemap.xml
- `413 Payload Too Large` - Document count exceeds 50,000 (sitemap protocol limit)
- `429 Too Many Requests` - Drive API rate limit exceeded (includes `Retry-After` header in seconds)
- `500 Internal Server Error` - Server error
- `503 Service Unavailable` - Drive API unavailable (NO RETRY per specification)
## Performance Characteristics
- **Cold Start**: < 10 seconds to accepting requests
- **Sitemap Generation**: < 5 seconds for 10,000 documents
- **Concurrent Requests**: 10+ without degradation
- **Memory Usage**: < 256MB under normal load
## Development
### Project Structure
```
google-drive-content-adapter/
├── .github/
│ ├── agents/ # Specify workflow agent definitions
│ └── prompts/ # Agent prompt templates
├── .specify/
│ ├── memory/
│ │ └── constitution.md # Project principles and standards
│ ├── scripts/
│ │ └── bash/ # Specify workflow helper scripts
│ └── templates/ # Templates for spec, plan, tasks
├── .vscode/
│ └── settings.json # VS Code configuration
├── config/
│ └── default.json # Application configuration
├── specs/
│ ├── 001-gdrive-url-header/ # Source URL header feature
│ │ ├── checklists/
│ │ ├── contracts/
│ │ ├── spec.md
│ │ ├── plan.md
│ │ ├── tasks.md
│ │ └── ...
│ ├── 001-sitemap/ # Sitemap generation feature
│ └── 002-document-export/ # Document export feature
├── src/
│ ├── globalVariables/
│ │ ├── google_drive_settings.json
│ │ └── googleDriveAdapterHelper.js
│ ├── proxyScripts/
│ │ └── proxy.js # Main request handler (monolithic)
│ ├── logger.js # Structured logging utility
│ └── server.js # HTTP server entry point
├── tests/
│ ├── contract/ # API contract tests
│ ├── integration/ # Integration tests
│ └── unit/ # Unit tests
├── package.json # Dependencies and scripts
└── README.md # This file
```
### Development Workflow
1. **Write Tests First** (TDD)
2. **Implement Minimum Code**
3. **Run Tests**: `npm test`
4. **Run in Development**: `npm run dev`
## Deployment
### Docker
```dockerfile
FROM node:18-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --production
COPY src/ ./src/
COPY config/ ./config/
CMD ["node", "src/server.js"]
EXPOSE 3000
```
```bash
docker build -t drive-sitemap-adapter .
docker run -p 3000:3000 -v $(pwd)/config:/app/config drive-sitemap-adapter
```
### Direct Node.js
```bash
NODE_ENV=production npm start
```
## Troubleshooting
### Authentication Failed (401)
- Verify service account key file exists at `config/service-account-key.json`
- Check service account email matches configuration
- Ensure Drive API is enabled in Google Cloud project
### Empty Sitemap
- Service account needs access to Drive files
- Share files/folders with service account email
- Check service account has "Viewer" permission
### Rate Limit (429)
- Wait for time specified in `Retry-After` header
- Reduce frequency of sitemap requests
- Check Google Cloud Console quotas
## License
ISC
## Documentation
For detailed setup and usage instructions, see:
### Sitemap Feature
- [Quick Start Guide](specs/001-drive-proxy-adapter/quickstart.md)
- [Feature Specification](specs/001-drive-proxy-adapter/spec.md)
- [Implementation Plan](specs/001-drive-proxy-adapter/plan.md)
- [Data Model](specs/001-drive-proxy-adapter/data-model.md)
### Source URL Header Feature
- [Quick Start Guide](specs/001-gdrive-url-header/quickstart.md)
- [Feature Specification](specs/001-gdrive-url-header/spec.md)
- [API Contract](specs/001-gdrive-url-header/contracts/response-headers.md)
- [Implementation Plan](specs/001-gdrive-url-header/plan.md)