# Source Stats

Weekly statistics about Source Cooperative's data storage and usage, generated from S3 inventory reports.

## Available Data

This repository contains three types of CSV reports, updated weekly:

### Account Statistics (`accounts/`)
Storage metrics grouped by account (data contributor).

**Filename format**: `accounts/YYYYMMDD.csv`

**Columns**:
- `account` - Account identifier
- `repositories` - Number of repositories per account
- `objects` - Total file count
- `storage_gb` - Storage used in gigabytes
- `avg_object_size_mb` - Average file size in megabytes
- `oldest_file` - Timestamp of oldest file
- `newest_file` - Timestamp of newest file

### Repository Statistics (`repositories/`)
Detailed breakdown by individual repository.

**Filename format**: `repositories/YYYYMMDD.csv`

**Columns**:
- `account` - Account identifier
- `repository` - Repository name
- `objects` - Total file count
- `storage_gb` - Storage used in gigabytes
- `avg_object_size_mb` - Average file size in megabytes
- `oldest_file` - Timestamp of oldest file
- `newest_file` - Timestamp of newest file

### Platform Summary (`source/`)
High-level metrics for the entire Source Cooperative platform.

**Filename format**: `source/YYYYMMDD.csv`

**Columns**:
- `metric` - Metric name
- `value` - Metric value

**Metrics included**:
- Total Accounts
- Total Repositories  
- Total Objects
- Total Storage (TB)

## Data Notes

- **Update Frequency**: Weekly, typically generated when new S3 inventory reports are available
- **Date Format**: Files are named using `YYYYMMDD.csv` format based on the report generation date
- **Data Scope**: Covers AWS S3 storage only (excludes Microsoft Azure hosted datasets)
- **Structure**: Analyzes data following Source's `[account]/[repository]` folder structure
- **Deduplication**: Inventory records are automatically deduplicated to ensure accurate counts

## Usage Examples

### List Available Files
```bash
# List all account statistics files
aws s3 ls s3://us-west-2.opendata.source.coop/source/source-stats/accounts/ --no-sign-request

# List all repository statistics files  
aws s3 ls s3://us-west-2.opendata.source.coop/source/source-stats/repositories/ --no-sign-request

# List all platform summary files
aws s3 ls s3://us-west-2.opendata.source.coop/source/source-stats/source/ --no-sign-request
```

### Download Files
```bash
# Download latest account statistics (replace YYYYMMDD with actual date)
aws s3 cp s3://us-west-2.opendata.source.coop/source/source-stats/accounts/YYYYMMDD.csv . --no-sign-request

# Download via direct URL
curl -O https://us-west-2.opendata.source.coop/source/source-stats/accounts/YYYYMMDD.csv
```

### Programmatic Access
```python
import pandas as pd

# Load latest account statistics (replace YYYYMMDD with actual date)
df = pd.read_csv('https://us-west-2.opendata.source.coop/source/source-stats/accounts/YYYYMMDD.csv')

# Load platform summary  
summary = pd.read_csv('https://us-west-2.opendata.source.coop/source/source-stats/source/YYYYMMDD.csv')
```

## Data Generation

These statistics are automatically generated from S3 inventory reports using AWS Athena queries. The source code for the generation process is available at [github.com/source-cooperative/source-stats](https://github.com/source-cooperative/source-stats).

## Questions or Issues

For questions about this data or to report issues, please contact Source Cooperative support or create an issue in the [source-stats repository](https://github.com/source-cooperative/source-stats/issues). 