# GBIF Occurrence Data & Derived Products

Processed GBIF occurrence data hexed at H3 resolutions 0-10, plus derived products.

GBIF releases new snapshots periodically. Releases are versioned by `year-month` (e.g., `2026-06`) so multiple versions can coexist in the bucket during transitions. **Always use the most recent release for new work.**

## Current Release: 2026-06

**Path:** `s3://public-gbif/2026-06/hex/`

Global GBIF occurrences (GBIF 2026-06-01 snapshot, **3,499,090,951 records**) partitioned by H3 resolution-0 cell — **122 h0 partitions, one `data_0.parquet` file each**. One row per occurrence; deduplicated by `gbifid`.

### Schema

- Taxonomic columns: `gbifid`, `datasetkey`, `occurrenceid`, `kingdom`, `phylum`, `class`, `order`, `family`, `genus`, `species`, `infraspecificepithet`, `taxonrank`, `scientificname`, `verbatimscientificname`, `verbatimscientificnameauthorship`
- Spatial/occurrence columns: `countrycode`, `locality`, `stateprovince`, `occurrencestatus`, `individualcount`, `decimallatitude`, `decimallongitude`, `coordinateuncertaintyinmeters`, `coordinateprecision`, `elevation`, `depth`, `eventdate`, `day`, `month`, `year`, `taxonkey`, `specieskey`, `basisofrecord`, `institutioncode`, `collectioncode`, `catalognumber`, `recordnumber`, `identifiedby`, `dateidentified`, `license`, `rightsholder`, `recordedby`, `typestatus`, `establishmentmeans`, `lastinterpreted`, `mediatype`, `issue`
- H3 hex columns: `h0` (INT64/UBIGINT, partition key), `h1`–`h10` (UBIGINT). Native resolution is `h10` (one ~15 m² cell per point); `h0`–`h9` are parent rollups.

### DuckDB Example

```sql
INSTALL h3 FROM community; LOAD h3;
INSTALL httpfs; LOAD httpfs;
SET s3_endpoint = 's3-west.nrp-nautilus.io';
SET s3_url_style = 'path';

-- Count occurrences by family in a single h0 cell
SELECT family, COUNT(*) AS n
FROM read_parquet('s3://public-gbif/2026-06/hex/h0=577762574070710271/data_0.parquet')
GROUP BY family
ORDER BY n DESC
LIMIT 20;
```

### Fast single-species reads (sidecar)

The hex is partitioned by `h0` (geography), not species, so a bare `WHERE specieskey=…` scans all 122 files. The tiny **`species-h0-index.parquet`** sidecar (`(specieskey, species, h0)`, ~22 MB) scopes the species' partitions in one query — DuckDB turns the subquery into a dynamic partition filter and reads only those h0 files (median 1/122; ~2s):

```sql
SELECT h8, COUNT(*) AS n
FROM read_parquet('s3://public-gbif/2026-06/hex/h0=*/data_0.parquet', hive_partitioning=true)
WHERE h0 IN (
    SELECT h0 FROM read_parquet('s3://public-gbif/2026-06/species-h0-index.parquet')
    WHERE specieskey = 2969135
)
AND specieskey = 2969135
GROUP BY h8;
```

`species` is NULL for genus-or-coarser records and excluded from the sidecar (it keys on `specieskey IS NOT NULL`).

## Deprecated: Legacy Hex (pre-2025) and 2025-06

- `s3://public-gbif/hex/` — original VARCHAR-keyed hex, **deprecated**.
- `s3://public-gbif/2025-06/hex/` — superseded by 2026-06; retained temporarily for backward compatibility, scheduled for removal once dependent apps repoint.

## Other Assets

### Taxonomic Aggregations by H3 Hexagon
**Prefix:** `s3://public-gbif/2026-06/taxonomy/` (partitioned by `h0`)

Aggregated counts of taxa within H3 resolution-0 hexagons.

**Schema:** `kingdom`, `phylum`, `class`, `order`, `family`, `genus`, `species`, `infraspecificepithet`, `taxonrank`, `scientificname`, `verbatimscientificname`, `verbatimscientificnameauthorship`, `n` (count), `h0`

### GBIF Occurrences in Redlined Cities
**File:** `s3://public-gbif/redlined_cities_gbif.parquet`

Spatial join of GBIF occurrences with "Mapping Inequality" (Redlining) polygons for US cities.

**Schema:** `gbifid`, `scientificname`, `kingdom`, `phylum`, `class`, `order`, `family`, `genus`, `species`, `recordedby`, `date`, `coordinateuncertaintyinmeters`, `city`, `state`, `grade` (A-D), `residential`, `commercial`, `industrial`

### Taxa List
**File:** `s3://public-gbif/taxa.parquet`

Reference list of all taxa found in the dataset.

## Source & Citation

- **Producer:** Global Biodiversity Information Facility (GBIF)
- GBIF.org (2026). GBIF Occurrence Download. <https://www.gbif.org/>
- Nelson, R. K., et al. (2023). Mapping Inequality. (for redlined cities asset)

## License

- **GBIF Data:** See [GBIF Data Use Agreement](https://www.gbif.org/terms). Individual records may have specific licenses (CC0, CC-BY, CC-BY-NC); the aggregate is treated **NonCommercial**.
- **Redlining Data:** CC-BY-NC-SA 4.0.
