# WaterNet Outputs and Code

## Introduction
This data repository contains global raster and vector outputs for the WaterNet model at 20 meters, and it links the code for three python modules related to the generation of this data. This research was supported by [Bridges to Prosperity](https://bridgestoprosperity.org/) and conducted by [Better Planet Laborary](https://betterplanetlab.com/). If using this data please cite the original WaterNet paper:

> Pierson, Matthew; Fankhauser, Katie; and Mehrabi, Zia. 2026. Mapping waterways worldwide with deep learning. arXiv. https://doi.org/10.48550/arXiv.2412.00050

## Technical Details
**Projection:** All files use EPSG: 4326.

## Raster Data
A Live demo of the raster data can be found at [The Fika Map Website](https://map.fikamap.com/waternetraster#9.13/5.5938/35.6743/-10.9/25). A STAC catalog is coming soon!

### Naming Convention
The raster outputs are in the raster folder. The file names follow the format `{xtile}_{ytile}.tif` and correspond to a zoom level 6 xyz map tile. A geojson file in the root directory named `waterway_model_outputs_20m_raster_overview.geojson` contains the bounding boxes for each raster file. The raster files are configured as COGs (Cloud-Optimized GeoTIFF) files.

### Raster Data Types
- Unsigned 8-bit integer datatype
- 0 = low probability of waterway
- 255 = high probability of waterway
- Vectorization thresholds:
  - < 255*0.1: not a waterway
  - \> 255*0.5: waterway

### Known Issues
There are a few known issues with the raster product. These include lower accuracy of capturing swampy areas and deserts (although alternative feature weightings of the model can allow for better capture of swamps, and our vectorization process aims to resolve the issues in deserts). We also note higher noise in areas where it is difficult to create cloud free composites (coastal areas, near the equator); and future integration of SAR data may help alleviate these particularly on a near-real time deploy. There is also artifact in Greenland (missing cells) that we expect is due to Sentinel-2 feature inputs, for which further investigation and back fill is likely required.

## Vector Data

Vector data has been made available as a series of geoparquet files as well as a set of pmtiles files. A geojson file in the root directory named `waterway_model_outputs_vector_overview.geojson` contains the bounding boxes for each parquet vector file. The individual geoparquet files are available in the vector folder and the pmtiles files are available in the pmtiles folder. There is also a joined pmtiles file named `waterway_model_outputs_20m_vector.pmtiles`. The pmtiles files offer live previews of the data if you click on an individual file. A live demo of the vector data can be found at [The Fika Map Website](https://map.fikamap.com/waternetvector).


### Naming Convention
Files can be found in the vector folder and follow the naming format `{hydrobasins level 2 id}_{part}.parquet`. Files correspond to parts of hydrobasin level 2 basins, split due to 2.5 GB size limits. Files are configured as geoparquet files.

The vector outputs have the following naming convention `{hydrobasins level 2 id}_{part}.parquet`. The level 2 HydroBasins can be obtained from [their website](https://www.hydrosheds.org/products/hydrobasins). As the naming convention suggests, each file corresponds to part of a hydrobasin level 2 basin. This data set is also available on [Harvard Dataverse](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/YY2XMG) for Academic purposes. Since many of the files were larger than 2.5 GB, they were split into parts to satisfy the Harvard Dataverse individual file size limit.

### Data Attributes

| Column | Data Type | Description |
|--------|-----------|-------------|
| stream_id | int | Unique ID for the stream segment |
| target_stream_id | int | Stream ID of the next adjacent downstream stream |
| source_stream_ids | list(int) | List of IDs for all adjacent streams flowing into this stream |
| stream_order | int | [Strahler stream order](https://en.wikipedia.org/wiki/Strahler_number) of this stream segment |
| from_tdx | bool | True if stream segment appears in [TDX-Hydro](https://earth-info.nga.mil/index.php?dir=geosci&action=geosciences) |
| tdx_stream_id | int | Corresponding streamID in [TDX-Hydro](https://earth-info.nga.mil/index.php?dir=geosci&action=geosciences) drainage basins dataset |
| intersects_lake | bool | True if geometry intersects a lake in [HydroLakes v1.0 Dataset](https://gee-community-catalog.org/projects/hydrolakes/) |
| length_m | float | Length of geometry computed using [pyproj.Geod.geometry_length](https://pyproj4.github.io/pyproj/stable/api/geod.html#pyproj.Geod.geometry_length) |
| geometry | LineString | Geometry of the segment |

### Known Issues
Vectorization of any large body of water (lakes, swamps, wide rivers, etc) can result in multiple geometries that should all be a single geometry. One method for identifying such artifacts are to search for streams of order 1 whose target streams are high in order. Since these artifacts often occur in lakes, we have included the column "intersects_lake" in the output, which indicates if the geometry intersects a lake in HydroLakes. Additionally, we have largely addressed this issue with post-processing -- both the vectors and codebase have been updated to reflect these improvements in the latest version of the dataset.

## WaterNet Models and Code
Code for model creation, training, and vectorization are linked in the GitHub Repositories below. Global inference code has not been included as the large amount of data required makes such an effort difficult to generalize across compute hardware.

- [WaterNet](https://github.com/Better-Planet-Laboratory/WaterNet_vectorize)
- [WaterNet Training and Evaluation](https://github.com/Better-Planet-Laboratory/WaterNet_training_and_evaluation)
- [WaterNet Vectorization](https://github.com/Better-Planet-Laboratory/WaterNet_vectorize)

The research behind the models is detailed in the following papers:
- [Mapping waterways worldwide with deep learning - https://doi.org/10.48550/arXiv.2412.00050](https://doi.org/10.48550/arXiv.2412.00050)
- [Deep learning waterways for rural infrastructure development - https://doi.org/10.48550/arXiv.2411.13590](https://doi.org/10.48550/arXiv.2411.13590)

Previous versions of the data are also available in Harvard dataverse with a permanent DOI at https://doi.org/10.7910/DVN/YY2XMG

## Acknowledgments
This research was funded by [Bridges to Prosperity](https://bridgestoprosperity.org/) and carried out by [Better Planet Laboratory](https://betterplanetlab.com/).

## Data Sources
- https://doi.org/10.5270/S2_-6eb6imz
- https://doi.org/10.5270/ESA-c5d3d65
- https://doi.org/10.22541/essoar.171629686.65893579/v1
- https://doi.org/10.1029/2008eo100001

## License:
This data is licensed under the [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/). Please cite the original WaterNet paper and Bridges to Prosperity if you use this data.