# Energy Performance Certificates (England and Wales)
## Description
This dataset is a copy of the [Energy Performance Certificates](https://epc.opendatacommunities.org/) for England and Wales made available in a [cloud-native geospatial](https://cloudnativegeo.org) geoparquet format. The original dataset is distributed as a zipped collection CSVs, and is available for download from [here](https://epc.opendatacommunities.org/).

## Accessing the Data
The data has been made available as admin-partitioned geoparquet.You can query the entire dataset using DuckDB like so.
```bash
duckdb -c "Select count(*) from read_parquet('s3://us-west-2.opendata.source.coop/addresscloud/epc/geoparquet-local-authority/*.parquet');" 
```
Alternatively, you can use the aws cli to access the data directly from the S3 bucket:
```bash
aws s3 --no-sign-request ls s3://us-west-2.opendata.source.coop/addresscloud/epc/
```
Or if you want the data for just one local authority then choose one from the browse section of this page.  

The geoparquet format is compliant with GDAL 3.5 onwards and readable in QGIS and many other platforms. See the [geoparquet website]() for more info on this  

## Data extraction process
These steps should help you emulate the process of extracting the EPC The data source data exists [here](https://epc.opendatacommunities.org/). You will need to sign up. Once doing this you should find the data avaible to download as a ZIP.
This ZIP contains a directory for each local authortiy in England and Wales. I've created this script to extract just the certificates.csv from each directory.
```bash
#!/bin/bash
# Zip file name in the current directory
zip_file="all-domestic-certificates.zip"
# Loop through the zip file contents and extract certificates.csv
unzip -l "$zip_file" | grep certificates.csv | while read -r line; do
file_name=$(echo "$line" | awk '{print $NF}')
folder_name=$(dirname "$file_name")
# Extract the specific file and save it with the folder name as filename
unzip -p "$zip_file" "$file_name" > "epc_csv/${folder_name//\//-}.csv"
done
```
## Data conversion
The next steps involve using the DuckDB client to load the data, geocoding each row by joining to the [OS Open UPRN dataset](https://www.ordnancesurvey.co.uk/products/os-open-uprn) and then exporting the entire dataset to admin-partitioned parquet files. Then we'll need to convert these to geoparquet using gpq tool.
There are 347 CSVs totalling 23.4 GB in size. You can convert the data to parquet using various tools. Once you've done that the total combined file size of the Parquet files is only 3.7 GB.
The steps below
```bash
$ duckdb
```
```sql
create or replace view epc_certs as select * from read_parquet('epc/*.parquet');
```
```sql
create or replace view opuprn as select * from read_parquet('opuprn.parquet');
```
```sql
create or replace view epc as
select epc_certs.*,
opuprn.LATITUDE as lat,
opuprn.LONGITUDE as lon
from epc_certs
join opuprn ON epc_certs.uprn = opuprn.uprn::text;
```
Once all the views have been created we can then export to partionied parquet files. The partition used here is the local authority name.
```sql
install spatial;
load spatial;
COPY (Select UPRN as uprn,
LOCAL_AUTHORITY as local_authority,
LOCAL_AUTHORITY_LABEL as local_authority_label,
CURRENT_ENERGY_RATING as current_energy_rating,
POTENTIAL_ENERGY_RATING as potential_rating,
INSPECTION_DATE as inspection_date,
ST_AsText(ST_Point(lon::double,lat::double)) as geometry
from epc)
TO 'epc_partitioned' (FORMAT PARQUET, PARTITION_BY (LOCAL_AUTHORITY_LABEL));
```
## Converting to Geoparquet
It is currently not possible to export to geoparquet using DuckDB so to convert all the outputted parquet to geoparquet and remove the parquet files I created this script
```bash
#!/bin/bash
shopt -s globstar
for d in ./**/*.parquet
do
dir="${d%/*}" # Strip the *.parquet pathname back to the containing directory
#echo $dir
parquet="${d##*/}" # Strip the *.parquet pathname back to just the filename
new_name=$(echo "$dir" | tr ' ' _ |cut -d'=' -f2)
#echo "Converted $parquet to geoparquet" >&2
(
cd "$dir" || exit
find . -type f -name "*.parquet" -size -100b -delete
if [ -f $parquet ]
then
gpq convert $parquet $new_name.parquet
fi
)
done
for d in ./**/*.parquet
do find . -type f -name "data*" -delete;
done
```
## Licence
Date is licensed under the Open Government Licence v3.0. More details can be found [here](https://epc.opendatacommunities.org/docs/copyright).
## Contact
This is an experimental dataset and I hope to add more attributes in the future. If you have any questions about the process you can contact me using matt@addresscloud.com