The WorldStrat Dataset

ML Research, ESA funded, Satellite Imagery

The WorldStrat Dataset

What is WorldStrat?

It’s a free dataset consisting of nearly 10,000 km² of free high-resolution and matched low-resolution satellite imagery, empowered by European Space Agency’s Phi-Lab as part of the ESA-funded QueryPlanet project.

You can download it from Kaggle or Zenodo. It comes with:

The team consisted of Julien Cornebise, myself and Freddie Kalaitzis, and it’s been one of the most interesting things I’ve worked on.

I loved working in a small team because I had the chance to learn and do so much research and engineering, from the dataset design and pipeline building, coding the super-resolution model, writing the datasheet and paper.

Why make it?

The goal of the dataset is to foster broad-spectrum applications of ML to satellite imagery, and possibly develop the same power of analysis allowed by costly private high-resolution imagery from free public low-resolution Sentinel2 imagery.

With that goal in mind, we also trained and released several highly compute-efficient baselines on the task of Multi-Frame Super-Resolution.

Capturing an entire planet of diversity using only 10,000km².

This was probably the most interesting part of the project for me. 10,000km² might sound like a lot, but it’s just a tiny fraction of the 510.1 million km² of the surface area of Earth.

That’s ~0.002% of the entire planet. And since we’re trying to enable broad-spectrum applications, we have to make sure we represent it as best as possible.

Imagine trying to capture as much of humanity given only 156,000 people. It’s definitely not impossible, but it’s an interesting problem.

How do we do it?

A natural idea would be: Can we just randomly sample the entire planet for 10,000km²?

We sure can, but let’s take a closer look at our planet:

  • 71% of Earth’s surface is water-covered, and the oceans make up about 96.5% of all Earth’s water.
  • Antarctica is the fifth-largest continent in the world with a size of 14.2 million km².
  • 7% of the Earth’s surface is covered by wetlands, with Canada alone having about 1.29 million km² of wetlands.

So if we just randomly sampled it, we would have:

  • 7,100 km² of water.
  • 1.4 times more of Antarctica than Europe.
  • 700 km² of wetland/moss.

While it would accurately represent the Earth statistics-wise, and other than some very happy wetland and moss researchers, it wouldn’t be very useful for broad-spectrum applications.

But then again, how do we decide how much of Antarctica to keep versus built-up cities, tropical forests, deserts?

Stratifying the sampled points

Whatever distribution we use, either hand picked or randomly selected, it won’t be perfect, but there are ways of making it more fair.

We can stratify the points we’ve sampled using a number of different stratifying datasets:

  • ESA Climate Change Initiative Land Cover (ESA CCI LC) dataset: 34 classes ranging from Agriculture and Forests to Permanent Ice and Snow.
  • Intergovernmental Panel on Climate Change (IPCC) stratification: 6 broad classes (Agriculture, Forest, Grassland, Wetland, Settlement, Other).
  • Global Human Settlement Model grid (GHSL S-MOD) settlement stratification: offers three density levels for three types of settlement stratification (Dense, Semi-dense, Low density - Urban, Suburban, Rural).

Using the above mentioned datasets, we can stratify our randomly sampled points to get a better understanding of what kind of areas they cover.

With that insight, we can try to bump up certain areas, like built-up land areas, and decrease the amount of areas like permanent ice and snow, or water bodies.

Asking expert partners to choose locations

Even stratified/sub-stratified randomly sampled locations might not capture many locations that would be the most interesting to experts.

So we asked our partners from organisations like Amnesty, the United Nations High Commissioner for Refugees (UNHCR), the European Space Agency’s Artisanal Mining ASMSpotter project for locations they’d like to see.

We also talked to other experts like geographers to validate our process of stratification and random sampling.

This is what the end result looks like: