Data principles

While clear titles and descriptions are important, your data lies at the core of what users are looking for. To provide users with data that is easy and reliable to use, we have a few guiding principles.

  1. Ensure machine readability

Data should be in a format a computer can understand. This means relevant fields can be extracted and parsed without human input. As an open data collective, any data you upload must also be in non-proprietary file formats.

  1. Check for errors and inconsistencies

Your data should be free from any errors and inconsistencies! This will go a long way in establishing trust, protecting the integrity of your data, and allowing mashup or use of multiple datasets, even across agencies.

We strongly encourage correcting any issues that negatively affect data quality, such as:

  • Null values on columns. If all values in a column are null, consider removing the column.

  • Duplicate rows

  • Outliers. While having many outliers is not necessarily a cause for concern, there should not be impossible values appearing. This might look like negative values appearing in columns that should only have positive ones, for example.

  • Inconsistent capitalisation in datasets. For example, if a value in a column is entered in all uppercase, all other values in that same column should also be entered in all uppercase.

    Consistency in capitalisation is important, as the same value appearing more than once in different casing may be considered different values altogether. For simplicity, we suggest sticking to one capitalisation format.

  • Inconsistent spacing in column values. This could be entering “hello” without a space at the end, and “hello “ with a space at the end, for example.

Additionally, your data should be kept tidy, a concept introduced in Hadley Wickham’s “Tidy Data”, in The Journal of Statistical Software. Ultimately, clearer organisation makes it easier for users to understand and use your data.

Drawing from these principles, you should ensure:

  • Each column uses the same unit of measurement

  • Each row makes one observation. This is all the variable information collected on a single subject or participant. For example, if your dataset looks at people, one row might contain a single person’s height, weight, and age. While the actual measurements of different people may vary from row to row, what you are measuring in each row shouldn’t.

  • Each table only has one type of observation unit. If your data looks at the average height of a population, you could have one observation unit as the overall population, and another as the population of different genders. This means you might have one table for overall average height, and one for average height of different genders.

  1. Granularity and precision

As far as possible, data should be raw and granular instead of aggregated and processed, such as the use of percentages.

Totals and sub-totals should be in separate tables if needed. For example, there are cases where aggregate numbers, such as totals and indices, cannot be derived from granular data points.

Get in touch

Our guidelines are always up for review to give our users the best experience possible. Have more feedback or questions? Contact us and we will reach out to you as soon as we can.

Last updated