Content quality guidelines

As an open data collective, we’re focused on providing quality data that users can not only explore, analyse or use for development, but derive value and meaning from. If you are uploading data, we’ve created a guide to help you get it ready for publishing.

Title of Dataset

Your title is what first attracts users to your data. To help users find the data they need, it is important that your title is not only accurate but easy to understand.

We recommend titles:

  • Stick only to what is essential. As each title is accompanied by a description, your title should only convey what a user needs to know at a glance.

    • For collections, titles can give a broad picture of the datasets contained. You might consider adding slightly more details to the dataset files themselves.

    • You might title a collection, “Annual pet registration”, with the datasets, “Pet registration by type”, “Pet registration by location”, and “Pet registration by month”.

    • For individual datasets, while titles should be broad thematically, they should still be specific to the data being shared.

    • You might title an individual dataset, “Monthly pet registration by location”.

  • Avoid having information that already appears as separate data points

    • Coverage years: Your chosen coverage period will be shown below your title. Unless differentiating between datasets in a collection, try to avoid adding these in your titles.

    • Agency name: While your agency will be shown as the data provider, you might consider adding this in only if you believe it is beneficial for users to know. This could be to avoid confusion if you are publishing data similar to another agency, for example.

      While it is possible to use the abbreviated form of your agency name, this should ideally be consistent across the datasets and collections you upload.

  • Be consistent. For example, if you add coverage years to dataset titles within a collection, do try to add years for all the datasets.

Description of your dataset

Your dataset description gives users a comprehensive look into what your data offers. Here are our suggested prompts:

  • What’s the dataset about? Give a short description of what your dataset contains using relevant keywords. As we use your description to provide search results to users, your dataset may be shown if a user searches for a keyword included in your description.

    You should also share anything you think is applicable, from variables and figures, to timeframes of the data.

  • Is there any context I need to provide? Briefly beak down any relevant technical or industry-specific terms a user may not be familiar with. If needed, do provide reference links.

  • How was this data collected?

    Describe your data collection methodology, if possible. This helps users verify the quality of your data sources.

  • What story does this data tell? Lastly, do share what makes this data meaningful, from depicting trends to creating transparency! This will help users understand how better to leverage the data.

Column descriptions

Your column descriptions help users understand what exactly your data is about. You can think of these as definitions of your column headers. For users to make the most of your data, descriptions should clearly convey what each column represents. Here are our suggested prompts:

  • What does the column header mean within the context of the data?

    Describe any context that a user should know about the column data. Think about what the column values mean in relation to what the data looks at specifically or how your agency defines certain terms.

    For example, if the column represents “Languages spoken”, you might specify which languages your data looks at specifically. If you define “Languages spoken” as those that are spoken fluently only, you could also provide that information.

    It may also be important to share how you collected the data, especially if the column sources from another piece of data.

  • Does your column header need additional definitions?

    Straightforward column headers may not need context specific to your organisation or data collection methods. However, it could still be helpful to generally define your header.

    For example, a column on Unique Identity Numbers (UEN) may be distinct enough for you to simply provide a definition for a UEN.

  • Do your column values need additional definitions?

    You might also consider providing extra information on your column values, especially if you are using abbreviations and contractions, or different formats.

    For example, if you have a column header “MRT stations”, you might use “AL”, “BE”, and “CA”, to represent “Aljunied”, “Bedok”, and “City Hall” respectively. It would be helpful to define each abbreviation in the description, especially if they are uncommon.

  • Do the values in your column have any known caveats?

    If there are known issues or caveats with the column, it would be helpful to share them. These could be anything of note, from external changes to discontinued variables, and more.

    For example, for a column header “Towns”, a caveat could be that Ang Mo Kio was split into Bishan, Toa Payoh, and Marymount.

Data principles

While clear titles and descriptions are important, your data lies at the core of what users are looking for. To provide users with data that is easy and reliable to use, we have a few guiding principles.

  • Ensure machine readability

    Data should be in a format a computer can understand. This means relevant fields can be extracted and parsed without human input. As an open data collective, any data you upload must also be in non-proprietary file formats.

  • Check for errors and inconsistencies

    Your data should be free from any errors and inconsistencies! This will go a long way in establishing trust, protecting the integrity of your data, and allowing mashup or use of multiple datasets, even across agencies.

    We strongly encourage correcting any issues that negatively affect data quality, such as:

    • Null values on columns. If all values in a column are null, consider removing the column.

    • Duplicate rows

    • Outliers. While having many outliers is not necessarily a cause for concern, there should not be impossible values appearing. This might look like negative values appearing in columns that should only have positive ones, for example.

    • Inconsistent capitalisation in datasets. For example, if a value in a column is entered in all uppercase, all other values in that same column should also be entered in all uppercase.

      Consistency in capitalisation is important, as the same value appearing more than once in different casing may be considered different values altogether. For simplicity, we suggest sticking to one capitalisation format.

    • Inconsistent spacing in column values. This could be entering “hello” without a space at the end, and “hello “ with a space at the end, for example.

    Additionally, your data should be kept tidy, a concept introduced in Hadley Wickham’s “Tidy Data”, in The Journal of Statistical Software. Ultimately, clearer organisation makes it easier for users to understand and use your data.

    Drawing from these principles, you should ensure:

    • Each column uses the same unit of measurement

    • Each row makes one observation. This is all the variable information collected on a single subject or participant. For example, if your dataset looks at people, one row might contain a single person’s height, weight, and age. While the actual measurements of different people may vary from row to row, what you are measuring in each row shouldn’t.

    • Each table only has one type of observation unit. If your data looks at the average height of a population, you could have one observation unit as the overall population, and another as the population of different genders. This means you might have one table for overall average height, and one for average height of different genders.

  • Granularity and precision

    As far as possible, data should be raw and granular instead of aggregated and processed, such as the use of percentages.

    Totals and sub-totals should be in separate tables if needed. For example, there are cases where aggregate numbers, such as totals and indices, cannot be derived from granular data points.

Changelog

Updates to your data matter to users! While regular changes to data—such as additions, removals, and updates—are automatically logged, we’ve introduced a changelog for you to share changes to your dataset schemas and collection methodologies.

When logging changes, do be as specific as possible. A clear picture of your updates can help users can make proper use of your data!

Get in touch

Our guidelines are always up for review to give our users the best experience possible. Have more feedback or questions? Contact us and we will reach out to you as soon as we can.

Last updated