Datasets

An Open Data Blend Dataset is a collection of analytics-ready data files packaged with rich metadata.

You can access Open Data Blend Datasets in two ways:

  • Open Data Blend Dataset UI:

    • Targets the broader community of information workers

    • For ad-hoc dataset acquisition and evaluation

    • Reading the documentation is optional

  • Open Data Blend Dataset API:

    • Targets technical individuals who are comfortable with code

    • For integrating datasets with a broader solution

    • Reading the documentation is recommended

Data Files

To maximise the accessibility and usefulness of the data that we publish, we support the three most popular open data file formats:

  • Compressed (Gzip) CSV

  • Apache ORC

  • Apache Parquet

Only previews of the top 100 rows are available in uncompressed CSV data files. The full versions of CSV data files are always Gzip compressed to reduce download times and save disk space.

The choice of which data file formats you use will often be driven by the set of tools and platforms that you intend to use them with.

A Gzip CSV or ORC data file is typically 40-50% smaller than the corresponding Parquet version. Only the ORC and Parquet data files are optimal for interactive analytical workloads.

Below are some examples of the types of file format choices that may be made:

Platform or Tool

Supported Formats

Chosen Format

Apache Spark

CSV, ORC, Parquet

Parquet

Apache Hive

CSV, ORC, Parquet

ORC

Presto

CSV, ORC, Parquet

ORC

Power BI Desktop

CSV, Parquet

CSV

Python

CSV, ORC, Parquet

Parquet

R

CSV, Parquet

Parquet

Tableau Desktop

CSV

CSV

In the table above 'CSV' refers to both uncompressed and compressed (Gzip) CSV data files.

Last updated