Datasets
An Open Data Blend Dataset is a collection of analytics-ready data files packaged with rich metadata.
You can access Open Data Blend Datasets in two ways:
Open Data Blend Dataset UI:
Targets the broader community of information workers
For ad-hoc dataset acquisition and evaluation
Reading the documentation is optional
Open Data Blend Dataset API:
Targets technical individuals who are comfortable with code
For integrating datasets with a broader solution
Reading the documentation is recommended
Data Files
To maximise the accessibility and usefulness of the data that we publish, we support the three most popular open data file formats:
Compressed (Gzip) CSV
Apache ORC
Apache Parquet
Only previews of the top 100 rows are available in uncompressed CSV data files. The full versions of CSV data files are always Gzip compressed to reduce download times and save disk space.
The choice of which data file formats you use will often be driven by the set of tools and platforms that you intend to use them with.
A Gzip CSV or ORC data file is typically 40-50% smaller than the corresponding Parquet version. Only the ORC and Parquet data files are optimal for interactive analytical workloads.
Below are some examples of the types of file format choices that may be made:
Platform or Tool | Supported Formats | Chosen Format |
Apache Spark | CSV, ORC, Parquet | Parquet |
Apache Hive | CSV, ORC, Parquet | ORC |
Presto | CSV, ORC, Parquet | ORC |
Power BI Desktop | CSV, Parquet | CSV |
Python | CSV, ORC, Parquet | Parquet |
R | CSV, Parquet | Parquet |
Tableau Desktop | CSV | CSV |
In the table above 'CSV' refers to both uncompressed and compressed (Gzip) CSV data files.
Last updated